For years, we have seen overcapacity and under-utilization within data centers. Until recently, it was not uncommon for me to visit a site with a design load of 10 mega volt amps (MVA) running only a 2 MVA IT load. Even more common was to see server utilization under 10 percent and not on just the odd machine or racks, but entire IT systems.
Within the energy efficiency and optimization communities, we have spent years trying to get customers to right-size from day one by implementing modular growth strategies and ensuring critical physical infrastructure, like power and cooling, closely matches the ultimate demand of the compute load. Even when we did this, often the requirement of the IT equipment was greatly over estimated, leaving a gap between design expectation and built reality.
With the evolution of the traditional enterprise data center, having a single server or processor dedicated to just one application is all but gone. The move to more software defined platforms has also meant better use of virtual machines and grid-style systems. Back when I started at IBM, we would have called these system mainframes.
One of the main reasons we would see this lack of utilization is because systems have been designed and built to support a customer’s largest need, on their busiest day, with some extra capacity built-in just in case. Those customers would also have a redundant system, just in case. Online retail, for example, would look at large sales events such as Black Friday, Cyber Monday, or Boxing Day sales. These retailers had to plan for peak load although most of time they are operating nowhere near it.
This overcapacity/under-utilization scenario can also be compared to a car. Most are designed to go 120 miles per hour (mph), but rarely go over 80 mph and spend most of their life traveling from 0-40 mph. Additionally, sites with good levels of redundancy are like someone making a journey that can choose from two cars.
The very large data center players, the hyperscalers, essentially operate fleets of trucks with supercar performance. They are 100 percent software defined. They run an almost identical hardware platform, one for processing and one for storage, in every data center they own. That means tens of thousands of grid-connected processors and storage devices which effectively operate as one. The loss of a single site causes only a minor drop in performance and not a loss of service.
Did you know that when you perform a Google search, your device sends that request to three different data centers located in different geographies? The data center that wins the race serves you the result. Should one of those sites be offline, there are still two left! The networks we use to transmit all of this data are very much the same, but with even more actual capacity built in from day one.
It’s very expensive to put cables in the ground and under the sea. When planning for new routes, the network provider is typically planning 10 years or more into the future. As demand grows over time, these providers simply need to “light up” an unused fiber connection to add huge amounts of instant capacity.
In essence, software platforms, the data centers that support them, and the networks connecting everything are designed to offer very high levels of capacity with even higher levels of redundancy.
All of this means that when there is a sudden and urgent demand, like Black Friday sales or a new YouTube video from Ed Sheeran, the entire ecosystem simply lifts its head and says, “Hold my beer.”
The unprecedented demands we are now seeing with the mass migration of professionals and students working from home are unprecedented, but maybe not in terms of total peak capacity. As the speaker systems for Spinal Tap went all the way to 11, the global data center network is only just at nine. This capacity level is not unusual, however, sustaining it for such a prolonged period of time is.
In a recent press release, the chief technology officer of BT stated that the UK network, while extremely busy, was not close to capacity. He said peak traffic is usually Sunday night for a few hours. Now, that peak is constant and higher than ever before.
Of course, there are some bottlenecks; the need to adjust certain services; and some people at the thin end of the edge of networks will see poor performance. Thankfully, this is not indicative of the ecosystem as a whole.
The elasticity of the entire ecosystem is more than capable of what it’s being asked to do at present. I know from personal experience that everyone is looking very closely at the bottlenecks and are working to rapidly expand them. Most of this urgently needed expansion is about bringing in-place equipment online or re-purposing redundant hardware.
Where new capacity is required, it’s being provisioned in a matter of days. Due to the aforementioned evolution of enterprise facilities, there is a huge amount of data center space readily available.
It’s a busy time for us here at Vertiv, but rest assured. From what I can see, our industry is being tested, but earning a grade of A+.