By Nick Schweissguth
A CPU’s clock speed is an indicator of how fast it can process data. Clock speed is a measure of how many cycles per second a CPU can execute. It is calculated in gigahertz (GHz), or billions of cycles per second.
CPUs typically operate at various clock speeds, depending on the workload. They may operate at minimum base speeds for everyday tasks, but increase speeds for more demanding workloads. CPUs can temporarily increase clock speeds to accommodate denser workloads. “Overclocking” is when CPUs increase clock speed higher than the manufacturer-specified level. Manufacturers allow for servers to overclock and operate above their base speed within a specified range to provide competitive advantages.
However, the higher the clock speed, the more heat is generated. There is a limit to how much heat a server can generate while still functioning. As a protective measure to prevent overheating, processors will automatically reduce clock speeds when they get too warm. Enabling higher clock speeds for prolonged periods of time therefore requires sufficient cooling mechanisms.
Traditional Cooling Methods No Longer Adequate for Overclocking
Cloud providers typically use mechanical air cooling systems for cooling servers in datacenters. However, air cooling is insufficient for cooling chips with increasing transistor counts.
Twenty years ago, CPUs could fit 30-40W per chip. Today, that number is 400 W, and it’s only accelerating. The Open Platform Communications (OPC) standards were set to handle up to 700W per chip. Now, that standard is up to 1000W – which can generate enough heat to boil a kettle of water. Increasing transistor counts, coupled with the mainstream adoption of data-heavy technologies such as artificial intelligence (AI), machine learning, edge computing, internet of things (IoT) and blockchain mining, will place unprecedented thermal limitations on CPUs, which will severely limit the ability to overclock, or prevent overclocking entirely.
Because air cooling cannot enable overclocking in this new technology landscape, organizations are exploring liquid immersion cooling solutions, which provide exponentially more heat rejection. Microsoft’s Zissou recently published the results of their comprehensive research exploring 2-phase immersion cooling (2PIC) technology and how it enables overclocking without adversely affecting the hardware or compromising its performance or lifespan.
Zissou’s 2-phase Immersion Cooling Research
Zissou’s research argues that 2-phase immersion cooling (2-PIC) has significant advantages over 1-phase (1PIC), Direct-to-Chip (cold plates) and other liquid cooling methods.
In 2-PIC, servers are immersed in tanks filled with dielectric, non-conductive fluids specifically engineered to effectively transfer heat from electronics. They are non-toxic, repel moisture, and do not mix with oxygen or air contaminants. As the server heats up, the fluid eventually boils and changes from liquid to gas. This phase change removes heat from the chips. The vapor then collects on a condenser coil placed just above the surface, where it is converted back into liquid in another phase change, then falls back into the tank. The process repeats in perpetuity for a continuous cooling operation.
“We argue that because immersion cooling offers high thermal dissipation and low junction temperatures, it is possible to operate server parts at higher frequencies (i.e., overclocking) for longer periods of time than ever possible before. In fact, the capability to overclock opens up many new directions to enhance system performance at scale.”Zissou
For its research, Zissou built three prototype 2-PIC tanks to explore how cloud providers can take advantage of 2-PIC to enable overclocking and observed the impact on many factors as compared to air cooling, including power consumption, component lifetime and total cost of ownership (TCO).
Power is an important datacenter consideration as it increases infrastructure costs. Overclocking increases power consumption substantially. However, immersion cooling provides power savings that offset higher energy requirements.
Zissou’s research showed that 2-PIC reduced wattage by 182W per server. These gains can alleviate a substantial portion of the increased power used for overclocking. It is also significantly more environmentally friendly.
Overclocking increases the operating frequency and consequently voltage, which over time can reduce the lifespan of servers. In fact, one of the most often cited disadvantages of overclocking is the reduced lifespan of hardware components. However, immersion cooling can compensate for degradation caused by overclocking. Zissou’s research showed that when overclocking with 2-PIC, the server lifetime equals that of an air-cooled server with no overclocking (around 5 years).
Zissou worked with several teams across Azure on a complete TCO analysis of a 2-PIC datacenter. The analysis compared a non-overclockable air-cooled datacenter with an overclockable 2-PIC datacenter.
Although using 2-PIC does add some costs such as the tanks and fluid, these costs are easily offset by the savings. These savings primarily come from 2-PIC lowering the data center PUE by 14%, which enables using the reclaimed power towards adding more servers and thereby amortizing all costs (construction, energy, IT, operations) across more cores.
Overall, Zissou found that overclockable 2-PIC datacenters reduce TCO by 4% when compared to the non-overclockable air-cooled datacenter.
Overclocking Can Continue with 2-phase Immersion Cooling
The ability to safely overclock in cloud platforms gives enterprises many competitive benefits, including higher-performance virtual machines (VMs), oversubscribed servers, and improved VM auto-scaling. As data volumes, chip densities and heat generation continue to rise, servers in facilities that are cooled with air and other traditional methods such as water vapor and cold plates will not have this tool in their arsenal.
As Zizzou has demonstrated, 2-PIC empowers cloud service providers to continue leveraging overclocking without traditional thermal limitations.
“We conclude that two-phase immersion and overclocking have enormous potential for next-generation cloud platforms.”