Data Center Infrastructure and the Future of CXL
For the first time ever, physical infrastructure needs to keep up with the pace of software adoption. This piece details the critical transition that data centers are currently undergoing, and what makes it so difficult.
Data Centers Today
Data centers are large buildings full of networked computers that represent the fundamental infrastructure layer of our world’s digital economy. Often located in the outskirts of cities, these robust facilities provide large-scale organizations with the compute and networking power required to store and process their data and run their applications. Since data centers contain an organization's most critical and proprietary assets, it is imperative that they are secure, reliable, and efficient. Like a center on a football team, data centers are often overlooked relative to how vital they are to a high-functioning offense. However, if you observe closely, you realize their pivotal role as the backbone of the offense and that without them, the team (in this case, our economy) cannot efficiently operate.
As large language models (LLM) continue to take the world by storm and drive large and small enterprises alike towards mass AI adoption, unprecedented demand is being created for compute resources in the process. In addition to accommodating our growing data/internet streaming needs, data centers must also onboard the training and inference of various multi-trillion parameter foundational models. According to a McKinsey study, demand in the US market alone—measured by power consumption to reflect the number of servers a data center can house—is expected to reach 35 gigawatts (GW) by 2030, up from 17 GW in 2022. One H100 GPU at full tilt consumes about the same amount of energy as the average American household. So, building a modern AI data center (with thousands of GPUs) is equivalent to bringing a new city onto the grind from a power consumption standpoint. This mass data center buildout will be one of the most critical infrastructure development projects in recent U.S. history. For the first time ever, physical infrastructure needs to be built at the pace of software adoption.
The Case for a Cohort of New Cloud Leaders
Legacy data centers will eventually be rendered obsolete, as CPU-native servers were built for serialized workloads—they are incapable of supporting the constant parallelized workloads for LLMs while also complying with uptime requirements, water usage, power sourcing, and satisfying strict service level agreements (SLAs) from buyers. An entirely new server architecture will be required across for data centers.
Our investment in CoreWeave last year was grounded on this very premise, that there will be a new leader in cloud services by way of the AI revolution. The existing data center infrastructure will need to be reconfigured to accommodate the power and compute requirements of the future, but legacy operators will struggle to adapt quickly due to the complexity and cost of retrofitting. CoreWeave, through their innovative product architecture, first-mover advantage, signed data center capacity, and myriad of strategic relationships (Nvidia, Microsoft, etc.) has already become the largest accelerated compute provider in the United States. CoreWeave already operates four of the ten largest supercomputers in the world.
As hyperscalers gradually retrofit their data centers to match today’s need, there will be opportunities for new entrants that offer creative, efficient accelerated cloud offerings in the interim, like CoreWeave, Lambda Labs, Together AI and The San Francisco Computer Company.
Memory as a Roadblock to Data Center Server Improvement
Despite the emergence of next generation data center providers like CoreWeave and Lambda Labs, most of the landscape—including the hyperscalers—is still in the process of adapting. The core components of data centers are (1) servers & networking, (2) storage, (3) cooling, and (4) uninterruptable power sources (UPS). While each of these components needs to see step-change improvements, perhaps the most important component is the server architecture itself.
Bleeding edge server architecture today revolves around the integration of Nvidia’s flagship H100 graphics processing units (GPUs). Since these chips are so performant relative to competition, Nvidia has become the industry standard chip manufacturer for this stage of AI. To fully utilize these high-performance chips, the ancillary components within servers like the CPU, accelerator, memory, storage, and network interfaces must also be upgraded. Improving server performance today means solving the significant memory issues that data centers face today:
- Large latency gap between DRAM and solid-state drive (SSD) storage – If a processor exhausts its memory capacity, it must then wait to retrieve the information from the SSD. This time that the processer is spent idle significantly throttles performance.
- Core counts in multi-core processers are scaling quicker than main memory channels – Once core count reaches memory capacity, each additional core would be underutilized.
- Growth of Accelerators with attached DRAM – This translates to more stranded memory resources.
This is not so easy, as each component (CPU, Memory, Storage, Accelerator, Interconnect, etc.) is produced by a different manufacturer and cannot be fully optimized out-of-box.
A case study that speaks to the benefit of standardization in computer hardware is Apple’s transition away from Intel-powered chips in favor of their own dedicated silicon for their devices. With their debut M-series system on a chip (SoC) working with their in-house flash storage and multi-core CPUs, Apple was able to show a 100%+ performance improvement compared to the previous generation Intel-powered Macbooks [see graphic below]. This was a step change improvement, rather than the incremental improvement that we were used to seeing in laptops.
In the context of data center servers, the near-term solution would be for leading manufacturers to agree on server architecture standard where each component can communicate with one another without friction – something modular, scalable, and relatively “future proofed.” Thankfully, such a standard exists for server interconnects called CXL, which is a step in the right direction.
Understanding Compute Express Link (CXL)
Compute Express Link (CXL) was formed in 2019 by a consortium of leading computer component manufacturers including Intel, Google, Cisco, Nvidia, AMD, and Microsoft. The goal was to develop an open interconnect standard where processors, expanded memory, and accelerators could communicate with low latency and maintain memory coherence even within heterogenous system architectures. In a server with CXL interconnects, the host CPU and the external devices effectively share each other’s memory – this solves the previously highlighted issue of exhausted memory cards on today’s servers.
CXL builds on top of the existing Peripheral Component Interconnect Express (PCIe) standard that is mostly used in the industry today and extends its capabilities to three main protocols—the combination of these protocols allows for every component in the server to utilize each other’s memory and even expand memory capacity if needed.
- CXL.io: Similar to PCIe, the existing standard interface for motherboard components.
- CXL.cache: Protocol enabling accelerators to access host CPU’s memory for added performance.
- CXL.memory: Allows host CPU to access device attached memory.
The illustrations below show how these protocols work together to facilitate memory sharing.
The Evolution of CXL Interconnect Standards
Since 2019, Compute Express Link has evolved quite a bit, introducing step change improvements in speed and functionality with every new update. Below, is a summary:
CXL 1.0/1.1
- Allows for only one host CPU.
- Leverages PCIe 5.0 physical & electrical interface.
- Allows data transfers at 32 GT/s in each direction over 16-lane link.
- Level 1 devices can only be utilized by one host processes at a time.
CXL 2.0
- CXL level 2 switches that allow 16 CPUs to simultaneously access all memory in the system.
- Level 2 devices can be utilized by multiple host processes at once.
CXL 3.1
- Leverages PCIe 6.1 physical & electrical interface.
- Increases data transfers speeds to 64 GT/s in each direction over a 16-lane link.
- Peer to peer memory access—devices can communicate with each other without involving the host CPU.
- Allows memory allocations to be dynamically reconfigured without having to reboot the host CPU.
Summarizing the Benefits of CXL-enabled servers.
- Enables low-latency connectivity between all server components.
- Cache coherency ensures that the host processor and CXL devices (GPU, FGPA, storage, smartNIC) all see the same data.
- Allows host processor to access expandable memory cards if at capacity.
- Created the “As Needed” Memory Paradigm, where all memory in a CXL-enabled system is fully utilized by the underlying host processors.
- All three CXL protocols are secured via Integrity and Data Encryption (IDE) which provides confidentiality, integrity, and replay protection.
- CXL enables composable server architecture, in that servers are broken apart into their various components and placed in groups where resources can be dynamically assigned to workloads on the fly.
Standardizing cache coherent interconnects enables each manufacturer on the value chain to fully optimize their product without having to worry about its compatibility with other components. The benefits of the CXL interconnect standard are clear – its rapid development (3 generations since 2019) and industry-wide adoption is a testament to the tremendous value it has delivered so far.
Standardizing cache coherent interconnects enables each manufacturer on the value chain to fully optimize their product without having to worry about its compatibility with other components. The benefits of the CXL interconnect standard are clear – its rapid development (3 generations since 2019) and industry-wide adoption is a testament to the tremendous value it has delivered so far.
Today: Living in a NVIDIA World (updated April 2024)
In theory, having an industry standard interconnect should have created a thriving ecosystem of compatible, high-performance infrastructure products. However, this is not the reality today. Instead, Nvidia’s chips and their surrounding protocols (Infiniband, NVLink, etc.) are so much faster than the competition that they are exposing the inherent limitations of PCIe, the underlying infrastructure that CXL products are built on.
The real estate on Nvidia H100 chips is highly sought after and PCIe is losing its credence to deserving more space. For some context, a 16-lane PCIe interface has 128GB/s of bandwidth while NVlink provides a whopping 900 GB/s of bandwidth for GPUs. As a result of this delta, Nvidia is limiting the number of PCIe lanes on its chips (the infra that CXL is built upon) in favor of NVLink. Therefore, by building on PCIe, chip designers are already underwriting a baseline performance/IO efficiency loss. This shows why many chip manufacturers are quietly shelving their CXL products. The path that Nvidia is paving is being followed by the hyperscalers too—Google’s TPU, Nvidia’s GPUs, Meta’s upcoming AI MTIA Gen 3 Accelerator, Microsoft’s Maia 100 (Athena) and Maia 200 (Braga) chips are all expected to halve PCIe lane counts to 8x instead of 16x.
To summarize, pad space on chips is extremely limited, and Nvidia is providing the steering instructions for what peripheral products should look like. Instead of CXL being the standard interconnect, Nvidia itself is becoming the standard.
Final Thoughts & Our Perspective
In the short and mid-term, it is hard to envision Nvidia falling out of favor as the de facto chip hardware and software provider for AI data centers. The input/output (IO) performance difference between PCIe and NVLink is too great for both customers and chip developers to ignore. With the high performance and low latency requirements of AI workloads, CXL just doesn’t offer enough to be the standard right now.
However, we do believe that CXL will play a role in addressing the memory bottleneck that is unfolding with data center CPUs. Unlike GPUs, server CPUs need many PCIe lanes as this is the primary device-to-device communication channel—PCIe also delivers 4X the bandwidth of the competing solution (DDR) for CPUs. As previously mentioned in this paper, memory bandwidth is a huge bottleneck for CPUs, and PCIe presents a medium to expand bandwidth at the cost of some latency. For GPU workloads, this latency is detrimental, but for CPUs, it is not. We believe that there is a chance that CXL itself will transition away from PCIe infrastructure and towards ethernet style SerDes. If you are building any CXL-enabled components of software specifically for CPUs, or any AI infrastructure product, please reach out. Our deep relationships with the largest tech manufacturers in the world (Samsung, SK, Toyota, and more) put us in a unique position to accelerate your path to commercialization.
Sources
- https://www.networkworld.com/article/969402/what-are-data-centers-how-they-work-and-how-they-are-changing-in-size-and-scope.html
- https://www.grcooling.com/blog/data-centers-history/
- https://www.paloaltonetworks.com/cyberpedia/what-is-a-data-center
- https://adrianco.medium.com/supercomputing-predictions-custom-cpus-cxl3-0-and-petalith-architectures-b67cc324588f
- https://www.rambus.com/blogs/compute-express-link/
- https://www.rambus.com/blogs/new-cxl-3-1-controller-ip-for-next-generation-data-centers/
- https://dgtlinfra.com/top-data-center-companies/
- https://www.semianalysis.com/p/cxl-is-dead-in-the-ai-era