TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings

Published 4 Apr 2023 in cs.AR, cs.AI, cs.LG, and cs.PF | (2304.01433v3)

Abstract: In response to innovations in ML models, production workloads changed radically and rapidly. TPU v4 is the fifth Google domain specific architecture (DSA) and its third supercomputer for such ML models. Optical circuit switches (OCSes) dynamically reconfigure its interconnect topology to improve scale, availability, utilization, modularity, deployment, security, power, and performance; users can pick a twisted 3D torus topology if desired. Much cheaper, lower power, and faster than Infiniband, OCSes and underlying optical components are <5% of system cost and <3% of system power. Each TPU v4 includes SparseCores, dataflow processors that accelerate models that rely on embeddings by 5x-7x yet use only 5% of die area and power. Deployed since 2020, TPU v4 outperforms TPU v3 by 2.1x and improves performance/Watt by 2.7x. The TPU v4 supercomputer is 4x larger at 4096 chips and thus ~10x faster overall, which along with OCS flexibility helps LLMs. For similar sized systems, it is ~4.3x-4.5x faster than the Graphcore IPU Bow and is 1.2x-1.7x faster and uses 1.3x-1.9x less power than the Nvidia A100. TPU v4s inside the energy-optimized warehouse scale computers of Google Cloud use ~3x less energy and produce ~20x less CO2e than contemporary DSAs in a typical on-premise data center.

Abstract PDF Upgrade to Chat

Authors (14)

Citations (253)

View on Semantic Scholar

Summary

The paper introduces TPU v4 with optical circuit switches and SparseCores that enhance flexibility and efficiency in complex machine learning workloads.
It achieves 2.1x faster performance than TPU v3 while outperforming competitors with 4.3x–4.5x speed improvements and superior energy efficiency.
The architecture dramatically lowers energy consumption and CO2 emissions, setting a new standard for sustainable, scalable AI data centers.

Overview of TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning

The paper presents TPU v4, a hardware architecture designed by Google, as an evolution in the domain of machine learning-focused supercomputers. It marks the fifth iteration of Google’s domain-specific architecture for machine learning and showcases pivotal innovations aimed at overcoming the challenges associated with modern machine learning workloads. These workloads, characterized by increasing complexity in scale and algorithmic diversity, require highly efficient computational infrastructures. The TPU v4 is distinguished by its use of Optical Circuit Switches (OCSes) and SparseCores to enable better performance, scalability, and energy efficiency.

Key Architectural Features

Optical Circuit Switches (OCSes): OCSes are utilized to dynamically reconfigure the TPU v4's interconnect topology. This feature significantly enhances the supercomputer's flexibility, scale, availability, utilization, and power efficiency. By allowing real-time topology adjustments, a twisted 3D torus topology can be chosen to improve performance for specific workload patterns such as all-to-all communication, which is vital for embeddings in large-scale machine learning models. The paper underscores that OCSes, comprising less than 5% of the total system cost and consuming less than 3% of the system power, are a strategic advancement over traditional interconnect solutions.
SparseCores: A crucial innovation in TPU v4 is the inclusion of SparseCores, which are specialized dataflow processors enabling a 5x–7x acceleration in models relying on embeddings, all while occupying only 5% of the die area and power. These cores are tailored to handle the high memory bandwidth demands associated with embedding operations typical in deep learning recommendation models (DLRMs).

Performance and Energy Efficiency

TPU v4 demonstrates substantial performance improvements over its predecessors. It is reported to be 2.1x faster than TPU v3 while achieving a 2.7x improvement in performance per Watt. The architecture is scalable, supporting configurations up to 4096 chips, which bolsters its capability to handle expansive models like LLMs efficiently. In comparison to competitive solutions, the TPU v4 is 4.3x–4.5x faster than Graphcore's IPU Bow and uses 1.2x–1.7x less power than Nvidia's A100 for equivalent systems. It achieves average training performance of approximately 60% of peak FLOPS per second, showcasing its efficiency in translating hardware capabilities into practical machine learning compute power.

Implications and Future Prospects

The practical deployment of TPU v4 has notable implications for environmental sustainability and energy consumption. On-premise data centers utilizing contemporary DSAs significantly lag in energy efficiency compared to TPU v4-equipped warehouse-scale computers in the cloud, translating to 2-6x lower energy use and 20x reduced CO2 equivalent emissions. This positions TPU v4 as a more sustainable option for large-scale machine learning operations.

The utilization of OCS infrastructure and SparseCores reflects an architectural direction that prioritizes flexibility and specialization, key considerations as machine learning models continue to proliferate in scale and variety. These innovations attest to the potential for further architectural advancements to enhance computational efficiency and performance.

Future developments may likely explore expanding the optical circuit switching capabilities, refining SparseCore functionality, and continuing to elevate the performance benchmarks set by TPU v4. As large models and recommendation systems become increasingly central to artificial intelligence applications, the evolutionary steps taken in architectures such as TPU v4 signal critical pathways to accommodating the rising demands of AI workloads within practical and ecological boundaries.

Markdown Report Issue