Many old CPU chips and obsolete computer processors as background
Co-innovation Tech Deep Dives

Got CPU? Unlocking AI Speed and Flexibility with Algorithms and Virtualization

The current AI industry assumes that CPUs are inferior to GPUs and other specialized processors (such as TPUs) for heavyweight AI computations. The popular algorithms for training neural networks developed during the 1980s are essentially a sequence of matrix multiplications. Matrix multiplications are one of the few special operations where GPUs (or TPUs) can leverage favorable memory-access patterns and thereby utilize thousands of cores to perform computations significantly faster than CPUs. 

With modern advancements in AI training, it has become increasingly clear that complete matrix computations are overkill for large models. Performing innovative selective computations (or sparse adaptive operations) is expected to be a more efficient alternative than full matrix multiplications. However, the overheads associated with adaptive sparse selection and the cache-unfriendly memory-access pattern render the existing ideas prohibitively slow. The unpredictable memory access loses most of the appealing boosts offered by specialized hardware. As a result, the community is still stuck with the wasteful algorithm of the 1980s for training neural networks, hoping for the hardware acceleration to scale. 

Enter ThirdAI’s brain-like efficient algorithm

At ThirdAI, we have figured out how to leverage probabilistic data structures to design super-efficient brain-like “associate memories.” These memories can enable selective sparse computations — analogous to sparse coding in the brain — for efficient training of neural networks. The resulting implementation, called BOLT (Big O’l Layer Training), requires exponentially fewer computations to train neural networks while reaching the same accuracy.

The BOLT algorithm achieves neural-network training in 1% or fewer FLOPS, unlike standard tricks like quantization, pruning, and structured sparsity, which only offer a slight constant factor improvement. As a result, we don’t have to rely on any specialized instructions, and the speedups are naturally observed on any commodity CPU, including Intel, AMD, or ARM. Even older versions of commodity CPUs can be made equally capable of training a billion parameter models faster than A100 GPUs. (Learn more about the technology and benchmarks.)

Software integrations

ThirdAI’s software accelerator is implemented in basic C++, with readily available Python bindings. As a result, the power of ThirdAI’s algorithm can be materialized in any workflow, be it TensorFlow, PyTorch, or any other. The software does not rely on specialized instructions, making the speedup readily available on any CPU architecture.

From datacenters to edge: democratizing AI training

CPUs are readily available — the most architecturally consistent, cheapest, and fully virtualized compute option that exists today. The advantages of enabling CPUs with ThirdAI’s BOLT to achieve top-of-the-line performance for AI cannot be overstated. It provides both acceleration and availability. Hardware accelerations are expensive and require a significant change in infrastructure. BOLT on CPUs enables AI for everyone without resorting to costly infrastructural changes. 

The ThirdAI’s algorithm’s efficiency (it only requires a handful of CPU cores to achieve GPU-like acceleration) can even enable AI training on edge devices. This capability could potentially change the complete economics of AI and IoT, which currently assume that AI training is a job for the cloud.  

Virtualization via Radium: push-button AI acceleration, all in one place  

Application and infrastructure management can get messy where specialized acceleration is concerned. The variety of available AI acceleration options causes end users to face a variety of integration challenges. VMware’s Project Radium offers a virtualized solution for AI/ML applications without decades of system enabling and ecosystem evolution. Radium will not only unify different hardware accelerators but will support software-acceleration backends. With Radium and ThirdAI’s BOLT integration, we will see significant AI speedup on CPU-only servers, offering dramatic acceleration on existing infrastructure without the need for hardware accelerators. 

The bottom line

The integration of Radium with ThirdAI’s software backend will automatically convert existing CPUs — whether old or new generations — into AI powerhouses. The solution is a no-brainer first step before investing in any expensive infrastructural changes. Radium’s seamless developer experience, paired with ThirdAI’s BOLT, brings high-performance ML training to a wide variety of contexts, from the cloud to the far edge. It supports everything from high-end multi-CPU servers to low-end Raspberry Pi-class systems.  Our CPU-driven approach will also help ensure availability of high-performance ML for everyone amidst the current global chip shortage.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *