Waterloo Business Review

The Hardware Bottleneck For Artificial Intelligence

Research in AI has yielded models with complexities of increasing orders of magnitude to solve the world’s pressing machine learning problems. The current state-of-the-art ranges from Facebook’s M2M-100 model that translates over 100 languages to Google’s BERT embeddings and more. However, this also leads to the demand for increased processing power for the algorithms to learn a larger number of variables, requiring better hardware infrastructure. With the demise of Moore’s Law constraining speed-gains from the number of transistors on circuitry, to what extent does the future of AI research have a fundamental hardware bottleneck? We explore emerging trends that may dominate mainstream research by consequence: the intersection with quantum computing to gain new hardware capabilities and design hybrid quantum-based models, or even tinyML to miniaturize existing models for compatibility on low-power embedded systems.

AI’s history can be traced to Pitts and McCulloch’s work on simple artificial neurons performing logic functions that culminated in Minsky’s basic neural net machine, SNARC, in 1951. In the aftermath of the Dartmouth Conference of 1956, more complex architectures were designed, including STUDENT and ELIZA which used large “semantic nets” to converse in English to Gerald Tesauro’s TD-Gammon which leveraged temporal difference learning to analyze gaming environments. The complexity stems from the transition in the “layers” used in creating such models, from simpler single-input neural networks (Ivakhnenko and Lapa, 1967) to convolutions to process images for computer vision (Fukushima, 1980) and recurrent memory for sequenced data (Rumelhart, 1986). The depth and combinations of such features in a single network have significantly increased model training time to learn a vast magnitude of hidden variables.

The above advancements in the size, complexities, and hyperparameters for deep neural networks have significantly outpaced research in semiconductor technology that is core to any hardware system, noted in studies by the University of Notre Dame. The growth of the internet has made unparalleled amounts of unstructured data available to train such models, which further complicates these woes. As Shi (2019) illustrates, hardware platforms currently used like graphics processing units lack the computational bandwidth and memory energy-efficiency to continuously scale AI development in the future. Notable examples of increasing hardware requirements include DeepMind’s AlphaGo Zero, the world’s premier reinforcement-learning agent at the board-game Go which was trained on 4.9 million games against itself, requiring over 64 GPUs and 19 CPU servers, as compared to the chess model Fritz-3 in 1995 that ran on merely 2 CPUs. Alternatively, consider OpenAI’s GPT-3, a 175 billion-parameter language model trained on over 410 billion word tokens, requiring theoretically 355 years to train on a single GPU. The size of state-of-the-art ML models grows exponentially each year, from GPT-2’s 10 billion tokens in 2018 to GPT-3’s 410 billion in 2020. On the other hand, GPU advancements are linear with NVIDIA’s flagship GPU increasing memory size from 32 to 40GB over two years. Thus, a hardware constraint exists, which can not only impair future ML research but also immensely increase model training time, and is attributed to the degradation of Moore’s Law.

Moore’s Law, central to semiconductor design, claims that the number of transistors in a circuit, used as a proxy for hardware speed, would double every two years as seen in a historic trend from 1970 (Figure 1). While initial innovations by Bell Labs and Toshiba shrunk the size oftransistors by seven orders of magnitude in under five decades, fundamental limits have neared - the current 10 nanometer transistor size often means that the channel on silicon chips isn’t always stable for current, causing electrical leakage. Thus, it is much harder to shrink transistors anymore as leakage threatens the chip integrity, limiting its carrying voltage and consequently processing power.

The constraint is further manifested in Dennard scaling- as transistors shrink, their power density remains constant which keeps the chip's power requirement in proportion with area and creates a barrier on speed. Hence, we are likely not going to witness the rampant innovation in transistor density seen since the 1970s, meaning that current hardware is unlikely to keep pace with requirements for AI research. This is particularly relevant since the availability of a mechanism to sustain future research is necessary to avoid an ‘AI winter’ which could otherwise stagnate advancements in the field. Thus, newer paradigms must be imagined. They are likely to concentrate on creating more powerful hardware with quantum computing or also designing a novel set of algorithms with lower hardware requisiteswith tinyML. Addressing the former, since the crucible of classical hardware lies in encoding information in binary as strings of 1s and 0s, can we rethink computing without this constraint? he human brain does not visualize decision-making as binary, but rather with uncertainty. Can we develop architectures that can similarly integrate this probabilistic measure in information processing, effectively birthing a newer generation of hardware? Instead of bits requiring transistors to be either on (1) or off (0), can they represent a probability distribution? This elemental question fast-tracked the Noisy Intermediate-Scale Quantum (NISQ) era where computing leverages superposition (the ability for objects to exist in multiple energy states simultaneously) and entanglement (an object’s state instantly referencing another even over large distances to create dependencies). Such quantum phenomena use ‘qubits’ which can exist in superpositions with varied probabilities. Hence, particles can vary their probabilities over the course of quantum processes whilst interacting with distant particles in perfect unison by entanglement, seemingly facilitating instant information processing. Such speed gains are evidenced by Google’s quantum supremacy breakthrough, where their 54-bit Sycamore quantum chip performed a numeric computation task in under 200 seconds to beat a traditional computer’s 10,000-year estimated run-time.

The United States Federal Government and State Government have been focusing on Bitcoin at an administrative agency level instead of a federal level. This means Bitcoin is overlooked by federal agencies such as the Securities and Exchange Commision (SEC) and the Internal Revenue Service (IRS). In comparison, Finish economists felt like there was no need for governments to regulate Bitcoin due to how decentralized it is, contrary to its European counterparts. Asian countries such as China banned cryptocurrency exchanges altogether. It is not hard to see that some countries have tighter regulations on cryptocurrency and it is this variation that creates a loophole for money launderers. This can be further supported by a 2020 cryptocurrency crime and money laundering report by Ciphertrace where it found that “74% of the bitcoin moved in exchange-to-exchange transactions were cross-border”.

So, how do such hardware advances benefit artificial intelligence? Firstly, it would address the hardware bottleneck by drastically reducing training and inference times required to run neural networks in fields across computer vision, natural language processing and more. This facilitates novel research in the development of complex and resource-efficient models that scale effectively to large datasets. Hybrid “quantum-classical” models, popularized by Beer et al. (2020), offer such advantages seamlessly through Tensorflow Quantum (a novel open-source development framework). Secondly, it spearheads research into optimization problems whose sheer number of possibilities currently prevent empirical testing: what is the best order to assemble a Boeing airplane, consisting of millions of parts, to minimize cost and time? What is the best scheduling algorithm for traffic signals in urban neighbourhoods to reduce car wait times? How can logistics like multiple dependent delivery routes be ascertained for lowest time-to-consumer?