diff --git a/contents/efficient_ai/efficient_ai.qmd b/contents/efficient_ai/efficient_ai.qmd index 2c92304b..49051dfd 100644 --- a/contents/efficient_ai/efficient_ai.qmd +++ b/contents/efficient_ai/efficient_ai.qmd @@ -10,53 +10,51 @@ Resources: [Slides](#sec-efficient-ai-resource), [Videos](#sec-efficient-ai-reso ![_DALL·E 3 Prompt: A conceptual illustration depicting efficiency in artificial intelligence using a shipyard analogy. The scene shows a bustling shipyard where containers represent bits or bytes of data. These containers are being moved around efficiently by cranes and vehicles, symbolizing the streamlined and rapid information processing in AI systems. The shipyard is meticulously organized, illustrating the concept of optimal performance within the constraints of limited resources. In the background, ships are docked, representing different platforms and scenarios where AI is applied. The atmosphere should convey advanced technology with an underlying theme of sustainability and wide applicability._](images/png/cover_efficient_ai.png) -Efficiency in artificial intelligence (AI) is not simply a luxury but a necessity. In this chapter, we dive into the key concepts underpinning AI systems' efficiency. The computational demands on neural networks can be daunting, even for minimal systems. For AI to be seamlessly integrated into everyday devices and essential systems, it must perform optimally within the constraints of limited resources while maintaining its efficacy. The pursuit of efficiency guarantees that AI models are streamlined, rapid, and sustainable, thereby widening their applicability across various platforms and scenarios. - +Efficiency in artificial intelligence (AI) is not merely desirable but a fundamental requirement. This chapter delves into the key concepts that underpin the efficient operation of AI systems. The computational demands of neural networks can be substantial, even for relatively simple systems. To seamlessly integrate AI into everyday devices and critical systems, it must perform optimally within the constraints of limited resources while maintaining effectiveness. The pursuit of efficiency ensures that AI models are streamlined, rapid, and sustainable, expanding their applicability across diverse platforms and scenarios. ::: {.callout-tip} ## Learning Objectives -- Recognize the need for efficient AI in TinyML/edge devices. +- Recognize the necessity for efficient AI in TinyML/edge devices. -- Understand the need for efficient model architectures like MobileNets and SqueezeNet. +- Grasp the importance of efficient model architectures like MobileNets and SqueezeNet. -- Understand why techniques for model compression are important. +- Understand the significance of techniques for model compression. -- Get an inclination for why efficient AI hardware is important. +- Gain an appreciation for the value of efficient AI hardware. -- Appreciate the significance of numerics and their representations. +- Recognize the importance of numerical representations and their precision. -- We appreciate that we need to understand the nuances of model comparison beyond accuracy. +- Understand the nuances of model comparison beyond just accuracy. - Recognize efficiency encompasses technology, costs, environment, and ethics. ::: -The focus is on gaining a conceptual understanding of the motivations and significance of the various strategies for achieving efficient AI, both in terms of techniques and a holistic perspective. Subsequent chapters will dive into the nitty-gritty details of these various concepts. +The focus is on acquiring a conceptual understanding of the motivations and significance of the various strategies employed to achieve efficient AI, both in terms of techniques and a holistic perspective. Subsequent chapters provide a more in-depth exploration of these multiple concepts. ## Introduction -Training models can consume significant energy, sometimes equivalent to the carbon footprint of sizable industrial processes. We will cover some of these sustainability details in the [AI Sustainability](../sustainable_ai/sustainable_ai.qmd) chapter. On the deployment side, if these models are not optimized for efficiency, they can quickly drain device batteries, demand excessive memory, or fall short of real-time processing needs. Through this introduction, we aim to elucidate the nuances of efficiency, setting the groundwork for a comprehensive exploration in the subsequent chapters. +Training models can consume significant energy, sometimes comparable to the carbon footprint of sizable industrial processes. The [AI Sustainability](../sustainable_ai/sustainable_ai.qmd) chapter will explore these sustainability details. On the deployment side, if these models are not optimized for efficiency, they can rapidly drain device batteries, demand excessive memory, or fall short of real-time processing requirements. Through this introduction, we aim to elucidate the nuances of efficiency, laying the groundwork for a comprehensive exploration in the subsequent chapters. ## The Need for Efficient AI -Efficiency takes on different connotations depending on where AI computations occur. Let's revisit and differentiate between Cloud, Edge, and TinyML in terms of efficiency. @fig-platforms provides a big picture comparison of the three different platforms. +Efficiency takes on different connotations depending on where AI computations occur. Regarding efficiency, let's revisit and differentiate between Cloud, Edge, and TinyML. @fig-platforms provides a big-picture comparison of the three different platforms. ![Cloud, Mobile and TinyML. Source: @schizas2022tinyml.](https://www.mdpi.com/futureinternet/futureinternet-14-00363/article_deploy/html/images/futureinternet-14-00363-g001-550.jpg){#fig-platforms} -For cloud AI, traditional AI models often run in large-scale data centers equipped with powerful GPUs and TPUs [@barroso2019datacenter]. Here, efficiency pertains to optimizing computational resources, reducing costs, and ensuring timely data processing and return. However, relying on the cloud introduced latency, especially when dealing with large data streams that must be uploaded, processed, and downloaded. +For cloud AI, traditional AI models often run in large-scale data centers equipped with powerful GPUs and TPUs [@barroso2019datacenter]. Here, efficiency pertains to optimizing computational resources, reducing costs, and ensuring timely data processing and results. However, relying on the cloud introduces latency, especially when dealing with large data streams that require uploading, processing, and downloading. -For edge AI, edge computing brings AI closer to the data source, processing information directly on local devices like smartphones, cameras, or industrial machines [@li2019edge]. Here, efficiency encompasses quick real-time responses and reduced data transmission needs. The constraints, however, are tighter—these devices, while more powerful than microcontrollers, have limited computational power compared to cloud setups. +For edge AI, edge computing brings AI closer to the data source, processing information directly on local devices like smartphones, cameras, or industrial machines [@li2019edge]. Here, efficiency encompasses quick real-time responses and reduced data transmission needs. However, the constraints are tighter—these devices, while more powerful than microcontrollers, have limited computational power compared to cloud setups. -Pushing the frontier even further is TinyML, where AI models run on microcontrollers or extremely resource-constrained environments. The difference in processor and memory performance between TinyML and cloud or mobile systems can be several orders of magnitude [@warden2019tinyml]. Efficiency in TinyML is about ensuring models are lightweight enough to fit on these devices, use minimal energy (critical for battery-powered devices), and still perform their tasks effectively. +TinyML pushes the frontier further, where AI models run on microcontrollers or extremely resource-constrained environments. The processor and memory performance difference between TinyML and cloud or mobile systems can be several orders of magnitude [@warden2019tinyml]. Efficiency in TinyML is about ensuring models are lightweight enough to fit on these devices, consume minimal energy (critical for battery-powered devices), and still perform their tasks effectively. -The spectrum from Cloud to TinyML represents a shift from vast, centralized computational resources to distributed, localized, and constrained environments. As we transition from one to the other, the challenges and strategies related to efficiency evolve, underlining the need for specialized approaches tailored to each scenario. Having underscored the need for efficient AI, especially within the context of TinyML, we will transition to exploring the methodologies devised to meet these challenges. The following sections outline the main concepts we will delve deeper into later. We will demonstrate the breadth and depth of innovation needed to achieve efficient AI as we delve into these strategies. +The spectrum from Cloud to TinyML represents a shift from vast, centralized computational resources to distributed, localized, and constrained environments. As we transition from one to the other, the challenges and strategies associated with efficiency evolve, underlining the need for specialized approaches tailored to each scenario. Having established the necessity of efficient AI, particularly within TinyML, we will explore the methodologies developed to address these challenges. The following sections outline the main concepts we will delve deeper into later. We will demonstrate the breadth and depth of innovation required to achieve efficient AI as we explore these strategies. ## Efficient Model Architectures -Choosing the right model architecture is as crucial as optimizing it. In recent years, researchers have explored some novel architectures that can have inherently fewer parameters while maintaining strong performance. - -**MobileNets:** MobileNets are efficient mobile and embedded vision application models [@howard2017mobilenets]. The key idea that led to their success is the use of depth-wise separable convolutions, which significantly reduce the number of parameters and computations in the network. MobileNetV2 and V3 further enhance this design by introducing inverted residuals and linear bottlenecks. +Selecting an optimal model architecture is as crucial as optimizing it. In recent years, researchers have made significant strides in exploring innovative architectures that can inherently have fewer parameters while maintaining strong performance. +**MobileNets:** MobileNets are efficient mobile and embedded vision application models [@howard2017mobilenets]. The key idea that led to their success is depth-wise separable convolutions, significantly reducing the number of parameters and computations in the network. MobileNetV2 and V3 further enhance this design by introducing inverted residuals and linear bottlenecks. **SqueezeNet:** SqueezeNet is a class of ML models known for its smaller size without sacrificing accuracy. It achieves this by using a "fire module" that reduces the number of input channels to 3x3 filters, thus reducing the parameters [@iandola2016squeezenet]. Moreover, it employs delayed downsampling to increase the accuracy by maintaining a larger feature map. @@ -64,13 +62,13 @@ Choosing the right model architecture is as crucial as optimizing it. In recent ## Efficient Model Compression -Model compression methods are very important for bringing deep learning models to devices with limited resources. These techniques reduce models' size, energy consumption, and computational demands without significantly losing accuracy. At a high level, the methods can briefly be binned into the following fundamental methods: +Model compression methods are essential for bringing deep learning models to devices with limited resources. These techniques reduce models' size, energy consumption, and computational demands without significantly losing accuracy. At a high level, the methods can be categorized into the following fundamental methods: -**Pruning:** This is akin to trimming the branches of a tree. This was first thought of in the [Optimal Brain Damage](https://proceedings.neurips.cc/paper/1989/file/6c9882bbac1c7093bd25041881277658-Paper.pdf) paper [@lecun1989optimal]. This was later popularized in the context of deep learning by @han2016deep. Certain weights or even entire neurons are removed from the network in pruning based on specific criteria. This can significantly reduce the model size. Various strategies include weight pruning, neuron pruning, and structured pruning. We will explore these in more detail in @sec-pruning. @fig-pruning is an examples of neural network pruning: removing some of the nodes in the inner layers (based on a specific criteria) reduces the numbers of edges between the nodes and, in turn, the size of the model. +**Pruning:** This is akin to trimming the branches of a tree. This was first thought of in the [Optimal Brain Damage](https://proceedings.neurips.cc/paper/1989/file/6c9882bbac1c7093bd25041881277658-Paper.pdf) paper [@lecun1989optimal]. This was later popularized in the context of deep learning by @han2016deep. Certain weights or entire neurons are removed from the network in pruning based on specific criteria. This can significantly reduce the model size. Various strategies include weight pruning, neuron pruning, and structured pruning. We will explore these in more detail in @sec-pruning. @fig-pruning is an example of neural network pruning: removing some of the nodes in the inner layers (based on specific criteria) reduces the number of edges between the nodes and, in turn, the model's size. ![Neural Network Pruning.](images/jpg/pruning.jpeg){#fig-pruning} -**Quantization:** quantization is the process of constraining an input from a large set to output in a smaller set, primarily in deep learning; this means reducing the number of bits that represent the weights and biases of the model. For example, using 16-bit or 8-bit representations instead of 32-bit can reduce the model size and speed up computations, with a minor trade-off in accuracy. We will explore these in more detail in @sec-quant. @fig-quantization shows an example of quantization by rounding to the closest number. The conversion from 32-bit floating point to 16-bit reduces the memory usage by 50%. And going from 32-bit to 8-bit Integer, memory is reduced by 75%. While the loss in numeric precision, and consequently model performance, is minor, the memory usage efficiency is very significant. +**Quantization:** Quantization is the process of constraining an input from a large set to output in a smaller set, primarily in deep learning; this means reducing the number of bits that represent the weights and biases of the model. For example, using 16-bit or 8-bit representations instead of 32-bit can reduce the model size and speed up computations, with a minor trade-off in accuracy. We will explore these in more detail in @sec-quant. @fig-quantization shows an example of quantization by rounding to the closest number. The conversion from 32-bit floating point to 16-bit reduces memory usage by 50%. Going from a 32-bit to an 8-bit integer reduces memory usage by 75%. While the loss in numeric precision, and consequently model performance, is minor, the memory usage efficiency is significant. ![Different forms of quantization.](images/jpg/quantization.jpeg){#fig-quantization} @@ -78,9 +76,9 @@ Model compression methods are very important for bringing deep learning models t ## Efficient Inference Hardware -[Training](../training/training.qmd): An AI model is an intensive task that requires powerful hardware and can take hours to weeks, but inference needs to be as fast as possible, especially in real-time applications. This is where efficient inference hardware comes into play. We can achieve rapid response times and power-efficient operation by optimizing the hardware specifically for inference tasks, which is especially crucial for edge devices and embedded systems. +[Training](../training/training.qmd): An AI model is an intensive task that requires powerful hardware and can take hours to weeks, but inference needs to be as fast as possible, especially in real-time applications. This is where efficient inference hardware comes into play. By optimizing the hardware specifically for inference tasks, we can achieve rapid response times and power-efficient operation, which is especially crucial for edge devices and embedded systems. -**TPUs (Tensor Processing Units):** [TPUs](https://cloud.google.com/tpu) are custom-built ASICs (Application-Specific Integrated Circuits) by Google to accelerate machine learning workloads [@jouppi2017datacenter]. They are optimized for tensor operations, offering high throughput for low-precision arithmetic, and are designed specifically for neural network machine learning. TPUs significantly accelerate model training and inference compared to general-purpose GPU/CPUs. This boost means faster model training and real-time or near-real-time inference capabilities, which are crucial for applications like voice search and augmented reality. +**TPUs (Tensor Processing Units):** [TPUs](https://cloud.google.com/tpu) are custom-built ASICs (Application-Specific Integrated Circuits) by Google to accelerate machine learning workloads [@jouppi2017datacenter]. They are optimized for tensor operations, offering high throughput for low-precision arithmetic, and are designed specifically for neural network machine learning. TPUs significantly accelerate model training and inference compared to general-purpose GPU/CPUs. This boost means faster model training and real-time or near-real-time inference capabilities, crucial for applications like voice search and augmented reality. [Edge TPUs](https://cloud.google.com/edge-tpu) are a smaller, power-efficient version of Google's TPUs tailored for edge devices. They provide fast on-device ML inferencing for TensorFlow Lite models. Edge TPUs allow for low-latency, high-efficiency inference on edge devices like smartphones, IoT devices, and embedded systems. AI capabilities can be deployed in real-time applications without communicating with a central server, thus saving bandwidth and reducing latency. Consider the table in @fig-edge-tpu-perf. It shows the performance differences between running different models on CPUs versus a Coral USB accelerator. The Coral USB accelerator is an accessory by Google's Coral AI platform that lets developers connect Edge TPUs to Linux computers. Running inference on the Edge TPUs was 70 to 100 times faster than on CPUs. @@ -88,9 +86,9 @@ Model compression methods are very important for bringing deep learning models t **NN Accelerators:** Fixed-function neural network accelerators are hardware accelerators designed explicitly for neural network computations. They can be standalone chips or part of a larger system-on-chip (SoC) solution. By optimizing the hardware for the specific operations that neural networks require, such as matrix multiplications and convolutions, NN accelerators can achieve faster inference times and lower power consumption than general-purpose CPUs and GPUs. They are especially beneficial in TinyML devices with power or thermal constraints, such as smartwatches, micro-drones, or robotics. -But these are all but the most common examples. A number of other types of hardware are emerging that have the potential to offer significant advantages for inference. These include, but are not limited to, neuromorphic hardware, photonic computing, etc. In [@sec-aihw], we will explore these in greater detail. +But these are all but the most common examples. Several other types of hardware are emerging that have the potential to offer significant advantages for inference. These include, but are not limited to, neuromorphic hardware, photonic computing, etc. In [@sec-aihw], we will explore these in greater detail. -Efficient hardware for inference speeds up the process, saves energy, extends battery life, and can operate in real-time conditions. As AI continues to be integrated into myriad applications- from smart cameras to voice assistants- the role of optimized hardware will only become more prominent. By leveraging these specialized hardware components, developers and engineers can bring the power of AI to devices and situations that were previously unthinkable. +Efficient hardware for inference speeds up the process saves energy, extends battery life, and can operate in real-time conditions. As AI continues to be integrated into myriad applications- from smart cameras to voice assistants- the role of optimized hardware will only become more prominent. By leveraging these specialized hardware components, developers and engineers can bring the power of AI to devices and situations that were previously unthinkable. ## Efficient Numerics @@ -100,11 +98,10 @@ Machine learning, and especially deep learning, involves enormous amounts of com There are many different types of numerics. Numerics have a long history in computing systems. -**Floating point:** Known as single-precision floating-point, FP32 utilizes 32 bits to represent a number, incorporating its sign, exponent, and fraction. FP32 is widely adopted in many deep learning frameworks and balances accuracy and computational requirements. It's prevalent in the training phase for many neural networks due to its sufficient precision in capturing minute details during weight updates. - +**Floating point:** Known as a single-precision floating point, FP32 utilizes 32 bits to represent a number, incorporating its sign, exponent, and fraction. FP32 is widely adopted in many deep learning frameworks and balances accuracy and computational requirements. It is prevalent in the training phase for many neural networks due to its sufficient precision in capturing minute details during weight updates. Also known as half-precision floating point, FP16 uses 16 bits to represent a number, including its sign, exponent, and fraction. It offers a good balance between precision and memory savings. FP16 is particularly popular in deep learning training on GPUs that support mixed-precision arithmetic, combining the speed benefits of FP16 with the precision of FP32 where needed. -Several other numerical formats fall into an exotic class. An exotic example is BF16 or Brain Floating Point. It is a 16-bit numerical format designed explicitly for deep learning applications. It's a compromise between FP32 and FP16, retaining the 8-bit exponent from FP32 while reducing the mantissa to 7 bits (as compared to FP32's 23-bit mantissa). This structure prioritizes range over precision. BF16 has achieved training results comparable in accuracy to FP32 while using significantly less memory and computational resources [@kalamkar2019study]. This makes it suitable not just for inference but also for training deep neural networks. +Several other numerical formats fall into an exotic class. An exotic example is BF16 or Brain Floating Point. It is a 16-bit numerical format designed explicitly for deep learning applications. It is a compromise between FP32 and FP16, retaining the 8-bit exponent from FP32 while reducing the mantissa to 7 bits (as compared to FP32's 23-bit mantissa). This structure prioritizes range over precision. BF16 has achieved training results comparable in accuracy to FP32 while using significantly less memory and computational resources [@kalamkar2019study]. This makes it suitable not just for inference but also for training deep neural networks. By retaining the 8-bit exponent of FP32, BF16 offers a similar range, which is crucial for deep learning tasks where certain operations can result in very large or very small numbers. At the same time, by truncating precision, BF16 allows for reduced memory and computational requirements compared to FP32. BF16 has emerged as a promising middle ground in the landscape of numerical formats for deep learning, providing an efficient and effective alternative to the more traditional FP32 and FP16 formats. @@ -183,25 +180,23 @@ In essence, these efficiency metrics are more than numbers dictating where and h ### Efficiency Comparisons -The ecosystem contains an abundance of models, each boasting its unique strengths and idiosyncrasies. However, pure model accuracy figures or training and inference speeds paint a partial picture. When we dive deeper into comparative analyses, several critical nuances emerge. - +The landscape of machine learning models is vast, with each model offering a unique set of strengths and implementation considerations. While raw accuracy figures or training and inference speeds might be tempting benchmarks, they provide an incomplete picture. A deeper comparative analysis reveals several critical factors influencing a model's suitability for TinyML applications. Often, we encounter the delicate balance between accuracy and efficiency. For instance, while a dense, deep learning model and a lightweight MobileNet variant might excel in image classification, their computational demands could be at two extremes. This differentiation is especially pronounced when comparing deployments on resource-abundant cloud servers versus constrained TinyML devices. In many real-world scenarios, the marginal gains in accuracy could be overshadowed by the inefficiencies of a resource-intensive model. -Moreover, the optimal model choice is only sometimes universal but often depends on the specifics of an application. Consider object detection: a model that excels in general scenarios that might falter in niche environments, such as when detecting manufacturing defects on a factory floor. This adaptability- or the lack of it- can dictate a model's real-world utility. - -Another important consideration is the relationship between model complexity and its practical benefits. Take voice-activated assistants, such as "Alexa" or "OK Google." While a complex model might demonstrate a marginally superior understanding of user speech if it's slower to respond than a simpler counterpart, the user experience could be compromised. Thus, adding layers or parameters only sometimes equates to better real-world outcomes. +Moreover, the optimal model choice is only sometimes universal but often depends on the specifics of an application. Consider object detection: a model that excels in general scenarios might falter in niche environments, such as when detecting manufacturing defects on a factory floor. This adaptability- or the lack of it- can dictate a model's real-world utility. +Another important consideration is the relationship between model complexity and its practical benefits. Take voice-activated assistants like "Alexa" or "OK Google." While a complex model might demonstrate a marginally superior understanding of user speech if it's slower to respond than a simpler counterpart, the user experience could be compromised. Thus, adding layers or parameters only sometimes equates to better real-world outcomes. Furthermore, while benchmark datasets, such as ImageNet [@russakovsky2015imagenet], COCO [@lin2014microsoft], Visual Wake Words [@chowdhery2019visual], Google Speech Commands [@warden2018speech], etc. provide a standardized performance metric, they might not capture the diversity and unpredictability of real-world data. Two facial recognition models with similar benchmark scores might exhibit varied competencies when faced with diverse ethnic backgrounds or challenging lighting conditions. Such disparities underscore the importance of robustness and consistency across varied data. For example, @fig-stoves from the Dollar Street dataset shows stove images across extreme monthly incomes. Stoves have different shapes and technological levels across different regions and income levels. A model that is not trained on diverse datasets might perform well on a benchmark but fail in real-world applications. So, if a model was trained on pictures of stoves found in wealthy countries only, it would fail to recognize stoves from poorer regions. -![Different types of stoves. Source: Dollar Street stove images.](https://pbs.twimg.com/media/DmUyPSSW0AAChGa.jpg){#fig-stoves} +![Different types of stoves. Credit: Dollar Street stove images.](images/jpg/dollar_street.jpg){#fig-stoves} In essence, a thorough comparative analysis transcends numerical metrics. It's a holistic assessment intertwined with real-world applications, costs, and the intricate subtleties that each model brings to the table. This is why having standard benchmarks and metrics widely established and adopted by the community becomes important. ## Conclusion -Efficient AI is extremely important as we push towards broader and more diverse real-world deployment of machine learning. This chapter provided an overview, exploring the various methodologies and considerations behind achieving efficient AI, starting with the fundamental need, similarities, and differences across cloud, Edge, and TinyML systems. +Efficient AI is crucial as we push towards broader and more diverse real-world deployment of machine learning. This chapter provided an overview, exploring the various methodologies and considerations behind achieving efficient AI, starting with the fundamental need, similarities, and differences across cloud, Edge, and TinyML systems. -We saw that efficient model architectures can be useful for optimizations. Model compression techniques such as pruning, quantization, and knowledge distillation exist to help reduce computational demands and memory footprint without significantly impacting accuracy. Specialized hardware like TPUs and NN accelerators offer optimized silicon for neural network operations and data flow. Efficient numerics balance precision and efficiency, enabling models to attain robust performance using minimal resources. In the subsequent chapters, we will explore these different topics in depth and in detail. +We examined efficient model architectures and their usefulness for optimization. Model compression techniques such as pruning, quantization, and knowledge distillation exist to help reduce computational demands and memory footprint without significantly impacting accuracy. Specialized hardware like TPUs and NN accelerators offer optimized silicon for neural network operations and data flow. Efficient numerics balance precision and efficiency, enabling models to attain robust performance using minimal resources. We will explore these topics in depth and detail in the subsequent chapters. Together, these form a holistic framework for efficient AI. But the journey doesn't end here. Achieving optimally efficient intelligence requires continued research and innovation. As models become more sophisticated, datasets grow, and applications diversify into specialized domains, efficiency must evolve in lockstep. Measuring real-world impact requires nuanced benchmarks and standardized metrics beyond simplistic accuracy figures. @@ -248,3 +243,4 @@ In addition to exercises, we offer a series of hands-on labs allowing students t - _Coming soon._ ::: + diff --git a/contents/efficient_ai/images/jpg/dollar_street.jpg b/contents/efficient_ai/images/jpg/dollar_street.jpg new file mode 100644 index 00000000..e4854371 Binary files /dev/null and b/contents/efficient_ai/images/jpg/dollar_street.jpg differ