AI compilers emerged and played a vital role in accelerating training and serving while we almost exhausted other approaches, such as optimizing architecture and hardware.
The good news is that ML compilers can significantly improve the efficiency of large-scale model serving. Then came a lot of ML compilers: Apache TVM, NVIDIA TensorRT, ONNX Runtime, LLVM, Google MLIR, TensorFlow XLA, Meta Glow, PyTorch nvFuser, and Intel PlaidML and OpenVINO.
Let’s take a closer look for a comprehensive grasp.
Apache TVM: Apache TVM is an open-source ML compiler framework for CPUs, GPUs, and other ML hardware accelerators. It aims to enable ML engineers to optimize and run computations efficiently on any hardware backend. TVM provides two main features: 1). compilation of deep learning models into minimum deployable modules; 2). infrastructure to automatically generate and optimize models on more backends with better performance.
NVIDIA TensorRT: This is a high-performance deep learning inference optimizer and runtime library for NVIDIA GPUs. It can be used to optimize and deploy models developed in TensorFlow, PyTorch, or ONNX format. TensorRT can significantly improve the inference speed of LLMs by optimizing the computation graph, using reduced-precision arithmetic, and applying other techniques.
ONNX Runtime: ONNX (Open Neural Network Exchange) is an open-source format for representing deep learning models. It was created by Microsoft, Facebook, and other collaborators to provide a standard format that allows interoperability among different deep-learning frameworks. ONNX Runtime is a performance-focused engine for running ONNX models. It supports a wide range of hardware platforms, including CPUs, GPUs, and edge devices. ONNX Runtime is designed to optimize the execution of machine learning models, providing better performance compared to running models directly in their native framework.
LLVM: LLVM began as a research project at the UIUC to provide a modern, SSA-based compilation strategy supporting the static and dynamic compilation of arbitrary programming languages. Since then, LLVM has grown into an umbrella project comprising several subprojects.
Google MLIR: MLIR (Multi-Level Intermediate Representation) is a representation format and library of compiler utilities that sits between the model representation and low-level compilers/executors generating hardware-specific code. It is a flexible infrastructure for modern optimizing compilers. This means it consists of a specification for intermediate representations (IR) and a coding toolkit to perform transformations on that representation. That is lowerings, transferring from higher-level representations to lower-level representations in compiler parlance.
TensorFlow XLA: XLA (Accelerated Linear Algebra) is a domain-specific compiler for linear algebra that can accelerate TensorFlow models with potentially no source code changes. XLA takes graphs (“computations”) defined in HLO (High-Level Operations) and compiles them into machine instructions for various architectures. XLA is modular because it can easily slot in an alternative backend to target some novel hardware architecture.
Meta Glow: Glow accepts a computation graph from deep learning frameworks like PyTorch and generates highly optimized code for machine learning accelerators. It contains many machine learning and hardware optimizations like kernel fusion to accelerate model development.
PyTorch nvFuser: nvFuser is a DL compiler that just-in-time compiles fast and flexible GPU-specific code to reliably accelerate users’ networks automatically, providing speedups for DL networks running on Volta and later CUDA accelerators by generating fast custom “fusion” kernels at runtime. It is specifically designed to meet the unique requirements of the PyTorch community and supports diverse network architectures and programs with dynamic inputs of varying shapes and strides.
Intel PlaidML: PlaidML is an open-source tensor compiler. With Intel’s nGraph graph compiler, it can enable popular DL frameworks' performance portability across various CPU, GPU, and other accelerator processor architectures.
OpenVINO: OpenVINO (Open Visual Inference and Neural Network Optimization) is an open-source toolkit also developed by Intel. OpenVINO is mainly to enable fast, high-performance deep learning inference on Intel hardware, such as CPUs, integrated GPUs, FPGAs, and VPUs (Vision Processing Units). It provides a set of tools and libraries designed to optimize and accelerate deep learning models for computer vision and other AI applications. OpenVINO supports various deep learning frameworks, including TensorFlow, Caffe, ONNX, Kaldi, and others. In summary, OpenVINO is tailored for optimizing and accelerating deep learning inference on Intel hardware, while PlaidML is a more generic, hardware-agnostic deep learning compiler that allows for broader device compatibility.
The above might miss other interesting AI compilers, but we can see their popularity and significance.