Compiler Design Lab works on AI Compilers

Overview of AI Compilers

AI compilers emerged and played a vital role in accelerating training and serving while we almost exhausted other approaches, such as optimizing architecture and hardware.

The good news is that ML compilers can significantly improve the efficiency of large-scale model serving. Then came a lot of ML compilers: Apache TVM, NVIDIA TensorRT, ONNX Runtime, LLVM, Google MLIR, TensorFlow XLA, Meta Glow, PyTorch nvFuser, and Intel PlaidML and OpenVINO.

Let’s take a closer look for a comprehensive grasp.

Apache TVM

Apache TVM: Apache TVM is an open-source ML compiler framework for CPUs, GPUs, and other ML hardware accelerators. It aims to enable ML engineers to optimize and run computations efficiently on any hardware backend. TVM provides two main features: 1). compilation of deep learning models into minimum deployable modules; 2). infrastructure to automatically generate and optimize models on more backends with better performance.

NVIDIA TensorRT

NVIDIA TensorRT: This is a high-performance deep learning inference optimizer and runtime library for NVIDIA GPUs. It can be used to optimize and deploy models developed in TensorFlow, PyTorch, or ONNX format. TensorRT can significantly improve the inference speed of LLMs by optimizing the computation graph, using reduced-precision arithmetic, and applying other techniques.

ONNX Runtime

ONNX Runtime: ONNX (Open Neural Network Exchange) is an open-source format for representing deep learning models. It was created by Microsoft, Facebook, and other collaborators to provide a standard format that allows interoperability among different deep-learning frameworks. ONNX Runtime is a performance-focused engine for running ONNX models. It supports a wide range of hardware platforms, including CPUs, GPUs, and edge devices. ONNX Runtime is designed to optimize the execution of machine learning models, providing better performance compared to running models directly in their native framework.

LLVM

LLVM: LLVM began as a research project at the UIUC to provide a modern, SSA-based compilation strategy supporting the static and dynamic compilation of arbitrary programming languages. Since then, LLVM has grown into an umbrella project comprising several subprojects.

Google MLIR

Google MLIR: MLIR (Multi-Level Intermediate Representation) is a representation format and library of compiler utilities that sits between the model representation and low-level compilers/executors generating hardware-specific code. It is a flexible infrastructure for modern optimizing compilers. This means it consists of a specification for intermediate representations (IR) and a coding toolkit to perform transformations on that representation. That is lowerings, transferring from higher-level representations to lower-level representations in compiler parlance.

TensorFlow XLA

TensorFlow XLA: XLA (Accelerated Linear Algebra) is a domain-specific compiler for linear algebra that can accelerate TensorFlow models with potentially no source code changes. XLA takes graphs (“computations”) defined in HLO (High-Level Operations) and compiles them into machine instructions for various architectures. XLA is modular because it can easily slot in an alternative backend to target some novel hardware architecture.

Meta Glow

Meta Glow: Glow accepts a computation graph from deep learning frameworks like PyTorch and generates highly optimized code for machine learning accelerators. It contains many machine learning and hardware optimizations like kernel fusion to accelerate model development.

PyTorch nvFuser

PyTorch nvFuser: nvFuser is a DL compiler that just-in-time compiles fast and flexible GPU-specific code to reliably accelerate users’ networks automatically, providing speedups for DL networks running on Volta and later CUDA accelerators by generating fast custom “fusion” kernels at runtime. It is specifically designed to meet the unique requirements of the PyTorch community and supports diverse network architectures and programs with dynamic inputs of varying shapes and strides.

Intel PlaidML

Intel PlaidML: PlaidML is an open-source tensor compiler. With Intel’s nGraph graph compiler, it can enable popular DL frameworks' performance portability across various CPU, GPU, and other accelerator processor architectures.

OpenVINO

OpenVINO: OpenVINO (Open Visual Inference and Neural Network Optimization) is an open-source toolkit also developed by Intel. OpenVINO is mainly to enable fast, high-performance deep learning inference on Intel hardware, such as CPUs, integrated GPUs, FPGAs, and VPUs (Vision Processing Units). It provides a set of tools and libraries designed to optimize and accelerate deep learning models for computer vision and other AI applications. OpenVINO supports various deep learning frameworks, including TensorFlow, Caffe, ONNX, Kaldi, and others. In summary, OpenVINO is tailored for optimizing and accelerating deep learning inference on Intel hardware, while PlaidML is a more generic, hardware-agnostic deep learning compiler that allows for broader device compatibility.

The above might miss other interesting AI compilers, but we can see their popularity and significance.

Tasks

Select an AI compiler from the list above. Motivate in a text file your choice, minimum 100 words. Maximum two students may work on an AI compiler with disjoint topics. The choice should be emailed to the lecture tutor chirila@cs.upt.ro no later than 06.10.2024. The AI compilers are allocated in the first come first served order.
Install your AI compiler on your machine and document each of the steps with problems and solutions related to your machine operating systems and configuration. Provide print screens of the folders of the installation.
Select a set of interesting topics with examples at a decent length from the AI compiler documentation starting from the beginning.
Create a PowerPoint presentation using the blank template about the selected topics, minimum 50 slides
The slides will compile each idea from the documentation, each code example and each figure if any.
Run the examples from the AI compiler documentation and write developer documentation for them explaining how they work.
Create new set of examples, with a higher degree of complexity. Explain this complexity in a text file, minimum 100 words.
Run the compiler on the new set of examples.
Compare the results and draw the conclusions in a text file, minimum 200 words.

Requirements

All projects will be uploaded in student name folders in gitlab.upt.ro in a repository owned by the lecture tutor (to be created).
Each task will be accomplished when the deliverable is uploaded in the gitlab.