Research Topic

Scalable and Chiplet-based Machine Learning Accelerators

Current Researchers: Yingnan Zhao, Dr. Ke Wang, and Dr. Yuan Li

As machine learning models grow in complexity and scale, traditional monolithic chip designs struggle to meet demands for compute density, energy efficiency, and scalability. Our research focuses on chiplet-based accelerator designs that address these limitations while enabling scalable deployment of machine learning systems. Our research works targets the following key areas: (1) Custom interconnection networks that adapt to diverse traffic patterns from heterogeneous cores, (2) Silicon interposer-aware network designs for high-bandwidth, low-latency communication in chiplet systems, (3) Hardware-software co-design for optimizing machine learning (ML) inference and training across distributed compute fabrics, and (4) Scalable accelerator fabrics that maintain throughput under limited area and power budgets. By embracing the modularity of chiplet-based architectures and the specialization of machine learning accelerators, our goal is to create high-performance, energy-efficient, and scalable computing platforms for next-generation AI applications, from edge devices to data centers.


01.

Y. Zhao, K. Wang, and A. Louri, "OPT-GCN: A Unified and Scalable Chiplet-based Accelerator for High-Performance and Energy-Efficient GCN Computation," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), May 2024.

As the size of real-world graphs continues to grow at an exponential rate, performing the Graph Convolutional Network (GCN) inference efficiently is becoming increasingly challenging. Prior works that employ a unified computing engine with a predefined computation order lack the necessary flexibility and scalability to handle diverse input graph datasets. In this paper, we introduce OPT-GCN, a chiplet-based accelerator design that performs GCN inference efficiently while providing flexibility and scalability through an architecture-algorithm co-design. On the architecture side, the proposed design integrates a unified computing engine in each chiplet and an active interposer, both of which are adaptable to efficiently perform the GCN inference and facilitate data communication. On the algorithm side, we propose dynamic scheduling and mapping algorithms to optimize memory access and on-chip computations for diverse GCN applications.

02.

Y. Chen, A. Louri, F. Lombardi, and S. Liu, "Chiplet-GAN: Chiplet-based Accelerator Design for Scalable Generative Adversarial Network Inference," in IEEE Circuits and System Magazine, vol. 24, no. 3, pp. 19-33, DOI: 10.1109/MCAS.2024.3359571, third quarter 2024.

Generative adversarial networks (GANs) have emerged as a powerful solution for generating synthetic data when the availability of large, labeled training datasets is limited or costly in large-scale machine learning systems. Recent advancements in GAN models have extended their applications across diverse domains, including medicine, robotics, and content synthesis. These advanced GAN models have gained recognition for their excellent accuracy by scaling the model. However, existing accelerators face scalability challenges when dealing with large-scale GAN models. As the size of GAN models increases, the demand for computation and communication resources during inference continues to grow. To address this scalability issue, this article proposes Chiplet-GAN, a chiplet-based accelerator design for GAN inference. Chiplet-GAN enables scalability by adding more chiplets to the system, thereby supporting the scaling of computation capabilities. To handle the increasing communication demand as the system and model scale, a novel interconnection network with adaptive topology and passive/active network links is developed to provide adequate communication support for ChipletGAN. Coupled with workload partition and allocation algorithms, Chiplet-GAN reduces execution time and energy consumption for GAN inference workloads as both model and chiplet-system scales.

03.

Yuan Li, Ahmed Louri, and Avinash Karanth, "SPACX: Silicon Photonic-based Scalable Chiplet Accelerator for DNN Inference," in Proceedings of 28th IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 831-845, April. 2 - 6, 2022. 

In pursuit of higher inference accuracy, deep neural network (DNN) models have significantly increased in complexity and size. To overcome the consequent computational challenges, scalable chiplet-based accelerators have been proposed. However, data communication using metallic-based interconnects in these chiplet-based DNN accelerators is becoming a primary obstacle to performance, energy efficiency, and scalability. The photonic interconnects can provide adequate data communication support due to some superior properties like low latency, high bandwidth and energy efficiency, and ease of broadcast communication. In this paper, we propose SPACX: a Silicon Photonics-based Chiplet Accelerator for DNN inference applications. Specifically, SPACX includes a photonic network design that enables seamless single-chiplet and cross-chiplet broadcast communications, and a tailored dataflow that promotes data broadcast and maximizes parallelism. Furthermore, we explore the broadcast granularities of the photonic network and implications on system performance and energy efficiency. A flexible bandwidth allocation scheme is also proposed to dynamically adjust communication bandwidths for different types of data.

04.

K. Shiflett, A. Karanth, A. Louri, and R. Bunescu, "Bitwise Neural Network Acceleration Using Silicon Photonics", in Proceedings of the ACM/IEEE Great Lakes Symposium on VLSI, Virtual Event, June 22-25, 2021.

Hardware accelerators provide significant speedup and improve energy efficiency for several demanding deep neural network (DNN) applications. DNNs have several hidden layers that perform concurrent matrix-vector multiplications (MVMs) between the network weights and input features. As MVMs are critical to the performance of DNNs, previous research has optimized the performance and energy efficiency of MVMs at both the architecture and algorithm levels. In this project, we propose to use emerging silicon photonics technology to improve parallelism, speed and overall efficiency with the goal of providing real-time inference and fast training of neural nets. We use microring resonators (MRRs) and Mach Zehnder interferometers (MZIs) to design two versions (all-optical and partial-optical) of hybrid matrix multiplications for DNNs. Our results indicate that our partial optical design gave the best performance in both energy efficiency and latency, with a reduction of 33.1% for energy-delay product (EDP) with conservative estimates and a 76.4% reduction for EDP with aggressive estimates.

05.

K. Shiflett, A. Karanth, R. Bunescu, and A. Louri, "Scaling Deep-Learning Inference with Chiplet-based Architecture and Photonic Interconnects", in Proceedings of International Symposium on Computer Architecture (ISCA), Valencia, Spain, June 14-18, 2021.

With the end of Dennard scaling, highly-parallel and specialized hardware accelerators have been proposed to improve the throughput and energy-efficiency of deep neural network (DNN) models for various applications. However, collective data movement primitives such as multicast and broadcast that are required for multiply-and-accumulate (MAC) computation in DNN models are expensive, and require excessive energy and latency when implemented with electrical networks. This consequently limits the scalability and performance of electronic hardware accelerators. Emerging technology such as silicon photonics can inherently provide efficient implementation of multicast and broadcast operations, making photonics more amenable to exploit parallelism within DNN models. Moreover, when coupled with other unique features such as low energy consumption, high channel capacity with wavelength-division multiplexing (WDM), and high speed, silicon photonics could potentially provide a viable technology for scaling DNN acceleration.
In this work, we propose Albireo, an analog photonic architecture for scaling DNN acceleration. By characterizing photonic devices such as microring resonators (MRRs) and Mach-Zehnder modulators (MZM) using photonic simulators, we develop realistic device models and outline their capability for system level acceleration. Using the device models, we develop an efficient broadcast combined with multicast data distribution by leveraging parameter sharing through unique WDM dot product processing. We evaluate the energy and throughput performance of Albireo on DNN models such as ResNet18, MobileNet and VGG16. When compared to current state-of-the-art electronic accelerators, Albireo increases throughput by 110 X, and improves energy-delay product (EDP) by an average of 74 X with current photonic devices. Furthermore, by considering moderate and aggressive photonic scaling, the proposed Albireo design shows that EDP can be reduced by at least 229 X.

06.

J. Li, A. Louri, A. Karanth, and R. Bunescu, "CSCNN: Algorithm-Hardware Co-Design for CNN Accelerators using Centrosymmetric Filters", in Proceedings of International Symposium on High-Performance Computer Architecture (HPCA), Seoul, Korea, February 27 - March 3, 2021.

Convolutional neural networks (CNNs) are at the core of many state-of-the-art deep learning models in computer vision, speech, and text processing. Training and deploying such CNN-based architectures usually require a significant amount of computational resources. Sparsity has emerged as an effective compression approach for reducing the amount of data and computation for CNNs. However, sparsity often results in computational irregularity, which prevents accelerators from fully taking advantage of its benefits for performance and energy improvement. In this work, we propose CSCNN, an algorithm/hardware co-design framework for CNN compression and acceleration that mitigates the effects of computational irregularity and provides better performance and energy efficiency. On the algorithmic side, CSCNN uses centrosymmetric matrices as convolutional filters. In doing so, it reduces the number of required weights by nearly 50% and enables structured computational reuse without compromising regularity and accuracy. Additionally, complementary pruning techniques are leveraged to further reduce computation by a factor of 2.8-7.2× with a marginal accuracy loss. On the hardware side, we propose a CSCNN accelerator that effectively exploits the structured computational reuse enabled by centrosymmetric filters, and further eliminates zero computations for increased performance and energy efficiency. Compared against a dense accelerator, SCNN and SparTen, the proposed accelerator performs 3.7×, 1.6× and 1.3× better, and improves the EDP (Energy Delay Product) by 8.9×, 2.8× and 2.0×, respectively.

07.

K. Shiflett, D. Wright, A. Karanth, and A. Louri, "PIXEL: Photonic Neural Network Accelerator", in Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), San Diego, CA, February 22-26, 2020.

Machine learning (ML) architectures such as Deep Neural Networks (DNNs) have achieved unprecedented accuracy on modern applications such as image classification and speech recognition. With power dissipation becoming a major concern in ML architectures, computer architects have focused on designing both energy-efficient hardware platforms as well as optimizing ML algorithms. To dramatically reduce power consumption and increase parallelism in neural network accelerators, disruptive technology such as silicon photonics has been proposed which can improve the performance-per-Watt when compared to electrical implementation. In this work, we propose PIXEL - Photonic Neural Network Accelerator that efficiently implements the fundamental operation in neural computation, namely the multiply and accumulate (MAC) functionality using photonic components such as microring
resonators (MRRs) and Mach-Zehnder interferometer (MZI). We design two versions of PIXEL - a hybrid version that multiplies optically and accumulates electrically and a fully optical version that multiplies and accumulates optically. We perform a detailed power, area and timing analysis of the different versions of photonic and electronic accelerators for different convolution neural networks (AlexNet, VGG16, and others). Our results indicate a significant improvement in the energy-delay product for both PIXEL designs over traditional electrical designs (48.4% for OE and 73.9% for OO) while minimizing latency, at the cost of increased area over electrical designs.

HPCAT Lab
High Performance Computing Architectures & Technologies Lab

Department of Electrical and Computer Enginnering
School of Engineering and Applied Science
The George Washington University


800 22nd Street NW
Washington, DC 20052
United States of America 

Contact

Ahmed Louri, IEEE Life Fellow
David and Marilyn Karlgaard Endowed Chair Professor of ECE
Director,  HPCAT Lab 


Email: louri@gwu.edu                    
Phone: +1 (202) 994 8241