Research Topic

Graph Processing & Neural Network Accelerators

Current Researchers: Dr. Jiajun Li, Dr. Hao Zheng, Dr. Ke Wang, Yingnan Zhao, and Belal Jahannia


Implementing graph processing & neural network algorithms on hardware platforms could incur problems such as poor locality, uncertain access latency, and unbalanced workloads. These problems are more obvious in the commercial environment where natural graphs or data loads are both large-scale and irregular. Therefore, domain-specific accelerators for graph processing or neural networks are needed to achieve high-performance computing during hardware implementation.

In this research, we are working on minimizing the computational complexity of graph processing, together with deep learning algorithms by employing novel dataflow and preprocessing frameworks. This could reduce redundant operations and fully exploit parallelism at the hardware level. Moreover, we are exploring efficient memory allocation approaches, improving data reuse, enhancing calculation throughput under limited bandwidth or directly increasing memory bandwidth based on custom architecture layout. Our ultimate target is designing high-performance and energy-efficient accelerators by FPGA / ASIC without violating computation accuracy.

01.

Y. Zhao, K. Wang, and A. Louri, “FSA: An Efficient Fault-Tolerant Systolic Array Based DNN Accelerator,” in Proceedings of the 40th IEEE International Conference on Computer Design (ICCD), Lake Tahoe, CA, October 23-26, 2022

With the advent of Deep Neural Network (DNN) accelerators, permanent faults are increasingly becoming a serious challenge for DNN hardware accelerator, as they can severely degrade DNN inference accuracy. The State-of-the-art works address this issue by adding homogeneous redundant Processing Elements (PEs) to the DNN accelerator’s central computing array, or bypassing faulty PEs directly. However, such designs induce inference loss, extra hardware cost, and performance overhead. Moreover, current designs are able to only deal with a limited number of faults due to costs. In this work, we propose FSA, a Fault-tolerant Systolic Array-based DNN accelerator with the goal of maintaining DNN inference accuracy in the presence of permanent faults. The key feature of the proposed FSA is a unified re-computing module (RCM) that dynamically recalculates the required DNN computations that are supposed to be accomplished by faulty PEs with minimal latency and power consumption. Simulation results show that the proposed FSA reduces inference accuracy loss by 46%, improves execution time by 23%, and reduces energy consumption by 35% on average, as compared to existing designs.

02.

J. Li, H. Zheng, K. Wang, and A. Louri, "SGCNAX: A Scalable Graph Convolutional Neural Network Accelerator with Workload Balancing", IEEE Transactions on Parallel and Distributed Systems, 33.11 (2022): 2834-2845.

Convolutional Neural Networks (GCNs) have emerged as promising tools for graph-based machine learning applications. Given that GCNs are both compute- and memory-intensive, this constitutes a major challenge for the underlying hardware to efficiently process large-scale GCNs. In this project, we introduce SGCNAX, a scalable GCN accelerator architecture for the high-performance and energy-efficient acceleration of GCNs. Unlike prior GCN accelerators that either employ limited loop optimization techniques, or determine the design variables based on random sampling, we systematically explore the loop optimization techniques for GCN acceleration and propose a flexible GCN dataflow that adapts to different GCN configurations to achieve optimal efficiency. We further propose two hardware-based techniques to address the workload imbalance problem caused by the unbalanced distribution of zeros in GCNs. Specifically, SGCNAX exploits an outer-product-based computation architecture that mitigates the intra-PE (processing elements) workload imbalance, and employs a group-and-shuffle approach to mitigate the inter-PE workload imbalance. Simulation results show that SGCNAX performs 9.2 X, 1.6 X, and 1.2 X better, and reduces DRAM accesses by a factor of 9.7 X, 2.9 X, and 1.2 X compared to HyGCN, AWB-GCN, and GCNAX, respectively.

03.

J. Li, A. Louri, A. Karanth, and R. Bunescu, "GCNAX: A Flexible and Energy-Efficient Accelerator for Graph Convolutional Neural Networks", in Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA), Virtual Conference, February 27 - March 3, 2021.

Graph convolutional neural networks (GCNs) have emerged as an effective approach to extend deep learning for graph data analytics. Given that graphs are usually irregular, as nodes in a graph may have a varying number of neighbors, processing GCNs efficiently pose a significant challenge on the underlying hardware. Although specialized GCN accelerators have been proposed to deliver better performance over generic processors, prior accelerators not only under-utilize the compute engine, but also impose redundant data accesses that reduce throughput and energy efficiency. Therefore, optimizing the overall flow of data between compute engines and memory, i.e., the GCN dataflow, which maximizes utilization and minimizes data movement is crucial for achieving efficient GCN processing. In this project, we propose a flexible and optimized dataflow for GCNs that simultaneously improves resource utilization and reduces data movement. This is realized by fully exploring the design space of GCN dataflows and evaluating the number of execution cycles and DRAM accesses through an analysis framework. Unlike prior strategies, the proposed dataflow can reconfigure the loop order and loop fusion strategy to adapt to different GCN configurations, which results in much improved efficiency. We then introduce a novel accelerator architecture called GCNAX, which tailors the compute engine, buffer structure and size based on the proposed dataflow. Evaluated on five real-world graph datasets, our simulation results show that GCNAX reduces DRAM accesses by a factor of 8.1 X and 2.4 X, while achieving 8.9 X, 1.6 X, speedup and 9.5 X, 2.3 X energy savings on average over HyGCN and AWB-GCN, respectively.

HPCAT Lab
High Performance Computing Architectures & Technologies Lab

Department of Electrical and Computer Enginnering
School of Engineering and Applied Science
The George Washington University


800 22nd Street NW
Washington, DC 20052
United States of America 

Contact

Ahmed Louri, IEEE Fellow
David and Marilyn Karlgaard Endowed Chair Professor of ECE
Director,  HPCAT Lab 


Email: louri@gwu.edu                    
Phone: +1 (202) 994 8241