With technology scaling, hardware reliability has become a major obstacle to reaping the benefits of increased integration projected by Moore's law. Integrated circuit reliability is affected by many parameters, including design-related parameters such as gate oxide width, wire cross-sectional area, integration density, and chip area; process-related parameters such as defect size, distribution, and density; and operation-related parameters such as voltage, power density, and temperature. In this research project, we explore cost-effective approaches to mitigate permanent faults (induced by hardware aging) and transient errors (induced by runtime variations, overheated hardware, and transistor delay, etc.) in network-on-chips (NoCs) and deep learning accelerators in particular.
Abstract—With the advent of Deep Neural Network (DNN) accelerators, permanent faults are increasingly becoming a serious challenge for DNN hardware accelerator, as they can severely degrade DNN inference accuracy. The State-of-the-art works address this issue by adding homogeneous redundant Processing Elements (PEs) to the DNN accelerator’s central computing array, or bypassing faulty PEs directly. However, such designs induce inference loss, extra hardware cost, and performance overhead. Moreover, current designs are able to only deal with a limited number of faults due to costs. In this paper, we propose FSA, a Fault-tolerant Systolic Array-based DNN accelerator with the goal of maintaining DNN inference accuracy in the presence of permanent faults. The key feature of the proposed FSA is a unified re-computing module (RCM) that dynamically recalculates the required DNN computations that are supposed to be accomplished by faulty PEs with minimal latency and power consumption. Simulation results show that the proposed FSA reduces inference accuracy loss by 46%, improves execution time by 23%, and reduces energy consumption by 35% on average, as compared to existing designs.
We propose CURE, a deep reinforcement learning (DRL)-based NoC design framework that simultaneously reduces network latency, improves energy-efficiency, and tolerates transient errors and permanent faults. CURE has several architectural innovations and a DRL-based hardware controller to manage design complexity and optimize trade-offs. First, in CURE, we propose reversible multi-function adaptive channels (RMCs) to reduce NoC power consumption and network latency. Second, we implement a new fault-secure adaptive error correction hardware in each router to enhance reliability for both transient errors and permanent faults. Third, we propose a router power-gating and bypass design that powers off NoC components to reduce power and extend chip lifespan. Further, for the complex dynamic interactions of these techniques, we propose using DRL to train a proactive control policy to provide improved fault-tolerance, reduce power consumption, and improved performance. Simulation using the PARSEC benchmark shows that CURE reduces end-to-end packet latency by 39 percent, improves energy efficiency by 92 percent, and lowers static and dynamic power consumption by 24 and 38 percent, respectively, over conventional solutions. Using mean-time-to-failure, we show that CURE is 7.7 X more reliable than the conventional NoC design.
Network-on-Chips (NoCs), currently being used for on-chip communication in manycore architectures, face several problems including high network latency, excessive power consumption, and low reliability. Simultaneously addressing these problems is proving to be difficult due to the explosion of the design space and the complexity of handling many trade-offs. In this paper, we propose IntelliNoC, an intelligent NoC design framework which introduces architectural innovations and uses reinforcement learning to manage the design complexity and simultaneously optimize performance, energy-efficiency, and reliability in a holistic manner. IntelliNoC integrates three NoC architectural techniques: (1) multifunction adaptive channels (MFACs) to improve energy-efficiency; (2) adaptive error detection/correction and re-transmission control to enhance reliability; and (3) a stress-relaxing bypass feature which dynamically powers off NoC components to prevent overheating and fatigue. To handle the complex dynamic interactions induced by these techniques, we train a dynamic control policy using Q-learning, with the goal of providing improved fault-tolerance and performance while reducing power consumption and area overhead. Simulation using PARSEC benchmarks shows that our proposed IntelliNoC design improves energy-efficiency by 67% and mean-time-to-failure (MTTF) by 77%, and decreases end-to-end packet latency by 32% and area requirements by 25% over baseline NoC architecture.
As technology continues to scale, transistors and wires on the chip are becoming increasingly vulnerable to various fault mechanisms, especially timing errors, resulting in exacerbation of energy efficiency and performance for NoCs. Typical techniques for handling timing errors are reactive in nature, responding to the faults after their occurrence. They rely on error detection/correction techniques which have resulted in excessive power consumption and degraded performance, since the error detection/correction hardware is constantly enabled. On the other hand, indiscriminately disabling error handling hardware can induce more errors and intrusive retransmission traffic. Therefore, the challenge is to balance the trade-offs among error rate, packet retransmission, performance, and energy. In this paper, we propose a proactive fault-tolerant mechanism to optimize energy efficiency and performance with reinforcement learning (RL). First, we propose a new proactive error handling technique comprised of a dynamic scheme for enabling per-router error detection/correction hardware and an effective retransmission mechanism. Second, we propose the use of RL to train the dynamic control policy with the goals of providing increased fault-tolerance, reduced power consumption and improved performance as compared to conventional techniques. Our evaluation indicates that, on average, end-to-end packet latency is lowered by 55%, energy efficiency is improved by 64%, and retransmission caused by faults is reduced by 48% over the reactive error correction techniques.
Network-on-Chips (NoCs) are quickly becoming the standard communication paradigm for the growing number of cores on the chip. While NoCs can deliver sufficient bandwidth and enhance scalability, NoCs suffer from high power consumption due to the router microarchitecture and communication channels that facilitate inter-core communication. As technology keeps scaling down in the nanometer regime, unpredictable device behavior due to aging, infant mortality, design defects, soft errors, aggressive design, and process-voltage-temperature variations, will increase and will result in a significant increase in faults and hardware failures. In this work, we propose QORE - a fault tolerant NoC architecture with Quad-Function Channel (QFC) buffers. The use of QFC buffers and their associated control enhances fault-tolerance by allowing the NoC to dynamically adapt to faults at he link level and reverse propagation direction to avoid faulty links. Additionally, QFC buffers reduce router power and improve performance by eliminating in-router buffering. Our simulation results using real benchmarks and synthetic traffic mixes show that QORE improves speedup by 1.3X and throughput by 2.3X when compared to state-of-the-art fault tolerant NoCs designs such as Ariadne and Vicis. Moreover, using Synopsys Design Compiler, we also show that network power in QORE is reduced by 21% with minimal control overhead.
Network-on-Chips (NoCs) are quickly becoming the standard communication fabric for multi-core systems. As technology continues to scale don into the nanometer regime, device behavior will become increasingly unreliable due to a combination of aging, soft errors, aggressive transistor design, and process-voltage-temperature variations. Further, stringent timing constraints in NoCs are designed so that data can be pushed faster. The net result is an increase in errors which must be mitigated by the NoC. Typical techniques for handling faults are often reactive as they respond to faults after the error has occurred, making the recovery process inefficient in energy and time. In this work, we take a different approach wherein we propose to use proactive, fault-tolerant schemes to be employed before the fault affects the system. We propose to utilize machine learning techniques to train a decision tree which can be used to predict faults efficiently in the network. Based on the prediction model, we dynamically mitigate there predicted faults through error correction codes (ECC) and relaxed timing transmission. Our results indicate that, on average, we can accurately predict timing errors 60.6% better than a static single error correction and double error detection (SECDED) technique resulting in an average 26.8% reduction in retransmitted packets, an average net speedup of 3.31X, and an average energy savings of 60.0% over other designs for real traffic patterns.