With continued aggressive technology scaling, Network-on- Chips (NoCs) architectures are facing three major challenges including minimizing power consumption, scaling performance and providing a reliable and robust communication limited by area, power, and cost constraints.
Researchers have proposed various techniques individually tackling these challenges, while few efforts to date have simultaneously targeted improving power, performance and reliability together. Due to the complexity of the interactions among three competing objectives and explosion of design space, it is harder to manually design rules and strategies for interconnection system for optimizing power, reliability and performance.
Network-on-Chips (NoCs), currently being used for on-chip communication in manycore architectures, face several problems including high network latency, excessive power consumption, and low reliability. Simultaneously addressing these problems is proving to be difficult due to the explosion of the design space and the complexity of handling many trade-offs. In this paper, we propose IntelliNoC, an intelligent NoC design framework which introduces architectural innovations and uses reinforcement learning to manage the design complexity and simultaneously optimize performance, energy-efficiency, and reliability in a holistic manner. IntelliNoC integrates three NoC architectural techniques: (1) multifunction adaptive channels (MFACs) to improve energy-efficiency; (2) adaptive error detection/correction and re-transmission control to enhance reliability; and (3) a stress-relaxing bypass feature which dynamically powers off NoC components to prevent overheating and fatigue. To handle the complex dynamic interactions induced by these techniques, we train a dynamic control policy using Q-learning, with the goal of providing improved fault-tolerance and performance while reducing power consumption and area overhead. Simulation using PARSEC benchmarks shows that our proposed IntelliNoC design improves energy-efficiency by 67% and mean-time-to-failure (MTTF) by 77%, and decreases end-to-end packet latency by 32% and area requirements by 25% over baseline NoC architecture.
As technology continues to scale, transistors and wires on the chip are becoming increasingly vulnerable to various fault mechanisms, especially timing errors, resulting in exacerbation of energy efficiency and performance for NoCs. Typical techniques for handling timing errors are reactive in nature, responding to the faults after their occurrence. They rely on error detection/correction techniques which have resulted in excessive power consumption and degraded performance, since the error detection/correction hardware is constantly enabled. On the other hand, indiscriminately disabling error handling hardware can induce more errors and intrusive retransmission traffic. Therefore, the challenge is to balance the trade-offs among error rate, packet retransmission, performance, and energy. In this paper, we propose a proactive fault-tolerant mechanism to optimize energy efficiency and performance with reinforcement learning (RL). First, we propose a new proactive error handling technique comprised of a dynamic scheme for enabling per-router error detection/correction hardware and an effective retransmission mechanism. Second, we propose the use of RL to train the dynamic control policy with the goals of providing increased fault-tolerance, reduced power consumption and improved performance as compared to conventional techniques. Our evaluation indicates that, on average, end-to-end packet latency is lowered by 55%, energy efficiency is improved by 64%, and retransmission caused by faults is reduced by 48% over the reactive error correction techniques.
Dynamic Voltage and Frequency Scaling (DVFS) is a popular technique that allows dynamic energy to be saved, but it can potentially lead to loss in throughput. We propose LEAD - Learning-enabled Energy-Aware Dynamic voltage/frequency scaling for NoC architectures wherein we use machine learning techniques to enable energy-performance trade-offs at reduced overhead cost. LEAD enables a proactive energy management strategy that relies on an offline trained regression model and provides a wide variety of voltage/frequency pairs (modes). LEAD groups each router and the router’s outgoing links locally into the same V/F domain, allowing energy management at a finer granularity without additional timing complications and overhead. Our simulation shows an average dynamic energy savings of 17% with a minimal loss of 4% in throughput and no latency increase.
As communication energy exceeds computation energy in future technologies, traditional on-chip electrical interconnects face fundamental challenges in the many-core era. Photonic interconnects have been proposed as a disruptive technology solution due to superior performance per Watt, distance independent energy consumption and CMOS compatibility for on-chip interconnects. Static power due to the laser being always switched on, varying link utilization due to spatial and temporal traffic fluctuations and thermal sensitivity are some of the critical challenges facing photonics interconnects. We propose photonic interconnects for heterogeneous multicores using a checkerboard pattern that clusters CPU-GPU cores together and implements bandwidth reconfiguration using local router information without global coordination. To reduce the static power, we also propose a dynamic laser scaling technique that predicts the power level for the next epoch using the buffer occupancy of previous epoch. To further improve power-performance trade-offs, we also propose a regression-based machine learning technique for scaling the power of the photonic link. Our simulation results demonstrate a 34% performance improvement over a baseline electrical CMESH while consuming 25% less energy per bit when dynamically reallocating bandwidth. When dynamically scaling laser power, our buffer-based reactive and ML-based proactive prediction techniques show 40 - 65% in power savings with 0 - 14% in throughput loss depending on the reservation window size.
NoC designs suffer from excessive static and dynamic power consumption. High dynamic power consumption results from switching and storing data within routers/links while excess static power is consumed when routers and links are not utilized for communication and yet have to be powered up. We propose LESSON (Learning Enabled Sleepy Storage Links and Routers in NoCs) to reduce both static and dynamic power consumption by power-gating the links and routers at low network utilization and moving the data storage from within the routers to the links at high network utilization. As the network utilization increases from low-to- high, to accommodate more traffic, we design the same channels to flow traffic in either direction, thereby avoiding complex routing or look-ahead wake-up algorithms. Machine learning algorithms predict when to power-gate the channels and routers and when to increase the channel bandwidths such that power savings are maximized while performance penalty is minimized. Our results show that we can improve total network power consumption when compared to conventional NoC buffer designs by 85.6% and when compared with aggressive NoC buffer designs by 31.7%. Our predictor shows marginal performance penalties and by dynamically changing the direction of the links, we can improve packet latency by 14%.
We develop a new approach to proactive fault-tolerance which uses machine learning (ML) algorithms to predict and mitigate errors. We provide a comprehensive fault-prediction system in which we (a) create a methodology to obtain realistic data sets, (b) train a ML algorithm to predict timing faults on links, and (c) mitigate for soft errors. We develop a fault model, which accounts for parameter variation and device wear-out, to create training/testing data sets for the ML algorithm. Using the training data set and the ID3 algorithm, we create decision trees which can be used to accurately predict the number of errors. Finally, we dynamically mitigate the errors using a combination of error correction codes (ECC) and a relaxed transmission. Our network results indicate a 26.8% reduction in packet retransmissions, a 3.31× speedup, and an energy savings of 60.0% on average over other designs.