
×
Design of Energy-Efficient Processing Elements for Near-Threshold Parallel Computing
von Michael Andreas Gautschi, herausgegeben von Qiuting Huang, Mathieu Luisier, Andreas Schenk und Bernd WitzigmannOver the last years, the number of internet of things (IoT) endpoint devices has grown considerably and this trend is expected to augment even more in the following decade. Such systems are small, mostly battery powered, and consist of various sensors, a wireless transmission solution and a micro-controller unit (MCU). The varying application requirements of such IoT systems ask for a programmable and scalable solution, which offers performance capabilities ranging from a few kOp/s to GOp/s. Further, such systems need to be low cost, energy efficient and consume only very few milliwatts. Today’s systems often consist of a low-power MCU with limited computing capabilities which is mainly used for controlling tasks and not for data processing. Nearsensor data processing on the other hand, allows for sensor fusion, and feature extraction and can significantly reduce the number of transmitted bytes. We propose to use a programmable multi-core system that is scalable in performance and energy efficiency due to its parallel architecture and the use of near-threshold (NT) operation.
This thesis focuses on the heart of this architecture, the processing elements (PEs), which can be programmed to execute various applications in parallel, or to jointly work on one single application. To reach a higher performance, and a better energy efficiency, a RISC-V processor architecture has been designed, and extended with new instructions typically present in more energy-efficient digital signal processing (DSP) engines. Sensor data of less precision can be processed on average 2.3× faster through single-instruction multiple-data (SIMD) extensions, and the integration of the PEs in the multicore platform is optimized through prefetch buffers to reduce cache contentions and instruction fetch costs.
Further, the feasibility to support high-dynamic-range (HDR) arithmetic in multi-core clusters is investigated through two number systems, the logarithmic number system (LNS) format and a traditional IEEE-754 floating point format. The former has been explored because complex operations such as multiplication, division, and squareroots transform to simple integer operation in the logarithmic domain and can be computed very energy efficient. Additions and subtractions translate to non-linear functions, which can be interpolated in a shared unit. This LNS unit also allows to process other complex functions like logarithms, and trigonometric functions allowing this system to process non-linear kernels up to 4.1× more energy-efficient than with traditional floating-point units (FPUs).
Finally, a generalized sharing framework is introduced which allows to share individual operators of various latencies in a cluster of multiple PEs. A fine-grained, shared FPU of 63 kGE, which supports all RISC-V instructions, is integrated in an octa-core cluster, enabling HDR arithmetic to all cores at diminishing costs. On a parallel seizure detection application, it is shown that access contentions can be kept below 2% which allows the shared unit to be scalable in performance while minimizing the per core area overhead.
Implementing a four-core cluster in an advanced technology node like 28 nm FD-SOI allows the PEs to achieve a top energy efficiency of 193 MOp/s per mW, which is significantly more than commercially available MCUs achieve, but scalable at the same time, driving the platform ready to serve more complex IoT systems which will require more and more HDR arithmetic.
This thesis focuses on the heart of this architecture, the processing elements (PEs), which can be programmed to execute various applications in parallel, or to jointly work on one single application. To reach a higher performance, and a better energy efficiency, a RISC-V processor architecture has been designed, and extended with new instructions typically present in more energy-efficient digital signal processing (DSP) engines. Sensor data of less precision can be processed on average 2.3× faster through single-instruction multiple-data (SIMD) extensions, and the integration of the PEs in the multicore platform is optimized through prefetch buffers to reduce cache contentions and instruction fetch costs.
Further, the feasibility to support high-dynamic-range (HDR) arithmetic in multi-core clusters is investigated through two number systems, the logarithmic number system (LNS) format and a traditional IEEE-754 floating point format. The former has been explored because complex operations such as multiplication, division, and squareroots transform to simple integer operation in the logarithmic domain and can be computed very energy efficient. Additions and subtractions translate to non-linear functions, which can be interpolated in a shared unit. This LNS unit also allows to process other complex functions like logarithms, and trigonometric functions allowing this system to process non-linear kernels up to 4.1× more energy-efficient than with traditional floating-point units (FPUs).
Finally, a generalized sharing framework is introduced which allows to share individual operators of various latencies in a cluster of multiple PEs. A fine-grained, shared FPU of 63 kGE, which supports all RISC-V instructions, is integrated in an octa-core cluster, enabling HDR arithmetic to all cores at diminishing costs. On a parallel seizure detection application, it is shown that access contentions can be kept below 2% which allows the shared unit to be scalable in performance while minimizing the per core area overhead.
Implementing a four-core cluster in an advanced technology node like 28 nm FD-SOI allows the PEs to achieve a top energy efficiency of 193 MOp/s per mW, which is significantly more than commercially available MCUs achieve, but scalable at the same time, driving the platform ready to serve more complex IoT systems which will require more and more HDR arithmetic.