Streaming Architectures for Extreme Energy Efficiency in High-Performance Computing

Name: Streaming Architectures for Extreme Energy Efficiency in High-Performance Computing
Author: Fabian Thomas Schuiki

von Fabian Thomas Schuiki, herausgegeben von Huang Qiuting, Mathieu Luisier, Andreas Schenk und Bernd Witzigmann

Mitwirkende
Autor / Autorin	Fabian Thomas Schuiki
Herausgegeben von	Huang Qiuting
Herausgegeben von	Mathieu Luisier
Herausgegeben von	Andreas Schenk
Herausgegeben von	Bernd Witzigmann

The end of Moore’s law and the breakdown of Dennard scaling has prompted a paradigm shift in the way we approach computer architecture design. Performance at low power has become the key ingredient in achieving high utilization of available hardware in order to mitigate the effect of limited frequency and overcome dark silicon. The von Neumann bottleneck is one of the key challenges in this field: instruction fetches compete with data accesses for memory bandwidth. This bottleneck also applies to the instruction pipeline of a processor, where load-store and control instructions compete with compute instructions for issue slots. A popular way to overcome this bottleneck is to implement dedicated accelerators for a specific problem. This approach has grown ever more popular with the recent rise of machine learning. It is based on the observation that, all other things being equal, specialization in hardware always wins. However the complementary conclusion also holds: the lack of general programmability limits the accelerator’s use to a specific problem. In a time of fast-moving algorithms, today’s hardware accelerator cannot compute tomorrow’s algorithm. General purpose processors have evolved to mitigate the von Neumann bottleneck as well. One example of this is the CISC-to-RISC translation in modern processors, which can act as an instruction compression scheme. Similarly, SIMD and SIMT paradigms offer a fixed increase in computations per instruction, while Cray-style vectorization offers a more dynamic and potentially higher increase. Among the algorithms that lend themselves particularly well to such acceleration is the class of data-oblivious algorithms. These algorithms have control flow which does not depend on the data being processed, and comprise many relevant algorithms from linear algebra, machine learning, and scientific computing. This thesis develops the concept of hardware address generation and direct memory streaming as a method to mitigate the von Neumann bottleneck, applies the concept to in-order single-issue processors, allowing them to achieve full utilization of compute resources, introduces pseudo-dual-issue execution with dedicated compute hardware loops, and distills these extensions into an architectural template for high-performance computers capable of concentrating a significant part of its energy footprint in the arithmetic units.