Efficient hardware acceleration of recurrent neural networks

Name: Efficient hardware acceleration of recurrent neural networks
Author: Vladimir Rybalkin

Abstract
Every next innovation revolution comes at a faster pace than a previous one. The Agricultural
Revolution happened approximately 10,000 years ago. The Scientific Revolution
that came with significant advances in natural sciences happened 500 years ago. Less than
200 years after the Industrial Revolution that brought machines to replace manual labor and
revolutionized production, we are evidencing a new revolution. According to Dr. Michio
Kaku, professor of theoretical physics in City College of New York, who is also a futurist
and a popularizer of science, „We are witnessing one of the greatest revolutions in all of
human history – a revolution driven by artificial intelligence and the Internet of Things.“
Artificial Intelligence (AI) is a machinery approach of mimicking human reasoning
used for adaptively solving problems, hence minimizing human involvement. The term
AI was coined in 1956, but AI has become more popular only recently due to the deep
learning breakthrough. Deep learning, also known as deep neural learning or Deep Neural
Network (DNN), is a hierarchical composition of artificial neural neurons connected in
layers with a problem-solving capability that increases with more layers creating a deeper
structure. Deep learning-based AI surpasses human capabilities in many applications, creating
a paradigm shift in virtually every tech industry sector, allowing for decision support
systems and intelligent search systems that complement and augment human abilities.
However, the AI revolution and progress in the deployment of DNNs would not be
possible without the evolution of computers, namely improvements in computing power
and storage. At the dawn of the computer industry, nobody knew where this new technology
would take us. „I think there is a world market for maybe five computers.“ - Thomas
Watson, president of IBM, 1943. Ken Olsen, a prominent computer industry pioneer, was
quoted in 1977 as saying that „There is no reason for any individual to have a computer
in their home.“ However, in less than 50 years, the rapid progress of technology following
Moore’s law enabled computers to evolve according to the bravest envisions of futurists.
Recently, the progress has brought computers to the next leap in innovation - the Internet
of Things (IoT), which can be seen as another evolutionary step made by the computers
on a long way from vacuum-tube machines occupying a complete building, to mainframes
of a size of a room, to personal computers available at each desk, finally, to interrelated
computing devices communicating over a network, which are available in our pocket as a
cellphone or a tablet, as wearable devices, and ubiquitous sensors. The deployment of revolutionary
AI on IoT devices requires unprecedented processing speed, power consumption,
and energy efficiency.
One of the most promising and rapidly developing platforms that can meet the requirements
is Field Programmable Gate Array (FPGA). FPGAs are semiconductor devices that
ii
are based on a matrix of configurable logic blocks connected via programmable interconnects.
Referring to Manoj Roge, vice president of product planning and business development
at Achronix, currently, there is a change in paradigm with respect to FPGAs that are
in the third era of programmable logic shifting from being used only as a glue logic or for
prototyping to independent compute engines for data acceleration. Today’s FPGAs push
the 500MHz performance barrier. FPGAs became a compelling proposition for almost any
design due to an unprecedented increase in logic density and a host of other features.
A Perfect Storm is a term used by analogy to an unusually severe storm that results from
a rare combination of meteorological phenomena. Referring to this analogy, currently,
we are in the middle of a perfect technological and innovation storm bringing together
(1) revolutionary artificial intelligence with beyond-human recognition capabilities, (2) the
ubiquitous IoT with unprecedented speed, power, and energy requirements, (3) and FPGAs
emerging as independent computing platform with a unique combination of flexibility and
efficiency.
Deployment of DNNs has high computational and storage requirements. One of the
most challenging for efficient implementation neural networks is a Long Short-Term Memory
(LSTM) network that achieves advanced accuracy in many applications targeting sequence
recognition, namely optical character and speech recognition, forecasting, and many
more. The main goal of this research has been to design efficient hardware architectures
of LSTM networks for applications requiring high throughput, low power, and energy efficiency.
Despite the advantages of FPGAs, very often CPUs and GPUs are preferred over
FPGAs because of the faster and easier development process. The thesis presents a holistic
design space exploration methodology and an automatic framework to facilitate fast
and efficient implementation of DNNs on FPGAs. This work also presents a low-power,
energy-efficient solution with real-time capabilities for digitizing historical documents as
an additional contribution. In the context of communication standards for IoT, the research
targets a design of a critical component of a hardware architecture for error correction codes
enabling high reliability and suitable for high-speed and low-latency wireless communication.
The novel contributions presented in this thesis are bundled into five topics:
• The first hardware architecture and Pareto-frontier analysis of bidirectional LSTM.
• The first hardware architecture and Pareto-frontier analysis of multidimensional LSTM.
• A cross-layer design space exploration methodology and a framework for automatic
co-design and implementation of DNNs and hardware architectures on FPGAs.
• The first heterogeneous architecture for low-power, real-time and energy-efficient
device for highly accurate end-to-end transcription of historical documents.
• The first hardware architecture for high-speed and low-latency Non-Binary Low-
Density Parity-Check (NB-LDPC) check node for Galois Field GF(256).