on 23 Jul 2018 2:11 PM

In this message we first give an overview of different components like CPUs, GPUs, FPGAs and ASICs, and we stress the main differences between them in terms of flexibility and processing power. Then we discuss the interest of FPGAs that are deployed in public, private and hybrid Cloud infrastructures, and we explain why you should consider ready-to-use FPGA accelerator functions to boost your Cloud applications.


1. Landscape

On one end of the spectrum that is shown below, general-purpose Central Processing Units (CPUs) give you a high level of flexibility when developing your software applications. At the other end of the spectrum, Application Specific Integrated Circuits (ASICs) implement functions in hardware and their architecture can be tailored to process workloads orders of magnitude faster than CPUs. Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs) are intermediate alternatives that can process workloads faster than CPUs and that are more flexible than ASICs.  


2. CPUs and GPUs: flexible instruction set architectures

First CPU architectures date back from the 1940’s (Harvard: 1939, Von Neumann: 1945), when the number of transistors on a single chip was limited to a few 100s, and when the speed of a transistor was much lower than the speed of a memory.

In a CPU, the execution of an application or program corresponds to the execution of a sequence of instructions. This execution is applicable to any processing task. The instructions are stored in a memory. A Control Unit fetches instructions sequentially from this memory. As illustrated on the left hand side in the following figure, each instruction is decoded by the Control Unit, which then controls the execution of the following sequence of operations to process the instruction:
- load the instruction into an Arithmetic Logic Unit (ALU)
- load data into registers
- execute ALU
- store data to memory

Simple analogy: a CPU is a skilled artisan working on a workbench, with assistants bringing the tools (instructions) and the materials (data).

Today’s modern processors contain several CPUs on a single chip, and can process multiple instructions in parallel. Nevertheless, a CPU or a set of CPUs can hardly process large data volumes like images efficiently.  


GPUs have been invented in the 1970’s, after improving the speed of transistors and increasing their number on a single chip, to cope with the limitations of CPUs, by processing different blocks of an image in parallel.

The architecture of a GPU is derived from the architecture of a CPU, by replicating the ALU (several hundreds of time these days) as illustrated on the right-hand side in the above figure. All ALUs execute in parallel the same sequence of instructions on different data slices:
- load common instruction into ALUs
- load data slices into registers
- execute ALUs
- store data slices to memory

Analogy: a GPU is a team of identically skilled artisans that perform the same tasks on identical pieces.

GPUs are more efficient than general-purpose CPUs to process a narrow set of workloads that exhibit both data parallelism and data locality. This limitation can be addressed based on custom hardware components like ASICs and FPGAs.


3. ASICs and FPGAs: custom hardware architecture

In the 1980’s, ASICs and FPGAs emerged from the need to process any kind of task with millions of custom instructions, millions of registers, and thousands of local memories in a single chip. In these architectures, data flows from one instruction to the next, without going back-and-forth to central memory as opposed to CPUs and GPUs. This leads to massively parallel architectures.

ASICs expose a fixed set and layout of custom instructions that are engraved in silicon. Although ASICs can boost processing tasks  faster that CPUs and significantly reduce unit costs, designing an ASIC can cost 10s of millions of dollars and take years, and ASICs can not be reconfigured to implement new functions.  

FPGAs expose a programmable set and layout of custom instructions. They can be reconfigured as many times as desired by designers who specify how data must be processed by an array of elementary blocks and exchanged between these blocks as illustrated below. Loading a new program in an FPGA only takes a few seconds.

Analogy: FPGAs are the electronic implementation of factory floors, with many assembly lines each with many stations doing specialized tasks. The entire factory floor can be rebuilt in less than a second! (but it can take months to prepare one configuration of the factory floor).


Although FPGAs are more flexible than ASICs, they also have a higher unit cost. Nevertheless, FPGAs are the most efficient programmable processing devices thanks to their parallel architecture that supports local data movements and massive IOs, 


4. Is the FPGA the best programming device to process your workload?

The number of transistors in a chip increased by 100,000,000x, from 200 in 1955 to 20 billions in 2018. As opposed to the 1980’s, transistor speed is now higher than memory speed. However, transistor speed plateaued in ~2004 due to physical energy density limits. These limits make it hard for manufacturers to follow Moore’s law, which is an observation made in the 60’s that the number of transistors in an integrated circuit doubles about every two years. The 70+ years old CPU compute architecture reached its limits. It is possible to add more cores/ALUs to CPUs/GPUs, but they don’t run faster.

Is there a single winner between CPUs, GPUs and FPGAs? Absolutely not! You should use all processing devices, depending on your workloads.

While very inefficient, CPU remain the most versatile processing device, and the easiest by far to program. If a CPU is efficient enough for a given workload, it is the best target. If a CPU is not efficient enough for some workloads, then GPUs should be explored first. If these workloads fall into the narrow application segment of tasks that exhibit both data parallelism and data locality, then GPU is the most efficient compute device. Finally, for workloads that are not fast enough on neither CPU nor GPU, then FPGA is the right choice. While FPGAs are the most efficient programmable devices, they remain very complex to program.


5. A new world of opportunities

The industry engaged FPGAs for general purpose compute. Microsoft Azure started its FPGA in 2011, and has been deploying at scale since 2015. Intel acquired Altera for $16B in 2016. Leading Cloud Service Providers like AWS, OVH, Alibaba, Baidu, Tencent and Huawei have been deploying FPGAs in 2017 and 2018, with more to come. Dell, EMC and Fujitsu also deployed FPGA-based servers in 2018.

In this public, private and hybrid Cloud context, CPU intensive processing tasks are offloaded to FPGA cards through PCIe interfaces, therefore leading to huge runtime decrease and cost saving. This opens up a new world of opportunities to boost Video and Image Processing, Data Analytics, Machine Learning, Security, Financial, AI or Genomics applications, to name a few.


Ready to use FPGA accelerators are available on AccelStore for you to boost your own applications in your public Cloud infrastructure or on premise. You can easily integrate and operate/orchestrate the execution of these accelerators thanks to an open source library. This library and accelerators can be tested in a few minutes through AccelStore. Please contact Accelize to get more details.