Optimizing AI and Machine Learning with eFPGAs1st September 2018
The market for artificial intelligence (AI) and machine learning applications has been growing substantially over the last several years. Designers have a tough row to hoe when it comes to satisfying these applications’ seemingly insatiable compute hunger. They are finding that traditional Von Neumann processor architectures are not optimal solutions for the neural networks fundamental to AI and machine learning.
When GPUs are used to train neural networks, they require floating pointing math that is very compute intensive. However, using integer math for inference, designers can speed computation by turning to FPGAs for neural network processing. Many companies are starting to recognize this, with Microsoft’s Project Brainwave, which uses FPGA chips to accelerate AI, as a perfect example.
How FPGAs Speed AI
FPGAs have many advantages for AI. An FPGA’s many multiplier-accumulators (MACS) can accelerate AI computation with massive parallelism. FPGAs are also reconfigurable, which is critical in the AI and machine learning market because these applications are still at an early stage, and algorithms are changing very quickly.
“How many MACs per second can I get?” is one of the first things chip designers start thinking about when they begin an AI project. The bulk of the computation that neural network inference requires is 8-bit (or sometimes 16-bit or even 8×16 or 16×8) integer matrix multiplies of very large matrices, driving the demand for MACs or GigaMACs, also referred to as GigaOps where one Op equals one multiply+accumulate. Thus, the GigaMACs/second determines throughput of the hardware.
Neural networks are a digital approximation of the neuron structure in the human brain (Figure 1). Each neuron gets inputs from a large number of other neurons. The output or firing of a neuron occurs when the sum of the inputs exceeds a certain value determined by the neuron’s activation function. Figure 2 shows a simple neural network for computational purposes, which roughly approximates the function of a human brain.
Each layer of the neural network is a matrix multiply. The input layer is a vector, which is multiplied by a 2-dimensional matrix of weights (determined in a training phase), to generate the values for the next layer. Then each successive layer is another matrix multiply. After each matrix multiply, the new values go through an activation phase before becoming inputs to the next step.
On a side note, memory bandwidth is another, separate factor as a lot of data has to be read in order to do the matrix math (Figure 3). A full tutorial on the math can be viewed at this link http://www.flex-logix.com/eflx4k-ai/
eFPGAs Help Balance Performance and Configurability
At Flex Logix, we have customers such as Harvard already designing deep learning chips using eFPGAs. AI designers seeking to balance performance and reconfigurability are finding eFPGAs optimal for this purpose. FPGAs and eFPGAs are good for AI because of the large number of built-in MACs, originally used for digital signal processing (DSP).
In a typical Altera FPGA, the ratio of logic (Look up table (LUT)6) to MACs is about 200:1. Xilinx is about 100:1, whereas the Flex Logix EFLX4K DSP eFPGA core has a ratio of about 50:1. Thus, for a given array size, eFPGA delivers two-to-four times more MACs— key to AI performance.
However, the MACs in all eFPGAs and FPGAs today are optimized for DSP. They have pre-adders, large multipliers (22×22 or 18×25, for example) and accumulators. AI prefers a MAC that has an 8×8 multiplier with accumulator with an option to configure as a 16×16, 16×8 or 8×16 MAC as well. Since an 8-bit MAC is smaller, it can run faster.
Flex Logix has architected its eFPGA to be optimized for AI in many ways. For example, it has optimized eFPGA for AI by using smaller 8×8 MACs, which are smaller (3 fit in the space of a 22×22 MAC) and run faster, and increasing the ratio of MACs to LUTs. The result is 441 8×8 MACs fit in an EFLX4K AI eFPGA core: more than 10 times the MACs in about the same area as the EFLX4K DSP core, which already had more MACs per square millimeter than any other FPGA/eFPGA. The 8×8 MACs can also be configured to do 16×16 multiplies if preferred. Customers with Verilog/VHDL code for neural network processing on FPGA will be able to use this new AI-optimized eFPGA without chaining code but achieving 10x the throughput.
Figure 4 shows details of the Flex Logix EFLX4K-AI eFPGA core. Just like all EFLX cores, this is a complete FPGA with an input/output pins (the small yellow squares, >1000) circling the core. The core consists of two types of logic: MLUTS or memory-LUTs for local weight storage and DSP blocks which consist of 3 x 8-bit MACs each, pipelined in long rows for high-speed vector math. This core can be arrayed to >7×7 and mixed with other EFLX4K cores such as Logic and DSP. The EFLX4K AI core can be implemented in any CMOS process in ~6 months on customer demand. A smaller EFLX1K AI core is also available for 40nm-180nm applications.
Many companies are starting to use FPGAs to implement AI and more specifically machine learning, deep learning, and neural networks as approaches to achieve AI. Foundational for AI are matrix multipliers, which consist of arrays of MACs. In existing FPGAs and eFPGAs, the MACs are optimized for DSPs with larger multipliers, pre-adders, and other logic—overkill for AI. For AI applications, smaller multipliers such as 16 bits or 8 bits, with the ability to support both modes with accumulators, allow more neural network processing per square millimeter.
AI chip designers want more MACs/second and more MACs/square millimeter, but they also want the flexibility of eFPGA to reconfigure designs as AI algorithms are changing rapidly. eFPGAs enable them to switch between 8 -and 16-bit modes as needed and implement matrix multipliers of varying sizes to meet their applications’ performance and cost constraints.
courtesy : EECATALOG