March 21, 2016 Derek Jones 2 comments

Compiler writers are always frustrated that the cpu they are currently targeting does not contain the one instruction that would enable them to generate really efficient code. If only it were possible to add new instructions to the cpu. Well, it looks like this will soon be possible; Intel have added an on chip FPGA to their Broadwell processor (available circa 2017).

Having custom instructions on a FPGA (they would be loaded at program startup) is not the same as having the instructions on the cpu itself, there will be communication overhead when the data operated on by the custom instruction get transferred back and forth between cpu/FPGA (being on-chip means this will be low). To make the exercise worthwhile the custom instruction has to do something that takes very many cycles on the cpu and either speeds it up or reduces the power consumed (the Catapult project at Microsoft has a rack of FPGA enhanced machines speeding up/reducing the power of matching search engine queries to documents).

A CPU+FPGA is like CPU+GPU, except that FPGAs are programmed at a much lower level, i.e., there is little in the way of abstraction between what the hardware does and what the coder sees.

Does the world need a FPGA attached to their cpu? Most don’t but there are probably a few customers who do, e.g., data centers with systems performing dedicated tasks and anybody into serious bit twiddling. Other considerations include Intel needing to add new bells and whistles to its product so that customers who have been trained over the years to buy the very latest product (which has the largest margins) stay on the buying treadmill. The FPGA is also a differentiator, not that Intel would ever think of AMD as a serious competitor.

Initially the obvious use case is libraries performing commonly occurring functionality. No, not matrix multiple and inverse, FPGA are predominantly integer operation units (there are approaches using non-standard floating-point formats that can be used if your FPGA unit does not have floating-point support).

From the compiler perspective the use case is spotting cpu intensive loops, where all the data can be held on the FPGA until processing is complete. Will there be enough of these loops to make it a worthwhile implementation target? I suspect not. But then I can see many PhDs being written on this topic and one of them could produce a viable implementation that bootstraps itself into one of the popular open source compilers.

Interpreters have to do a lot of housekeeping work. Perhaps programs written in Java or R could be executed on the FPGA that uses the cpu as a slave processor. It is claimed that most R programs spend their time in library functions that have been implemented in C and Fortran, but I’m seeing more and more code that appears to be all R. For some programs an R-machine implemented in hardware could produce orders of magnitude speed improvements.

The next generation of cryptocurrency proof-of-work algorithms are being designed to be memory intensive, so they cannot be efficiently implemented using ASIC-proof (this prevents mining being concentrated in a few groups who have built bespoke mining operations). The analysis I have seen is based on ‘conventional’ cpu and ASIC designs. A cpu+FPGA is a very different kind of beast and one that might require another round of cryptocurrency design.

These cpu+FPGA processors have the potential to dramatically upend existing approaches to structuring programs. Very interesting times ahead!

Categories: Uncategorized Tags: compiler, cpu, FPGA, R

Will IEEE 754 become a fringe representation?

December 1, 2008 Derek Jones No comments

Many people believe that with a few historical exceptions the IEEE 754 standard has won the floating-point value bit-representation battle. What these people have forgotten is that money rules; customers are willing to ditch standards if it increases profit. FPGA devices can be configured to perform float-point operations faster and more cheaply than commodity cpus.

Making optimal use of a FPGA may require using a radix of 4 and for the time being automatically convert back and forth between an external 754 radix-2 representation. In those cases where multiplication/division operations are more common than addition/subtraction use of a logarithmic number system has performance benefits. For specialist scientific calculations (where cpu time is measured in days) purpose built FPGA devices are the path to significant performance improvements. In many mass market applications the full power of a 32-bit representation is not needed and a representation using fewer bits does an acceptable job using less powerful (ie, cheaper) hardware.

Customer demand for higher performance and lower cost will push vendors to deliver purpose designed products. IEEE 754 may be the floating-point representation that people without spending power use because it was once designed into cpus and vendors are forced to continue to support it for backwards compatibility.

Categories: floating-point Tags: floating-point, FPGA, IEEE 754, Standard

The Shape of Code

Archive

cpu+FPGA: applications can soon have bespoke instructions

Will IEEE 754 become a fringe representation?

Recent Posts

Recent Comments

Archives

Meta