This aspect of their new chips is massively underrated. An FPGA is the future-proof solution here, not chip-level instructions for the soup-du-jour in machine learning.
Edit: which is not to say that I'm not welcoming the new instructions with open arms...
I'm not as hyped about FPGA-in-CPU so much as I am of having Intel release a specification for their FPGAs that will allow development of third-party tools to program them.
Right now the various vendors seem to insist on their own proprietary everything which makes it hard to streamline your development toolchain. Many of the tools I've used are inseparably linked to a Windows-only GUI application.
Your lab has some interesting publications. Doing good work. Appreciate you taking time to try to solve the FPGA problem. Here's a few related works in case you want to build on or integrate with them:
I'm not too familiar with FPGAs, but isn't the tradeoff that since they are flexible they are usually much slower than CPUs/GPUs and it is usually used to prototype an ASIC? How is FPGA-in-CPU going to be a good thing?
They're slower in terms of clock speed, but they're not slower in terms of results.
You can do things in an FPGA that a CPU can't even touch, it can be configured to do massively parallel computations for example.
If Bitcoin is any example, GPU is faster than CPU, FPGA is faster than GPU, and ASIC is faster than FPGA. Each one is at least an order of magnitude faster than the other.
A GPU can do thousands of calculations in parallel, but an FPGA can do even more if you have enough gates to support it.
I haven't looked too closely at the SHA256 implementations for Bitcoin, but it is possible to not only do a lot of calculations in parallel, but also have them pipelined so each round of your logic is actually executing simultaneously on different data rather than sequentially.
One of the biggest caveats of FPGAs that I don't see mentioned often is that they're slow to program. This implies some unusual restrictions about where they can be used. I.e. data centers will benefit, general purpose computing not so much.
Well, normally what you lose in terms of clockspeed when using a FPGA you make up in being able establish hardware dataflows for increased parallelism. But I don't have a sense for whether deep learning problems are amenable to that sort of thing.
It will only really take off if they can get the user experience of getting a HDL program to the FPGA to the same level as getting a shader program to the GPU. Unless they can do all that in microcode though it's going to force them to open up some stacks of Altera's closed processes. I'm hopeful but there's a lot of proprietary baggage around FPGAs that I think have kept them from truly reaching their potential.
Almost all FPGAs have entirely proprietary toolchains, from Verilog/HDL synthesis all the way down to bitstream loaders for programming your board. These tools are traditionally very poorly built as applications, scriptable only with things like TCL, terrible error messages, godawful user interfaces designed in 1999, etc etc.[1] This makes integrating with them a bit more "interesting" for doing things like build system integration, etc.
Furthermore, FPGAs are inherently much more expensive for things like reprogramming, than just recompiling your application. Synthesis and place-and-route can take excessive amounts of time even for some kinds of small designs. (And for larger designs, you tend to do post-synthesis edits on the resulting gate netlists to fix problems if you can, in order to avoid the horrid turn around time of hours or days for your tools). The "flow" of debugging, synthesis, testing etc is all fundamentally different and has a substantially different kind of cost model.
This means many kind of "dream" ideas of on-the-fly reconfiguration "Just-in-Time"-ing your logic cells, based on things like hot, active datapaths -- is a bit difficult for all but the simplest designs, due to the simple overhead of the toolchain and design phase. That said, some simple designs are still incredibly useful!
What's more likely with these systems is you'll use, or develop, pre-baked sets of bitstreams to add in custom functionality for customers or yourselves on-demand, e.g. push out a whole fleet of Xeons, configure the cells of machines on the edge to do hardware VPN offload, configure backend machines that will run your databases with a different logic that your database engine can use to accelerate queries, etc.
Whether or not that will actually pan out remains to be seen, IMO. Large enterprises can definitely get over the hurdles of shitty EDA tools and very long timescales (days) for things like synthesis, but small businesses that may still be in a datacenter will probably want to stick with commercial hardware accelerators.
Note that I'm pretty new to FPGA-land, but this is my basic understanding and impressions, given some time with the tools. I do think FPGAs have not really hit their true potential, and I do think the shit tooling -- and frankly god awful learning materials -- has a bit to do with it, which drives away many people who could learn.
On the other hand: FPGAs and hardware design is so fundamentally different for most software programmers, there are simply things that, I think, cannot so easily be abstracted over at all -- and you have to accept at some level that the design, costs, and thought process is just different.
[1] The only actual open-source Verilog->Upload-to-board flow is IceStorm, here: http://www.clifford.at/icestorm/. This is only specifically for the iCE40 Lattice FPGAs; they are incredibly primitive and very puny compared to any modern Altera or Xilinx chip (only 8,000 LUTs!), but they can drive real designs, and it's all open source, at least.
Just as an example, I've been learning using IceStorm, purely because it's so much less shitty. When I tried Lattice's iCE40 software they provided, I had to immediately fix 2 bash scripts that couldn't detect my 4.x version kernel was "Linux", and THEN sed /bin/sh to /bin/bash in 20 more scripts __which used bash features__, just so that the Verilog synthesis tool they shipped actually worked. What the fuck.
Oh! Also, this was after I had to create a fake `eth0` device with a juked MAC address (an exact MAC being required for the licensing software), because systemd has different device names for ethernet interfaces based on the device (TBF, for a decent reason - hotplugging/OOO device attachment means you really want a deterministic naming scheme based on the card), but their tool can't handle that, it only expects eth0.
As a software programmer, my mind is blown by the idea of a 3GHz CPU. That's 3,000,000,000 cycles per second. The speed of light c = 299,792,458 meters per second. Which means that, in the time it takes to execute 1 cycle, 1/3,000,000,000 seconds -- light has moved approximately 0.0999m ~= 4 inches. Unbelievable. Computers are fast!
On the other hand... Once you begin to deal with FPGAs, you often run into "timing constraints" where your design cannot physically be mapped onto a board, because it would take too long for a clock signal (read: light) to traverse the silicon, along the path of the gates chosen by the tool. You figure this out after your tool took a long amount of time to synthesize. You suddenly wish the speed of light wasn't so goddamn slow.
In one aspect of life, the speed of light, and computers, are very fast, and I hate inefficiency. In another, the speed of light is just way too slow for me sometimes, and it's irritating beyond belief because my design would just look nicer if it weren't for those timing requirements requiring extra work.
How to reconcile this kind of bizarre self-realization is an exercise left to the reader.
It's not the speed of light that's the long pole in the timing constraint, it's the "switching" time of the circuits that make up the digital logic. You can't switch from zero to one or one to zero instantaneously. It's modeled (or at least it was when I went to college) as a simple RC circuit. Your clock can't tick any faster than the time it takes to charge the slowest capacitor in your system. That's why chips have gotten faster and faster over the years, we've reduced that charge/discharge time (not because we've sped up light).
"pre-baked sets of bitstreams" is a good way to describe what I expect to be the most likely scenario. In other words, spiritually very similar to ordinary configurable hardware (PCIe cards, CPUs, DIMMs, USB devices, etc)
Don't forget the niche nature of an FPGA, too. They are incredibly slow & power hungry compared to a dedicated CPU, so similar to the CPU/GPU split you find that only particular workloads are suitable.
Your comment represents some of the best of HN (detailed, illuminating, informative), but is incredibly depressing for someone with an idle curiosity in FPGAs. This is what I've long suspected, and it seems that the barrier to entry is generally a bit too high for "software" people.
I do think there's some light at the end of the tunnel. IceStorm is a thousand times better than any of the existing EDA tools if you just want to learn, and has effectively done a full reverse engineering of the iCE40 series. It only has a Verilog frontend, but it's a great starting point.
You can get an iCE40 FPGA for ~$50 USD, or as low as $20. It'll take you probably 30 minutes to compile all the tools and install them (you'll spend at least 2x that much time just fucking with stupid, traditional EDA tools, trying to make sense of them, and you'll still do that more than once probably), and you're ready.
The learning material stuff is much more difficult... Frankly, just about every Verilog tutorial out there, IMO, totally sucks. And Verilog is its own special case of "terrible language" aside from that, which will warp your mind.
Did you just use an undeclared wire (literally the equivalent of using an undeclared variable in any programming language)? Don't worry: Verilog won't throw an error if you do that. Your design just won't work (??????).
If you're a "traditional" software programmer (I don't know how else to put it?) who's mostly worked in languages like C, Java, etc. then yes: just the conceptual change of hardware design will likely be difficult to overcome, but it is not insurmountable.
The actual "write the design" part hasn't been too bad for me - but that's because I don't write Verilog. I write my circuits in Haskell :) http://www.clash-lang.org -- it turns out Haskell is actually an OK language for describing sequential and combinatoral circuits in very concise way that embodies the problem domain very well -- and I have substantial Haskell experience. So I was able to get moving quickly, despite being 100% software person.
There are also a lot of other HDLs that will probably take the edge off, although learning and dealing with Verilog is basically inevitable, but I'd rather get accustomed to the domain than inflict self-pain.
MyHDL, which is an alternative written in Python, seems to be an oft-recommended starting point, and can output Verilog. Someone seems to even have a nice tool that will combine MyHDL with IceStorm in a tiny IDE! Seems like a great way to start -- https://github.com/nturley/synthia
HDLs such as Verilog are fine when you keep in mind what the D stands for. They are not programming languages. They are hardware description languages. Languages used to describe hardware. In order to describe hardware, you need to picture it in your head (muxes, flops, etc.). Once you do, writing the HDL that describes that hardware is reasonably easy.
What doesn't work is using an HDL to write as if it were software. It's just not.
Sure, I don't think anything I said disagrees with that. This conceptual shift in "declarative description" of the HW is a big difference in moving from software to hardware design, and no language can remove that difference. It doesn't matter if you're using MyHDL or Haskell to do it! It just so happens Haskell is a pretty declarative language, so it maps onto the idea well.
But it's not just about that... Frankly, most HDLs have absolutely poor abstraction capabilities, and are quite verbose. Sure, it isn't a software language, but that's a bit aside from the point. I can't even do things like write or use higher order functions, and most HDLs don't have very cheap things like data types to help enforce correctness (VHDL is more typed, but also pretty verbose) -- even when I reasonably understand the compilation model and how the resulting design will look, and know it's all synthesizable!
On top of that, simply due to language familiarity, it's simply much easier for me to structure hardware descriptions in terms of things like Haskell concepts, than directly in Verilog. It wouldn't be impossible for me to do it in Verilog, though -- I'm just more familiar with other things!
At some level this is a bit of laziness, but at another, it's a bit of "the devil you know". I do know enough Verilog to get by of course, and do make sure the output isn't completely insane. And more advanced designs will certainly require me going deeper into these realms where necessary.
I've recently been writing a tiny 16-bit processor with Haskell, and this kind of abstraction capability has been hugely important for me in motivation, because it's simply much easier for me to remember, describe and reason about at that level. It's simply much easier for me to think about my 2-port read, 1-port write register file, mirrored across two BRAMs, when it looks as nice and simple as this: https://gist.github.com/thoughtpolice/99202729866a865806fd6d..., and that code turns into the kind of Verilog you'd expect to write by hand, for the synthesis tool to infer the BRAMs automatically (you can, of course, also use the direct cell library primitive).
That said -- HDL choice is really only one part of the overall process... I've still got to have post-synthesis and post-PNR simulation test benches, actual synthesizable test benches to image to the board, etc... Lots of things to do. I think it's an important choice (BlueSpec is another example here, which most people I've known to use it having positive praise), but only one piece of the overall process, and grasping the whole process is still something I'm grappling with.
I haven't used PyMTL but have used MyHDL, years ago. (I'm glad for aseipp's comments because that proprietary mess is exactly why I haven't bothered keeping up with FPGA development, though IceStorm has gotten me interested...) Glancing at PyMTL it seems they're comparable -- neither require that much code (https://github.com/jandecaluwe/myhdl). I'd say without digging deeper MyHDL is older, stabler, better documented, and has been used for real world projects including ASICs. PyMTL might allow for more sophisticated modeling. I'd be concerned about PyMTL's future after the students working on it graduate, whereas MyHDL is Jan's baby.
As someone who recently got into programming FPGAs from decades of usual software development, it's not that bad. Takes a totally different mindset, but it's much more approachable than what it seems to be.
Actually a couple of YouTube videos can get you up and running pretty fast. Verilog is quite simple and doing things like writing your own VGA adapter is pretty straightforward and teaches you a lot and is a lot of fun.
Anecdote evidence, but I am currently learning FPGA and VHDL, and find it not more difficult than "normal" C. Way more more expressive, for sure: a complete 8 channel PWM controller under 150 lines. Parallelism without locking feels very cool.
I think VHDL makes things easier. Verilog has some fundamental nondeterminism: http://insights.sigasi.com/opinion/jan/verilogs-major-flaw.h... Still, HDL in general is probably on the order of difficulty as getting used to pointers. If you've already done a bunch of work with breadboards and logic gates it's even easier to grok. Of course a lot of people just instantiate a CPU core on the FPGA and program in C...
My understanding (and please correct me if I'm wrong!) is that it's pretty easy to burn out FPGAs, electrically, in a way that it's not particularly trivial to do with other sorts of circuitry. So one downside of exposing your bitstreams is people are going to blow things up and demand new chips.
There's no evidence so far that FPGAs come anywhere close to GPUs w/r to deep learning performance. All the benchmarks so far, through Arria 10, show it to be mediocre for inference, and the lack of training benchmark data IMO implies it's a disaster for that task. See also Google flat out refusing to define what processors they measured TPU performance and efficiency against.
FPGAs are best when deployed on streaming tasks. And one would think inference would be just that, yet the published performance numbers are on par with 2012 GPUs. That said, if they had as efficient a memory architecture as GPUs do, things could get interesting down the road. But by then I suspect ASICs (including one from NVIDIA) will be the new kings.
GPUs have GDDR5 and that is primarily what allows them to dominate in so many applications. Many of them are primarily memory-bound and not computation-bound. This means that the super-fast GDDR memory and the algorithms which can do a predictable linear walk through memory get an enormous speed boost over almost anything else out there.
> But by then I suspect ASICs (including one from NVIDIA) will be the new kings.
Yes, I suspect Nvidia has been developing/prototyping a Deep Learning ASIC for some time now. The power savings from an ASIC (particularly for inference) are just too massive to ignore.
Nvidia also seems to be involved in an inference only accelerator from Stanford called EIE (excellent paper here - https://arxiv.org/pdf/1602.01528v2.pdf).
Oh, these are entirely different approaches. New instructions are something I can use immediately (well, after I get the CPU). I know how to feed data to them, I can debug them, they are predictable, my tools support them, overall — I can make good use of them from day one.
As for FPGAs, anybody who actually used one, especially for more than a toy problem, will tell you that the experience is nowhere near that. Plus, at the moment, we are comparing a solution that could possibly perhaps materialise with something that will simply appear as part of the next CPU.
FPGAs sound amazing, but if you work with them you learn they can be a real PITA. The vision of the dynamically shapeshifting coprocessor FPGA is a long way in the future.
Edit: which is not to say that I'm not welcoming the new instructions with open arms...