Every fall semester, I’m teaching a course called ‘Hardware/Software Codesign’. The course has been on the books since 2006, but its content has hardly been a static topic. Every year there are some new things to talk about, and most recently this has been driven by the availability of FPGA-SoC chips, Field Programmable Gate Arrays with an ARM hardcore processor. Such chips are programmed with ‘hardware’ and ‘software’ at the same time: they are configured with a hardware design in the FPGA fabric, while they’re also running an embedded operating system on the dual-core ARM. So you get a rather sophisticated embedded software environment that can be customized with hardware acceleration. This is an incredibly interesting programming model, and it is a sign of things to come in the post-Moore world. The students in this course are a mixed population of senior undergraduate students and early graduate students. This year we had a record enrollment of 65 students.
The Hardware/Software Codesign course covers basic hardware/software interfacing techniques, combined with an in-depth discussion of the connections between the hardware and software levels of design abstraction. The course also emphasizes the practical application and hands-on design. Teaching this creative aspect of design is among the most interesting features of the course, certainly to me as an instructor but also (presumably) to the students.
The final assignment in this course is the Codesign Challenge, an open-ended performance optimization assignment. The starting point is a reference implementation in C on the ARM core of the FPGA-SoC. The students then have to improve the design performance against a fixed testbench while staying within the constraints of the FPGA-SoC platform. Our platform is a DE1-SoC board, which has a 800MHz dual-core ARM and an Altera Cyclone V FPGA fabric with 85K programmable logic elements. Performance is evaluated as wall-clock time for a given testbench running on the ARM. Speedup is measured in terms of the speedup over the execution time of the reference testbench. The assignment is called the Codesign Challenge because it is competitively graded: the fastest design gets the most points. To the students, this appears both fascinating and discomforting. Isn’t there simply one correct solution that leads to a full grade? No, there isn’t. The only solution that deserves full grade is the best solution.
This years’ Codesign Challenge is a correlation problem. Correlation is a well-known signal processing operation used to identify signals within another signal. It is used for radar signal processing, GPS, image processing, differential power analysis, and many other applications.
The reference design uses the ARM as well as the FPGA fabric. Memory on the FPGA is filled with 1,024 waveforms of 128 samples of one byte each. Each of these 1024 waveforms has to be correlated with a reference 16-sample waveform. The correlation operation between such a search waveform and the reference involves computing the dot product between the two waveforms at every possible time-shift. So we need 16*128 = 2,048 multiplications to compute the correlation for a single waveform, and we need 2 million multiplications to do it for all 1,024 waveforms. Finally, the assignment required a peak detection per search waveform, so that you only return the index of the correlation peak in every waveform. For thus obtain 1,024 peak index values.
The implementation of this correlation problem in C has three nested loops: an outer loop to count the 1,024 waveforms, a middle loop to count each sample of the 128 samples within a waveform, and an inner loop to compute the dot product of 16 samples of the reference with 16 samples of the search waveform.
The reference implementation also has this additional quirk: the 1,024 waveforms are stored on a byte-wide memory on the FPGA. This mapping does not affect the functionality of the C program. The FPGA memory is mapped into the address space of the C program through a mmap system call. This mapping introduces a significant performance bottleneck. First, the byte-wide single-port FPGA memory returns only a single sample per read operation, rather than the 32-bit or 64-bit we would expect for a well-designed main memory. It does not support burst-access or other advanced read mechanisms. Second, the memory-mapped FPGA accesses are mapped outside of the memory hierarchy, which means that they are not cached. Finally, the FPGA memory is attached to the ARM processor through a complex bus structure involving multiple levels of bus hierarchy and multiple standards (AXI and Avalon). These three elements in combination slow down the access speed of the FPGA memory to a couple hundred ARM clock cycles per memory access.
The effect of such a slow memory on the reference implementation is a performance disaster: the execution time of the reference implementation is over 460 million clock cycles! This means that every iteration of the inner loop requires 230 ARM clock cycles – a huge number for such a simple inner loop. The detailed assignment of the codesign challenge, as well as the reference implementation, are available on github.
The students have three weeks to work on this design problem (4 weeks if you count the Thanksgiving break). Furthermore, we spent time in class every week to discuss possible optimization strategies. More on that later. Let’s first look at the results.
This graph shows the speedup obtained by the students on a logarithmic scale. So the fastest designs are over 1000 times faster than the reference implementation. The winning design has a speedup of 4,288. A large portion of the students obtained a speedup of 1.2 times, which is what you get when you optimize the reference using compiler optimization (O3). About one-third of the class achieves a better result, reaching speedups of 10x to 1000x, depending on the ingenuity of the design and the level of optimization.
At first sight, this graph looks discouraging, since there are so few students who manage to increase the performance of the reference, while over two thirds appear to be stuck at 1.2x. However, it would be wrong to conclude that students with a small speedup failed the assignment. On the contrary, a result like this is to be expected for several reasons.
Design, and optimization, is an experience-based skill. While all of the students have the basic skill to run a C compiler or an FPGA synthesis tool, knowing how to apply a skill is much harder. Knowing how to drive a car does not make you a Formula-I racer; knowing how to hammer a nail does not make you a carpenter. The contribution of an assignment such as this one is to show students that it’s possible to become a Formula-I racer (if you can drive a car) or a woodworking wizzard (if you can hammer a nail). But you don’t have to be one from day one!
The assignment is very difficult. This is because a performance optimization like this one is not limited by a single bottleneck but by a series of interdependent bottlenecks. If you inspect the inner loop closely, you can see that there are no data dependencies between any of the iterations. In principle, it’s possible to compute 1024 * 128 * 16 multiplications in parallel, and finish the entire problem in a single cycle! Of course, this would require an insane amount of data multipliers (two million) and fully parallel memory access. Both of these constraints exceed the capabilities of the FPGA. Performance optimization thus becomes a balancing act that needs to gradually improve computation, communication, control and storage features in the architecture. This is complex and requires insight across the design abstraction levels in hardware and software.
The design time for the co-design challenge is limited to three weeks. For senior students, who take several additional courses, this is a very, very short time, especially at the end of the semester. In this respect, it’s not a surprise that many solutions consist of simple compiler optimization. Furthermore, the students provide a report in addition to their result, that explains their view on the optimization problem. From these reports, it’s clear that the students understand the key bottleneck in the design, but they are struggling with the many practical details of design implementation.
One-third of the students found solutions that were significantly faster than the reference implementation. Here are some of the ideas they proposed.
Increase the port width of the FPGA RAM from 8 bit to 32 bit or 64 bit. This will drastically increase the memory bandwidth (by 4X or 8X).
Use multi-port memories on FPGA, and partition the FPGA RAM into several smaller modules that can be concurrently accessed.
Use a coprocessor in the FPGA that can compute correlations. This is difficult because the FPGA runs 16 times slower than the ARM (50 MHz vs. 800MHz). To be effective, the coprocessor has to compute at least an entire reference correlation at a time (16 samples).
Use Direct Memory Access to avoid accesses on the FPGA RAM by the ARM. This strategy is especially effective with the previous one since it can eliminate the ARM entirely while an inner loop is computed.
Increase the FPGA clock, from 50MHz to 100MHz or higher. This optimization gives another 2X to 3X of performance at relatively low complexity cost to the design.
Use threading on the ARM, to benefit from the dual-core architecture. The throughput of the C program can be doubled by running two concurrent threads of the outer loop, one on each core.
It is clear that navigating this optimization space is complex. There are many opportunities to trip up and to get stuck with a sub-optimal solution. The range of solutions identified by the best designs in the Codesign Challenge is truly impressive, and it reflects the optimization space in great detail. The fastest designs are all-hardware designs. The two winning designs, in fact, are bottom-up hardware designs that maximize the number of data multiplications that can be achieved in parallel. These designs compute 128 parallel data-multiplications per cycle and fine-tune the movement of data between memory and the coprocessor. However, the fastest all-software designs are very clever as well. The two fastest software designs reach an acceleration of about 70 times. These designs use the insight that it’s better to move the 128 Kilobyte of search waveforms from the FPGA to the ARM memory hierarchy, and optimize it there using threads and efficient programming. These software designs take advantage of the fact that the ARM executes 16 clock cycles for every FPGA clock cycle.
There were also very clever design tricks, such as using Fast Fourier Transforms to compute the correlation, and using systolic hardware architectures to parallelize the computation together with the communication.
Unlike previous years, I did not leave it to the students to figure out all of these complex issues by themselves. Instead, we spent several lectures to talk about optimization strategy. For example, we discussed the design of on-chip memory communication systems, the use of Direct Memory Access, the analysis of data-flow and memory access patterns of the application.
This particular problem, as it turns out, is best thought of as a data movement problem, not as a computation problem. If you study the inner loop in detail, you’ll see that every complete inner loop reads 16 samples from the search waveform. This is done in an incremental fashion, such that we first read sample 0 to 15, then sample 1 to 16, then sample 2 to 17, and so forth. It’s possible to map the data structure needed for the inner loop into a shift register, which is shifted one position for each iteration of the middle loop. This is the key to implement a coprocessor which can efficiently compute 16 data-multiplications in parallel. A sample design for this idea is available from on github.
Finally, executing a design space exploration as a set of parallel designs is truly insightful. Simply by the numbers, the Codesign Challenge is able to test more solutions than any single designer would reasonably conceive. The Codesign Challenge also helps us to make sense of the value of optimization at different levels of abstraction - for example it allows to compare the benefit of architectural optimization versus software optimization.