A hardware compiler for data-streaming

A Hardware Compiler for Data-streaming Reconfigurable Architectures

Grigorios Magklis
University of Rochester

1 Introduction

I'm currently working towards my Ph.D. at the University of Rochester, working with professor Michael Scott, my advisor. My current research is on Complexity Adaptive Processing. This summer I worked with Laurent Moll, on building a compiler for the Sepia board.

2 Sepia

Sepia consists of a reconfigurable device (Xilinx FPGA) that has a PCI connection to the host machine and a network interface to connect with other machines or Sepia boards. It is currently being used to merge partially rendered images to a final image. In the current configuration each board accepts two streams of data (from the local machine and from the network) representing parts of an image, and outputs a data steam that is a combination of the two inputs. The combining function can be anything from alpha-channel blending to z-buffer comparison.

3 Hardware Configuration

The process of configuring the h/w involves many steps. First the algorithm is described and tested in some high level language, like C. Then the appropriate hardware is designed by hand. After this the hardware is described in some hardware description language (HDL) and is tested again. Finally the FPGA code is compiled and "downloaded" to the board.

The goal of this project is to simplify this process by allowing the user to specify the algorithm in some high level language and have the final FPGA code generated automatically by the compiler.

4 The Compiler

Our compiler understands a very simplified C-like language. This language does not have features like pointers and arrays due to the problems they present with efficiently mapping them to hardware resources. It also does not have other features like complex types (struct, union) and support for functions (except for a main function), because there was not enough time to implement them. These features can be easily added in the future though. Loops are also not allowed (for, while, do) because of the nature of the computation: the hardware is operating on streams of data and has to generate a result on every clock cycle.

When designing the compiler we decided to follow traditional compiler techniques. The only difference is in the intermediate language instruction set. Instead of representing an ISA, it represents the available hardware resources. The first stages of the compiler are similar to a common C compiler. It first generates a parsing tree and then it builds the intermediate representation code (IR) in single static assignment form. The only interesting difference is that the generated code is one large basic block. This is possible because the IR contains a "select" instruction – similar to the "?:" C operator.

The next step is optimizing the IR. In this step we also follow traditional compiler techniques, only that now we are trying to optimize for code size, not speed. The optimizations performed, in order, are: copy and constant propagation, constant expression evaluation, expression simplification and unreachable code elimination.

The final step of the compilation is to produce location information for each primitive (IR instruction). This is the most important step since it is going to decide the performance, i.e. clock frequency, of the resulting hardware. Unfortunately I didn’t have time to implement this step. The "scheduling problem" as it is called, is an NP-hard problem and so far people have used a number of heuristics to solve it. We are thinking to use the datapath information we get from the IR in order to generate placement information. This can be done by first identifying which IR instructions belong to each pipeline stage of the resulting hardware and then try to schedule (i.e. generate location information) for each stage separately. This has the property of reducing the problem dimensions from two to one.