The MIPS processors, particularly the newer ones, can execute a substantial amount of real-time signal processing code. But as with all processors, writing efficient code requires an understanding of the processor and compiler architectures. This list is intended to give you some things to think about.
First, my disclaimer. I've based this list upon a number of "gotchas" that I and other audio & image-processing developers have hit. This is by no means a complete list, but it's hopefully got some useful hints. It's a new list and I'm constantly refining it and adding things. If you have your own "gotchas" or suggestions, please send them in!
Thanks to Bruce Karsh, Gints Klimanis, Chris Pirazzi, and Bryan James for their review and contributions.
Bit-Swizzling & Logical Ops Integer only Addition Integer faster than FP (older processors) FP faster than integer (R5K/R10K/R8K and newer) General Multiplication FP faster than integer Division FP is faster. But if you can replace division by multiplication, do so; multiplication is faster. Conditionals Integer is faster
From this, you can sort of see that numerical computations are best done in floating-point, and logical operations are best done in integer. From a computation perspective, floating-point usually has a much larger dynamic range than integer, so if you are scaling or mixing data in your computations, you seldom need to check the results for overflow.
Another thing to consider is that the FP unit has its own register set. When you use FP, you effectively get a much larger register set to use, because your logical code uses the integer registers, whereas the computations use the FP registers. With more registers, some loops may require fewer load and store operations.
Finally, when you use floating-point instructions, you take advantage of more parallelism in the processor. For example, the R5K can do an integer and floating-point operation simultaneously. If you use floating-point, the compiler will usually find an integer operation (like a load or store) to perform simultaneously. The R10K allows even more instruction-level parallelism.
In general, for the typical signal-processing application, which requires a lot of multiply/adds of variables, floating-point offers superior performance. But as always, it depends upon the architecture of the application; your mileage may vary.
This is all somewhat processor-dependent. For example, the R4400 is better at floating-point than the R4600. If your code is targeted mainly at the R4600 the optimal balance of FP to integer operations will differ from that on the R4400. But most of the newer processors provide superior floating-point; in general, that's the way to go for numerical computations.
2a. Another note: one common speed-up for integer operations is to use table-lookup, where possible. This often works well for image processing, where the input is 8- or 10-bit data and the tables are small enough to stay in the cache. For audio operations, table-lookup is less common, since audio data is typically 16-24 bits.
Also, be aware: in some cases, the compiler may do type conversion "behind your back," particularly by promoting single-precision to double-precision floating-point (hint #7).
MIPS 1 R3000 MIPS 2 R4000 instructions MIPS 3 R4000 -- introduces 64-bit operations MIPS 4 R8000/R10000 instructions notably floating-point multiply/add, conditional FP move
I've noticed that one big benefit of going from MIPS1->MIPS2 is a reduction in the cost of FP->integer conversion. For MIPS1, the compiler will generate some inline code to perform this; for MIPS2, the code is replaced by a single instruction. In general, though, you should not expect a huge performance gain going mips1->mips2.
MIPS3 instructions essentially introduce native 64-bit integer operations. These can be useful in some cases. One example I've seen is a phase accumulator (e.g. from a wavetable synthesizer). The operations on this are typically add (to add the delta from one sample to the next), convert to integer (to compute the sample array index), comparisons (to determine whether or not the loop-point has passed), and subtractions (to bring the accumulator back to the beginning of the loop). Looking in the table under hint #2, we see that this probably works faster with integer data types (fixed-point). However, 32-bit fixed-point may not have enough precision for the task. Using MIPS3 would enable you to have a native 64-bit fixed-point phase accumulator (coded as a "long long" type in C).
The MIPS4 instruction set provides some nice opportunities for signal-processing applications. It provides an FP multiply/add instruction and floating-point conditional-move instructions. These last allow one to write branch-free limiting code, which is a big win in some cases (see #5).
With each instruction set, you may get a performance gain at the cost of compatibility with older MIPS-architecture processors. In other words, if you compile -mips2, you may see some speedup, but you lose R3000 compatibility. This is a trade-off you will have to consider. Some people maintain multiple versions of a binary using different ISA's. (If your application uses "inst" for installation, you can have "inst" automatically select the appropriate binary for the end-user's machine).
MIPS3 and MIPS4 are only supported on IRIX6.X, and MIPS4 requires an R8K or R10K processor.
There's really no substitute for this. Looking at the source-code and thinking about the algorithms at a high-level is a first step, but in the end if you need to squeeze every bit of performance out of your code, you need to examine the actual instructions that the CPU will execute.
Another reason to avoid branches is that many of the compiler optimizations are limited to basic blocks, i.e. the region between two branches. Removing branches often gives the compiler a better chance to optimize the code. Finally, branches inside loops reduce the effectiveness of loop unrolling.
Organize your code and your data to optimize your use of the cache. Minimize your inner loops and working sets of data. For example, in image-processing, "tiling" is a common practice. Suppose a number of consecutive operations are to be performed on an image. The "obvious" way would be to perform each operation on the entire image, then proceed to the next operation. However, assuming the image doesn't fit in the cache, this gets the worst possible cache performance, since the image has to be reloaded from main memory every time it is touched. A far superior approach is to operate on small "tiles" which fit in the cache: perform all the operations on one tile, then proceed to the next.
There are lots of other tricks to improve cache performance, but hopefully this short example will get you thinking. If you need data in a critical loop, try to make sure it's in the cache every time the loop is executed.
Moreover, the optimal code scheduling differs from processor to processor. By writing in C (or another high-level language), and telling the compiler to generate code for the appropriate target architecture, you make it much easier to optimize for different architectures.
for (i=0; i < 50; i++) { array[i]=0; } int *p=array; for (i=0; i< 50; i++) { *p++=0; }
Both are functionally equivalent if the value of p is not used later in the second case. Some people shy away from the first approach on the mistaken belief that the address generation for the array indexing is computationally expensive. In fact, the compiler generates code for the first version which has 20% fewer instructions per iteration than the code for the second version. Why? The compiler often generates better code for simpler expressions, because it can better understand what the code is trying to accomplish. In the first case, the compiler sees that the code is looping on an array index. It calculates the end address of the array (200 from the base, since it is an integer array), and generates a loop which has only one variable -- the address in the array. "i" as such is nowhere to be seen. The loop requires four instructions per iteration:
[test1.c: 5] 0x4009a8: 248500c8 addiu a1,a0,200 [test1.c: 5] 0x4009ac: 24840004 addiu a0,a0,4 [test1.c: 5] 0x4009b0: 0085082b sltu at,a0,a1 [test1.c: 5] 0x4009b4: 1420fffd bne at,zero,0x4009ac [test1.c: 6] 0x4009b8: ac80fffc sw zero,-4(a0)
The second example is too "smart" for the compiler. The compiler can't quite figure out what the programmer is trying to accomplish, so it generates two loop variables, one for "i" and one for "p" (even though "p" is never used after the loop). This example requires 5 instructions per iteration, since it has to increment two loop variables.
[test2.c: 4] 0x4009a8: 24020032 li v0,50 [test2.c: 6] 0x4009ac: 00002025 move a0,zero [test2.c: 6] 0x4009b0: 24840001 addiu a0,a0,1 [test2.c: 6] 0x4009b4: 0082082a slt at,a0,v0 [test2.c: 7] 0x4009b8: ac600000 sw zero,0(v1) [test2.c: 6] 0x4009bc: 1420fffc bne at,zero,0x4009b0 [test2.c: 7] 0x4009c0: 24630004 addiu v1,v1,4
(if you're not familiar with MIPS assembly, note that the instruction after the branch [bne] is executed before the branch actually occurs. This is the so-called "branch delay slot").
Some folks have tested numerous ways of stating loops to see for which the compiler generates the fastest code. This is an interesting experiment, but it may not hold true from compiler release to compiler release. However, the general principle holds: don't be afraid to state your loops in simple terms. Often the compiler does best with the simplest constructs.
for (i=0; i < x; i++) { array[i]=0; }
We've seen above (#9) that the compiler generates somewhat smart code for this. For each iteration of the loop, the pointer is incremented and tested against its end value. But we can cut down on the number of instructions per array item if we write the loop like this:
for (i=0; i < x; i+=4) { array[i]=0; array[i+1]=0; array[i+2]=0; array[i+3]=0; }
This loop is faster than the first one. The trick is that the pointer increment and test-against-end-value need only be done for every 4 array elements, instead of for every array element. There's no extra computation overhead associated with the (i+n) indices, because these make use of the MIPS addressing mode: the compiler can determine the fixed offset (represented by n) from the base address represented by i. Also, since there are fewer iterations, there are fewer branches. As noted above, a pipelined processor likes to execute instructions linearly (#5). The compiler-generated code looks like:
[test3.c: 6] 0x4009a8: 248500c8 addiu a1,a0,200 [test3.c: 6] 0x4009ac: 24840010 addiu a0,a0,16 [test3.c: 6] 0x4009b0: 0085082b sltu at,a0,a1 [test3.c: 7] 0x4009b4: ac80fff0 sw zero,-16(a0) [test3.c: 8] 0x4009b8: ac80fff4 sw zero,-12(a0) [test3.c: 9] 0x4009bc: ac80fff8 sw zero,-8(a0) [test3.c: 6] 0x4009c0: 1420fffa bne at,zero,0x4009ac [test3.c: 10] 0x4009c4: ac80fffc sw zero,-4(a0)
Note that this snippet assumes that x is divisible by 4, unlike the original loop. There's a way around this for general values of x: split the loop into two loops. The first is not unrolled, and processes (x & 3) elements. The remaining number of elements is guaranteed to be divisible by 4, and the second, unrolled loop handles the bulk of the elements. If x is going to be very small, it's not worth the overhead, but for most values of x, this approach is a big win.
So why did we choose 4? First, it's a power of two, which makes the modulo operation cheap (x & 3, as seen above). It's also fairly small. As the loop-unrolling block size becomes larger, you get diminishing returns. At some point, things actually slow down because the code no longer gets nice cache behavior. Common values for the block size are 4, 8, and sometimes 16.
Experiment with your code to see what works well. As always, if it's a critical loop, disassemble it to see what the compiler is doing for you.
A critical thing to note is that the compiler will usually do loop unrolling for you, unless it determines that there is little benefit to doing so. One reason why it can determine this is that pointers inside the loop are aliased. See #12.
[...more to come...]