
   Cache Blocking
   -------------- 
   Cache blocking can be an effective technique to 
   minimize the transfer of data in/out of the data
   caches.  This program illustrates this technique.
   The first function, normal, computes a two dimensional
   vector sum on arrays arrayA and arrayB, each sized by
   the constant DIM.  The second function, blocked, blocks
   the loop indices while computing the vector sum.  
   Depending on the machine on which you are running 
   this program, the second function should be much faster.

   The cache misses that occur in the normal function 
   have to do with the fact that the vector sum 
   strides through by a factor of 1 in one array and
   by DIM in the other array.  Hence, switching rows
   and columns is not a solution in this case as you
   will still have a stride of DIM.
   
   DIM needs to be even since we unroll by 2 in block.
   It would be nice if it were larger than the
   the size of the machine's cache line, or else
   you may not see much difference in performance.

   Modify DIM and see what effect it has on
   the overall difference between the speed
   for the normal and blocked functions.  

   This illustration is based on the example from
   "High Performance Computing", by Dowd (O'Reilly).
   Examples that demonstrate even more performance
   gain by splitting inner/outer loops or by 
   unrolling by different factors are given in this
   book.  These examples largely concentrate on ways
   to avoid TLB (translation lookaside buffer) misses.

