	ijk_loop example
	----------------

	The ijk_loop example illustrates two points:  
	(1) the effect of index order on performance
	as the array is processed; and (2) how 
  	ideal time analysis provides different
	results than PC sampling.  The difference
	between the two may point to areas in the code
	which may suffer from cache latency.

	This code was written, run, and analysed on 
	an R5000 O2 SGI computer.  Your mileage will
	vary with other machines of course.

	Let's walk through an example.  First, we
	compile the code:

		cc -o foo2 foo2.c -O2

	SGI provides a utility called ssrun which
	does our performance analysis.  

		ssrun -ideal foo2

	and

		ssrun -usertime foo2

	will give us two output files.  The first 
	ssrun command will give us the idealized
        analysis; the second gives us the actual
        performance of the code taking latencies, 
        etc. into account.

	Using prof, we can get some output files for analysis.

		prof output_file > foo2.output

	Looking at the output from the idealized output, we 
	see that it tells us that the three functions each 
	took the same amout of time.  We'd expect that since
	they execute the same number of loops.

	However, looking at the usertime output, we see that
	the kji_loop took about 37% of the overall processing
	time, while the ijk and ikj loops took much less.

	IRIX Notes
	-----------
	
	to build and profile type:
		gmake profile

	the resultant output is found in two files:
	
		basic block results:  go.exe_ideal_results
		process time results: go.exe_usertime_results

	Linux Notes
	-----------

	to build and profile type:
		gmake profile

	the resultant output is found in two files:

		basic block results:  main.cxx.gcov
		process time results: go.exe_usertime_results

