skip to primary navigationskip to content

High Performance Computing

University Information Services

Studying at Cambridge

 

Turing Tuning

Turing was decommissioned at the end of February 1999. The pages on Turing are not being maintained, but are being left accessible for historical interest. They may well have bad links and similar errors.


This page gives a simple example of some of the concepts involved in tuning code for a vector supercomputer such as the Hitachi S3600. It is only an example, and as such it is fairly shallow, and, being an example of a standard library routine, should be replaced by a library call anyway.

Tuning a 3D FFT

The first point when tuning any routine is to get a clear idea of what the routine does and how it is likely to be used. The routine considered was a complex to complex 3D FFT, and a three dimensional FFT is simply a collection of 1D FFT's.

In order to FFT a NXxNYxNZ array, firstly NYxNZ 1D FFT's of length NX are done down the X columns, and then NXxNZ of length NY down the Y columns, and finally the same for the Z columns. The expected size of the array for the application in mind would be 50x50x50 to 100x100x100, and not necessarily with all dimensions equal. However, the dimensions were assumed to have no factors other than 2, 3 and 5 for ease of doing the 1D FFT's.

Two issues arise in this problem:

  • How to vectorise?
  • Efficient use of memory bandwidth

There are two possible ways of doing the vectorisation. Firstly, one could consider writing a vectorised single 1D FFT routine. This is not only difficult, but the vector length will be short - each individual 1D FFT is over only about 100 items afterall. Alternatively, one could vectorise over the different 1D FFT's being done simultaneously, as in each of the three passes over the array around 1000 to 10,000 independent FFT's are performed simulatneously. This clearly gives a much longer vector length.

Therefore a routine for the 1D FFT's was used which could do multiple 1D FFT's simultaneously, accepting as arguments the number of FFT's to be done, the stride between elements in a individual FFT and the stride between the sets to be FFT'd. This routine had its inner-most loops looping across the FFT's to be done, rather than having the inner-most loops doing a FFT and the outer loop ensuring that the whole set passed was FFT'd. Thus the vectorisation, which will occur most strongly on the innermost loops, was across the FFT's.

Recalling that an array element C(IX,IY,IZ) will have an address of C(IX+IY*NX+IZ*NX*NY), this leads to the following pseudocode as a first attempt:

Do transforms along X by NY*NZ simultaneous transforms of
   length NX
   unit stride
   separation between transforms NX

Do transforms along Y by the following loop:
   DO Z=1,NZ
     Do transform NX simultaneous transforms along Y of
       length NY
       stride NX
       separation between transforms 1
   END DO

Do transforms along Z by NX*NY simultaneous transforms of
   length NZ
   stride NX*NY
   separation between transforms 1

The ugly loop in the middle, which does one XY plane of the cube per iteration, is the simplest solution to the problem that the Y columns of the matrix are slightly scattered in memory, the first NXof them occuring with the first element C(1)C(2)..C(NX), then the next batch occuring at C(1+NX*NY)C(2+NX*NY) etc.

Timing these transforms (FFT+IFFT) for various cubic grids gives the following data:

 

Grid     Time(s)   VPU/CPU

  16    0.003320    0.51
  18    0.003617    0.43
  20    0.003564    0.52
  24    0.007285    0.65
  25    0.005611    0.57
  27    0.009276    0.49
  30    0.010252    0.58
  32    0.034403    0.79
  36    0.016349    0.69
  40    0.025894    0.79
  45    0.023546    0.69
  48    0.065152    0.86
  50    0.035115    0.72
  54    0.050679    0.68
  60    0.061869    0.81
  64    0.400791    0.95
  72    0.164020    0.89
  75    0.091989    0.80
  80    0.273189    0.93
  81    0.153424    0.69
  90    0.179736    0.82
  96    1.014565    0.96
 100    0.258008    0.89
 108    0.396227    0.87
 120    0.726538    0.93
 125    0.423678    0.86
 128    4.066586    1.00
 135    0.543115    0.85

How should such data be interpreted? Firstly, it is clear that some sizes (32, 64, 96, 128) are taking anomalously long. Although the vectorisation ratio is very high for these sizes, something is clearly amiss. That something is a memory bank conflict.

In order to interface (relatively) slow memory to a fast CPU, the S3600 has a number, NB, of separate banks of memory, arranged so that sequential access to memory results in accessing the banks in a "round robin" fashion, and thus the processor sees NB times the bandwidth that a single bank could provide. Although I can find no documentation of how many banks the S3600 has, the answer is certainly a power of 2 (from hardware design considerations), and is probably about 16. The width of a bank is also unclear, but again it will be a power of two, probably around 32 bytes.

Although sequential memory access "hits" each bank in turn, and thus uses the combined bandwidth of all banks, accessing memory in a strided fashion misses some banks completely. Assuming 16 banks of 32 bytes each, accesses falling every 32 bytes hit each bank in turn, accesses falling every 64 bytes miss alternate banks, and accesses falling every 512 bytes go exclusively to a single bank, thus losing the advantage of having multiple banks.

The cure is simply to copy the array into a larger array of size NX+1xNYxNZ (or expand it in place), and then do the FFT. When doing the transforms along X the stride between the transforms is nowNX+1 not NX. Obviously this expansion is done only if the original value of NX was "bad," i.e. had a factor of, say, 4. This expansion revises the timings as shown:

         ----before----         ----after----
Grid    time(s)    VPU/CPU    time(s)    VPU/CPU
    
  16    0.003320    0.51      0.003053    0.46
  18    0.003617    0.43      0.003572    0.43
  20    0.003564    0.52      0.004064    0.58
  24    0.007285    0.65      0.006475    0.61
  25    0.005611    0.57      0.005610    0.57
  27    0.009276    0.49      0.009258    0.50
  30    0.010252    0.58      0.009947    0.59
  32    0.034403    0.79      0.016292    0.55
  36    0.016349    0.69      0.016145    0.68
  40    0.025894    0.79      0.020235    0.73
  45    0.023546    0.69      0.023750    0.69
  48    0.065152    0.86      0.032562    0.72
  50    0.035115    0.72      0.034986    0.73
  54    0.050679    0.68      0.050344    0.68
  60    0.061869    0.81      0.055139    0.79
  64    0.400791    0.95      0.076589    0.69
  72    0.164020    0.89      0.098706    0.80
  75    0.091989    0.80      0.090852    0.81
  80    0.273189    0.93      0.118200    0.82
  81    0.153424    0.69      0.153169    0.69
  90    0.179736    0.82      0.179788    0.82
  96    1.014565    0.96      0.223708    0.79
 100    0.258008    0.89      0.221919    0.87
 108    0.396227    0.87      0.305558    0.83
 120    0.726538    0.93      0.395230    0.88
 125    0.423678    0.86      0.423474    0.86
 128    4.066586    1.00      0.509342    0.81
 135    0.543115    0.85      0.543939    0.85

Now the anomalous results have been eliminated. The VPU/CPU ratio has declined, for the VPU was being "used" waiting for data from memory before, whereas now the waiting time has been reduced. The effect on the real time used can be quite dramatic - a factor of 8 being seen for the 128 data above. Of course this sort of optimisation would slow down most workstations, as the time to exand and contract the data would far outweigh any improvement that the memory banking on the average workstation would give to the odd stride.

If one was concerned about small data sizes, a further improvement could be made. For, for example, the 24x24x24 case, two sets of transforms with a vector length of 576 are done, and 24 sets of transforms with a length of 24. However, an in-place rotation of the matrix could reduce this to three sets of length 576. This optimisation leads to the third set of results below:

         ----before----         ---strides----         ---vectors---
Grid    time(s)    VPU/CPU    time(s)    VPU/CPU      time(s)    VPU/CPU
    
  16    0.003320    0.51      0.003053    0.46        0.001725     0.82
  18    0.003617    0.43      0.003572    0.43        0.001766     0.80
  20    0.003564    0.52      0.004064    0.58        0.002202     0.87
  24    0.007285    0.65      0.006475    0.61        0.004475     0.90
  25    0.005611    0.57      0.005610    0.57        0.003481     0.90
  27    0.009276    0.49      0.009258    0.50        0.004974     0.82
  30    0.010252    0.58      0.009947    0.59        0.006038     0.91
  32    0.034403    0.79      0.016292    0.55        0.008750     0.92
  36    0.016349    0.69      0.016145    0.68        0.012287     0.93
  40    0.025894    0.79      0.020235    0.73        0.015478     0.97
  45    0.023546    0.69      0.023750    0.69        0.017475     0.93
  48    0.065152    0.86      0.032562    0.72        0.023468     0.94
  50    0.035115    0.72      0.034986    0.73        0.023565     0.95
  54    0.050679    0.68      0.050344    0.68        0.034043     0.89
  60    0.061869    0.81      0.055139    0.79        0.050396     0.96
  64    0.400791    0.95      0.076589    0.69        0.050648     0.96
  72    0.164020    0.89      0.098706    0.80        0.082056     0.94
  75    0.091989    0.80      0.090852    0.81        0.074592     0.96
  80    0.273189    0.93      0.118200    0.82        0.097538     0.97
  81    0.153424    0.69      0.153169    0.69        0.115563     0.85
  90    0.179736    0.82      0.179788    0.82        0.146915     0.94
  96    1.014565    0.96      0.223708    0.79        0.179485     0.95
 100    0.258008    0.89      0.221919    0.87        0.230832     0.98
 108    0.396227    0.87      0.305558    0.83        0.363328     0.95
 120    0.726538    0.93      0.395230    0.88        0.358902     0.97
 125    0.423678    0.86      0.423474    0.86        0.374054     0.98
 128    4.066586    1.00      0.509342    0.81        0.424823     0.97
 135    0.543115    0.85      0.543939    0.85        0.493072     0.93

This final table shows most dramatic improvements where very small vector lengths have been combined. Again, this improvement would slow down most workstations.

One final point: it is difficult to win all the time. On average, the final scheme presented is best, and the initial scheme the worst. But for specific examples (e.g. 20, 100, 108) this is not so. So take care to perform a reasonably comprehensive test before drawing conclusions!

Conclusions

  • Keep vectors long
  • Avoid large power of two strides
  • Memory copies can be cheap