After a couple of days of speed-testing mod 2 on the St. Andrews machines (64 cores of AMD Piledriver), several features in the code needed to be adjusted to make best use of the particular features of the machine.
My tests were done multiplying 160,000 x 160,000 random matrices, which took 163 seconds in 2014. The original code this time around initially took about the same time.
First off, the L2 cache, although slower on AMD than Intel, is huge – 2MB shared between two cores. It saved over 10% of the runtime to do 5,000 rows for each brick made rather than 1,000 (which is optimal on the Intel Haswell). This reduced the time to 141 seconds.
A lot of fiddling with all sorts of parameters, re-ordering the assembler code etc. produced one further second or improvement! Nevertheless the fact that changes to the chopping of matrices B and C (C=AxB) in the vertical direction slowed things down highlighted another issue – that of cauldron occupancy.
160,000 columns is 156.25 cauldrons (of 1024 columns) and when I chop vertically, I chop equally. I was originally chopping into 20 parts for no good reason, and for this particular size (160,000) this happens to be the optimum! For example chopping into 19 parts needs 9 cauldrons per part, so that in all 171 cauldrons of work are done instead of 160.
I need to think long and hard about this. Of course for huge matrices (a million or so) this is less important as even a poor chop would waste only a couple of percent, but for moderate matrices (e.g. 9,000 odd) it will be very important to take note of the fact that the underlying technology is working with rows of 1024 columns. It may be necessary to chop unequally. For example chopping 160,000 equally into 20 parts does 160 cauldrons, but by making all parts but the last a full 8 cauldrons, the last block will only be 5 cauldrons, so doing 157 cauldrons of work all told.
The third thing that came out of fiddling with the sequence of instructions in the assembler code was that the code seemed to execute at the same speed whatever I did. I think this means that the code is entirely limited by the decode speed – it can only average two instructions per core. Hence the key thing to do is to reduce the instruction count, and the only way I could see to do this was to use the “andn” instruction. This saves 7 register-register moves, which I had thought wouldn’t help because they execute in zero time, but they must still be decoded. The result was a reduction of a further three seconds to 137.2
Clearly AMD noticed this issue in some way, as the next version of the chip (“Steamroller”) has one decoder per core, not one shared between two cores, so the tuning would have to be repeated for that chip.
Unfortunately the andn instruction is fairly new, depending on the BMI1 flag in the core. I must either test the bit to see whether BMI1 is available, or produce a special version of the performance-critical assembler routines for the piledriver. I’ll probably do the latter for now. The trouble is that the optimal code for the Piledriver will not run at all on the intel chips before Haswell (i.e. do not even work on Sandy/Ivy bridge).
I now need to think how best to take advantage of all these new insights.