Tuning mod 2 on AMD Piledriver.

After a couple of days of speed-testing mod 2 on the St. Andrews machines (64 cores of AMD Piledriver), several features in the code needed to be adjusted to make best use of the particular features of the machine.

My tests were done multiplying 160,000 x 160,000 random matrices, which took 163 seconds in 2014.  The original code this time around initially took about the same time.

First off, the L2 cache, although slower on AMD than Intel, is huge – 2MB shared between two cores.  It saved over 10% of the runtime to do 5,000 rows for each brick made rather than 1,000 (which is optimal on the Intel Haswell).  This reduced the time to 141 seconds.

A lot of fiddling with all sorts of parameters, re-ordering the assembler code etc. produced one further second or improvement!  Nevertheless the fact that changes to the chopping of matrices B and C (C=AxB) in the vertical direction slowed things down highlighted another issue – that of cauldron occupancy.

160,000 columns is 156.25 cauldrons (of 1024 columns) and when I chop vertically, I chop equally.  I was originally chopping into 20 parts for no good reason, and for this particular size (160,000) this happens to be the optimum!  For example chopping into 19 parts needs 9 cauldrons per part, so that in all 171 cauldrons of work are done instead of 160.

I need to think long and hard about this.  Of course for huge matrices (a million or so) this is less important as even a poor chop would waste only a couple of percent, but for moderate matrices (e.g. 9,000 odd) it will be very important to take note of the fact that the underlying technology is working with rows of 1024 columns.  It may be necessary to chop unequally.  For example chopping 160,000 equally into 20 parts does 160 cauldrons, but by making all parts but the last a full 8 cauldrons, the last block will only be 5 cauldrons, so doing 157 cauldrons of work all told.

The third thing that came out of fiddling with the sequence of instructions in the assembler code was that the code seemed to execute at the same speed whatever I did.  I think this means that the code is entirely limited by the decode speed – it can only average two instructions per core.  Hence the key thing to do is to reduce the instruction count, and the only way I could see to do this was to use the “andn” instruction.  This saves 7 register-register moves, which I had thought wouldn’t help because they execute in zero time, but they must still be decoded.  The result was a reduction of a further three seconds to 137.2

Clearly AMD noticed this issue in some way, as the next version of the chip (“Steamroller”) has one decoder per core, not one shared between two cores, so the tuning would have to be repeated for that chip.

Unfortunately the andn instruction is fairly new, depending on the BMI1 flag in the core.  I must either test the bit to see whether BMI1 is available, or produce a special version of the performance-critical assembler routines for the piledriver.  I’ll probably do the latter for now.  The trouble is that the optimal code for the Piledriver will not run at all on the intel chips before Haswell (i.e. do not even work on Sandy/Ivy bridge).

I now need to think how best to take advantage of all these new insights.

Cannot get Haswell to run as fast as I thought.

After a very considerable amount of experimentation, I am unable to explain why the multiplication of random 50,000 x 50,000 matrices mod 2 takes about 72 seconds on my laptop using SSE, but still takes about 50 seconds using AVX-2.  The chip is an Intel i3 4005U.

More specifically I have changed it to use the same amount of cache memory for each stage, fiddled around with instruction sizes (thinking it might be a mu-op cache problem), checked that it really is the BrickMad that is the problem, fiddled with the order of the instructions (in case it is false dependency chains) and generally spent two days trying everything I can think of, all with no effect.

I have now given up trying.  The AVX-2 code seems to take about 70% of the time of the “legacy” SSE code, and as yet I have no clue as to how to make it run faster.

Use of L1 and L2 cache.

My timing experiments on matrices mod 2 on the Haswell chip in my laptop has lead me (if you are a glass-half-empty person) to a realization that my previous analysis was oversimplified, or (if there are any glass-half-full people) to see how to do a more accurate estimation.

My current code used 8 slices of grease level 4 on 256-byte vectors.  This occupies 32 KB, cleverly (I thought) utilizing all of L1 cache.  Of course we must also fetch and store Cfmt rows (256 bytes) from time to time, expected to fit in L2, and to fetch small amounts of Afmt (40 bytes).  I propose to negect the Afmt.  The Cfmt fetch, however, goes into L1, ejecting a line of the grease.  This line must also be re-fetched from L2, and there is a danger that it will be fetched before the Cfmt is the oldest line in the cache, thereby ejecting a further line, so that this also requires re-fetching.  In other words as well as the fetch and store of Cfmt, there are some (I make it 25/12) fetches of the grease.  Hence the fetch bandwidth to L2 is over three times what I thought!  😦

It is not so hard to see in general what to do about this.  Either the Cfmt fetches must be rare enough that this bandwidth is not the bottleneck, or we must reduce our usage of L1 to leave one slot  available for the pollution.  It seems clear that the latter approach is best, though I haven’t tested it practice.

On AMD, I believe the L1 cache is 4-way associative, whereas on Intel it is 8-way.  Hence I must consider the space available for grease on the AMD chips to be 12K (not 16K as I previously thought) and on the Intel chips (pre-Skylake) 28K.

This is, I must admit, a bit inconvenient.  All my address computations at present are done with logical operations (Shift, And etc.) and somewhere now a multiplication must be introduced.

It remains to decide how best to reduce the L1 occupancy.  One approach is to reduce the length of vectors used – on Haswell from 256 to 224 or 192.  224 still means that four cache-lines must be fetched for Cfmt, whereas 192 reduces the occupancy more than is necessary.  Another approach is to use 7 slices instead of 8, but this makes the alcove 28 entries, which adds a bit of complication to the Afmt generation.  A third approach is to reduce the grease level from 4 to 3, which would reduce the L2 bandwidth demands quite considerably by doing, say, 12 or 14 slices.  Probably the best will be to switch to Golay grease, where five slices reduces the Cfmt fetch by a large factor (8/15) and also reduces the L1 utilization (from 32K to 30K).  Of course at present 1/16 of the grease data is the zero vector, and there may be a cheap way to eliminate or reduce this.

All this leads to the conclusion that the HPMI code for the field of order 2 is still not in final form, and that further experimentation is necessary.

More Thoughts on Gaussian Elimination.

The immediate objective is for me to get the high-performance matrix multiplication (HPMI) working over smallish fields, while Jon Thackray concentrates on the top-level programs for chopped-up rank, null-space, invert and split.

The result of this should certainly be a vast improvement on older methods, but the issue remains that even after chopping up, the putting of a single “slab” into echelon form will not be fast, as it is done using general-purpose “Dfmt” routines, and the HPMI is approximately 100 times faster.  Unfortunately, these “Echelization” operations tend to be part of a dependency chain, limiting the amount of parallelization that we can do.  Hence the current design will not scale well, and I do not expect a Gaussian Operation to be able to use more than 10-20 cores before it gets stuck waiting for an echelization to complete.

We are therefore going to need a faster “slab echelize” before we get much older.

The key operation is the multiply-and-add.  We have two vectors V and W, and we wish to form V = V + x.W for x an arbitrary field element.  In Dfmt this is mainly done using byte-wide table look-up.  Only the field of order 2 can go much faster – even GF(4) needs a table look-up much of the time.  They therefore tend to operate at about 0.6 bytes/nSec, which is well within the memory budget (around 1.5 bytes/nSec) and therefore computationally bound.

The main idea, then, is to change the format into something like Bfmt or Cfmt so that this operation can go much faster.  For example, mod 3 we can use logic-6 which, even using SSE, should get about 8 bytes/nSec.  Yes!  We want to do this, but the result is that we will become bound by memory bandwidth.

In the context of Gaussian Elimination, it may be that, since the multiplications should be running with very little real memory bandwidth used, it may be OK to let the – typically single thread of – Gaussian use much more than its fair share of memory bandwidth and things might go OK with a simple-minded approach (find a new independent row and then clean the rest of the matrix with it, for example).

To be honest, that is probably exactly what I will do next.  It is immediately obvious, however, that by collecting a few rows together before cleaning the rest of the matrix, things will be speeded up considerably.

Firstly more work will be done for each pass of the matrix, so that the memory bandwidth demands will be reduced enormously.  Even doing two rows at once will be a big win in this area.

Secondly, it gives us the opportunity to use grease, and pre-compute a few rows to save yet further work.  If we are working with 30,000 dimensions mod 3, each row takes 7.5 Kilobytes, so there is a limit to what we can do in this area.

Thirdly, if we also chop our rows, we can utilize some of the code used by the HPMI.  This is really only suitable at the 1,000 dimension scale, but that still leaves plenty of scope.

The correct strategy for mod 3 seems to be to take about 1,000 rows and put them into echelon form, and then use the HPMI to clean out the rest of the matrix with them.

The first phase of this is to produce the code to put 1,000 rows into echelon form (and initially to use it to put all the rows into echelon form).  We will have to wait and see whether a second phase (feeding into the HPMI) is needed.

Already the first phase throws up several new problems.  Firstly, for larger primes we wish to add several times before doing any reduction, and it may be quite challenging to predict accurately when the reduction is to be done.  Secondly, for extension fields, it is quite a complex job to do the scalar multiplication.  For both of these, the pressures on the formats are not exactly the same as they are in the HPMI and we might find ourselves inventing new formats in the end.

In the first instance, however, it seems clear that Bfmt is the way to go, and if a version of V = V + x.W can be produced (with V and W in Bfmt), along with the extraction (return the i’th field element in the Bfmt vector V) and ideally non-zero location (what is the smallest value of i where the i’th entry in V is non-zero) we will be well on our way to a substantial speed inprovement for Gaussian.

Progress report.

Last week I got a version working using very basic arithmetic (Dfmt) and including a multi-core multiply, but only a single core nullspace and with no split or invert at all.  This has now been handed over to Jon Thackray, and he plans to develop the multi-core Gaussian operations of rank, null-space, split and invert.

I am now concentrating on the huge increase in performance that the High Performance Meataxe Interface can bring, starting with the field of order 2.

This code, albeit to a different specification, has been running for some years now, so it should not take more than a few days to get it working again, though the changes are quite substantial.  I propose, at this stage, to stick with a simple grease level 4 – I do not expect Golay grease to be that much faster, and there are more pressing things to do first.

I then propose some considerable timing tests on this to see how much the secondary routines, and in particular making the grease table and converting to Afmt, influence the overall times to decide whether to leave them in “c” or to write the key parts of these in assembler also.

The next stage will be to write the HPMI for mod 3.  As previously described, this will use the logic-6 SSE addition instead of table look-up, so it should be a lot faster.  As well as being of some small practical use, this should clear away most of the design stupidities that I may have made by accidentally relying too much on the characteristic being two.

I then propose to progress to the field of order 4, using the technology of linear forms to reduce this to matrix multiplication mod 2.  This is again as much a design problem as a programming one.

Hopefully this will all be completed and working by September, when I propose to rejoin Jon Thackray in Cambridge to put together our two sets of work, and hopefully produce a working, top-quality simple meataxe good for the fields of order 2, 3 and 4 and working for all other fields up to 2^64, albeit a lot more slowly.

At that stage, it is expected that the Gaussian programs may well be limited by the speed of the single core “small” Echelon-form proggy, as the multiply, although in principle the vast majority of the the work, should be so much faster that it may not be the main performance limiting factor.  The trouble is that the Echelon-form runs form a dependency chain, limiting the amount of concurrency available.  We will wait and see, but I am half expecting to have to address the speed of this stage also before operations like null-space and split can run in times commensurate with multiplication.

Golay grease code for mod 2 arithmetic.

The Golay code is best known as an error correcting code, but it is also a perfect grease code, and it is looking increasingly like meataxe64 is going to have to use it.

The basic grease code consists of 23 vectors in an 11-dimensional space mod 2, such that every vector of the space is the sum of at most three of them.  It is said to be perfect because each vector has a unique representation in this form.  Numerically the “coincidence” is that

23C3 + 23C2 + 23C1 + 23C0 = 1771 + 253 + 23 + 1 = 2048.

We cannot afford a data-dependent branch, so instead we store the zero vector and add that in whenever fewer than three vectors are needed.  Hence 24 vectors are stored, and we can afford to store 5 slices of this, so storing 120 vectors, adding 3×5=15 of them to add in any vector from a 55-space.  We could, of course, store only one zero vector, but that would require us to store 7 bits for each Afmt entry, whereas we can otherwise manage with 5 bits.  With a bit of clever trickery, and by storing a zero between two slices, we could manage with storing just three zeros (118 vectors) though I’m not sure the slight reduction in L1 usage would warrant the extra complexity.

The conversion to Afmt is not too hard.  We can make a look-up table with 2048 entries, giving the three vectors (so 15 bits) that must be added to make it, and this fits comfortably into L1 cache.  Of course the Afmt conversion is done upfront, so the L1 cache is not needed during the working (BrickMad) phase.

The population of the brick is, however, an oustanding issue.  The 23 vectors we store are clear, but we can use an arbitrary 11×11 matrix to transform the basis we have into the “standard” Golay basis, and our aim is then to find a sequence of additions, as short as possible, to make all 23 required vectors.  It seems natural to choose 11 of the required vectors as a basis, so we only have to make the other 12, though it is far from clear to me so far that this will provide the shortest addition chain.  If we do this, I think these 12 required basis vectors have 6 or 7 set bits, and it seems clear that an addition chain of length around 60 or 70 can be found fairly easily, though I have not yet really had any good ideas as to how to look.

What does all this gain?  The competitive system is probably to use 8 slices of grease level 4 – 128 vectors stored to do 32 dimensions with 8 additions.  Let’s compare them.

Golay does 55 dimensions with 15 adds (30 mu-ops) and one fetch/store cycle on the answer (3 mu-ops) – 0.6 mu-ops per dimension.

Grease level 4 does 32 dimensions with 8 adds (16 mu-ops) and one fetch/store cycle (3 mu-ops) – 0.59375 mu-ops per dimension.

Not a lot in it so far!

Populating the brick is used for 1024 rows of C, and with Golay uses (say) 60 x 5 adds for 55 dimensions, adding another 0.01065 mu-ops per dimension, whereas grease level 4 only needs 88 adds for 32 dimensions – 0.00536 mu-ops per dimension.

Afmt conversion?  For Golay, this will be about five mu-ops for 11 dimensions, but shared amongst, say, 25600 bits of C which using SSE is  200 registers-full, another 0.00227 mu-ops per dimension, where grease level 4 can do five mu-ops for 32 dimensions – 0.00078.

Totals.

Golay 0.60000 + 0.01065 + 0.00227 = 0.61292 mu-ops per dim

GL-4 0.  0.59375 + 0.00536 + 0.00078 = 0.59989.

Golay is also marginally worse in that the bandwidth required to fetch the Afmt is slightly higher (15 bits instead of 11).

So why is Golay to be considered?  The answer is that it is much lighter in its use of L2 bandwidth – doing one access for each 55 dimensions instead of one for each 32 dimensions.  I am farily confident that this will more than compensate for the 2%-3% increase in computation in most cases, and fear that I am going to have to implement this!

Steve Linton has already tried this, so it is not new, but I suspect that an attempt to really use this as fast as it will go will nevertheless offer challenges that I might yet baulk at!