Progress on Gaussian Elimination.

The three-layer Gaussian elimination system now seems to be working well.

The top layer is the multi-core chopped-up Gaussian, which has been working for some time.

Steve Linton (with minimal assistance from me!) last week implemented the single-core Gaussian elimination routine to chop yet further and use the HPMI wherever possible, as suggested by his prototype (in GAP) a few weeks ago.

I have now started to address the base-case a bit, where a smallish matrix (currently less than 300 in at least one dimension) must be echelonized, with transformation matrix, using the pool/batch system and (so far) projective level 2 grease where appropriate.  This needs to work without the benefit of the HPMI (the matrices are too small).

The result is that the single-core part is now within a factor of two of multiplication speed, so that the top layer only needs to chop to obtain concurrency – previously it needed to chop much more to avoid the bottleneck of the poor single-core Gaussian.

This complete system now passes the regression tests and – as far as I know – is working.

There remain many improvements still on the “to do” list.  Many of them may stay there for some time, but I propose to spend a week or two now testing (and mainly performance testing) the system in various environments to see how it manages as the machine, shape, size, field etc. vary, aiming to produce a Gaussian elimination system over finite fields that is hopefully good for all purposes within its range (x86, q<2^64).


Sieve homomorphisms.

Let r and s be two matrix representations of a group H over the same field F whose characteristic may divide the order of H.  Notice that this concept is basis-dependent – if you change the basis, I consider that a different matrix representation.  Furthermore, I consider the basis to be b0, b1, b2…bN so that the basis elements are numbered.

A homomorphism from r to s is a sieve if every basis vector of r is mapped either to zero or a basis vector of s later in the sequence than all that arose from earlier basis elements of r.  In other words the map consists of deleting some basis elements of r and then inserting some basis elements of s, preserving the order of those retained.

If anyone knows of a previous definition of this sort, please let me know.

The initial motivation for this definition comes from consideration of the tensor-condense of R and S for a group G, condensing with the norm element of a subgroup H whose order may not be coprime to the characteristic.  One decomposes the restriction of R and S to H, and then tries to put the summands of r and s into a basis such that all homomorphisms from r to s are sieves.  If this can be done, the operation at the heart of the tensor condense algorithm is (I think!) a “dot-product” of rectangles – the sum over all i and j of Rij * Sij – once row- and column-selects have been applied according to the sieve.

I call two sets of matrix representations R and S sieve compatible if the set of homomorphisms from any element of R to any element of S has a basis of sieves.

The concept arises already in the absolutely irreducible case – we must put the matrices for equivalent representations into a “standard basis” so that the homomorphism is a scalar.

More generally, we can try to achieve sieve-compatibility by putting all the representations into a “very standard basis” – whatever that means!  I am hoping that it means something, and that programs can be written to convert to it, hopefully simplifying other things besides tensor-condense.

Elementary Gaussian Elimination.

Steve Linton’s recent prototyping work has made it very clear that, in meataxe64, the performance of Gaussian elimination depends almost entirely on the speed of matrix multiplication, and the performance of the “base case” – perhaps around 500 dimensions – is of little significance.

Despite this, my perfectionist nature insists that the base case  be well implemented, and this post consists of my armchair thoughts as to how to do this.  Implementation and testing will come later.

There are two obvious strategies for Gaussian elimination, and (spoiler) they are both wrong!

In one approach, each row is “cleaned” (has all the pivotal columns cleared to zero) with all previous ones, and then, if non-zero, a pivot in this new row is located and we pass on to the next row.

In the other approach, if a row is non-zero, a pivot is located and all the other rows have this column cleared.  There are several variants depending on which row is taken, and whether the previous rows have the pivot cleared or just the later ones.

There are several things wrong with all of these.  Firstly, they are not cache-friendly – when doing R += y . S, either R or S is used only once, which is not good.  Secondly it ignores the possibility of using “grease” to reduce the work when the field is small.  Thirdly the extraction of the single (pivotal) value from each row is not very fast (again for small fields) and it is better  to extract strings of consecutive pivotal values if at all possible.

Once these issues are isolated, it is not too hard to see how to address them.  Furthermore, we can do so and still use “standard” columns and rows.

The idea is to use the first method for a few rows to get a “pool” of rows in a sort-of reduced echelon form – each pivot is the only non-zero entry in its column within the pool, though later rows may have earlier pivots.  Every now and then (perhaps every five vectors) we inspect the pool to see if there is a suitable “batch” of perhaps 5-10 vectors.  The choice of a batch is aimed at being cache-friendly (so is several vectors), suitable for grease and – perhaps most tricky – suitable for fast extraction.  This batch is then greased and used to clear the pivots of the batch of rows from the entirety of the matrix.

The best form of grease to use depends on the field.  As far as I can see so far, there are three cases.

Characteristic Two.  As XOR is so much faster that the table look-up needed for extensions fields in characteristic two, an “affine” grease is suggested, where all multiples of a vector (in a 1-space, 2-space or even 4-space for GF(4)) are stored, and only addition is used for the main clearing work.  For large fields of characteristic two, it may well be better to do several adds rather than a multiplication, though the PCLMUL instruction limits the scope for this somewhat.

Small fields of odd characteristic. In these cases, multiply-and-add is not much slower than add (both involve byte-wide table look-up(s)) so a “projective” grease is suggested, where only one vector is stored in each 1-space.  This seems appropriate up to fields of order up to a few hundred.

Large fields. For fields of order over a few hundred, there is no point doing grease at all.

Meataxe64 has no concept of fast extraction at the moment.  I have seen some pressure for this in at least two other circumstances – handling polynomials and the fast multiplication of a sparsified matrix – so I think the time has come to figure out how best to do this.  As an example of the problem . . . when working mod 2, the simple-minded way would extract the entries of a byte one bit at a time and then assemble them into a byte again, taking over 100 nano-seconds to load a byte!  The difficulty is to devise a robust field-independent way of managing to simply load the byte most of the time.

Multiply timings on 64xPiledriver.

Been testing for performance on the 64 Piledriver core in St. Andrews.  Here are the results for multiplying random square NxN matrices over various smallish fields.

F – dimension – time
2 – 400,000 – 30m 28s
3 – 200,000 – 16m 54s
4 – 200,000 – 11m 50s
5 – 100,000 – 10m 12s
7 – 100,000 – 14m 30s
8 – 160,000 – 12m 37s
9 – 160,000 – 26m 10s
11 – 100,000 – 14m 32s
13 – 100,000 – 14m 45s
16 – 160,000 – 18m 45s
17 – 80,000 – 18m 49s
23 – 80,000 – 18m 51s
25 – 80,000 – 16m 29s
27 – 100,000 – 14m 18s
31 – 80,000 – 18m 52s
32 – 125,000 – 14m 28s
37 – 80,000 – 18m 51s
47 – 80,000 – 18m 53s
49 – 64,000 – 11m 56s
59 – 80,000 – 18m 53s
64 – 100,000 – 11m 19s
73 – 64,000 – 11m 54s
81 – 80,000 – 11m 0s
97 – 64,000 – 14m 58s
109 – 64,000 – 14m 58s
121 – 60,000 – 10m 5s
125 – 50,000 – 6m 57s
127 – 64,000 – 15m 0s
128 – 80,000 – 8m 20s

Reorganization of Assembler code.

I have decided to move the decision as to which assembler routines to call from compile-time to run-time.  There are three reasons for this.

The first, which Klaus mentioned a couple of years back, is that it can be convenient to install the programs, already compiled, into a directory mounted on different machines.  This requires the compiled code to run on all the machines, thereby failing to take advantage of those with a more powerful architecture.  The current change will fix part of that, but still leave the number of threads and memory size compiled in – an issue which may require attention later.

The second arises from the “large fields of characteristic 2” work, which needs the CLMUL (“carry-less multiplication”) flag, and from the HPMI for 5-181 work, which needs SSE4.1 as it stands.  I have previously taken the view that “modern” machines have all these instructions, and that meataxe64 needs a modern enough machine, but I feel that approach, although probably acceptable, is not ideal.

The third reason is that Markus and now Steve have been trying to incorporate meataxe64 stuff into GAP, and the old approach is not really acceptable in that environment.

The “cpuid” function of the x86 returns a large and confusing wealth of detail on the machine, and it is a heavy job to pick through all these flags and values.  I propose, therefore, to write an assembler routine to summarize this into a few bytes of data more suitable for choosing between routines and strategies in meataxe64.  The main properties of an (x86-64) machine are given by a single byte, which I call the class.  The values I propose are as follows, where each class implies all the previous ones also.
‘a’ – All machines, including those where LAHF and CMPXCH16 don’t work.  I do not know whether even linux or the output of gcc would run on these!
‘b’ – LAHF and CMPXCHG16 have been fixed and SSE3 is available.
‘c’ – SSSE3 (the rest of SSE-3) is available.
‘d’ – SSE4.1 is available.  This is critical for primes 17-61, which ideally work in 10-bit fields but this needs PMULDQ.  Otherwise these primes must work in 16-bit fields, making them about 50% slower.
‘e’ – SSE4.2 is available.  These include the byte-string processing which looks useful, though I haven’t used it yet.
‘f’ – CLMUL is available.  This allows the “large fields of characteristic 2” stuff, speeding them up by a factor over 1,000.
‘g’ – AVX.  This allows 256-bit working, but for floating point only, which is not much use in meataxe64.  Nevertheless the VEX prefix clears the top half (allowing better mixing of SSE and AVX) and gives 3-operand instructions, which are of some minor use.
‘h’ – BMI1.  A few useful bit-manipulation instructions.
‘i’ – BMI2.  Some more bit-manipulation instructions, including PEXT/PDEP which look useful, though again I haven’t used them (yet).
‘j’ – AVX-2.  This allows the use of 256-bit registers for integral work.

To the user, the main question is whether the machine is class ‘j’ (Haswell, Broadwell, Skylake, Ryzen) or not, affecting the speed by about a factor of two in most cases.

To the programmer, this allows fingertip control over each routine separately, so that the large fields in characteristic 2 test class >= ‘f’ and the 10-bit working tests class >= ‘d’, avoiding the previous ludicrous situation where 10-bit working was not allowed because CLMUL was not implemented.

No doubt further classes will be added later – AVX-512F is looming large on the horizon now.

After this change, an old machine can only use meataxe64 if it can at least assemble the new instructions.  I suspect this will also need to be addressed in the fullness of time.

One effect of this is that it is now worthwhile to implement the AS-stuff 16-bits, which will speed up 5-13 and 67-181 by a few percent and also allow 17-61 to work on classes ‘a’-‘c’ – neither of great importance, but it has been irritating me for a while.

My first target is to get it all working on class ‘a’ – not that I have a class ‘a’ machine to test it on!  A glance at the top 500 supercomputers suggests that it is classes ‘i’ and ‘j’ that are actually important at the moment, so these will get particular attention.

If anyone wants it for other purposes, the “mactype” routine will be available for all to use, as I suspect the issues discussed here crop up elsewhere.  Not that I’ve actually written it yet – I’ll post again when all this works.  Unless unexpected snags arise, this shouldn’t take more than a few days.

In return, if anyone has computers I can use for testing (probably in December) I’d appreciate a logon for a day or so.  One urgent need is for an Intel i9 with AVX-512, so I can start work on those routines.  The other need is at the other end.  I have no access to a VIA 2000 nor a VIA 3000, nor the older Intel and AMD machines from 2003-2010 (core-2, Athlon, Merom, Nehalem, Westmere etc.) and I’d like to run the regression tests, and probably some performance tests as well, to check that I’ve got things right (and to fix them if I haven’t).

Large fields of characteristic 2.

In 2010, the Intel “Westmere” processor introduced the “PCLMULQDQ” instruction, which multiplies two 64-bit mod-2 polynomials together to give a 128-bit result, and in the more recent processors it is pretty fast.  Up to 16-bit fields, the use of logarithm tables has been retained, but from 17-63-bits inclusive, the instruction is now used for the one-vector-at-a-time work.  Indeed the use of this instruction looks so good that it may well be the fastest method available for fields over about 2^16, and I therefore have no current plans to write HPMI code for such fields.  For example, a 5,000 x 5,000 multiply over the field of order 2^35 = 34359738368 takes 8 mins 13 seconds on my 2-core Haswell laptop.

It implements h += f.g as polynomials modulo the Conway polynomial p(X) in characteristic 2, using the PCLMULQDQ instruction and Barrett (I think it is called) reduction.  Note that I have used ‘^’ below to denote exponentiation, and ‘+’ to denote exclusive-or.  The degree of the field is d.

The first step is to compute upfront, as soon as we know the field, e(X) = X^2d / p(X) (discarding the remainder) where d is the degree of the field.

To compute f(X) * g(X) modulo p(X), we first use the instruction to compute f(X)*g(X) as a polynomial.  We express this as a(X).X^d + b(X) and use the instruction a second time to multiply a(X) by e(X), and again chop into two halves a(X) * e(X) = s(X).X^d + t(X), discarding t.  A third use of the instruction is used to compute s(X).p(X) = q(X).X^d + r(X), and the answer we require is then r(X)+b(X).

To see this, we note the following . . .

1) X^2d = e(X).p(X) + u(X) with degree e(X) = d and degree u(X) < d.

2) f(X).g(X) = a(X).X^d + b(X) with degree of a(X) and b(X) < d.

3) a(X).e(X) = s(X).X^d + t(X) with degree of t(X) < d.

4) s(X).p(X) = q(X).X^d + r(X) with degree of r(X) < d.

Claim. q(X)=a(X).   Proof.

a(X).X^2d = a(X).[e(X).p(X)+u(X)] = a(X).e(X).p(X) + a(X).u(X) = [s(X).X^d+t(X)].p(X) + a(X).u(X)= s(X).X^d.p(X) + t(X).p(X) + a(X).u(X) = s(X).p(X).X^d + t(X).p(X) + a(X).u(X) = [q(X).X^d + r(X)].X^d + t(X).p(X) + a(X).u(X)= q(X).X^2d + r(X).X^d + t(X).p(X) + a(X).u(X)

As all terms except the first have degree less than 2d, so the result follows.

The instruction naturally chops up a product into y(X).X^64 and z(X), so in practice we shift the things we multiply left a total of 64-d positions.


Small-scale Gaussian Elimination.

Some measurements on a single core suggest (depending a bit on the field) that the HPMI is, for matrices over 5,000 dimensions, roughly 20 times faster than doing things in Dfmt without grease.  At about 100 dimensions, however, the overhead of the HPMI and all the associated conversions means that the two methods are about the same speed.  It is therefore necessary to consider how best to perform “echelization” in Dfmt on matrices of a few hundred dimensions – any larger, and it is better to chop the matrix up and use the HPMI for the matrix multiplications.

For small fields we would like to use grease, but we would like to avoid the overhead of computing the necessary multiples.  This is possible.  The idea is to use the vectors we actually have rather than make new ones.  I call the method “auto-grease”, though it may not be new and may already have a name.

It is important to realize too that with the dimension in the hundreds, we may reasonably assume that the entire matrix fits in a low-enough cache that memory access time is not significant.

As an example, consider mod 3 doing 4 dimensions at a time with one add/subtract – grease level 4.  We proceed in three stages.

The first stage proceeds as normal to find the first four vectors that are linearly independent and to choose four columns that show this.  We then put these four vectors into reduced echelon form . . . I call these the “white” vectors.

The middle stage (the main one) then takes each remaining vector.  If a single add/subtract can clear all four pivotal columns, this is done, and otherwise the vector is simply added to the list of vectors to be used for such clearing, which I call the grey list.  There will be at most 40 vectors on the lists in all – four white and at most 36 grey.

The third stage clears the pivotal columns on the grey vectors.  This can usually be done with two add/subtracts, but this requires some care . . .

If our objective is simply to echelize, we may do any add/subtracts, but if we require “standard” rows (as in the standard basis procedure) we are never allowed to add a later vector into an earlier one unless we can be sure that this will not turn that row from a selected one into a non-selected one.  Hence this stage (only) may be a little slower in the standard rows case.

For larger fields, at least of odd characteristic, we may utilize the multiply-and-add capability, which is not much slower than a simple add/subtract.  Hence more generally we collect projective vectors, so (for example) over the field of order 27 we collect 28 projective vectors in a 2-space, halving the work.

In meataxe64, addition in characteristic 2 uses vector-length XOR even in Dfmt, so that addition is at least ten times faster than a multiply-and-add by a multiplier that is neither 0 nor 1 (so the field order must be at least 4 for this to happen).  This makes is appropriate to collect all multiples and do addition, rather than relying on multiply-and-add.  In other words, the grease is “affine”, not “projective”.

All this effectively means that we can do echelization over fields of order up to a hundred or so using grease without any extra work to make the multiples.

All “small” fields now using HPMI.

By implementing the AS-coded primes 5-181 (now including SSE so working on older computers also), generic cubic extensions (of prime fields) and a few particular small fields {16,32,64,128,256,512,81,243,625}, I think it fair to say that meataxe64 is fast for all small fields, with the smallest field not using the HPMI now being 191.

My 4 GB laptop can multiply matrices 20,000 x 20,000 up to q=16 in at most 90 seconds, up to q=181 in 6 minutes or less, and q=5929741 (181^3) took 43 minutes.  In fact, with multiplication running in memory, I don’t think there is a multiplication in an HPMI field that fits in 4 GB of memory and takes over an hour, though I haven’t looked hard.

Typical big computers these days may have at least 32 cores and at least 256GB, for which I would estimate that a multiplication filling memory would take a day or two.  Even with a Terabyte, no run taking more than a week or two would fit in memory.

It is clear, therefore, that the next thing to look at is to work with matrices that are too large to fit in memory.

HPMI now working for primes up to 181

The AS-based scheme of the previous post now seems to be working, and at much the expected performance – for a 20,000 x 20,000 dense multiply on my 1.7 GHz 2-core laptop, taking 65 seconds mod 5, 95 seconds mod 7,11,13, 225 seconds for primes 17-61 and 345 seconds for primes 67-181.  Gaussian elimination uses it as well, and (depending on exactly what is being done) should run at comparable speeds.  Quadratic extensions all use this HPMI too (using Karatsuba), and take between 3 and 4 times as long as the prime field.  All fields under 100 except 64 now multiply 20,000 matrices in under 6 minutes on the laptop.  Ought to do 64 as well, I suppose.

It currently uses AVX-2, so runs on Haswell or later, or Ryzen, but not on older chips.  I’ll probably get around to rewriting the two assembler routines using SSE at some point soon.

HPMI for primes between 5 and 181.

I am now about half-way through the detailed design for an HPMI implementation for AVX-2 for all the primes between 5 and 181 inclusive, and it is all looking pretty good so far.  There are six main routines to write, and in order of (performance) importance they are . . .

BwaMad.  I propose to use seven slices of AS-code of 128-byte length.  Hence there are 7 adds and 7 subtracts of four registers.  Entries will be 8 bits {5-13} 10 bits {17-61} and 16 bits {67-181}.  Afmt is 8 bytes  -skip/terminate and 7 bytes, each of 2 4-bit fields specifying which of 16 rows to add and which to subtract.  After all 14 arithmetic operations, an “adequate” reduction is done by ‘and’ with a mask of the top bits, ‘xor’ back in to clear those bits, ‘shift’ to bring them into the units position, ’32-bit multiply’ by the relevant power of 2 modulo p, ‘add’ of this back in.  Finally we add in a “bias” – a multiple of p sufficient to protect the subtractions from underflow.   The arithmetic looks like 23 mu-ops for each of the add/subtract pairs at 128-byte length, and the adequate reduction 43 mu-ops for the draft code I have written, but not yet tested.

BGrease. The pre-computation of all the multiples of the input vectors follows an addition chain, and (at least initially) I propose to use an assembler program to interpret a byte-code to direct this, and to input this byte code by hand.  Here the entries need to be completely reduced to the range 0 . . . p-1, and it is this reduction that takes the most time.  The first step in reduction is to ‘add’ <2^n – p> for some n then  ‘and’ with 2^n.  This gets a bit which is set if and only if we want to subtract p.  We therefore shift this bit into the units position, and subtract from the unshifted bit, resulting in either 0 if we don’t want to subtract p and all-1 if we do.  ‘and’ with p then makes the number we want to subtract, and we then subtract it.

BSeed.  The conversion of Dfmt into the Brick Work Area needs to space the entries out into 8 {5-13} 10 {17-61} or 16 bit {67-181} fields.  For {5,7,11,13} a look-up table for one byte of Dfmt into 3 {5} or 2 {7,11,13} bytes of data will be needed.  For {17-61} I propose to do the packing of three bytes into one 32-bit integer (three 10-bit fields) by shifting and adding.  For {67-181} it is sufficient to load a uint8_t and store it as a uint16_t.

DtoA.  The conversion of Dfmt into the Afmt (skip/terminate then 7 bytes of two nybbles) is a simple byte/byte lookup table in all cases.  The AS code for p=5 has grease level 3, and for 7,11,13 has grease level 2 – a fortunate coincidence.  This table will be computed at the start of the program run.

DtoB.  Bfmt will just be Dfmt, as the expansion required would increase the memory bandwidth too much.  It is therefore left to BSeed to do the conversion work.

CtoD.  This requires the full reduction modulo p of relatively large numbers – up to 23835 in the case of p=173 (worst case).  Initially I propose to do this in ‘c’ using a Barrett-41 scheme for each number individually.  Each number is multiplied by the constant <(2^41)/p rounded up>, shifted right 41 places, multiplied by p and then subtracted from the original number.  In all cases needed this performs a complete reduction modulo p.  It may be that, for small primes e.g. {5,7,11,13} this will not be fast enough, and a vectorized assembler routine will be desirable.  Having said that, this is probably not the “perfect” way to do these primes anyway, so I may not bother.

The design originally arose as the best way I know for primes in the 101-181 range, so in some sense needs to be done anyway, but it now seems clear that even for p=5 the result will be quite good.  Grease level 3 and working with one-byte fields means that adding and subtracting one AVX-2 register does 32 bytes, or 96 elementary operations in one clock-cycle.  Of course all the other work will take its toll on this, but I think we can hope for 48 elementary operations per clock, which equates to a 20,000 multiply in about one minute on my two core laptop.  The current (ungreased, table-lookup based) version takes 25 minutes for this.  The Logic-13 approach would, I think, be faster for p=5 (and logic-15 for p=7) but probably only twice as fast.  Anyway . . . the idea of producing an HPMI that is good for so many primes at once is too tempting.

Initially I propose to write the routines for p=67 as this is the simplest case.  Once this is working, I can check that it is as good as I hope (and maybe do something about it if not) before doing all the work needed to bring all the other primes on stream.

The instructions needed for the assembler work all exist in SSE, so a 128-bit NEHalem version shouldn’t be too hard, though it will run at about half-speed.