Multiply timings on 64xPiledriver.

Been testing for performance on the 64 Piledriver core in St. Andrews.  Here are the results for multiplying random square NxN matrices over various smallish fields.

F – dimension – time
2 – 400,000 – 30m 28s
3 – 200,000 – 16m 54s
4 – 200,000 – 11m 50s
5 – 100,000 – 10m 12s
7 – 100,000 – 14m 30s
8 – 160,000 – 12m 37s
9 – 160,000 – 26m 10s
11 – 100,000 – 14m 32s
13 – 100,000 – 14m 45s
16 – 160,000 – 18m 45s
17 – 80,000 – 18m 49s
23 – 80,000 – 18m 51s
25 – 80,000 – 16m 29s
27 – 100,000 – 14m 18s
31 – 80,000 – 18m 52s
32 – 125,000 – 14m 28s
37 – 80,000 – 18m 51s
47 – 80,000 – 18m 53s
49 – 64,000 – 11m 56s
59 – 80,000 – 18m 53s
64 – 100,000 – 11m 19s
73 – 64,000 – 11m 54s
81 – 80,000 – 11m 0s
97 – 64,000 – 14m 58s
109 – 64,000 – 14m 58s
121 – 60,000 – 10m 5s
125 – 50,000 – 6m 57s
127 – 64,000 – 15m 0s
128 – 80,000 – 8m 20s


Reorganization of Assembler code.

I have decided to move the decision as to which assembler routines to call from compile-time to run-time.  There are three reasons for this.

The first, which Klaus mentioned a couple of years back, is that it can be convenient to install the programs, already compiled, into a directory mounted on different machines.  This requires the compiled code to run on all the machines, thereby failing to take advantage of those with a more powerful architecture.  The current change will fix part of that, but still leave the number of threads and memory size compiled in – an issue which may require attention later.

The second arises from the “large fields of characteristic 2” work, which needs the CLMUL (“carry-less multiplication”) flag, and from the HPMI for 5-181 work, which needs SSE4.1 as it stands.  I have previously taken the view that “modern” machines have all these instructions, and that meataxe64 needs a modern enough machine, but I feel that approach, although probably acceptable, is not ideal.

The third reason is that Markus and now Steve have been trying to incorporate meataxe64 stuff into GAP, and the old approach is not really acceptable in that environment.

The “cpuid” function of the x86 returns a large and confusing wealth of detail on the machine, and it is a heavy job to pick through all these flags and values.  I propose, therefore, to write an assembler routine to summarize this into a few bytes of data more suitable for choosing between routines and strategies in meataxe64.  The main properties of an (x86-64) machine are given by a single byte, which I call the class.  The values I propose are as follows, where each class implies all the previous ones also.
‘a’ – All machines, including those where LAHF and CMPXCH16 don’t work.  I do not know whether even linux or the output of gcc would run on these!
‘b’ – LAHF and CMPXCHG16 have been fixed and SSE3 is available.
‘c’ – SSSE3 (the rest of SSE-3) is available.
‘d’ – SSE4.1 is available.  This is critical for primes 17-61, which ideally work in 10-bit fields but this needs PMULDQ.  Otherwise these primes must work in 16-bit fields, making them about 50% slower.
‘e’ – SSE4.2 is available.  These include the byte-string processing which looks useful, though I haven’t used it yet.
‘f’ – CLMUL is available.  This allows the “large fields of characteristic 2” stuff, speeding them up by a factor over 1,000.
‘g’ – AVX.  This allows 256-bit working, but for floating point only, which is not much use in meataxe64.  Nevertheless the VEX prefix clears the top half (allowing better mixing of SSE and AVX) and gives 3-operand instructions, which are of some minor use.
‘h’ – BMI1.  A few useful bit-manipulation instructions.
‘i’ – BMI2.  Some more bit-manipulation instructions, including PEXT/PDEP which look useful, though again I haven’t used them (yet).
‘j’ – AVX-2.  This allows the use of 256-bit registers for integral work.

To the user, the main question is whether the machine is class ‘j’ (Haswell, Broadwell, Skylake, Ryzen) or not, affecting the speed by about a factor of two in most cases.

To the programmer, this allows fingertip control over each routine separately, so that the large fields in characteristic 2 test class >= ‘f’ and the 10-bit working tests class >= ‘d’, avoiding the previous ludicrous situation where 10-bit working was not allowed because CLMUL was not implemented.

No doubt further classes will be added later – AVX-512F is looming large on the horizon now.

After this change, an old machine can only use meataxe64 if it can at least assemble the new instructions.  I suspect this will also need to be addressed in the fullness of time.

One effect of this is that it is now worthwhile to implement the AS-stuff 16-bits, which will speed up 5-13 and 67-181 by a few percent and also allow 17-61 to work on classes ‘a’-‘c’ – neither of great importance, but it has been irritating me for a while.

My first target is to get it all working on class ‘a’ – not that I have a class ‘a’ machine to test it on!  A glance at the top 500 supercomputers suggests that it is classes ‘i’ and ‘j’ that are actually important at the moment, so these will get particular attention.

If anyone wants it for other purposes, the “mactype” routine will be available for all to use, as I suspect the issues discussed here crop up elsewhere.  Not that I’ve actually written it yet – I’ll post again when all this works.  Unless unexpected snags arise, this shouldn’t take more than a few days.

In return, if anyone has computers I can use for testing (probably in December) I’d appreciate a logon for a day or so.  One urgent need is for an Intel i9 with AVX-512, so I can start work on those routines.  The other need is at the other end.  I have no access to a VIA 2000 nor a VIA 3000, nor the older Intel and AMD machines from 2003-2010 (core-2, Athlon, Merom, Nehalem, Westmere etc.) and I’d like to run the regression tests, and probably some performance tests as well, to check that I’ve got things right (and to fix them if I haven’t).

Large fields of characteristic 2.

In 2010, the Intel “Westmere” processor introduced the “PCLMULQDQ” instruction, which multiplies two 64-bit mod-2 polynomials together to give a 128-bit result, and in the more recent processors it is pretty fast.  Up to 16-bit fields, the use of logarithm tables has been retained, but from 17-63-bits inclusive, the instruction is now used for the one-vector-at-a-time work.  Indeed the use of this instruction looks so good that it may well be the fastest method available for fields over about 2^16, and I therefore have no current plans to write HPMI code for such fields.  For example, a 5,000 x 5,000 multiply over the field of order 2^35 = 34359738368 takes 8 mins 13 seconds on my 2-core Haswell laptop.

It implements h += f.g as polynomials modulo the Conway polynomial p(X) in characteristic 2, using the PCLMULQDQ instruction and Barrett (I think it is called) reduction.  Note that I have used ‘^’ below to denote exponentiation, and ‘+’ to denote exclusive-or.  The degree of the field is d.

The first step is to compute upfront, as soon as we know the field, e(X) = X^2d / p(X) (discarding the remainder) where d is the degree of the field.

To compute f(X) * g(X) modulo p(X), we first use the instruction to compute f(X)*g(X) as a polynomial.  We express this as a(X).X^d + b(X) and use the instruction a second time to multiply a(X) by e(X), and again chop into two halves a(X) * e(X) = s(X).X^d + t(X), discarding t.  A third use of the instruction is used to compute s(X).p(X) = q(X).X^d + r(X), and the answer we require is then r(X)+b(X).

To see this, we note the following . . .

1) X^2d = e(X).p(X) + u(X) with degree e(X) = d and degree u(X) < d.

2) f(X).g(X) = a(X).X^d + b(X) with degree of a(X) and b(X) < d.

3) a(X).e(X) = s(X).X^d + t(X) with degree of t(X) < d.

4) s(X).p(X) = q(X).X^d + r(X) with degree of r(X) < d.

Claim. q(X)=a(X).   Proof.

a(X).X^2d = a(X).[e(X).p(X)+u(X)] = a(X).e(X).p(X) + a(X).u(X) = [s(X).X^d+t(X)].p(X) + a(X).u(X)= s(X).X^d.p(X) + t(X).p(X) + a(X).u(X) = s(X).p(X).X^d + t(X).p(X) + a(X).u(X) = [q(X).X^d + r(X)].X^d + t(X).p(X) + a(X).u(X)= q(X).X^2d + r(X).X^d + t(X).p(X) + a(X).u(X)

As all terms except the first have degree less than 2d, so the result follows.

The instruction naturally chops up a product into y(X).X^64 and z(X), so in practice we shift the things we multiply left a total of 64-d positions.


Small-scale Gaussian Elimination.

Some measurements on a single core suggest (depending a bit on the field) that the HPMI is, for matrices over 5,000 dimensions, roughly 20 times faster than doing things in Dfmt without grease.  At about 100 dimensions, however, the overhead of the HPMI and all the associated conversions means that the two methods are about the same speed.  It is therefore necessary to consider how best to perform “echelization” in Dfmt on matrices of a few hundred dimensions – any larger, and it is better to chop the matrix up and use the HPMI for the matrix multiplications.

For small fields we would like to use grease, but we would like to avoid the overhead of computing the necessary multiples.  This is possible.  The idea is to use the vectors we actually have rather than make new ones.  I call the method “auto-grease”, though it may not be new and may already have a name.

It is important to realize too that with the dimension in the hundreds, we may reasonably assume that the entire matrix fits in a low-enough cache that memory access time is not significant.

As an example, consider mod 3 doing 4 dimensions at a time with one add/subtract – grease level 4.  We proceed in three stages.

The first stage proceeds as normal to find the first four vectors that are linearly independent and to choose four columns that show this.  We then put these four vectors into reduced echelon form . . . I call these the “white” vectors.

The middle stage (the main one) then takes each remaining vector.  If a single add/subtract can clear all four pivotal columns, this is done, and otherwise the vector is simply added to the list of vectors to be used for such clearing, which I call the grey list.  There will be at most 40 vectors on the lists in all – four white and at most 36 grey.

The third stage clears the pivotal columns on the grey vectors.  This can usually be done with two add/subtracts, but this requires some care . . .

If our objective is simply to echelize, we may do any add/subtracts, but if we require “standard” rows (as in the standard basis procedure) we are never allowed to add a later vector into an earlier one unless we can be sure that this will not turn that row from a selected one into a non-selected one.  Hence this stage (only) may be a little slower in the standard rows case.

For larger fields, at least of odd characteristic, we may utilize the multiply-and-add capability, which is not much slower than a simple add/subtract.  Hence more generally we collect projective vectors, so (for example) over the field of order 27 we collect 28 projective vectors in a 2-space, halving the work.

In meataxe64, addition in characteristic 2 uses vector-length XOR even in Dfmt, so that addition is at least ten times faster than a multiply-and-add by a multiplier that is neither 0 nor 1 (so the field order must be at least 4 for this to happen).  This makes is appropriate to collect all multiples and do addition, rather than relying on multiply-and-add.  In other words, the grease is “affine”, not “projective”.

All this effectively means that we can do echelization over fields of order up to a hundred or so using grease without any extra work to make the multiples.

All “small” fields now using HPMI.

By implementing the AS-coded primes 5-181 (now including SSE so working on older computers also), generic cubic extensions (of prime fields) and a few particular small fields {16,32,64,128,256,512,81,243,625}, I think it fair to say that meataxe64 is fast for all small fields, with the smallest field not using the HPMI now being 191.

My 4 GB laptop can multiply matrices 20,000 x 20,000 up to q=16 in at most 90 seconds, up to q=181 in 6 minutes or less, and q=5929741 (181^3) took 43 minutes.  In fact, with multiplication running in memory, I don’t think there is a multiplication in an HPMI field that fits in 4 GB of memory and takes over an hour, though I haven’t looked hard.

Typical big computers these days may have at least 32 cores and at least 256GB, for which I would estimate that a multiplication filling memory would take a day or two.  Even with a Terabyte, no run taking more than a week or two would fit in memory.

It is clear, therefore, that the next thing to look at is to work with matrices that are too large to fit in memory.

HPMI now working for primes up to 181

The AS-based scheme of the previous post now seems to be working, and at much the expected performance – for a 20,000 x 20,000 dense multiply on my 1.7 GHz 2-core laptop, taking 65 seconds mod 5, 95 seconds mod 7,11,13, 225 seconds for primes 17-61 and 345 seconds for primes 67-181.  Gaussian elimination uses it as well, and (depending on exactly what is being done) should run at comparable speeds.  Quadratic extensions all use this HPMI too (using Karatsuba), and take between 3 and 4 times as long as the prime field.  All fields under 100 except 64 now multiply 20,000 matrices in under 6 minutes on the laptop.  Ought to do 64 as well, I suppose.

It currently uses AVX-2, so runs on Haswell or later, or Ryzen, but not on older chips.  I’ll probably get around to rewriting the two assembler routines using SSE at some point soon.

HPMI for primes between 5 and 181.

I am now about half-way through the detailed design for an HPMI implementation for AVX-2 for all the primes between 5 and 181 inclusive, and it is all looking pretty good so far.  There are six main routines to write, and in order of (performance) importance they are . . .

BwaMad.  I propose to use seven slices of AS-code of 128-byte length.  Hence there are 7 adds and 7 subtracts of four registers.  Entries will be 8 bits {5-13} 10 bits {17-61} and 16 bits {67-181}.  Afmt is 8 bytes  -skip/terminate and 7 bytes, each of 2 4-bit fields specifying which of 16 rows to add and which to subtract.  After all 14 arithmetic operations, an “adequate” reduction is done by ‘and’ with a mask of the top bits, ‘xor’ back in to clear those bits, ‘shift’ to bring them into the units position, ’32-bit multiply’ by the relevant power of 2 modulo p, ‘add’ of this back in.  Finally we add in a “bias” – a multiple of p sufficient to protect the subtractions from underflow.   The arithmetic looks like 23 mu-ops for each of the add/subtract pairs at 128-byte length, and the adequate reduction 43 mu-ops for the draft code I have written, but not yet tested.

BGrease. The pre-computation of all the multiples of the input vectors follows an addition chain, and (at least initially) I propose to use an assembler program to interpret a byte-code to direct this, and to input this byte code by hand.  Here the entries need to be completely reduced to the range 0 . . . p-1, and it is this reduction that takes the most time.  The first step in reduction is to ‘add’ <2^n – p> for some n then  ‘and’ with 2^n.  This gets a bit which is set if and only if we want to subtract p.  We therefore shift this bit into the units position, and subtract from the unshifted bit, resulting in either 0 if we don’t want to subtract p and all-1 if we do.  ‘and’ with p then makes the number we want to subtract, and we then subtract it.

BSeed.  The conversion of Dfmt into the Brick Work Area needs to space the entries out into 8 {5-13} 10 {17-61} or 16 bit {67-181} fields.  For {5,7,11,13} a look-up table for one byte of Dfmt into 3 {5} or 2 {7,11,13} bytes of data will be needed.  For {17-61} I propose to do the packing of three bytes into one 32-bit integer (three 10-bit fields) by shifting and adding.  For {67-181} it is sufficient to load a uint8_t and store it as a uint16_t.

DtoA.  The conversion of Dfmt into the Afmt (skip/terminate then 7 bytes of two nybbles) is a simple byte/byte lookup table in all cases.  The AS code for p=5 has grease level 3, and for 7,11,13 has grease level 2 – a fortunate coincidence.  This table will be computed at the start of the program run.

DtoB.  Bfmt will just be Dfmt, as the expansion required would increase the memory bandwidth too much.  It is therefore left to BSeed to do the conversion work.

CtoD.  This requires the full reduction modulo p of relatively large numbers – up to 23835 in the case of p=173 (worst case).  Initially I propose to do this in ‘c’ using a Barrett-41 scheme for each number individually.  Each number is multiplied by the constant <(2^41)/p rounded up>, shifted right 41 places, multiplied by p and then subtracted from the original number.  In all cases needed this performs a complete reduction modulo p.  It may be that, for small primes e.g. {5,7,11,13} this will not be fast enough, and a vectorized assembler routine will be desirable.  Having said that, this is probably not the “perfect” way to do these primes anyway, so I may not bother.

The design originally arose as the best way I know for primes in the 101-181 range, so in some sense needs to be done anyway, but it now seems clear that even for p=5 the result will be quite good.  Grease level 3 and working with one-byte fields means that adding and subtracting one AVX-2 register does 32 bytes, or 96 elementary operations in one clock-cycle.  Of course all the other work will take its toll on this, but I think we can hope for 48 elementary operations per clock, which equates to a 20,000 multiply in about one minute on my two core laptop.  The current (ungreased, table-lookup based) version takes 25 minutes for this.  The Logic-13 approach would, I think, be faster for p=5 (and logic-15 for p=7) but probably only twice as fast.  Anyway . . . the idea of producing an HPMI that is good for so many primes at once is too tempting.

Initially I propose to write the routines for p=67 as this is the simplest case.  Once this is working, I can check that it is as good as I hope (and maybe do something about it if not) before doing all the work needed to bring all the other primes on stream.

The instructions needed for the assembler work all exist in SSE, so a 128-bit NEHalem version shouldn’t be too hard, though it will run at about half-speed.

“Sparsification” procedure

The need for an effective “Characteristic Polynomial” procedure in meataxe64 is clear.  Over large fields, it is an essential prerequisite for using the Holt-Rees method to locate elements of the group algebra with non-zero nullity.  Also it is much easier to compute Brauer characters from the (factorized) characteristic polynomial.  In matrix group recognition, it is an important step.  I am fairly sure that other applications will be found once we have an effective algorithm.

The main idea is to repeatedly multiply a vector by the matrix until a linear dependency is found.  As meataxe64 is mainly designed to multiply a large batch of vectors by a matrix, the fact that there is only one, and we can’t do the next one until we finish the last, poses as problem.  This post is the first phase of solving that problem.

The first phase, which I call “Sparsification”, is to change the basis to make it much faster to multiply a vector.  In order to use the HPMI effectively for this phase, it is necessary to work with a fair number of vectors at a time, ranging down from about N = 1,000 in characteristic 2.  We therefore start with the first N coordinate vectors, and multiply repeatedly by the matrix, discarding at each stage any that are linearly dependent on all previous vectors (including those earlier in the same batch) resulting in a matrix where (frequently) most of the rows consist of zeros except for a single 1 somewhere.  Several details, however, need to be sorted out.

One stupidity to avoid is that surely the zero matrix should be sparse!  Although in practice this may seldom happen, it seems sensible that if at any time a (new) basis vector is actually taken to zero, we should avoid all work for that row.

Another factor to consider is that ideally we would like to move strings of entries rather than moving them one at a time.  Hence if Vi . M = Vj in our new basis, it is highly desirable that V(i+1) . M = V(j+1) and so on, and luckily this feature often comes out naturally from the procedure.

Another thing to worry about is that as we proceed, the number of vectors being used may drop off, and in this case we can “top it up” by adding more basis vectors (ideally consecutive ones) using the list of Pivots to determine which to use.  The optimum stage to do this is unclear to me at the moment, since topping-up decreases the time for sparsification, but also decreases the sparsity of the result.

Our output format is aimed to make multiplication as fast as possible, so the next thing to consider is how that multiplication is done – not in great detail, but just enough to work out what the output of “sparsification” is.

The first phase of multiplication is to take the input vector V (to be multiplied) and to make a vector which I will call U, initially set to zero, and then has strings moved from V into U by cut and paste.  At the end, the first part of U (the same length as V) is considered as V1, and the second part as W.  It is worth making W start on a byte boundary.  V1 is V multiplied by all the sparse rows of the matrix, and W is the vector that we want to multiply by the dense rows.

W is then converted to some sort of Afmt, ready to be multiplied by the (suitably prepared) dense part of the matrix B.  How we do all this needs to be considered carefully, and I do not yet see whether “grease” is a good idea, as it reduces the work to be done and the amount of memory that needs to be fetched for a single multiplication, yet it also increases the total size of (the dense part of ) B so may make the procedure less cache-friendly.  I hope to make a further post on this subject once I get it clear.

In the meantime, however, the nature of the sparsified matrix is already clear.  It basically consists of a list of triples ( v , u , k ) where v is the starting position of a string to be moved from V, u is the starting position where it goes in U, and k is the length of the string.  There then follows the dense rows of the matrix with the rows in the same order as W (the right-hand end of U).

Hopefully this “sparsification” procedure can be made to run effectively, with the aim of converting a matrix into sparsified format in a time comparable to a matrix multiplication at the same size.

Types of Gaussian Elimination

With a view to presenting the various uses of Gaussian elimination with the appropriate algorithms at the “functions” layer (which may chop on disk before calling the multi-core stuff), I needed to get some sort of classification of what the different uses actually are.  Here is my analysis so far.  I restrict myself to row operations, as they are much faster than column ones, at least in meataxe64.

Rank.  The input is a single matrix, and the output is a single number – the rank of the matrix.  Essentially any row operations can be done to the input, leaving great freedom for the implementer, and there is no need to “back-clean” – it is sufficient that the entries below (but not necessarily above) a pivot are zero.  I believe the determinant function is related to this, though I haven’t got that clear yet.

Echelon Form.  The input is a single matrix, but now we require it to compute also the list of columns where a pivot was found, and the remnant – the non-pivotal columns of the final echelon form.  There is no restriction on which columns are chosen for pivots, and there is no requirement to determine a subset of the rows that span the space.  Again essentially any row operations can be done, but in the end we do need to back-clean – a pivotal entry must be the only non-zero entry in its column.

Standard Form.  This is needed for the “standard basis” algorithm.  As well as the final echelon form, the “standard” list of rows is needed – those rows that are linearly independent of all previous rows.  There is no restriction on the choice of pivotal columns, however, but the need to tell, for each row, whether it is linearly independent of the previous ones limits the choice of algorithm considerably.

Invert.  Here the transformation matrix is required, but essentially any row operations can be done.  In this case, however, tricks to reduce the work compared to the “chopped-up-Gaussian approach may well be counterproductive, since they are liable to mess up the transformation matrix more quickly, threatening to lose on the transformation matrix all that was gained on the input matrix.

It is interesting that the “echelize” procedure needed in chopped-up-Gaussian, producing row select, column-select, multiplier, cleaner and remnant does not seem to figure in this list!  Indeed only “Standard Form” needs a row-select output at all, and there it is restricted to being standard.

It should be appreciated that meataxe64 uses negative reduced echelon form – the pivotal entries must all be -1.

The idea then is that the “function” layer chops the matrix horizontally if necessary and then calls memory-based multi-core routines to do the work.  For meataxe work, the most important of these is Echelon Form, which is used for null-space and split.  This can readily be build to use an essentially identically specified routine to work in memory.  This routine will need, however, to do chopped-up-Gaussian internally.  The key routine that is needed, therefore, is to put a matrix into echelon form, in memory, using multiple cores and the chopped-up-Gaussian methods.

Jon Thackray joins the “team”

The work involved in producing the next version of meataxe64 has proved a bit too heavy for me, but Jon Thackray has agreed (at least in principle) to move from a frequent advisor and assistant to a member of the team, which now numbers two!

My feeling is that my own skills are at the lower levels of the software, and I love to squeeze another 10% of performance out of almost anything. Those who have tried to use my offerings, however, will know that my skills at user interface are (English understatement) not always top-notch, and the idea is that I write the low-level functions (as I do now) and Jon takes charge of the higher levels.

More specifically, the idea is that there is a “functions” layer, where the low-level ones, such as multiplication, Gaussian, transpose etc. I write, and the higher level ones (such as null-space, invariant subspace, split and standard base) Jon writes, along with all the programs that use them.

There remains much to do to sort out the mess it is currently in, but I feel the focus of drawing a clear line between performance and usability has a good chance of resulting in a system that has both.