New Mod 2 method only a marginal improvement.

The new HPMI method described in the previous post is now working, and it does bring a speed improvement at the largest sizes of around 8%.  Nothing amazing, but it does work.  There are several reasons why it didn’t really achieve the hoped-for improvements.

One reason is that the computation of the required linear combinations of matrix B require more additions.  The old method needed 11 additions for each of the 8 slices – 88 in all – whereas the new one needs about 180 for the two 51-codes and a further few dozen for the Gray codes.

Another issue is the complexity of converting matrix A into the form needed.  This turned out to be 650 lines of code (mostly tables) but it does take a lot more computer time than simply moving 4 bytes after checking whether they are zero.

We can use matrix A for a larger length of vector (in B and C) but this now competes with Strassen and the sweet spot for the new method does seem to be matrices about twice as large, losing another factor.

It is interesting that, by speeding up the central loop (brick multiply) the next stage out (brick population) becomes performance critical too, multiplying by about ten the amount of code that must be performance tuned.  Now we have speeded it up further, we find that the conversion to Afmt becomes performance critical, multiplying the tuned code by another factor of ten.

It is not clear to me whether the Afmt conversion can be vectorized.  Using AVX-512 it seems possible that it could do some parts at least 16 at a time.

What is anyway clear is that the old method is faster for matrices under about 8,000 dimensions, and although there is a place for the new method too, it does not (sadly) render the old method obsolete.  I am therefore faced with the choice of having both methods in use, which the HPMI design cannot currently cope with, or simply going back to the old method and discarding this new method as an interesting and valuable idea, but not really good enough to warrant the upheaval that using it would cause.  I have not decided yet, but my feeling is that the new method is not worth its weight.

All is not lost, though.  In the process I have actually speeded up the old method by using assembler to populate the brick.  Only about 3%, but something, anyway.

Advertisements

Faster working mod 2.

This is joint work with Steve Linton.

It is now six years since the basic design of the HPMI mod 2 was laid down, and it seems that there are several tricks that could be applied, aiming (though this is not yet proven) for an speed increase of around 30%.

The first (and main) idea is to use an “accumulator”.  Even on older chips we have sixteen registers – enough to hold two 1024-bit vectors (cauldrons), and by holding one vector throughout the process, we can change it gradually (following a Gray code) and do our work in that order, rather than sequentially as at present.  That way we can use one add (and avoid the load) to deal with about 11 bits.

Once we are adding in an accumulator every time, we can also add some but not others in via the accumulator, steering the lower bits somewhat and gaining a bit more.

Another useful idea is to put the computation of the grease-code into a data-driven approach also.  That way we only need to compute the vectors we are actually going to need, a trick that is essential to the Gray-code approach.

Using all these tricks (and a few more) the plan now is that instead of adding in 8 slices of grease level 4 (8 loads, 8 adds) we add in 5 vectors only, three via the accumulator which is added in also – 5 loads, 6 adds.

There is a fairly simple way to do this.  The idea is to divide the 32-bits of A into 11+1+10+10, done as follows.

Our first job is to make 2048 lists and counts of what values the first 11 bits take.  We then go through the 2048 slots in Gray code order noting the (top 11 bits of) the xor of all pairs of adjacent ones where the count is non-zero.  These will need to be made.  Steve’s simulations suggest that there will typically be 39-44 of these.

Kaikkonen and Rosendahl provide a set of 51 vectors (plus zero) in a 10-space such that every vector is the sum of two of them.  We store 104 21-bit vectors 1 x 0 and 0 0 x for each x in this code, and add either an odd or an even number of 1 x 0 in via the accumulator depending on what we want the 1 bit to be for the next value.

The result is that we need to store 143-148 [104+(39-44)] entries which is rather more than the 16K I usually allow for the use of L1, but it seems that enough of them are accessed very infrequently for this to be OK.

The result of this is that the BGrease and BWAMad routines get amalgamated into one two-phase code that populates the Brick Work Area in phase 1 and multiplies a 1024×32 A by a 32×1024 B adding into a 1024×1024 in phase 2.

I’m not quite clear how to share the bits in the Afmt – one idea is to use 14 bits for the row of C and 7+7+6+6+6 for the BWA entries, making a total of 48 bits – rather more than the 40 used at present.  This all represents a noticeable increase in the demand for Afmt which is a bit worrying but I think it is OK.

I am not quite at the stage of starting to code this, but I am now pretty sure that mod 2 can be made a good bit faster, and am keen to demonstrate this in the near future.

HPMI using single-precision floats.

It now seems fairly clear that a good HPMI for primes 197-1669 can be written based on the vector use of single precision floating point numbers.  Modern chip makers seem to give floating-point arithmetic precedence, and the result is that it tends to be very fast.  A single-precision floating-point representation of an integer between -2^24 and +2^24 is exact, and with a single instruction to perform 8 operations of the form C+=A*B, and the fact that two of these can be executed every clock cycle, a speed of 16 operations per clock can be aimed at, though probably not quite achieved.  This is approximately the same as the current AS-based HPMI for 67-193, and it likely to be a close call as to which method is faster.

The central fragment of the inner loop holds nine AVX registers each with eight entries, which we may call C 0 0 . . . C 0 23, C 1 0 . . . C 1 23, C 2 0 . . . C 2 23.  We load B j 0 . . . B j 23 and also A 0 j, A 1 j, A 2 j and do nine fused multiply-and-add operations Cik+=Aij*Bjk  The next fragment is for j+1.

Adequate reduction is done by multiplying by r – the nearest float to 1/p – rounding to the nearest integer and then doing a multiply-and-subtract.  For the largest primes (about 1,200 – 1669) this reduction is done every 24 operations, but for smaller primes once every 48, 72 or even fewer will suffice to avoid overflow.

It seems likely that holding the entries (for A and B, and possibly even C) in memory as 16-bit integers rather than 32-bit floats will be preferable, since although the conversion takes compute-work, the saving of memory bandwidth is probably usually of greater importance.

There are many details, not all of which I have yet worked out, but I am now convinced that the project is basically possible.  In principle there are three versions of the central loop, depending on whether SSE, AVX or AVX-with-FMA is used.

First attempt at Strassen-Winograd

The first version of Strassen-Winograd to go into meataxe64 was not a great success.  The “stripping”, needed to make the dimensions divisible by the relevant power of two took longer than the rest of the multiplication!  Much of this was due to poor implementation, but some rough estimations suggest that even with ideal implementation, the stripping function would still be too slow, in that it would be better to use naive methods in many cases rather than use a Strassen-Winograd that needs certain kinds of stripping.  More specifically, we certainly cannot afford to strip off a few columns of B,C (in C=A*B or C+=A*B) because we then need to multiply those columns by the entirety of A, which requires extraction of all the individual elements of A.

It may be that further investigation and implementation might rescue some of this situation and relax the requirements somewhat, but for now I propose to keep the Strassen-Winograd enabled multiplication routine so that it tests whether the dimensions are suitable for simple-minded Strassen-Winograd methods, and only uses them if all these tests pass.  In other words, getting the dimensions right is somebody else’s problem.  That means that the stripping function (if indeed it ever exists) is higher level software.

In practice, my feeling is that padding is probably the better way to go.  The multi-core multiply anyway chops the input matrix into blocks, and it seems sensible to chop it into blocks whose dimensions, except perhaps for the last column and row, are suitable for Strassen-Winograd.  This must surely be a no-brainer.  In fact, I think the last block can be padded as well, so that multiplication can use these faster methods as they (now) are throughout.  I need to have a go at the multi-core multiply anyway (so it can re-read matrix B if memory is short) so that is the next step in this project.

Multi-core Gaussian without transformation matrix can also control dimensions by careful chopping, though in this case the main cleaning job multiplies an (a x r) matrix by an (r x b) one where a and b are the chop sizes, so that r (the number of pivots found in the blocks above the (a x r) matrix) is the only dimension we cannot control completely, and there are at least four approaches to dealing with this.  More experimentation is clearly needed.  To summarize, we could chose a and b so that even a reasonably well conditioned matrix will usually have a suitable r (and just use naive if/when it doesn’t succeed), or we could learn to do stripping of cols of A/rows of B rapidly, or we could pad out our matrices with zeros in the appropriate places, or we could just not even bother to try to get Strassen-Winograd methods into Gaussian.

Multi-core Gaussian with transformation matrix is even more problematic.  Much of the work is done on the transformation matrix, whose dimensions are usually controlled by ranks of things.  In some sense all the options from the previous paragraph still exist, but they are all (except perhaps the last one!) harder, and I propose to reconsider this after I get some experience of the other cases.  Hence the main Gaussian with transformation matrix will only use Strassen-Winograd methods when it gets lucky, and initial efforts will be to try to chop in such a way that it gets lucky quite often.

Having a go at Strassen-Winograd.

I have decided to make some use of Strassen-Winograd methods in matrix multiplication.  Prototyping work by Steve Linton a few weeks ago suggests that a speed improvement of around 30% may well be available for all fields, and if so, it is time to try to obtain it.

In the past, I have always shunned away from these methods.  One problem is that they tend to destroy any sparsity that the matrices have, and it is far from clear that multiplying seven dense matrices will be any faster than multiplying eight that are sparser.  Another issue is that although the term of cubic complexity is decreased, by reducing the sizes of the blocks, the quadratic term (due to format conversions and possibly extractions and assembly for extension fields) is increased – a factor that is made worse by the extra additions and subtractions of the method itself.  Another negative issue is that the memory bandwidth requirements are substantially increased.

I have recently realized, however, that my fear of increased memory footprint seems to have been unfounded.  Although extra work areas are needed, they can be kept fairly limited in extent, albeit by using yet more memory bandwidth by doing the additions and subtractions at matrix scale rather than row-by-row.

Another factor is Steve’s implementation of a the mid-scale echelonize by recursive chopping so as to take advantage of the HPMI, which means that the blocks used in the multiplies of multi-core echelonize are now considerably larger.

It therefore seems clear that there is a size-range from around a gigabyte to perhaps 20 megabytes where a matrix multiplication can usefully use the methods of Strassen-Winograd, and I am now giving it a try.  I propose, at least initially, to try implement the paper Implementation of Strassen’s Algorithm for Matrix Multiplication dated 1.Aug 1996 by Steven Huss-Lederman, Elaine M. Jacobson, Jeremy R. Johnson, Anna Tsao, Thomas Turnbull.

Because of the large quadratic terms of meataxe64 – particularly the format conversions – it seem likely that the “base-case” size will be larger than other authors have used, and I am expecting 10,000 x 10,000 to be about the smallest matrix that might benefit from these methods.  Conversely considerations of memory footprint and concurrency make it hard to see how the multi-core parallel algorithms can benefit, and this part will remain naive, at least for the time being.  I therefore currently only plan to implement Strassen-Winograd methods in the single core “slab” routine.

This necessitates the introduction of the concept of a “stride” to various routines of meataxe64.  The idea is that a matrix consists of corresponding parts of consecutive rows rather than the entire row.  This is needed recursively for multiplication, but also for the addition and subtractions of the Strassen-Winograd method.

Another problem is that these methods assume that the number of rows and bytes of a matrix be sufficiently even – for example if three levels are done, these must be divisible by 8.  It would be possible to solve this problem by padding, but I propose to do it by stripping – three strips are taken off and multiplied (probably in Dfmt) separately at the start.

I propose to do Strassen at the highest level – Dfmt over the extension field.  It would be possible to do it in the prime field, and even (for matrices B and C, anyway) after format conversion, but the advantages of these ideas has not yet been demonstrated to me, and I propose to do things the simplest way, at least to start with.

Hopefully my next post will be to say that this is all working!

Progress on Gaussian Elimination.

The three-layer Gaussian elimination system now seems to be working well.

The top layer is the multi-core chopped-up Gaussian, which has been working for some time.

Steve Linton (with minimal assistance from me!) last week implemented the single-core Gaussian elimination routine to chop yet further and use the HPMI wherever possible, as suggested by his prototype (in GAP) a few weeks ago.

I have now started to address the base-case a bit, where a smallish matrix (currently less than 300 in at least one dimension) must be echelonized, with transformation matrix, using the pool/batch system and (so far) projective level 2 grease where appropriate.  This needs to work without the benefit of the HPMI (the matrices are too small).

The result is that the single-core part is now within a factor of two of multiplication speed, so that the top layer only needs to chop to obtain concurrency – previously it needed to chop much more to avoid the bottleneck of the poor single-core Gaussian.

This complete system now passes the regression tests and – as far as I know – is working.

There remain many improvements still on the “to do” list.  Many of them may stay there for some time, but I propose to spend a week or two now testing (and mainly performance testing) the system in various environments to see how it manages as the machine, shape, size, field etc. vary, aiming to produce a Gaussian elimination system over finite fields that is hopefully good for all purposes within its range (x86, q<2^64).

Sieve homomorphisms.

Let r and s be two matrix representations of a group H over the same field F whose characteristic may divide the order of H.  Notice that this concept is basis-dependent – if you change the basis, I consider that a different matrix representation.  Furthermore, I consider the basis to be b0, b1, b2…bN so that the basis elements are numbered.

A homomorphism from r to s is a sieve if every basis vector of r is mapped either to zero or a basis vector of s later in the sequence than all that arose from earlier basis elements of r.  In other words the map consists of deleting some basis elements of r and then inserting some basis elements of s, preserving the order of those retained.

If anyone knows of a previous definition of this sort, please let me know.

The initial motivation for this definition comes from consideration of the tensor-condense of R and S for a group G, condensing with the norm element of a subgroup H whose order may not be coprime to the characteristic.  One decomposes the restriction of R and S to H, and then tries to put the summands of r and s into a basis such that all homomorphisms from r to s are sieves.  If this can be done, the operation at the heart of the tensor condense algorithm is (I think!) a “dot-product” of rectangles – the sum over all i and j of Rij * Sij – once row- and column-selects have been applied according to the sieve.

I call two sets of matrix representations R and S sieve compatible if the set of homomorphisms from any element of R to any element of S has a basis of sieves.

The concept arises already in the absolutely irreducible case – we must put the matrices for equivalent representations into a “standard basis” so that the homomorphism is a scalar.

More generally, we can try to achieve sieve-compatibility by putting all the representations into a “very standard basis” – whatever that means!  I am hoping that it means something, and that programs can be written to convert to it, hopefully simplifying other things besides tensor-condense.

Elementary Gaussian Elimination.

Steve Linton’s recent prototyping work has made it very clear that, in meataxe64, the performance of Gaussian elimination depends almost entirely on the speed of matrix multiplication, and the performance of the “base case” – perhaps around 500 dimensions – is of little significance.

Despite this, my perfectionist nature insists that the base case  be well implemented, and this post consists of my armchair thoughts as to how to do this.  Implementation and testing will come later.

There are two obvious strategies for Gaussian elimination, and (spoiler) they are both wrong!

In one approach, each row is “cleaned” (has all the pivotal columns cleared to zero) with all previous ones, and then, if non-zero, a pivot in this new row is located and we pass on to the next row.

In the other approach, if a row is non-zero, a pivot is located and all the other rows have this column cleared.  There are several variants depending on which row is taken, and whether the previous rows have the pivot cleared or just the later ones.

There are several things wrong with all of these.  Firstly, they are not cache-friendly – when doing R += y . S, either R or S is used only once, which is not good.  Secondly it ignores the possibility of using “grease” to reduce the work when the field is small.  Thirdly the extraction of the single (pivotal) value from each row is not very fast (again for small fields) and it is better  to extract strings of consecutive pivotal values if at all possible.

Once these issues are isolated, it is not too hard to see how to address them.  Furthermore, we can do so and still use “standard” columns and rows.

The idea is to use the first method for a few rows to get a “pool” of rows in a sort-of reduced echelon form – each pivot is the only non-zero entry in its column within the pool, though later rows may have earlier pivots.  Every now and then (perhaps every five vectors) we inspect the pool to see if there is a suitable “batch” of perhaps 5-10 vectors.  The choice of a batch is aimed at being cache-friendly (so is several vectors), suitable for grease and – perhaps most tricky – suitable for fast extraction.  This batch is then greased and used to clear the pivots of the batch of rows from the entirety of the matrix.

The best form of grease to use depends on the field.  As far as I can see so far, there are three cases.

Characteristic Two.  As XOR is so much faster that the table look-up needed for extensions fields in characteristic two, an “affine” grease is suggested, where all multiples of a vector (in a 1-space, 2-space or even 4-space for GF(4)) are stored, and only addition is used for the main clearing work.  For large fields of characteristic two, it may well be better to do several adds rather than a multiplication, though the PCLMUL instruction limits the scope for this somewhat.

Small fields of odd characteristic. In these cases, multiply-and-add is not much slower than add (both involve byte-wide table look-up(s)) so a “projective” grease is suggested, where only one vector is stored in each 1-space.  This seems appropriate up to fields of order up to a few hundred.

Large fields. For fields of order over a few hundred, there is no point doing grease at all.

Meataxe64 has no concept of fast extraction at the moment.  I have seen some pressure for this in at least two other circumstances – handling polynomials and the fast multiplication of a sparsified matrix – so I think the time has come to figure out how best to do this.  As an example of the problem . . . when working mod 2, the simple-minded way would extract the entries of a byte one bit at a time and then assemble them into a byte again, taking over 100 nano-seconds to load a byte!  The difficulty is to devise a robust field-independent way of managing to simply load the byte most of the time.

Multiply timings on 64xPiledriver.

Been testing for performance on the 64 Piledriver core in St. Andrews.  Here are the results for multiplying random square NxN matrices over various smallish fields.

F – dimension – time
2 – 400,000 – 30m 28s
3 – 200,000 – 16m 54s
4 – 200,000 – 11m 50s
5 – 100,000 – 10m 12s
7 – 100,000 – 14m 30s
8 – 160,000 – 12m 37s
9 – 160,000 – 26m 10s
11 – 100,000 – 14m 32s
13 – 100,000 – 14m 45s
16 – 160,000 – 18m 45s
17 – 80,000 – 18m 49s
23 – 80,000 – 18m 51s
25 – 80,000 – 16m 29s
27 – 100,000 – 14m 18s
31 – 80,000 – 18m 52s
32 – 125,000 – 14m 28s
37 – 80,000 – 18m 51s
47 – 80,000 – 18m 53s
49 – 64,000 – 11m 56s
59 – 80,000 – 18m 53s
64 – 100,000 – 11m 19s
73 – 64,000 – 11m 54s
81 – 80,000 – 11m 0s
97 – 64,000 – 14m 58s
109 – 64,000 – 14m 58s
121 – 60,000 – 10m 5s
125 – 50,000 – 6m 57s
127 – 64,000 – 15m 0s
128 – 80,000 – 8m 20s

Reorganization of Assembler code.

I have decided to move the decision as to which assembler routines to call from compile-time to run-time.  There are three reasons for this.

The first, which Klaus mentioned a couple of years back, is that it can be convenient to install the programs, already compiled, into a directory mounted on different machines.  This requires the compiled code to run on all the machines, thereby failing to take advantage of those with a more powerful architecture.  The current change will fix part of that, but still leave the number of threads and memory size compiled in – an issue which may require attention later.

The second arises from the “large fields of characteristic 2” work, which needs the CLMUL (“carry-less multiplication”) flag, and from the HPMI for 5-181 work, which needs SSE4.1 as it stands.  I have previously taken the view that “modern” machines have all these instructions, and that meataxe64 needs a modern enough machine, but I feel that approach, although probably acceptable, is not ideal.

The third reason is that Markus and now Steve have been trying to incorporate meataxe64 stuff into GAP, and the old approach is not really acceptable in that environment.

The “cpuid” function of the x86 returns a large and confusing wealth of detail on the machine, and it is a heavy job to pick through all these flags and values.  I propose, therefore, to write an assembler routine to summarize this into a few bytes of data more suitable for choosing between routines and strategies in meataxe64.  The main properties of an (x86-64) machine are given by a single byte, which I call the class.  The values I propose are as follows, where each class implies all the previous ones also.
‘a’ – All machines, including those where LAHF and CMPXCH16 don’t work.  I do not know whether even linux or the output of gcc would run on these!
‘b’ – LAHF and CMPXCHG16 have been fixed and SSE3 is available.
‘c’ – SSSE3 (the rest of SSE-3) is available.
‘d’ – SSE4.1 is available.  This is critical for primes 17-61, which ideally work in 10-bit fields but this needs PMULDQ.  Otherwise these primes must work in 16-bit fields, making them about 50% slower.
‘e’ – SSE4.2 is available.  These include the byte-string processing which looks useful, though I haven’t used it yet.
‘f’ – CLMUL is available.  This allows the “large fields of characteristic 2” stuff, speeding them up by a factor over 1,000.
‘g’ – AVX.  This allows 256-bit working, but for floating point only, which is not much use in meataxe64.  Nevertheless the VEX prefix clears the top half (allowing better mixing of SSE and AVX) and gives 3-operand instructions, which are of some minor use.
‘h’ – BMI1.  A few useful bit-manipulation instructions.
‘i’ – BMI2.  Some more bit-manipulation instructions, including PEXT/PDEP which look useful, though again I haven’t used them (yet).
‘j’ – AVX-2.  This allows the use of 256-bit registers for integral work.

To the user, the main question is whether the machine is class ‘j’ (Haswell, Broadwell, Skylake, Ryzen) or not, affecting the speed by about a factor of two in most cases.

To the programmer, this allows fingertip control over each routine separately, so that the large fields in characteristic 2 test class >= ‘f’ and the 10-bit working tests class >= ‘d’, avoiding the previous ludicrous situation where 10-bit working was not allowed because CLMUL was not implemented.

No doubt further classes will be added later – AVX-512F is looming large on the horizon now.

After this change, an old machine can only use meataxe64 if it can at least assemble the new instructions.  I suspect this will also need to be addressed in the fullness of time.

One effect of this is that it is now worthwhile to implement the AS-stuff 16-bits, which will speed up 5-13 and 67-181 by a few percent and also allow 17-61 to work on classes ‘a’-‘c’ – neither of great importance, but it has been irritating me for a while.

My first target is to get it all working on class ‘a’ – not that I have a class ‘a’ machine to test it on!  A glance at the top 500 supercomputers suggests that it is classes ‘i’ and ‘j’ that are actually important at the moment, so these will get particular attention.

If anyone wants it for other purposes, the “mactype” routine will be available for all to use, as I suspect the issues discussed here crop up elsewhere.  Not that I’ve actually written it yet – I’ll post again when all this works.  Unless unexpected snags arise, this shouldn’t take more than a few days.

In return, if anyone has computers I can use for testing (probably in December) I’d appreciate a logon for a day or so.  One urgent need is for an Intel i9 with AVX-512, so I can start work on those routines.  The other need is at the other end.  I have no access to a VIA 2000 nor a VIA 3000, nor the older Intel and AMD machines from 2003-2010 (core-2, Athlon, Merom, Nehalem, Westmere etc.) and I’d like to run the regression tests, and probably some performance tests as well, to check that I’ve got things right (and to fix them if I haven’t).