Holt/Rees implemented (a bit)

Over Christmas, David Craven asked me to make a 12474 dimensional representation of 3.F22 over GF(25), which needed to be chopped out of 78 x 5643 = 440,154, all over GF(25).  I managed to do this, albeit with a bit of luck, on the St. Andrews machine ‘lovelace’, but it highlighted the difficulty of finding singular elements of the group algebra over such a large field.

This problem was, of course, noticed – and solved – in 1994 by Holt and Rees.  The basic idea is to compute the characteristic polynomial of some element of the group algebra, then find the roots of that to determine a suitable scalar to subtract.

I have described in this blog some – as yet unimplemented – ideas as to how to compute the characteristic polynomial at a large scale.  I still haven’t done that!  What I have done is to write a characteristic polynomial program zcp in the old, obvious way – single core working one vector at a time, with a view to be able to play with it, at least in small dimensions.  Also, this program will ultimately be zcp1, available to test a good implementation if and when I get round to it.  I have also written a program zhr to find the roots of a polynomial, again with no intelligence – it just tries every root in turn.

To try it out, I computed, on my laptop, a few representations of L3(193) mod 193, and found that it works just fine.  For such large fields there seems to be no need to do any multiplications to make group algebra elemens.  Using two generators a1=(010 001 100) and a2=(110 010 001) it was always sufficient to use a1 + x.a2 – y, where the method is to make a1 + x.a2, find the characteristic polynomial of that, then find the roots of that to determine y.  Most values of x seemed to produce a value of y or two, which was all that was needed.

Once the dimension got above a couple of thousand, the characteristic polynomial step started to take a while – 28 seconds for 2160, for example – so there remains a need for a better characteristic polynomial program.  Also to find the roots of that polynomial over the field 37249 (193^2) took 2.3 seconds, so for huge fields, this program will also be a problem.

Nevertheless, a start has been made.

Advertisements

Cygwin

Jon Thackray has ported meataxe64 to cygwin.  The main difficulty (as I understand it) is that the calling conventions are different, so that the assembler routines need to be described as having the AMD ABI calling conventions, not the Microsoft ones.  If anyone wants to use it, let me (or him) know.

Direct Sum Decompose.

As a step for tensor condense, we seem to need to take a large module M, restrict it to a subgroup and decompose the restriction into a large number of smallish summands.  The operation that seems to be possible (though this can probably be generalized a bit) is that if the module has a projective quotient P – necessarily a summand – then we can find a basis that decomposes the module into P + R in roughly cubic time.

The plan is to choose a suitable element F of the group algebra (generated by Gi, say) and produce matrices T1, T2 by spinning – basically on the left and on the right.  The choice of F is further discussed below.

T1 is obtained by taking some null-vector of F and spining it up under the Gi, as usual, taking T1 as the basis for the space.

T2 is obtained by taking a null-vector of the transpose F’, spinning it up under the transposes Gi’, then taking T2 as the transpose of a basis for that space.

Our hope is that T1 and the null-space T2 give a direct-sum decomposition.  This will be true, I think, if and only if T1 and T2 have the same dimension N, and that the rank of T1.T2 is also N.

Assuming we are looking for projective summands, we will need the null-vector of F to spin up to projectives.  This is achieved by powering up an element of the group algebra until the rank is stable.  The probability of success depends on the number of irreducibles where F has nullity, so a good choice of F is a “peak-word” – an element of the group algebra with minimum non-zero nullity on one irreducible and zero nullity on all the others.

As a first exercise, I formed the tensor produce of 162 and 189 dimensional irreducibles for the alternating group A9 mod 3.  The second algebra element F2 I tried had no nullity on the 1 nor the 7 dimensional irreducibles, and (considering the dimension 30618) I thought it likely that all the other projectives would be present.  The 16th and 8th powers of F2 had the same nullity (789) so I span up the first vector to make a 4698 dimensional module, and then tried about 30 random elements in the nullspace of the transpose before finding one that also span up to 4698 and such that the rank of T1.T2 was indeed 4698.  This all took about half an hour on my laptop.

This gives, I think, a “proof of concept” that explicit decomposing (i.e. finding a basis exhibiting the direct sum decomposition) of a suitable module into projectives and the rest should be possible at the scales at which meataxe64 can work.

This suggests a strategy for decomposing for tensor condense in at most three stages.

1. Use the above method to strip off all projectives.

2. Use class-sums (evaluated using a straight-line program) to decompose the remnants into blocks.

3. Use slower, but more powerful methods to further decompose the remaining modules.

The hope is that methods 1 and 2 can quickly reduce the dimension sufficiently to bring it into range of the slower methods, or perhaps that the modules will anyway be small enough that method 3 is not needed.

Redesign of Thread Farm

I have concluded that the time has come to redesign and rewrite the thread farm in its entirety.  For matrix multiplication and (particularly) Gaussian elimination, it has worked well, and the automatic scheduling needed to synchronize the parts must clearly be retained.  The current driver for this rewrite is really to enable a good implementation of the helper-subgroup based vector permute, and I now aim for a system that can handle all that is currently possible and vector permute as well.  Several defects have shown up over the last few years, and I feel the time has come to rectify them.

Firstly, it is not portable.  Although I have concentrated on 64-bit x86, it seems that C++11 is not such a hard memory model, and I propose to have a go at it.  Maybe I’ll find that it is an over-ambitious sandbox for learning C++11, but I’ll have a go.  I do not know whether the rest of the system will work on any other machine, but certainly the thread farm won’t, so I might as well start there.

Secondly it has no “multiplexing” capability, which seems necessary for a decent implementation of vector-permute.

Thirdly, it has poor support for lists of memory objects.  This has made the reading and writing of matrices quite messy.  A fuller integration of these lists is also needed for vector permute.

Fourthly the difficulty of knowing when a memory object can be free’d has caused several headaches and bugs, where an object is destroyed before the main program got around to submitting further readers.

Fifthly, at present only the main program can submit jobs, and again vector permute seems to need to be able to create jobs as it goes – when new vectors are found that must be multiplied by the generators, for example.

Sixthly I find that the old thread farm is not really using the fact that it is running on a shared-memory machine, and I feel the interfaces should be designed for the distributed setting, as that is what the algorithms really use.

 

New Mod 2 method only a marginal improvement.

The new HPMI method described in the previous post is now working, and it does bring a speed improvement at the largest sizes of around 8%.  Nothing amazing, but it does work.  There are several reasons why it didn’t really achieve the hoped-for improvements.

One reason is that the computation of the required linear combinations of matrix B require more additions.  The old method needed 11 additions for each of the 8 slices – 88 in all – whereas the new one needs about 180 for the two 51-codes and a further few dozen for the Gray codes.

Another issue is the complexity of converting matrix A into the form needed.  This turned out to be 650 lines of code (mostly tables) but it does take a lot more computer time than simply moving 4 bytes after checking whether they are zero.

We can use matrix A for a larger length of vector (in B and C) but this now competes with Strassen and the sweet spot for the new method does seem to be matrices about twice as large, losing another factor.

It is interesting that, by speeding up the central loop (brick multiply) the next stage out (brick population) becomes performance critical too, multiplying by about ten the amount of code that must be performance tuned.  Now we have speeded it up further, we find that the conversion to Afmt becomes performance critical, multiplying the tuned code by another factor of ten.

It is not clear to me whether the Afmt conversion can be vectorized.  Using AVX-512 it seems possible that it could do some parts at least 16 at a time.

What is anyway clear is that the old method is faster for matrices under about 8,000 dimensions, and although there is a place for the new method too, it does not (sadly) render the old method obsolete.  I am therefore faced with the choice of having both methods in use, which the HPMI design cannot currently cope with, or simply going back to the old method and discarding this new method as an interesting and valuable idea, but not really good enough to warrant the upheaval that using it would cause.  I have not decided yet, but my feeling is that the new method is not worth its weight.

All is not lost, though.  In the process I have actually speeded up the old method by using assembler to populate the brick.  Only about 3%, but something, anyway.

Faster working mod 2.

This is joint work with Steve Linton.

It is now six years since the basic design of the HPMI mod 2 was laid down, and it seems that there are several tricks that could be applied, aiming (though this is not yet proven) for an speed increase of around 30%.

The first (and main) idea is to use an “accumulator”.  Even on older chips we have sixteen registers – enough to hold two 1024-bit vectors (cauldrons), and by holding one vector throughout the process, we can change it gradually (following a Gray code) and do our work in that order, rather than sequentially as at present.  That way we can use one add (and avoid the load) to deal with about 11 bits.

Once we are adding in an accumulator every time, we can also add some but not others in via the accumulator, steering the lower bits somewhat and gaining a bit more.

Another useful idea is to put the computation of the grease-code into a data-driven approach also.  That way we only need to compute the vectors we are actually going to need, a trick that is essential to the Gray-code approach.

Using all these tricks (and a few more) the plan now is that instead of adding in 8 slices of grease level 4 (8 loads, 8 adds) we add in 5 vectors only, three via the accumulator which is added in also – 5 loads, 6 adds.

There is a fairly simple way to do this.  The idea is to divide the 32-bits of A into 11+1+10+10, done as follows.

Our first job is to make 2048 lists and counts of what values the first 11 bits take.  We then go through the 2048 slots in Gray code order noting the (top 11 bits of) the xor of all pairs of adjacent ones where the count is non-zero.  These will need to be made.  Steve’s simulations suggest that there will typically be 39-44 of these.

Kaikkonen and Rosendahl provide a set of 51 vectors (plus zero) in a 10-space such that every vector is the sum of two of them.  We store 104 21-bit vectors 1 x 0 and 0 0 x for each x in this code, and add either an odd or an even number of 1 x 0 in via the accumulator depending on what we want the 1 bit to be for the next value.

The result is that we need to store 143-148 [104+(39-44)] entries which is rather more than the 16K I usually allow for the use of L1, but it seems that enough of them are accessed very infrequently for this to be OK.

The result of this is that the BGrease and BWAMad routines get amalgamated into one two-phase code that populates the Brick Work Area in phase 1 and multiplies a 1024×32 A by a 32×1024 B adding into a 1024×1024 in phase 2.

I’m not quite clear how to share the bits in the Afmt – one idea is to use 14 bits for the row of C and 7+7+6+6+6 for the BWA entries, making a total of 48 bits – rather more than the 40 used at present.  This all represents a noticeable increase in the demand for Afmt which is a bit worrying but I think it is OK.

I am not quite at the stage of starting to code this, but I am now pretty sure that mod 2 can be made a good bit faster, and am keen to demonstrate this in the near future.

HPMI using single-precision floats.

It now seems fairly clear that a good HPMI for primes 197-1669 can be written based on the vector use of single precision floating point numbers.  Modern chip makers seem to give floating-point arithmetic precedence, and the result is that it tends to be very fast.  A single-precision floating-point representation of an integer between -2^24 and +2^24 is exact, and with a single instruction to perform 8 operations of the form C+=A*B, and the fact that two of these can be executed every clock cycle, a speed of 16 operations per clock can be aimed at, though probably not quite achieved.  This is approximately the same as the current AS-based HPMI for 67-193, and it likely to be a close call as to which method is faster.

The central fragment of the inner loop holds nine AVX registers each with eight entries, which we may call C 0 0 . . . C 0 23, C 1 0 . . . C 1 23, C 2 0 . . . C 2 23.  We load B j 0 . . . B j 23 and also A 0 j, A 1 j, A 2 j and do nine fused multiply-and-add operations Cik+=Aij*Bjk  The next fragment is for j+1.

Adequate reduction is done by multiplying by r – the nearest float to 1/p – rounding to the nearest integer and then doing a multiply-and-subtract.  For the largest primes (about 1,200 – 1669) this reduction is done every 24 operations, but for smaller primes once every 48, 72 or even fewer will suffice to avoid overflow.

It seems likely that holding the entries (for A and B, and possibly even C) in memory as 16-bit integers rather than 32-bit floats will be preferable, since although the conversion takes compute-work, the saving of memory bandwidth is probably usually of greater importance.

There are many details, not all of which I have yet worked out, but I am now convinced that the project is basically possible.  In principle there are three versions of the central loop, depending on whether SSE, AVX or AVX-with-FMA is used.

First attempt at Strassen-Winograd

The first version of Strassen-Winograd to go into meataxe64 was not a great success.  The “stripping”, needed to make the dimensions divisible by the relevant power of two took longer than the rest of the multiplication!  Much of this was due to poor implementation, but some rough estimations suggest that even with ideal implementation, the stripping function would still be too slow, in that it would be better to use naive methods in many cases rather than use a Strassen-Winograd that needs certain kinds of stripping.  More specifically, we certainly cannot afford to strip off a few columns of B,C (in C=A*B or C+=A*B) because we then need to multiply those columns by the entirety of A, which requires extraction of all the individual elements of A.

It may be that further investigation and implementation might rescue some of this situation and relax the requirements somewhat, but for now I propose to keep the Strassen-Winograd enabled multiplication routine so that it tests whether the dimensions are suitable for simple-minded Strassen-Winograd methods, and only uses them if all these tests pass.  In other words, getting the dimensions right is somebody else’s problem.  That means that the stripping function (if indeed it ever exists) is higher level software.

In practice, my feeling is that padding is probably the better way to go.  The multi-core multiply anyway chops the input matrix into blocks, and it seems sensible to chop it into blocks whose dimensions, except perhaps for the last column and row, are suitable for Strassen-Winograd.  This must surely be a no-brainer.  In fact, I think the last block can be padded as well, so that multiplication can use these faster methods as they (now) are throughout.  I need to have a go at the multi-core multiply anyway (so it can re-read matrix B if memory is short) so that is the next step in this project.

Multi-core Gaussian without transformation matrix can also control dimensions by careful chopping, though in this case the main cleaning job multiplies an (a x r) matrix by an (r x b) one where a and b are the chop sizes, so that r (the number of pivots found in the blocks above the (a x r) matrix) is the only dimension we cannot control completely, and there are at least four approaches to dealing with this.  More experimentation is clearly needed.  To summarize, we could chose a and b so that even a reasonably well conditioned matrix will usually have a suitable r (and just use naive if/when it doesn’t succeed), or we could learn to do stripping of cols of A/rows of B rapidly, or we could pad out our matrices with zeros in the appropriate places, or we could just not even bother to try to get Strassen-Winograd methods into Gaussian.

Multi-core Gaussian with transformation matrix is even more problematic.  Much of the work is done on the transformation matrix, whose dimensions are usually controlled by ranks of things.  In some sense all the options from the previous paragraph still exist, but they are all (except perhaps the last one!) harder, and I propose to reconsider this after I get some experience of the other cases.  Hence the main Gaussian with transformation matrix will only use Strassen-Winograd methods when it gets lucky, and initial efforts will be to try to chop in such a way that it gets lucky quite often.

Having a go at Strassen-Winograd.

I have decided to make some use of Strassen-Winograd methods in matrix multiplication.  Prototyping work by Steve Linton a few weeks ago suggests that a speed improvement of around 30% may well be available for all fields, and if so, it is time to try to obtain it.

In the past, I have always shunned away from these methods.  One problem is that they tend to destroy any sparsity that the matrices have, and it is far from clear that multiplying seven dense matrices will be any faster than multiplying eight that are sparser.  Another issue is that although the term of cubic complexity is decreased, by reducing the sizes of the blocks, the quadratic term (due to format conversions and possibly extractions and assembly for extension fields) is increased – a factor that is made worse by the extra additions and subtractions of the method itself.  Another negative issue is that the memory bandwidth requirements are substantially increased.

I have recently realized, however, that my fear of increased memory footprint seems to have been unfounded.  Although extra work areas are needed, they can be kept fairly limited in extent, albeit by using yet more memory bandwidth by doing the additions and subtractions at matrix scale rather than row-by-row.

Another factor is Steve’s implementation of a the mid-scale echelonize by recursive chopping so as to take advantage of the HPMI, which means that the blocks used in the multiplies of multi-core echelonize are now considerably larger.

It therefore seems clear that there is a size-range from around a gigabyte to perhaps 20 megabytes where a matrix multiplication can usefully use the methods of Strassen-Winograd, and I am now giving it a try.  I propose, at least initially, to try implement the paper Implementation of Strassen’s Algorithm for Matrix Multiplication dated 1.Aug 1996 by Steven Huss-Lederman, Elaine M. Jacobson, Jeremy R. Johnson, Anna Tsao, Thomas Turnbull.

Because of the large quadratic terms of meataxe64 – particularly the format conversions – it seem likely that the “base-case” size will be larger than other authors have used, and I am expecting 10,000 x 10,000 to be about the smallest matrix that might benefit from these methods.  Conversely considerations of memory footprint and concurrency make it hard to see how the multi-core parallel algorithms can benefit, and this part will remain naive, at least for the time being.  I therefore currently only plan to implement Strassen-Winograd methods in the single core “slab” routine.

This necessitates the introduction of the concept of a “stride” to various routines of meataxe64.  The idea is that a matrix consists of corresponding parts of consecutive rows rather than the entire row.  This is needed recursively for multiplication, but also for the addition and subtractions of the Strassen-Winograd method.

Another problem is that these methods assume that the number of rows and bytes of a matrix be sufficiently even – for example if three levels are done, these must be divisible by 8.  It would be possible to solve this problem by padding, but I propose to do it by stripping – three strips are taken off and multiplied (probably in Dfmt) separately at the start.

I propose to do Strassen at the highest level – Dfmt over the extension field.  It would be possible to do it in the prime field, and even (for matrices B and C, anyway) after format conversion, but the advantages of these ideas has not yet been demonstrated to me, and I propose to do things the simplest way, at least to start with.

Hopefully my next post will be to say that this is all working!

Progress on Gaussian Elimination.

The three-layer Gaussian elimination system now seems to be working well.

The top layer is the multi-core chopped-up Gaussian, which has been working for some time.

Steve Linton (with minimal assistance from me!) last week implemented the single-core Gaussian elimination routine to chop yet further and use the HPMI wherever possible, as suggested by his prototype (in GAP) a few weeks ago.

I have now started to address the base-case a bit, where a smallish matrix (currently less than 300 in at least one dimension) must be echelonized, with transformation matrix, using the pool/batch system and (so far) projective level 2 grease where appropriate.  This needs to work without the benefit of the HPMI (the matrices are too small).

The result is that the single-core part is now within a factor of two of multiplication speed, so that the top layer only needs to chop to obtain concurrency – previously it needed to chop much more to avoid the bottleneck of the poor single-core Gaussian.

This complete system now passes the regression tests and – as far as I know – is working.

There remain many improvements still on the “to do” list.  Many of them may stay there for some time, but I propose to spend a week or two now testing (and mainly performance testing) the system in various environments to see how it manages as the machine, shape, size, field etc. vary, aiming to produce a Gaussian elimination system over finite fields that is hopefully good for all purposes within its range (x86, q<2^64).