Faster working mod 2.

This is joint work with Steve Linton.

It is now six years since the basic design of the HPMI mod 2 was laid down, and it seems that there are several tricks that could be applied, aiming (though this is not yet proven) for an speed increase of around 30%.

The first (and main) idea is to use an “accumulator”.  Even on older chips we have sixteen registers – enough to hold two 1024-bit vectors (cauldrons), and by holding one vector throughout the process, we can change it gradually (following a Gray code) and do our work in that order, rather than sequentially as at present.  That way we can use one add (and avoid the load) to deal with about 11 bits.

Once we are adding in an accumulator every time, we can also add some but not others in via the accumulator, steering the lower bits somewhat and gaining a bit more.

Another useful idea is to put the computation of the grease-code into a data-driven approach also.  That way we only need to compute the vectors we are actually going to need, a trick that is essential to the Gray-code approach.

Using all these tricks (and a few more) the plan now is that instead of adding in 8 slices of grease level 4 (8 loads, 8 adds) we add in 5 vectors only, three via the accumulator which is added in also – 5 loads, 6 adds.

There is a fairly simple way to do this.  The idea is to divide the 32-bits of A into 11+1+10+10, done as follows.

Our first job is to make 2048 lists and counts of what values the first 11 bits take.  We then go through the 2048 slots in Gray code order noting the (top 11 bits of) the xor of all pairs of adjacent ones where the count is non-zero.  These will need to be made.  Steve’s simulations suggest that there will typically be 39-44 of these.

Kaikkonen and Rosendahl provide a set of 51 vectors (plus zero) in a 10-space such that every vector is the sum of two of them.  We store 104 21-bit vectors 1 x 0 and 0 0 x for each x in this code, and add either an odd or an even number of 1 x 0 in via the accumulator depending on what we want the 1 bit to be for the next value.

The result is that we need to store 143-148 [104+(39-44)] entries which is rather more than the 16K I usually allow for the use of L1, but it seems that enough of them are accessed very infrequently for this to be OK.

The result of this is that the BGrease and BWAMad routines get amalgamated into one two-phase code that populates the Brick Work Area in phase 1 and multiplies a 1024×32 A by a 32×1024 B adding into a 1024×1024 in phase 2.

I’m not quite clear how to share the bits in the Afmt – one idea is to use 14 bits for the row of C and 7+7+6+6+6 for the BWA entries, making a total of 48 bits – rather more than the 40 used at present.  This all represents a noticeable increase in the demand for Afmt which is a bit worrying but I think it is OK.

I am not quite at the stage of starting to code this, but I am now pretty sure that mod 2 can be made a good bit faster, and am keen to demonstrate this in the near future.

HPMI using single-precision floats.

It now seems fairly clear that a good HPMI for primes 197-1669 can be written based on the vector use of single precision floating point numbers.  Modern chip makers seem to give floating-point arithmetic precedence, and the result is that it tends to be very fast.  A single-precision floating-point representation of an integer between -2^24 and +2^24 is exact, and with a single instruction to perform 8 operations of the form C+=A*B, and the fact that two of these can be executed every clock cycle, a speed of 16 operations per clock can be aimed at, though probably not quite achieved.  This is approximately the same as the current AS-based HPMI for 67-193, and it likely to be a close call as to which method is faster.

The central fragment of the inner loop holds nine AVX registers each with eight entries, which we may call C 0 0 . . . C 0 23, C 1 0 . . . C 1 23, C 2 0 . . . C 2 23.  We load B j 0 . . . B j 23 and also A 0 j, A 1 j, A 2 j and do nine fused multiply-and-add operations Cik+=Aij*Bjk  The next fragment is for j+1.

Adequate reduction is done by multiplying by r – the nearest float to 1/p – rounding to the nearest integer and then doing a multiply-and-subtract.  For the largest primes (about 1,200 – 1669) this reduction is done every 24 operations, but for smaller primes once every 48, 72 or even fewer will suffice to avoid overflow.

It seems likely that holding the entries (for A and B, and possibly even C) in memory as 16-bit integers rather than 32-bit floats will be preferable, since although the conversion takes compute-work, the saving of memory bandwidth is probably usually of greater importance.

There are many details, not all of which I have yet worked out, but I am now convinced that the project is basically possible.  In principle there are three versions of the central loop, depending on whether SSE, AVX or AVX-with-FMA is used.

First attempt at Strassen-Winograd

The first version of Strassen-Winograd to go into meataxe64 was not a great success.  The “stripping”, needed to make the dimensions divisible by the relevant power of two took longer than the rest of the multiplication!  Much of this was due to poor implementation, but some rough estimations suggest that even with ideal implementation, the stripping function would still be too slow, in that it would be better to use naive methods in many cases rather than use a Strassen-Winograd that needs certain kinds of stripping.  More specifically, we certainly cannot afford to strip off a few columns of B,C (in C=A*B or C+=A*B) because we then need to multiply those columns by the entirety of A, which requires extraction of all the individual elements of A.

It may be that further investigation and implementation might rescue some of this situation and relax the requirements somewhat, but for now I propose to keep the Strassen-Winograd enabled multiplication routine so that it tests whether the dimensions are suitable for simple-minded Strassen-Winograd methods, and only uses them if all these tests pass.  In other words, getting the dimensions right is somebody else’s problem.  That means that the stripping function (if indeed it ever exists) is higher level software.

In practice, my feeling is that padding is probably the better way to go.  The multi-core multiply anyway chops the input matrix into blocks, and it seems sensible to chop it into blocks whose dimensions, except perhaps for the last column and row, are suitable for Strassen-Winograd.  This must surely be a no-brainer.  In fact, I think the last block can be padded as well, so that multiplication can use these faster methods as they (now) are throughout.  I need to have a go at the multi-core multiply anyway (so it can re-read matrix B if memory is short) so that is the next step in this project.

Multi-core Gaussian without transformation matrix can also control dimensions by careful chopping, though in this case the main cleaning job multiplies an (a x r) matrix by an (r x b) one where a and b are the chop sizes, so that r (the number of pivots found in the blocks above the (a x r) matrix) is the only dimension we cannot control completely, and there are at least four approaches to dealing with this.  More experimentation is clearly needed.  To summarize, we could chose a and b so that even a reasonably well conditioned matrix will usually have a suitable r (and just use naive if/when it doesn’t succeed), or we could learn to do stripping of cols of A/rows of B rapidly, or we could pad out our matrices with zeros in the appropriate places, or we could just not even bother to try to get Strassen-Winograd methods into Gaussian.

Multi-core Gaussian with transformation matrix is even more problematic.  Much of the work is done on the transformation matrix, whose dimensions are usually controlled by ranks of things.  In some sense all the options from the previous paragraph still exist, but they are all (except perhaps the last one!) harder, and I propose to reconsider this after I get some experience of the other cases.  Hence the main Gaussian with transformation matrix will only use Strassen-Winograd methods when it gets lucky, and initial efforts will be to try to chop in such a way that it gets lucky quite often.

Having a go at Strassen-Winograd.

I have decided to make some use of Strassen-Winograd methods in matrix multiplication.  Prototyping work by Steve Linton a few weeks ago suggests that a speed improvement of around 30% may well be available for all fields, and if so, it is time to try to obtain it.

In the past, I have always shunned away from these methods.  One problem is that they tend to destroy any sparsity that the matrices have, and it is far from clear that multiplying seven dense matrices will be any faster than multiplying eight that are sparser.  Another issue is that although the term of cubic complexity is decreased, by reducing the sizes of the blocks, the quadratic term (due to format conversions and possibly extractions and assembly for extension fields) is increased – a factor that is made worse by the extra additions and subtractions of the method itself.  Another negative issue is that the memory bandwidth requirements are substantially increased.

I have recently realized, however, that my fear of increased memory footprint seems to have been unfounded.  Although extra work areas are needed, they can be kept fairly limited in extent, albeit by using yet more memory bandwidth by doing the additions and subtractions at matrix scale rather than row-by-row.

Another factor is Steve’s implementation of a the mid-scale echelonize by recursive chopping so as to take advantage of the HPMI, which means that the blocks used in the multiplies of multi-core echelonize are now considerably larger.

It therefore seems clear that there is a size-range from around a gigabyte to perhaps 20 megabytes where a matrix multiplication can usefully use the methods of Strassen-Winograd, and I am now giving it a try.  I propose, at least initially, to try implement the paper Implementation of Strassen’s Algorithm for Matrix Multiplication dated 1.Aug 1996 by Steven Huss-Lederman, Elaine M. Jacobson, Jeremy R. Johnson, Anna Tsao, Thomas Turnbull.

Because of the large quadratic terms of meataxe64 – particularly the format conversions – it seem likely that the “base-case” size will be larger than other authors have used, and I am expecting 10,000 x 10,000 to be about the smallest matrix that might benefit from these methods.  Conversely considerations of memory footprint and concurrency make it hard to see how the multi-core parallel algorithms can benefit, and this part will remain naive, at least for the time being.  I therefore currently only plan to implement Strassen-Winograd methods in the single core “slab” routine.

This necessitates the introduction of the concept of a “stride” to various routines of meataxe64.  The idea is that a matrix consists of corresponding parts of consecutive rows rather than the entire row.  This is needed recursively for multiplication, but also for the addition and subtractions of the Strassen-Winograd method.

Another problem is that these methods assume that the number of rows and bytes of a matrix be sufficiently even – for example if three levels are done, these must be divisible by 8.  It would be possible to solve this problem by padding, but I propose to do it by stripping – three strips are taken off and multiplied (probably in Dfmt) separately at the start.

I propose to do Strassen at the highest level – Dfmt over the extension field.  It would be possible to do it in the prime field, and even (for matrices B and C, anyway) after format conversion, but the advantages of these ideas has not yet been demonstrated to me, and I propose to do things the simplest way, at least to start with.

Hopefully my next post will be to say that this is all working!