Extension fields improved.

With assistance from Martin Albrecht, meataxe64 now uses the HPMI for all non-prime fields up to order 2^16 = 65536 (primes are only up to 193).  My 2-core laptop can now multiply 7,000 x 7,000 matrices over any of these fields in less than a minute.  The slowest is actually 15625=5^5, which needs 14 matrix multiplications over GF(5) to produce the result.

My implementations entirely rely on “interpolation”, which includes interpolation over extension fields and using powers of linear polynomials . . . I have not yet grasped the practicalities of using “extension of extension” methods, and “function field” methods are also still beyond my mental capacity.  Hence GF(64) currently uses 18 matrices (where extending GF(4) would only need 15) and 19683 uses 27 (where a function field can apparently do it in 26).  Perhaps I should say “almost entirely rely” because GF(32) uses 13 matrices – a method found, I believe, by brute-force search.  As I see it, most of the fields use the same number of matrices as the best known method, and those that don’t are not too far off.

The linear forms needed, both on the input and to make the output, were all optimized using a beam search to factorize the matrices as the product of as few elementary operations (e.g. add <lambda> times one row to another row) as possible using a program (zem – elementary matrices) that may have wider applicability.

I currently do Strassen in the extension field, though perhaps not for any good reason.


AVX-512 Version now working.

All HPMI fields – characteristic 2-193 – now use AVX-512 if it is available.  The speed-up is less that I’d hoped – mod 2 takes 71%, mod 3 takes 70% and mod 5 takes 85% of the time it takes (on the same chip) using only AVX-2.  By working with 512-bit registers and using the ternary logic, the amount of “work” to be done is 50%  in characteristic 5 and less in characteristics 2 and 3.  It may be that the hardware technology is still not mature, and perhaps we’ll think of some new tricks, but as far as I know today, that is the result.

Characteristic Polynomial Kicking Off.

I have recently realized that a good <characteristic polynomial> program is possible, able to work heroically with 1,000,000 x 1,000,000 matrices, and that it will be a powerful tool for investigating condensed representations. – by computing and factorizing the characteristic polynomial of a “random” element, one gets a good view of the composition factors without needing to do all the chopping.

All the tricks of meataxe64 are basically totally useless if one only wants to multiply one long vector by a huge matrix.  Similarly doing Gaussian elimination one vector at a time is rubbish too.  The HPMI and multi-core tricks can speed things up by a factor of 100,000 or so, but only if we work with relatively large batches – perhaps a thousand or so – of vectors all over the place.

The method I am proposing so far assumes that the minimal polynomial is commensurate with the dimension, and that the “random” vectors we choose will spin up to a fair chunk of the space.  I am hoping that modifications can be made if this turns out to be an unwarranted assumption, and hopefully that a robust method will emerge, but I am not that far into things yet.

I can see no real alternative but to take a single vector and repeatedly multiply, although Steve Linton suggests we might be able to work out the answer if we spin up ten at a time, which might help a bit later.   Ignoring that for the time being, the idea is to transform the matrix into one that we can multiply by at high speed, and I propose to use the “sparsified” format for this.  We want to be able to multiply a vector by most of the rows of the matrix by just shifting – the matrix should consist mainly of long sequences of rows which are zero except have one ‘1’ in consecutive columns.  This is the sort of thing I am expecting to use . . .

0 0 1 0 0 0
0 0 0 1 0 0
0 0 0 0 1 0
0 0 0 0 0 1
8 4 3 3 3 4
4 4 4 2 3 4

The first program we want therefore takes a batch of “random” vectors, big enough for the HPMI and multicore to kick in properly, but small enough so that the number of dense rows in the result is not too great, so that multiplication is fast.  We may want to do this in two stages . . . not sure yet.  There is a Gaussian step here too, again with big matrices throughout.

The other program we need takes a single vector and spins it up using this fast sparsified multiplication.  Again we must spin it up into a big batch before doing any Gaussian elimination, so that it can take advantage of the HPMI and multicore.  This will be wasteful if the chosen vector doesn’t spin up to much!

I haven’t got far with this yet.  I have defined a “sparsified” format, and written a program to read in a flat matrix and convert to this sparsified format, spotting any sparsification already present in the matrix.  A sparsified matrix can also be printed.  Finally there is a naive (slow) characteristic polynomial program and a naive (slow) program that looks for roots in a polynomial for Holt/Rees purposes, trying each field element in turn.    In other words I can test the suite – all I need to do now is to write it!

More news when I have it.

Holt/Rees implemented (a bit)

Over Christmas, David Craven asked me to make a 12474 dimensional representation of 3.F22 over GF(25), which needed to be chopped out of 78 x 5643 = 440,154, all over GF(25).  I managed to do this, albeit with a bit of luck, on the St. Andrews machine ‘lovelace’, but it highlighted the difficulty of finding singular elements of the group algebra over such a large field.

This problem was, of course, noticed – and solved – in 1994 by Holt and Rees.  The basic idea is to compute the characteristic polynomial of some element of the group algebra, then find the roots of that to determine a suitable scalar to subtract.

I have described in this blog some – as yet unimplemented – ideas as to how to compute the characteristic polynomial at a large scale.  I still haven’t done that!  What I have done is to write a characteristic polynomial program zcp in the old, obvious way – single core working one vector at a time, with a view to be able to play with it, at least in small dimensions.  Also, this program will ultimately be zcp1, available to test a good implementation if and when I get round to it.  I have also written a program zhr to find the roots of a polynomial, again with no intelligence – it just tries every root in turn.

To try it out, I computed, on my laptop, a few representations of L3(193) mod 193, and found that it works just fine.  For such large fields there seems to be no need to do any multiplications to make group algebra elemens.  Using two generators a1=(010 001 100) and a2=(110 010 001) it was always sufficient to use a1 + x.a2 – y, where the method is to make a1 + x.a2, find the characteristic polynomial of that, then find the roots of that to determine y.  Most values of x seemed to produce a value of y or two, which was all that was needed.

Once the dimension got above a couple of thousand, the characteristic polynomial step started to take a while – 28 seconds for 2160, for example – so there remains a need for a better characteristic polynomial program.  Also to find the roots of that polynomial over the field 37249 (193^2) took 2.3 seconds, so for huge fields, this program will also be a problem.

Nevertheless, a start has been made.


Jon Thackray has ported meataxe64 to cygwin.  The main difficulty (as I understand it) is that the calling conventions are different, so that the assembler routines need to be described as having the AMD ABI calling conventions, not the Microsoft ones.  If anyone wants to use it, let me (or him) know.

Direct Sum Decompose.

As a step for tensor condense, we seem to need to take a large module M, restrict it to a subgroup and decompose the restriction into a large number of smallish summands.  The operation that seems to be possible (though this can probably be generalized a bit) is that if the module has a projective quotient P – necessarily a summand – then we can find a basis that decomposes the module into P + R in roughly cubic time.

The plan is to choose a suitable element F of the group algebra (generated by Gi, say) and produce matrices T1, T2 by spinning – basically on the left and on the right.  The choice of F is further discussed below.

T1 is obtained by taking some null-vector of F and spining it up under the Gi, as usual, taking T1 as the basis for the space.

T2 is obtained by taking a null-vector of the transpose F’, spinning it up under the transposes Gi’, then taking T2 as the transpose of a basis for that space.

Our hope is that T1 and the null-space T2 give a direct-sum decomposition.  This will be true, I think, if and only if T1 and T2 have the same dimension N, and that the rank of T1.T2 is also N.

Assuming we are looking for projective summands, we will need the null-vector of F to spin up to projectives.  This is achieved by powering up an element of the group algebra until the rank is stable.  The probability of success depends on the number of irreducibles where F has nullity, so a good choice of F is a “peak-word” – an element of the group algebra with minimum non-zero nullity on one irreducible and zero nullity on all the others.

As a first exercise, I formed the tensor produce of 162 and 189 dimensional irreducibles for the alternating group A9 mod 3.  The second algebra element F2 I tried had no nullity on the 1 nor the 7 dimensional irreducibles, and (considering the dimension 30618) I thought it likely that all the other projectives would be present.  The 16th and 8th powers of F2 had the same nullity (789) so I span up the first vector to make a 4698 dimensional module, and then tried about 30 random elements in the nullspace of the transpose before finding one that also span up to 4698 and such that the rank of T1.T2 was indeed 4698.  This all took about half an hour on my laptop.

This gives, I think, a “proof of concept” that explicit decomposing (i.e. finding a basis exhibiting the direct sum decomposition) of a suitable module into projectives and the rest should be possible at the scales at which meataxe64 can work.

This suggests a strategy for decomposing for tensor condense in at most three stages.

1. Use the above method to strip off all projectives.

2. Use class-sums (evaluated using a straight-line program) to decompose the remnants into blocks.

3. Use slower, but more powerful methods to further decompose the remaining modules.

The hope is that methods 1 and 2 can quickly reduce the dimension sufficiently to bring it into range of the slower methods, or perhaps that the modules will anyway be small enough that method 3 is not needed.

Redesign of Thread Farm

I have concluded that the time has come to redesign and rewrite the thread farm in its entirety.  For matrix multiplication and (particularly) Gaussian elimination, it has worked well, and the automatic scheduling needed to synchronize the parts must clearly be retained.  The current driver for this rewrite is really to enable a good implementation of the helper-subgroup based vector permute, and I now aim for a system that can handle all that is currently possible and vector permute as well.  Several defects have shown up over the last few years, and I feel the time has come to rectify them.

Firstly, it is not portable.  Although I have concentrated on 64-bit x86, it seems that C++11 is not such a hard memory model, and I propose to have a go at it.  Maybe I’ll find that it is an over-ambitious sandbox for learning C++11, but I’ll have a go.  I do not know whether the rest of the system will work on any other machine, but certainly the thread farm won’t, so I might as well start there.

Secondly it has no “multiplexing” capability, which seems necessary for a decent implementation of vector-permute.

Thirdly, it has poor support for lists of memory objects.  This has made the reading and writing of matrices quite messy.  A fuller integration of these lists is also needed for vector permute.

Fourthly the difficulty of knowing when a memory object can be free’d has caused several headaches and bugs, where an object is destroyed before the main program got around to submitting further readers.

Fifthly, at present only the main program can submit jobs, and again vector permute seems to need to be able to create jobs as it goes – when new vectors are found that must be multiplied by the generators, for example.

Sixthly I find that the old thread farm is not really using the fact that it is running on a shared-memory machine, and I feel the interfaces should be designed for the distributed setting, as that is what the algorithms really use.


New Mod 2 method only a marginal improvement.

The new HPMI method described in the previous post is now working, and it does bring a speed improvement at the largest sizes of around 8%.  Nothing amazing, but it does work.  There are several reasons why it didn’t really achieve the hoped-for improvements.

One reason is that the computation of the required linear combinations of matrix B require more additions.  The old method needed 11 additions for each of the 8 slices – 88 in all – whereas the new one needs about 180 for the two 51-codes and a further few dozen for the Gray codes.

Another issue is the complexity of converting matrix A into the form needed.  This turned out to be 650 lines of code (mostly tables) but it does take a lot more computer time than simply moving 4 bytes after checking whether they are zero.

We can use matrix A for a larger length of vector (in B and C) but this now competes with Strassen and the sweet spot for the new method does seem to be matrices about twice as large, losing another factor.

It is interesting that, by speeding up the central loop (brick multiply) the next stage out (brick population) becomes performance critical too, multiplying by about ten the amount of code that must be performance tuned.  Now we have speeded it up further, we find that the conversion to Afmt becomes performance critical, multiplying the tuned code by another factor of ten.

It is not clear to me whether the Afmt conversion can be vectorized.  Using AVX-512 it seems possible that it could do some parts at least 16 at a time.

What is anyway clear is that the old method is faster for matrices under about 8,000 dimensions, and although there is a place for the new method too, it does not (sadly) render the old method obsolete.  I am therefore faced with the choice of having both methods in use, which the HPMI design cannot currently cope with, or simply going back to the old method and discarding this new method as an interesting and valuable idea, but not really good enough to warrant the upheaval that using it would cause.  I have not decided yet, but my feeling is that the new method is not worth its weight.

All is not lost, though.  In the process I have actually speeded up the old method by using assembler to populate the brick.  Only about 3%, but something, anyway.

Faster working mod 2.

This is joint work with Steve Linton.

It is now six years since the basic design of the HPMI mod 2 was laid down, and it seems that there are several tricks that could be applied, aiming (though this is not yet proven) for an speed increase of around 30%.

The first (and main) idea is to use an “accumulator”.  Even on older chips we have sixteen registers – enough to hold two 1024-bit vectors (cauldrons), and by holding one vector throughout the process, we can change it gradually (following a Gray code) and do our work in that order, rather than sequentially as at present.  That way we can use one add (and avoid the load) to deal with about 11 bits.

Once we are adding in an accumulator every time, we can also add some but not others in via the accumulator, steering the lower bits somewhat and gaining a bit more.

Another useful idea is to put the computation of the grease-code into a data-driven approach also.  That way we only need to compute the vectors we are actually going to need, a trick that is essential to the Gray-code approach.

Using all these tricks (and a few more) the plan now is that instead of adding in 8 slices of grease level 4 (8 loads, 8 adds) we add in 5 vectors only, three via the accumulator which is added in also – 5 loads, 6 adds.

There is a fairly simple way to do this.  The idea is to divide the 32-bits of A into 11+1+10+10, done as follows.

Our first job is to make 2048 lists and counts of what values the first 11 bits take.  We then go through the 2048 slots in Gray code order noting the (top 11 bits of) the xor of all pairs of adjacent ones where the count is non-zero.  These will need to be made.  Steve’s simulations suggest that there will typically be 39-44 of these.

Kaikkonen and Rosendahl provide a set of 51 vectors (plus zero) in a 10-space such that every vector is the sum of two of them.  We store 104 21-bit vectors 1 x 0 and 0 0 x for each x in this code, and add either an odd or an even number of 1 x 0 in via the accumulator depending on what we want the 1 bit to be for the next value.

The result is that we need to store 143-148 [104+(39-44)] entries which is rather more than the 16K I usually allow for the use of L1, but it seems that enough of them are accessed very infrequently for this to be OK.

The result of this is that the BGrease and BWAMad routines get amalgamated into one two-phase code that populates the Brick Work Area in phase 1 and multiplies a 1024×32 A by a 32×1024 B adding into a 1024×1024 in phase 2.

I’m not quite clear how to share the bits in the Afmt – one idea is to use 14 bits for the row of C and 7+7+6+6+6 for the BWA entries, making a total of 48 bits – rather more than the 40 used at present.  This all represents a noticeable increase in the demand for Afmt which is a bit worrying but I think it is OK.

I am not quite at the stage of starting to code this, but I am now pretty sure that mod 2 can be made a good bit faster, and am keen to demonstrate this in the near future.

HPMI using single-precision floats.

It now seems fairly clear that a good HPMI for primes 197-1669 can be written based on the vector use of single precision floating point numbers.  Modern chip makers seem to give floating-point arithmetic precedence, and the result is that it tends to be very fast.  A single-precision floating-point representation of an integer between -2^24 and +2^24 is exact, and with a single instruction to perform 8 operations of the form C+=A*B, and the fact that two of these can be executed every clock cycle, a speed of 16 operations per clock can be aimed at, though probably not quite achieved.  This is approximately the same as the current AS-based HPMI for 67-193, and it likely to be a close call as to which method is faster.

The central fragment of the inner loop holds nine AVX registers each with eight entries, which we may call C 0 0 . . . C 0 23, C 1 0 . . . C 1 23, C 2 0 . . . C 2 23.  We load B j 0 . . . B j 23 and also A 0 j, A 1 j, A 2 j and do nine fused multiply-and-add operations Cik+=Aij*Bjk  The next fragment is for j+1.

Adequate reduction is done by multiplying by r – the nearest float to 1/p – rounding to the nearest integer and then doing a multiply-and-subtract.  For the largest primes (about 1,200 – 1669) this reduction is done every 24 operations, but for smaller primes once every 48, 72 or even fewer will suffice to avoid overflow.

It seems likely that holding the entries (for A and B, and possibly even C) in memory as 16-bit integers rather than 32-bit floats will be preferable, since although the conversion takes compute-work, the saving of memory bandwidth is probably usually of greater importance.

There are many details, not all of which I have yet worked out, but I am now convinced that the project is basically possible.  In principle there are three versions of the central loop, depending on whether SSE, AVX or AVX-with-FMA is used.