This is joint work with Steve Linton.
It is now six years since the basic design of the HPMI mod 2 was laid down, and it seems that there are several tricks that could be applied, aiming (though this is not yet proven) for an speed increase of around 30%.
The first (and main) idea is to use an “accumulator”. Even on older chips we have sixteen registers – enough to hold two 1024-bit vectors (cauldrons), and by holding one vector throughout the process, we can change it gradually (following a Gray code) and do our work in that order, rather than sequentially as at present. That way we can use one add (and avoid the load) to deal with about 11 bits.
Once we are adding in an accumulator every time, we can also add some but not others in via the accumulator, steering the lower bits somewhat and gaining a bit more.
Another useful idea is to put the computation of the grease-code into a data-driven approach also. That way we only need to compute the vectors we are actually going to need, a trick that is essential to the Gray-code approach.
Using all these tricks (and a few more) the plan now is that instead of adding in 8 slices of grease level 4 (8 loads, 8 adds) we add in 5 vectors only, three via the accumulator which is added in also – 5 loads, 6 adds.
There is a fairly simple way to do this. The idea is to divide the 32-bits of A into 11+1+10+10, done as follows.
Our first job is to make 2048 lists and counts of what values the first 11 bits take. We then go through the 2048 slots in Gray code order noting the (top 11 bits of) the xor of all pairs of adjacent ones where the count is non-zero. These will need to be made. Steve’s simulations suggest that there will typically be 39-44 of these.
Kaikkonen and Rosendahl provide a set of 51 vectors (plus zero) in a 10-space such that every vector is the sum of two of them. We store 104 21-bit vectors 1 x 0 and 0 0 x for each x in this code, and add either an odd or an even number of 1 x 0 in via the accumulator depending on what we want the 1 bit to be for the next value.
The result is that we need to store 143-148 [104+(39-44)] entries which is rather more than the 16K I usually allow for the use of L1, but it seems that enough of them are accessed very infrequently for this to be OK.
The result of this is that the BGrease and BWAMad routines get amalgamated into one two-phase code that populates the Brick Work Area in phase 1 and multiplies a 1024×32 A by a 32×1024 B adding into a 1024×1024 in phase 2.
I’m not quite clear how to share the bits in the Afmt – one idea is to use 14 bits for the row of C and 7+7+6+6+6 for the BWA entries, making a total of 48 bits – rather more than the 40 used at present. This all represents a noticeable increase in the demand for Afmt which is a bit worrying but I think it is OK.
I am not quite at the stage of starting to code this, but I am now pretty sure that mod 2 can be made a good bit faster, and am keen to demonstrate this in the near future.