So far, the High Performance Meataxe Interface caters only for primes 2 and 3. I have been looking at the design for 5-181 using AS-codes, mainly with a view to sorting out what a forward-looking interface design should be. I probably should look also at much larger primes that use multiplication for a complete view, but I thought it worthwhile to post my conclusions so far.

There are currently six major routines in the HPMI. I am aware also that there will need to be two more for Gaussian (extract selected columns from Cfmt into Dfmt, and convert Dfmt to Cfmt) in the end, but for multiplication C += A x B the main ones are . . .

**DtoA** – convert the general Dfmt to Afmt. This is the coding routine, taking each element (or handful of elements) of matrix A and working out how to add that multiple.

**DtoB** – convert matrix B from Dfmt into Bfmt – a more suitable format for BSeed, but without using too much space to avoid using too much memory bandwidth.

**BSeed** – take a block of Bfmt and complete its conversion to give the basic vectors of the BWA (Brick Work Area).

**BGrease** – compute all the grease multiples needed in the (already seeded) BWA.

**BwaMad** – the main (cubic) operation to use Afmt and the BWA and do the multiply-and-add into the Cfmt.

**CtoD** – convert the answer from Cfmt back into the general Dfmt.

Of these, BwaMad probably needs to be written in assembler in all cases, and is the performance critical heart. It seems that BGrease will usually need assembler assistance also – a poor implementation would be the slowest step. The others are more like “support” routines, that need to know the nitty-gritty details of BwaMad and BGrease, but are not themselves under great performance pressure.

In practice, BSeed, BGrease and BwaMad are combined into a routine **BrickMad** that simply calls them all in turn, but for implementation it seems important to treat the three as independent.

The choice as to how to do all these operations will depend on the chip, so the routine **hpmiset** that is called at the beginning of the program run, and is different for different “technologies” (GEN, NEH, PIL and HAS, with ZEN and V64 in sight) must set a collection of flags and parameters to control these six routines, and it appears that it needs to be able to do this for each routine separately.

As each of the routines typically takes thousands of clock cycles, the main routine (e.g. DtoA) can afford to switch on its flag. Certainly the flags AfmtMagic, BfmtMagic and CfmtMagic are needed to control DtoA, DtoB and CtoD respectively, and the simplest way forwards seems to be to have SeedMagic, GreaseMagic and BwaMagic to control the other three in much the same way. Although the set of useful combinations of these flags looks set to grow hugely, it seems that many functions can be used in various ways. For example, a version of DtoA that takes each byte and looks it up in a byte->byte table to get the Afmt looks fine for 5-181, even though other operations (particularly BwaMad) will have several different versions for this range, depending on whether 8, 10 or 16 bits are used in Cfmt, and how the adequate reduction is to be done.

A full implementation for all primes 2 to 2^64 would certainly have a lot of versions of Bwa, and probably quite a few BSeed versions as well, but I think it is a simplification to be able to re-use the other routines as much as possible, and so far it appears that this simplification is quite significant.

Not that I have any immediate plans to implement these other characteristics. My current aim is to reorganize the HPMI code and document it, so that they can be written later without too much repetition. At the moment, the choice of what to do is far too haphazard and implicit.

Another factor that has come out of this work is that the higher-level routines need some guidance – how many rows is it best to do at once, for example. These are only recommendations – the system will work anyway – but there are several ways in which the “technology” – HAS, NEH etc. – can usefully give information to enable the higher level functions to work well. In particular, I notice that the AS implementations will do considerable expansion of the data in Cfmt, making it likely that a Cfmt block should be held constant for as long as possible, so that the “space-filling curve” methods are not appropriate in this case.

Another improvement is that by exposing BSeed/BGrease/BwaMad, it becomes possible (later!) to hold a small B (e.g. 1,000 x 1,000) as a collection of seeded and greased BWAs, which may well be advantageous for programs like vector-permute.

My feeling overall is that the HPMI, which started many years ago as an experiment in performance programming, is heading towards a bit more maturity, hopefully forming a basis for any implementation of finite-field matrix multiplication at high performance.