Jon Thackray joins the “team”

The work involved in producing the next version of meataxe64 has proved a bit too heavy for me, but Jon Thackray has agreed (at least in principle) to move from a frequent advisor and assistant to a member of the team, which now numbers two!

My feeling is that my own skills are at the lower levels of the software, and I love to squeeze another 10% of performance out of almost anything. Those who have tried to use my offerings, however, will know that my skills at user interface are (English understatement) not always top-notch, and the idea is that I write the low-level functions (as I do now) and Jon takes charge of the higher levels.

More specifically, the idea is that there is a “functions” layer, where the low-level ones, such as multiplication, Gaussian, transpose etc. I write, and the higher level ones (such as null-space, invariant subspace, split and standard base) Jon writes, along with all the programs that use them.

There remains much to do to sort out the mess it is currently in, but I feel the focus of drawing a clear line between performance and usability has a good chance of resulting in a system that has both.


HPMI redesign.

So far, the High Performance Meataxe Interface caters only for primes 2 and 3.  I have been looking at the design for 5-181 using AS-codes, mainly with a view to sorting out what a forward-looking interface design should be.  I probably should look also at much larger primes that use multiplication for a complete view, but I thought it worthwhile to post my conclusions so far.

There are currently six major routines in the HPMI.  I am aware also that there will need to be two more for Gaussian (extract selected columns from Cfmt into Dfmt, and convert Dfmt to Cfmt) in the end, but for multiplication C += A x B the main ones are . . .

DtoA – convert the general Dfmt to Afmt.  This is the coding routine, taking each element (or handful of elements) of matrix A and working out how to add that multiple.

DtoB – convert matrix B from Dfmt into Bfmt – a more suitable format for BSeed, but without using too much space to avoid using too much memory bandwidth.

BSeed – take a block of Bfmt and complete its conversion to give the basic vectors of the BWA (Brick Work Area).

BGrease – compute all the grease multiples needed in the (already seeded) BWA.

BwaMad – the main (cubic) operation to use Afmt and the BWA and do the multiply-and-add into the Cfmt.

CtoD – convert the answer from Cfmt back into the general Dfmt.

Of these, BwaMad probably needs to be written in assembler in all cases, and is the performance critical heart.  It seems that BGrease will usually need assembler assistance also – a poor implementation would be the slowest step.  The others are more like “support” routines, that need to know the nitty-gritty details of BwaMad and BGrease, but are not themselves under great performance pressure.

In practice, BSeed, BGrease and BwaMad are combined into a routine BrickMad that simply calls them all in turn, but for implementation it seems important to treat the three as independent.

The choice as to how to do all these operations will depend on the chip, so the routine hpmiset that is called at the beginning of the program run, and is different for different “technologies” (GEN, NEH, PIL and HAS, with ZEN and V64 in sight) must set a collection of flags and parameters to control these six routines, and it appears that it needs to be able to do this for each routine separately.

As each of the routines typically takes thousands of clock cycles, the main routine (e.g. DtoA) can afford to switch on its flag.  Certainly the flags AfmtMagic, BfmtMagic and CfmtMagic are needed to control DtoA, DtoB and CtoD respectively, and the simplest way forwards seems to be to have SeedMagic, GreaseMagic and BwaMagic to control the other three in much the same way.  Although the set of useful combinations of these flags looks set to grow hugely, it seems that many functions can be used in various ways.  For example, a version of DtoA that takes each byte and looks it up in a byte->byte table to get the Afmt looks fine for 5-181, even though other operations (particularly BwaMad) will have several different versions for this range, depending on whether 8, 10 or 16 bits are used in Cfmt, and how the adequate reduction is to be done.

A full implementation for all primes 2 to 2^64 would certainly have a lot of versions of Bwa, and probably quite a few BSeed versions as well, but I think it is a simplification to be able to re-use the other routines as much as possible, and so far it appears that this simplification is quite significant.

Not that I have any immediate plans to implement these other characteristics.  My current aim is to reorganize the HPMI code and document it, so that they can be written later without too much repetition.  At the moment, the choice of what to do is far too haphazard and implicit.

Another factor that has come out of this work is that the higher-level routines need some guidance – how many rows is it best to do at once, for example.  These are only recommendations – the system will work anyway – but there are several ways in which the “technology” – HAS, NEH etc. – can usefully give information to enable the higher level functions to work well.  In particular, I notice that the AS implementations will do considerable expansion of the data in Cfmt, making it likely that a Cfmt block should be held constant for as long as possible, so that the “space-filling curve” methods are not appropriate in this case.

Another improvement is that by exposing BSeed/BGrease/BwaMad, it becomes possible (later!) to hold a small B (e.g. 1,000 x 1,000) as a collection of seeded and greased BWAs, which may well be advantageous for programs like vector-permute.

My feeling overall is that the HPMI, which started many years ago as an experiment in performance programming, is heading towards a bit more maturity, hopefully forming a basis for any implementation of finite-field matrix multiplication at high performance.

Use of L3 cache in matrix multiplication.

At the bottom end, the “block” multiply-and-add works with matrices of approximately 128K, a scale that is primarily dictated by the size of L2 cache.

Here I consider the most cache-friendly way to multiply matrices, in a single core, of “slab” scale – very approximately half a gigabyte in size, using this bottom end.

We therefore wish to do Cik += Aij x Bjk for all i,j,k each taken from a large set, and for correctness purposes the sequence in which these are done is irrelevant. I use the notation <ijk> to mean Cik += Aij x Bjk, and write the sequence in that form, so that , for example, <123><456>… means C13+=A12xB23 then C46+=A45xB56 . . .  For estimation purposes, however, I will assume here that the blocks are all the same size (irrespective of whether they are of matrix A, B or C) and that there are about 50 blocks in each of the three directions – i.e. i, j and k are all chosen from a set of about 50.  The format conversions are done at the “slab” scale also, so the points of relevance here are to make the blocks of matrices Aij/Cik as large as possible subject to the matrix Cik residing in L2 cache, so that the Brick Population (including computing the linear combinations for “grease”) is used as much as possible. In other words we divide our blocks so that the index set of i is as small as we can.

Subject to that, our aim is to make best use of L3 cache.

In the total absence of L3 cache, the best approach seems to be to pin each Cik in turn into L2 cache, so that the inner loop is over the variable j. We may represent this as <000><010><020><030><040>…, and with this method (I call it “strategy 1”) each block multiply-and-add needs to fetch both the Aij and the Bjk blocks from memory, and once every 50 blocks we must write and read matrix Cik, so the total Data Accesses per Block (DAB) is about 2.04 – 1 for A, 1 for B, 0.02 to write C and 0.02 to read C.

V1e used a method assuming space for about 10 blocks in L3 cache, and achieved a DAB=1.54 in this case. This was a worthwhile step, implemented without much thought, but I will not describe it further here as it seems strictly worse than the method below, though this is as yet unimplemented, so I cannot be sure.

It is interesting that a better approach is suggested by studying “space filling curves”, and in particular the Hilbert curve in 3 dimensions seems relevant. One difference is that, although we want our seqence of <ijk> triples to be such that the pairs ij, jk and (particularly) ik occured as recently as possible, either two values of (say) i are equal or they are not – we are not interested in how much they differ.

Looking at the tables of recent chips – both Intel and AMD – it seems that L3 cache-per-thread varies between about six and twenty times the size of L2 cache-per-thread, and ideally we would like to make optimum use of this in a “robust” way that makes good use if it is six, and better use if it is twenty.

I am not really entirely sure exactly what I am trying to optimize here. The relative importance of speed with L3=0, with L3=6xL2, with L3=20xL2 and all the values in between is unknown and will probably remain so.

After considerable experimentation, however, I have concluded that the following sequence is “good” and propose to implement it.

<000><001><101><100> <110><111><011><010> <020><021><121><120>…

where each triple is the same as the one 8 back, but with the middle (j) value increased by two.

By my calculations this achieves a DAB=1.04 with L3=11, DAB=1.54 with L3=8, DAB=2.27 with L3=5 and DAB=3.5 with L3=0.

By way of comparison, assuming mod 2 with blocks 1024, the multiplication work takes about 1,000,000 clocks, so a DAB of 1 unit must fetch, from real memory, 128K in that time. At 3 GHz with 18 cores on a chip, this requires about 6.4 GB/s, so even strategy 1 (DAB=2.04) is probably OK on all current chips. We are starting to get close, however, and with AVX512 and perhaps ever more cores on a chip, it seems likely that these methods will be needed fairly soon.

In any case, it seems best to do this as it is nearly always better than the V1e method. Also it means that meataxe64 can run alongside programs that are more demanding on their memory bandwidth.

Meataxe64 tests computers.

My attempts to get good timings for Gaussian with transformation matrices have not gone very well so far.  It seems that using so many cores, so much memory, so much disk I/O so hard for so long is asking a lot of the poor little machine.  It gets hot, and must run hot for many hours.  I have so far run into an operating system bug, disk I/O errors and memory errors and am starting to despair that credible benchmarks may not be obtainable.

It may be that the best use of meataxe64 is not to compute matrix products, invariant subspaces, ranks, inversions and suchlike, but rather to check out a computer to see if it is fit and healthy before using it for more important work.

Spinning up an invariant subspace.

An important problem – perhaps the main problem – for the meataxe is to take a space (often only one vector to start with) and to repeatedly multiply vectors in the space by the generators and include the result in that space also until the full space, invariant under all the generators, is found.

Version V1e was able to do this, though the methods used were far from efficient, and I feel that meataxe64 has now reached the point where it is worth considering this process more carefully.  Over the years, many ideas have popped up, some of them implemented, some not, and this post is simply listing those ideas.  It is far from clear what is worth doing and what isn’t. Anyway, here is a fairly long list of ideas that I am currently considering.  At some point soon I’ll have to decide which to implement for V2c.

1.  To test for inclusion in the space we already have, we naturally put vectors into some sort of echelon form, and we ought to utilize this when multiplying by the generators.  Meataxe64 naturally holds a matrix in echelon form as a bit-string and a “remnant”, so we really should be able to multiply vectors in this form.

2.  It is often possible to hold the generators in some compact form, saving on both I/O and multiplication work.  Permutations, tensors, “sparsified” matrices can all be used for this purpose, and a general-purpose vector multiply (ideally taking point 1. into account) would save a lot of time.  Even if one generator is in full-matrix form, we can often make the other(s) in a compact form, and by using them preferentially, avoid much of the I/O on the full-matrix generator.

3.  In extreme cases of compact generators (e.g. splitting a permutation representation) the multiply will be so fast that the time will be dominated by the Gaussian work.  In this case it may be appropriate to consider “rip-off” techniques where the Gaussian is only done to some randomly/cleverly chosen subset of the columns, possibly remembering the linear combinations that were zero on our subset to check later that they are indeed zero on the remaining columns.

4.  When dealing directly with a group (this may be totally useless for condensations) the generators may satisfy short relators, and we would ideally like to use these to save work.  For example, if one of the generators has order two, there is no point multiplying a vector by that generator twice in succession, saving both the multiplication and the subsequent inclusion test.

5.  Taking that idea one step further, I can’t help feeling that the relators could also be used somehow to save work at the action-computing stage.  This must surely be true with standard basis, at least.

6.  The objective is usually to produce the invariant subspace in echelon form, which allows the action on both the subspace and the quotient space to be computed fairly easily.  This enables us also to get the action on further matrices not used for the spinning, which can often be useful.

7.  Often the vector selected for spinning is taken more-or-less at random from a null-space.  During the early phases of spinning, where the time is anyway mainly spent reading the generators, we may as well spin up all the vectors in a basis for that null-space, enabling us later to make a bigger starting space for any vector we may later choose.  This opens up the prospect of learning further condensation tricks to choose that starting vector a bit more intelligently.

8.  It seems clear that the action on the subspace is actually being computed by the spinning process, if only we were clever enough to see what it was.  I can’t help feeling that we should somehow keep better notes of what we did in the spinning process, for later post-processing to get this action if it were needed.

9.  Much (but not all) of this technology is appropriate also for the “standard basis” operation.  Here it is critical that the vectors multiplied are not the result of Gaussian elimination, but there are still lots of tricks (mentioned above) that do apply.

10.  In characteristic zero, and particularly over the ordinary integers, it is critical to hold a “lattice” (i.e. a quadratic form invariant under the group) to assist with various steps.  Somehow this idea has never been used as yet in the modular case, even though quadratic and unitary forms are often readily available.  As an example, if a space is invariant, so is its perpendicular, and it is surely faster to use this than to spin up again.

Invert now working

Not too much trouble once the chopped-up Gaussian with transformation matrix stuff was working.

6 mins 19 sec to invert a random (non-singular) 50,000 x 50,000 matrix mod 3 on my 1.7 GHz 2 core (Haswell) laptop.

Just got to work out a reasonable strategy now for “split” and “standard base”, and I think version 2 will be complete and ready for the documentation, validation and shell-scripts phases.  Looking like a few weeks now – less than a year late!

Klaus got V1e working on a mac.

Klaus Lux informs me that V1e with Haswell arithmetic seems to work on a mac.  This required some edits to the assembler code, and he sent me the (working) result . . . if anyone want’s to run V1e on a mac, ask me (or him) for the version of pcritHASs.s with the (fairly minor) changes to suit the assembler syntax.

Timings for Gaussian with transformation matrices

The “full” chopped-up Gaussian with keeptrack/transformation matrix seems now to be working well. Below are some sample timings on my 2 (Haswell) core laptop (an i3 4005U @ 1.70 GHz) on random square matrices. The matrix was chopped between 10×10 and 20×20 in all cases.

field dim time

2 120,000 12m 51s

3 50,000 5m 57s

4 50,000 4m 42s

5 12,000 5m44s

8 50,000 9m 31s

13 12,000 10m 16s

16 25,000 2m 20s

125 12,000 20m 54s

32768 5,000 8m 10s

As a reminder, multiply is currently fast for {2,3,4,8 9 16}

Chopped-up Gaussian finally working!

After about three years of work, involving considerable effort from Steve Linton, Jon Thackray, Gabriele Nebe and Alice Niemeyer, the “chopped-up Gaussian” is finally completely working.  It performs the “echelize” function on a matrix (putting it into negative reduced echelon form and keeping track of the row-operations done) by chopping the matrix in both directions into little blocks, which are then operated on (on multiple, shared-memory x86 cores) at the “little block” scale throughout, collectively achieving the echelize function on a large matrix.

This heroic, joint effort means that meataxe64 now has the complete set of primitives needed for basic representation theory work, adding inverse and standard-basis at large scales to the others, which have been working for a while.

There is, as always, still a lot of tidying up to do – checking that memory is free’d, writing the inverse function itself (echelize produces its negative!) etc. etc. etc., but somehow it feels that it is now downhill all the way to version 2.

First look at Ryzen

Some first analysis (thanks Agner Fog!) and benchmarks are out for the new AMD Ryzen chip, and although I haven’t got my hands on one, I have had a look at the details.  Here are my thoughts so far.

To summarize, mod 2 and 3 on Ryzen look about 55% of the speed of Haswell etc., but this can perhaps be increased to 75% by writing code tuned for Ryzen.  For other characteristics I expect the Ryzen to run at about 115% (so 15% faster) and again this can probably be improved a bit by tuning.

The main observation is that Haswell etc. is faster at vector code, and Ryzen faster for “ordinary” code.

The first thing to notice is that the AVX2 (256-bit) work is done 128-bits at a time, both for operations and for fetching (from L1 cache).  Hence the current code mod 2 and 3 will run about half-speed compared to the Intel Haswell and later chips.

Looking more closely, however, for “ordinary” things it manages six mu-ops per clock, and the L2 cache is 512 KB and quite fast – not that much slower than the L1.  All that suggests that a new arithmetic version “ZEN” should be created for this chip.  The basic idea is . . .

  1.  We should use shorter rows, as the Ryzen can do a lot more administration (address computation) alongside the real work at vector length.  The HAS arithmetic is flat out doing the vector operations, whereas on ZEN (mod 2) one instruction/two mu-ops for the vector operation leaves three instructions/four mu-ops spare, which we should try to use.
  2.  We should consider putting the greased brick into L2 rather than L1.  It is so much bigger and not that much slower.

The idea of both of these is to increase the grease level, but unfortunately we are then in danger of spending too much time greasing the brick, so we can’t go too far down that road.  Nevertheless, mod 2, by using 256-bit rows (instead of 1024) we could keep 4096 rows of C in the same space as 1024 took, making grease level 6 entirely sensible, and fitting into L1.

Whereas mod 2 is entirely limited by bandwidth, mod 3 looks limited by the mu-ops (two for each logical operation at 256-bit length).  Here the temptation is therefore to put the greased brick into L2 and use a bigger grease level that way, on the basis that the fetches are not such a great part of the work and we can afford to fetch from L2.

Other characteristics (not specifically programmed in assembler) are based around table look-up, and it looks to me like this should be about 15% faster on Ryzen than on the Intel chips even with current code, and it may be that further tuning can improve this further.

It should be emphasized that this is all theoretical.  Anyone out there got a Ryzen I can log on to?