I have decided to move the decision as to which assembler routines to call from compile-time to run-time. There are three reasons for this.
The first, which Klaus mentioned a couple of years back, is that it can be convenient to install the programs, already compiled, into a directory mounted on different machines. This requires the compiled code to run on all the machines, thereby failing to take advantage of those with a more powerful architecture. The current change will fix part of that, but still leave the number of threads and memory size compiled in – an issue which may require attention later.
The second arises from the “large fields of characteristic 2” work, which needs the CLMUL (“carry-less multiplication”) flag, and from the HPMI for 5-181 work, which needs SSE4.1 as it stands. I have previously taken the view that “modern” machines have all these instructions, and that meataxe64 needs a modern enough machine, but I feel that approach, although probably acceptable, is not ideal.
The third reason is that Markus and now Steve have been trying to incorporate meataxe64 stuff into GAP, and the old approach is not really acceptable in that environment.
The “cpuid” function of the x86 returns a large and confusing wealth of detail on the machine, and it is a heavy job to pick through all these flags and values. I propose, therefore, to write an assembler routine to summarize this into a few bytes of data more suitable for choosing between routines and strategies in meataxe64. The main properties of an (x86-64) machine are given by a single byte, which I call the class. The values I propose are as follows, where each class implies all the previous ones also.
‘a’ – All machines, including those where LAHF and CMPXCH16 don’t work. I do not know whether even linux or the output of gcc would run on these!
‘b’ – LAHF and CMPXCHG16 have been fixed and SSE3 is available.
‘c’ – SSSE3 (the rest of SSE-3) is available.
‘d’ – SSE4.1 is available. This is critical for primes 17-61, which ideally work in 10-bit fields but this needs PMULDQ. Otherwise these primes must work in 16-bit fields, making them about 50% slower.
‘e’ – SSE4.2 is available. These include the byte-string processing which looks useful, though I haven’t used it yet.
‘f’ – CLMUL is available. This allows the “large fields of characteristic 2” stuff, speeding them up by a factor over 1,000.
‘g’ – AVX. This allows 256-bit working, but for floating point only, which is not much use in meataxe64. Nevertheless the VEX prefix clears the top half (allowing better mixing of SSE and AVX) and gives 3-operand instructions, which are of some minor use.
‘h’ – BMI1. A few useful bit-manipulation instructions.
‘i’ – BMI2. Some more bit-manipulation instructions, including PEXT/PDEP which look useful, though again I haven’t used them (yet).
‘j’ – AVX-2. This allows the use of 256-bit registers for integral work.
To the user, the main question is whether the machine is class ‘j’ (Haswell, Broadwell, Skylake, Ryzen) or not, affecting the speed by about a factor of two in most cases.
To the programmer, this allows fingertip control over each routine separately, so that the large fields in characteristic 2 test class >= ‘f’ and the 10-bit working tests class >= ‘d’, avoiding the previous ludicrous situation where 10-bit working was not allowed because CLMUL was not implemented.
No doubt further classes will be added later – AVX-512F is looming large on the horizon now.
After this change, an old machine can only use meataxe64 if it can at least assemble the new instructions. I suspect this will also need to be addressed in the fullness of time.
One effect of this is that it is now worthwhile to implement the AS-stuff 16-bits, which will speed up 5-13 and 67-181 by a few percent and also allow 17-61 to work on classes ‘a’-‘c’ – neither of great importance, but it has been irritating me for a while.
My first target is to get it all working on class ‘a’ – not that I have a class ‘a’ machine to test it on! A glance at the top 500 supercomputers suggests that it is classes ‘i’ and ‘j’ that are actually important at the moment, so these will get particular attention.
If anyone wants it for other purposes, the “mactype” routine will be available for all to use, as I suspect the issues discussed here crop up elsewhere. Not that I’ve actually written it yet – I’ll post again when all this works. Unless unexpected snags arise, this shouldn’t take more than a few days.
In return, if anyone has computers I can use for testing (probably in December) I’d appreciate a logon for a day or so. One urgent need is for an Intel i9 with AVX-512, so I can start work on those routines. The other need is at the other end. I have no access to a VIA 2000 nor a VIA 3000, nor the older Intel and AMD machines from 2003-2010 (core-2, Athlon, Merom, Nehalem, Westmere etc.) and I’d like to run the regression tests, and probably some performance tests as well, to check that I’ve got things right (and to fix them if I haven’t).