July 10, 2015 Derek Jones No comments

Optimizing compilers include a list of common code patterns and equivalent, faster executing, patterns that they can be mapped to (traditionally the role of peephole optimization, which in many early compilers was the only optimization).

The problem with this human written table approach is that it is fixed in advance and depends on the person adding the entries knowing lots of very obscure stuff.

An alternative approach is the use of a superoptimizer, these take a code sequence and find the functionally equivalent sequence that executes the fastest. However, superoptimizing interesting sequences often take a very long time and so use tends to be restricted to helping figure out, during compiler or library development, what to do for sequences that are known to occur a lot and be time critical.

A new paper, Optgen: A Generator for Local Optimizations, does not restrict itself to sequences deemed to be common or interesting, it finds the optimal sequence for all sequences by iterating through all permutations of operands and operators. Obviously this will take awhile and the results in the paper cover expressions containing two unknown operands + known constants, and expressions containing three and four operands that are not constants; the operators searched were: unary ~ and - and binary +, &, |, - and ~.

The following 63 optimization patterns involving two variables (x and y)+constants (c0, c1, c2) were found, with the right column listing whether LLVM, GCC or the Intel compiler supported the optimization or not.

  optional-precondition => before  ->  after		L G I
 -~x -> x + 1						+ + -
 -(x & 0x80000000) -> x & 0x80000000			- + -
 ~-x -> x - 1						+ + -
 x +~x -> 0xFFFFFFFF					+ + -
 x + (x & 0x80000000) -> x & 0x7FFFFFFF			- - -
 (x | 0x80000000) + 0x80000000 -> x & 0x7FFFFFFF	+ - -
 (x & 0x7FFFFFFF) + (x & 0x7FFFFFFF) -> x + x		+ + -
 (x & 0x80000000) + (x & 0x80000000) -> 0		+ + -
 (x | 0x7FFFFFFF) + (x | 0x7FFFFFFF) -> 0xFFFFFFFE	+ + -
 (x | 0x80000000) + (x | 0x80000000) -> x + x		+ + -
 x & (x + 0x80000000) -> x & 0x7FFFFFFF			+ - -
 x & (x | y) -> x					+ + -
 x & (0x7FFFFFFF - x) -> x & 0x80000000			- - -
 -x & 1 -> x & 1					- + -
 (x + x)& 1 -> 0					+ + -
 is_power_of_2(c1) && c0 & (2 * c1 - 1) == c1 - 1
=> (c0 - x) & c1 -> x & c1				- - -
 x | (x + 0x80000000) -> x | 0x80000000			+ - -
 x | (x & y) -> x					+ + -
 x | (0x7FFFFFFF - x) -> x | 0x7FFFFFFF			- - -
 x | (x^y) -> x | y					+ - -
 ((c0 | -c0) &~c1) == 0 ) (x + c0) | c1!x | c1		+ - +
 is_power_of_2(~c1) && c0 & (2 *~c1 - 1) ==~c1 - 1
=> (c0 - x) | c1 -> x | c1				- - -
 -x | 0xFFFFFFFE -> x | 0xFFFFFFFE			- - -
 (x + x) | 0xFFFFFFFE -> 0xFFFFFFFE			+ + -
 0 - (x & 0x80000000) -> x & 0x80000000			- + -
 0x7FFFFFFF - (x & 0x80000000) -> x | 0x7FFFFFFF	- - -
 0x7FFFFFFF - (x | 0x7FFFFFFF) -> x & 0x80000000	- - -
 0xFFFFFFFE - (x | 0x7FFFFFFF) -> x | 0x7FFFFFFF	- - -
 (x & 0x7FFFFFFF) - x -> x & 0x80000000			- - -
 x^(x + 0x80000000) -> 0x80000000			+ - -
 x^(0x7FFFFFFF - x) -> 0x7FFFFFFF			- - -
 (x + 0x7FFFFFFF)^0x7FFFFFFF -> -x			- - -
 (x + 0x80000000)^0x7FFFFFFF -> ~x			+ + -
 -x^0x80000000 -> 0x80000000 - x			- - -
 (0x7FFFFFFF - x)^0x80000000 -> ~x			- + -
 (0x80000000 - x)^0x80000000 -> -x			- + -
 (x + 0xFFFFFFFF)^0xFFFFFFFF -> -x			+ + -
 (x + 0x80000000)^0x80000000 -> x			+ + -
 (0x7FFFFFFF - x)^0x7FFFFFFF -> x			- - -
 x - (x & c) -> x &~c					+ + -
 x^(x & c) -> x &~c					+ + -
 ~x + c -> (c - 1) - x					+ + -
 ~(x + c) -> ~c - x					+ - -
 -(x + c) -> -c - x					+ + -
 c -~x -> x + (c + 1)					+ + -
 ~x^c -> x^~c						+ + -
 ~x - c -> ~c - x					+ + -
 -x^0x7FFFFFFF -> x + 0x7FFFFFFF			- - -
 -x^0xFFFFFFFF -> x - 1					+ + -
 x & (x^c) -> x &~c					+ + -
 -x - c -> -c - x					+ + -
 (x | c) - c -> x &~c					- - -
 (x | c)^c -> x &~c					+ + -
 ~(c - x) -> x +~c					+ - -
 ~(x^c) -> x^~c						+ + -
 ~c0 == c1 => (x & c0)^c1 -> x | c1			+ + -
 -c0 == c1 => (x | c0) + c1 -> x &~c1			- - -
 (x^c) + 0x80000000 -> x^(c + 0x80000000)		+ + -
 ((c0 | -c0) & c1) == 0 => (x^c0) & c1!x & c1		+ + -
 (c0 &~c1) == 0 => (x^c0) | c1!x | c1			+ - -
 (x^c) - 0x80000000 -> x^(c + 0x80000000)		+ + -
 0x7FFFFFFF - (x^c) -> x^(0x7FFFFFFF - c)		- - -
 0xFFFFFFFF - (x^c) -> x^(0xFFFFFFFF - c)		+ + -

optional-precondition => before -> after L G I -~x -> x + 1 + + - -(x & 0x80000000) -> x & 0x80000000 - + - ~-x -> x - 1 + + - x +~x -> 0xFFFFFFFF + + - x + (x & 0x80000000) -> x & 0x7FFFFFFF - - - (x | 0x80000000) + 0x80000000 -> x & 0x7FFFFFFF + - - (x & 0x7FFFFFFF) + (x & 0x7FFFFFFF) -> x + x + + - (x & 0x80000000) + (x & 0x80000000) -> 0 + + - (x | 0x7FFFFFFF) + (x | 0x7FFFFFFF) -> 0xFFFFFFFE + + - (x | 0x80000000) + (x | 0x80000000) -> x + x + + - x & (x + 0x80000000) -> x & 0x7FFFFFFF + - - x & (x | y) -> x + + - x & (0x7FFFFFFF - x) -> x & 0x80000000 - - - -x & 1 -> x & 1 - + - (x + x)& 1 -> 0 + + - is_power_of_2(c1) && c0 & (2 * c1 - 1) == c1 - 1 => (c0 - x) & c1 -> x & c1 - - - x | (x + 0x80000000) -> x | 0x80000000 + - - x | (x & y) -> x + + - x | (0x7FFFFFFF - x) -> x | 0x7FFFFFFF - - - x | (x^y) -> x | y + - - ((c0 | -c0) &~c1) == 0 ) (x + c0) | c1!x | c1 + - + is_power_of_2(~c1) && c0 & (2 *~c1 - 1) ==~c1 - 1 => (c0 - x) | c1 -> x | c1 - - - -x | 0xFFFFFFFE -> x | 0xFFFFFFFE - - - (x + x) | 0xFFFFFFFE -> 0xFFFFFFFE + + - 0 - (x & 0x80000000) -> x & 0x80000000 - + - 0x7FFFFFFF - (x & 0x80000000) -> x | 0x7FFFFFFF - - - 0x7FFFFFFF - (x | 0x7FFFFFFF) -> x & 0x80000000 - - - 0xFFFFFFFE - (x | 0x7FFFFFFF) -> x | 0x7FFFFFFF - - - (x & 0x7FFFFFFF) - x -> x & 0x80000000 - - - x^(x + 0x80000000) -> 0x80000000 + - - x^(0x7FFFFFFF - x) -> 0x7FFFFFFF - - - (x + 0x7FFFFFFF)^0x7FFFFFFF -> -x - - - (x + 0x80000000)^0x7FFFFFFF -> ~x + + - -x^0x80000000 -> 0x80000000 - x - - - (0x7FFFFFFF - x)^0x80000000 -> ~x - + - (0x80000000 - x)^0x80000000 -> -x - + - (x + 0xFFFFFFFF)^0xFFFFFFFF -> -x + + - (x + 0x80000000)^0x80000000 -> x + + - (0x7FFFFFFF - x)^0x7FFFFFFF -> x - - - x - (x & c) -> x &~c + + - x^(x & c) -> x &~c + + - ~x + c -> (c - 1) - x + + - ~(x + c) -> ~c - x + - - -(x + c) -> -c - x + + - c -~x -> x + (c + 1) + + - ~x^c -> x^~c + + - ~x - c -> ~c - x + + - -x^0x7FFFFFFF -> x + 0x7FFFFFFF - - - -x^0xFFFFFFFF -> x - 1 + + - x & (x^c) -> x &~c + + - -x - c -> -c - x + + - (x | c) - c -> x &~c - - - (x | c)^c -> x &~c + + - ~(c - x) -> x +~c + - - ~(x^c) -> x^~c + + - ~c0 == c1 => (x & c0)^c1 -> x | c1 + + - -c0 == c1 => (x | c0) + c1 -> x &~c1 - - - (x^c) + 0x80000000 -> x^(c + 0x80000000) + + - ((c0 | -c0) & c1) == 0 => (x^c0) & c1!x & c1 + + - (c0 &~c1) == 0 => (x^c0) | c1!x | c1 + - - (x^c) - 0x80000000 -> x^(c + 0x80000000) + + - 0x7FFFFFFF - (x^c) -> x^(0x7FFFFFFF - c) - - - 0xFFFFFFFF - (x^c) -> x^(0xFFFFFFFF - c) + + -

The optional-precondition are hand-written rules that constants must obey for a particular expression mapping to apply, e.g., (c0 &~c1) == 0.

Hand written rules using bit-vector (for each bit of a constant the value one, zero or don’t know) analysis of constants is used to prune the search space (otherwise the search of the two operand+constants case would still be running).

LLVM covers 40 of the 63 cases, GCC 36 and Intel just 1 (wow). I suspect that many of the unsupported patterns are rare, but now they are known it is worth implementing them just to be able to say that this particular set of possibilities is completely covered.

Categories: Uncategorized Tags: superoptimizer

Superoptimizers are back in vogue

November 6, 2012 Derek Jones No comments

There has always been the need for a few developers with in-depth knowledge of a particular cpu architecture to sit down and think very hard about how best to implement a snippet of code performing some operation in assembly language, e.g., library implementors wanting the tightest code for a critical inner loop or compiler writers who need to map from intermediate code to machine code.

In 1987 Massalin published his now famous paper that introduced the term Superoptimizer; a program that enumerates all possible combinations of instruction sequences until the shortest/fastest one producing the desired output from the given input is found (various heuristics were used to prune the search space e.g., only considering 15 or so opcodes, and the longest sequence it ever generated contained 12 instructions).

While the idea was widely talked about, it never caught on in practice (a special purpose branch eliminator was produced for GCC; Hacker’s Delight also includes a stand-alone system). Perhaps the guild of mindbogglingly-obtuse-but-fast-instruction-sequences black-balled it (apprentices have to spend several years doing nothing but writing assembly code for their chosen architecture, thinking about how to make it go faster and/or be shorter and only talk to other apprentices/members and communicate with non-converts exclusively about their latest neat sequence), or perhaps it was just a case of not invented here (writing machine code used to be something that even run-of-the-mill developers got to do every now and again), or perhaps it was not considered cost-effective to build a superoptimizer for a given project (I don’t know of anyone offering a generic tool that could be tailored for specific cases) or perhaps developers were happy to just ride the wave of continually faster processors.

It was not until 2008 with Bansal’s thesis that superoptimizer research started to take off (as in paper publication rate increased from once every five years to more than one a year). Bansal found a new market, binary translation i.e., translating the binary of a program built to run on one kind of cpu to run on a different kind of cpu, for instance the Mac 68K emulator.

Bansal and other researchers’ work was oriented towards relatively short instruction sequences. To be really useful, some way of handling longer sequences was needed.

A few days ago Stochastic Superoptimization arrived on the scene (or rather a paper describing it became available for download). Schkufza, Sharma and Aiken use Markov chain Monte Carlo methods to sample the possible instruction sequences rather than generating all of them. The paper gives a 116 instruction example from which the author’s tool removed 16 lines to produce code that went 1.6 times faster (only 30 ‘core’ instructions were given in paper); what is also very interesting is that the tool operates on compiler generated output (gcc/llvm), suggesting the usage build program, profile it and then stochastic superoptimize the hot spots.

Markov chains and Monte Carlo methods are trendy topics that researchers like to write about, so we will certainly see more papers in this area.

These days few developers have had hands-on experience with machine code, so the depth of expertise that was once easy to find is now rare, processors have many more weird and wonderful instructions often interacting with older instructions in obscure ways, and the cpu architecture landscape continues to change regularly. The time may have arrived for superoptimizers to be widely used by industry.

Of course, superoptimizers can work at any level of abstraction, including expression trees built directly from some complicated floating-point calculation that needs to be optimized for accuracy or speed.

Categories: Uncategorized Tags: assembly code, binary translation, branch, compiler writer, cpu, floating-point, Markov chain, monte carlo, optimizer, optimizing, stochastic, superoptimizer

The Shape of Code

Archive

Exhuastive superoptimization

Superoptimizers are back in vogue

Recent Posts

Recent Comments

Archives

Meta