gives 8x (!) speed up on Haswell laptop vs. standard CG for 8 RHS solves. 166 iterations vs. 537 iterations so algorithmic gain + 2x in flop rate gain. Better than a slap in the face with a wet kipper.