Dithering makes a good example of how cogen
can make an abstract
implementation run fast. This section sketches a procedure that
dithers a 24-bit rgb image down to any size color-cube. The rgb
image's channels may have any linear organization. A software cache
converts loop code with byte-loads into code with word loads, shifts,
and masks. It works by keeping the low two bits of a loop index
static. The ditherer is built from the following components: a 1D
linear looper, the software cache, a hash-noise dither, and a byte
permuter.
Why RTCG: The size of the destination color-cube depends on how many unallocated colors are available in the X server, and thus isn't available until runtime. Furthermore the user may want to display images from different sources with different channel layouts.
In this section, the partially-static integers and the operations on them are manually. This is a new [i've never seen it, have you?] form of binding time improvement. It seems likely that applying the idea behind partially static structures to partially static bit-fields could automate this division. Eventually an analogue of variable splitting might be used to further optimize a number of bit-preserving operations.
If x
is a dynamic variable, then x2
is the static low two bits.
; index and offset. one offset per channel. loop((i, i2), ((o, o2) ...)) = if (i2 == max2 and i == max) return; loop b = byte-load-cached(heap-p, o, o2); f(b, ...); loop(plus((i,i2),(0,1)), (plus((o,o2),step2) ...))plus((a, a2), (b, b2)) = e = a2 + b2; f = a + b; if (e > m) (f + 1, e - m) else (f, e)
; state: (p0, wo0, w0) ; records previous heap pointer, word offset, byte offset ; when is the state initialized??? byte-load-cached(p, wo, bo) = if (p = p0 and wo = wo0) w = w0 else w0 = w = word-load(p, wo) wo0 = wo p0 = p b = byte(w, bo)
Elimination of the cache test requires variable splitting and
improving the binding time of the eq?
function so that (eq? v
v) --> #t
.
When the compiler runs it could produce eg:
loop(i, or, og, ob) = w = word-load(heap-p, or); r = byte(w, 0); g = byte(w, 1); b = byte(w, 2); f(r, g, b); r = byte(w, 3); w = word-load(heap-p, or); g = byte(w, 0); b = byte(w, 1); f(r, g, b); r = byte(w, 2); g = byte(w, 3); w = word-load(heap-p, or); b = byte(w, 0); f(r, g, b); r = byte(w, 1); g = byte(w, 2); b = byte(w, 3); f(r, g, b); if (i == max) return;loop(i+1, o+1);
Similar optimizations include allowing the user to write a one-channel filter function, and then applying it to grayscale, rgb, and rgba images without loss of performance. Applying such a function to an 8-bit indexed image is the next step.