<a name="dither"><h4> Dithering and Per-pixel Ops</h4></a>
<p>
Dithering makes a good example of how <code>cogen</code> can make an abstract
implementation run fast.  This section sketches a procedure that
dithers a 24-bit rgb image down to any size color-cube.  The rgb
image's channels may have any linear organization.  A software cache
converts loop code with byte-loads into code with word loads, shifts,
and masks. It works by keeping the low two bits of a loop index
static.  The ditherer is built from the following components: a 1D
linear looper, the software cache, a hash-noise dither, and a byte
permuter.
<p>
Why RTCG: The size of the destination color-cube depends on how many
unallocated colors are available in the X server, and thus isn't
available until runtime.  Furthermore the user may want to display
images from different sources with different channel layouts.
<p>
In this section, the partially-static integers and the operations on
them are manually.  This is a new [i've never seen it, have you?] form
of binding time improvement.  It seems likely that applying the idea
behind partially static structures to partially static bit-fields
could automate this division.  Eventually an analogue of variable
splitting might be used to further optimize a number of bit-preserving
operations.
<p>
If <code>x</code> is a dynamic variable, then <code>x2</code> is the static low two bits.
<p>
<ul><pre>; index and offset.  one offset per channel.
loop((i, i2), ((o, o2) ...)) =
  if (i2 == max2 and
      i == max)
    return;
  loop
    b = byte-load-cached(heap-p, o, o2);
  f(b, ...);
  loop(plus((i,i2),(0,1)), (plus((o,o2),step2) ...))
<p>
plus((a, a2), (b, b2)) =
  e = a2 + b2;
  f = a + b;
  if (e &gt; m)
    (f + 1, e - m)
  else
    (f, e)
<p>
; state: (p0, wo0, w0)
; records previous heap pointer, word offset, byte offset
; when is the state initialized???
byte-load-cached(p, wo, bo) =
  if (p = p0 and wo = wo0)
      w = w0
    else
      w0 = w = word-load(p, wo)
      wo0 = wo
      p0 = p
  b = byte(w, bo)
</pre></ul>
<p>
Elimination of the cache test requires variable splitting and
improving the binding time of the <code>eq?</code> function so that <code>(eq? v
v) --> #t</code>.
<p>
When the compiler runs it could produce eg:
<ul><pre>loop(i, or, og, ob) =
  w = word-load(heap-p, or);
  r = byte(w, 0);
  g = byte(w, 1);
  b = byte(w, 2);
  f(r, g, b);
  r = byte(w, 3);
  w = word-load(heap-p, or);
  g = byte(w, 0);
  b = byte(w, 1);
  f(r, g, b);
  r = byte(w, 2);
  g = byte(w, 3);
  w = word-load(heap-p, or);
  b = byte(w, 0);
  f(r, g, b);
  r = byte(w, 1);
  g = byte(w, 2);
  b = byte(w, 3);
  f(r, g, b);
  
  if (i == max) return;
<p>
  loop(i+1, o+1);
</pre></ul>
<p>
Similar optimizations include allowing the user to write a one-channel
filter function, and then applying it to grayscale, rgb, and rgba
images without loss of performance.  Applying such a function to an
8-bit indexed image is the next step.