Buffering and APL

APL is a programming language for numerical matrix work [APL]; it is the original collection oriented language [SiBle91]. This section examines the use of buffering (a term from operating systems) to implement such languages on uniprocessors with a memory hierarchy. Buffering (sometimes also called batching) is a very effective and widely used programming technique, but also has its limits: latency suffers, sometimes performance suffers, dynamic conditionals induce copying and increase overhead. There is no known way to effectively handle dynamic jumps, or higher order scans, reduces, and permutations.

If a program is applied to many data:

vector_eval(struct program, vector v) {
  for (i = 0; i < max(v); i++)
     scalar_eval(program, v, i);
}

scalar_eval(struct program, scalar v) {
  switch(program) {
    case p_prim:
     prim(v);
  }
}

then the control flow for the interpreter and the loop over the data may be interchanged. This is buffered:

vector_eval(struct program, vector v) {
  switch(program) {
    case p_prim:
     for (i = 0; i < max(v); i++)
        prim(v, i);
    ...
  }
}

Rather than loop over all the data, it can be advantageous to limit the loop length (block size) so that the total temporary memory required fits within the (or a) cache. Renderman and Nyquist [Nyquist] do this. This is called blocking or tiling [WoLa91]. Note that as the temporary space required increases, the vector length must go down to remain in the cache, though this probably only makes a difference in extreme cases.

There are, however, some drawbacks to buffering. Latency suffers because a complete block must be processed before the first result is available. Latency is particularly important in interactive music applications. Throughput may suffer from increased cache and memory traffic or from an inability to effectively use multiple functional units.

Dynamic control constructs are very difficult to handle while buffering. Conditionals can be handled either with mask vectors or by copying and collapsing. I don't know any reasonable way to handle dynamic jumps, or higher-order scans, reduces, and permutations.

With RTCG the static control flow of the inner interpreter is eliminated. The resulting loop may be better optimized. Memory references for temporaries in the program can be replaced with register references. Tiling is still useful, though not as often.

See [ThoDa95] for an analysis of buffering vs compilation in the context of audio signal synthesis.