Automatic Program Specialization for Interactive Media

Conclusion

I have shown how to apply specialization to problems in media-processing. The ideas has been implemented and the benchmarks show it has the potential to allow programmers to write and type-check very general programs, and then create specialized versions that are comparable to hand-crafted C programs.

The fundamental idea is that semantics-based compiler generation is a portable, easy-to-use interface to run-time code generation. This improves on performing traditional macro-expansion at run-time in three ways: first a program can be written and debugged with specialization disabled, that is, without RTCG. Second, variable capture and quoting errors are impossible to make. Third, the code generators can be specialized themselves, resulting in low-overhead RTCG.

Furthermore, I proposed introducing knowledge of the linear-algebraic properties of integers into the specializer instead of treating them as atoms. The programmer can write high-level specifications of loops, and generate efficient implementations with the confidence that the system will preserve the semantics of their code. By making aliasing and alignment static, many of the operations normally performed by a hardware cache at runtime can be done at code generation time.

Of course, problems remain. The next section describes the difficulties I encountered as a programmer while building the benchmark examples. I then take a step back, and assess the current usability of the systems, and the prospects for improvement.

Pitfalls and Prospects

Sometimes specializing programs with cyclic values produces excessive special cases. Say I want to make a routine F that manipulates an arbitrary byte-vector in memory. As a bit-address, a byte-pointer is a cyclic value zero mod eight. So I write

fun sumb (p, q) =
  let val s = vector_signal(set_base(p,32),
                     set_base(q,32), 8, 8)
  in reduce(s, plus, 0)
  end

F =

and indeed F does exactly what I want, and is as fast as I expect. But it unnecessarily contains four copies of the inner loop, one for each possible alignment of the terminating pointer (q). A set_base of just the initial pointer results in one copy of the inner loop. I should be able to get the smaller code by writing

fun sumb (p, q) =
  let val evend = q - (q%32)
      val s0 = vector_signal(set_base(p,32), evend, 8, 8)
      val s1 = vector_signal(evend, set_base(q,32), 8, 8)
  in reduce(s0, op_plus, reduce(s1, op_plus, 0))
  end

Nitrous frequently fails to terminate because it tries to generate either infinite or exponential quantities of code. This is not surprising considering that it was designed to err on the side of propagating too much information. For example, because of the code duplication in dynamic loops (see Section ss), nesting loops results in code whose size is exponential in the levels of nesting.

A more interesting source of code explosion is the cache. The inclusion of the partially-dirty cache lines in particular may produce a large number of static states. Irregular patterns of writes can create this (consider Figure ipw). Even if the size and stride of the signal are simple (say both are eight), then the masks in the cache still suffer from exponential explosion because each byte could be either clean or dirty. One way to reduce the expansion is to flush the cache to return it to a known state. This is like defining a procedure-calling convention for the cache.

fun make_write_head p stride size =
  (fn s_put => (fn v => store_sample p size v)
    | s_shift => (make_write_head (p+d*stride) stride size))

Figure ipw: Irregular writes.

Section irregular presents an example of infinite specialization resulting from too much inlining. The need for the dcall annotation was not anticipated.

An unexpected problem results from the combination of Simple's restriction of dynamic control flow to loops and sharing. The code in Figure sls (in the style of Appendix signals) exhibits the problem. The two dynamic values in two_counters (two instances of (lift 0)) are equal and shared. If two_counters is passed to reduce then after the first iteration the dynamic values diverge, so the sharing is lost. But in Simple, the static state at the top and bottom of a loop must match.

fun make_counter from by =
 fn s_end => true
  | s_next => make_counter (from+by) by
  | s_get => from

val count_by by = make_counter (lift 0) by

val two_counters = map2 op_plus (count_by 2) (count_by 3)

Figure sls: With Simple, any loop over evens will fail.

In summary, none of the implementations of bit-addressing is currently practical. Nitrous is too slow, and Simple is too restricted. The Similix implementation is promising because, like the Simple system, its specializers run in fractions of a second, but without the restrictions of Simple. None of these systems has a fast enough backend to generate code at runtime. But I believe that application of standard engineering practice to the ideas of this thesis would result in systems useful to expert programmers. I now outline such systems, and the difficulties they will face.

Modifying an existing second-generation polyvariant specializer like PGG (or its ML equivalent, if available) is a natural choice. Cyclic integers are easy to add, though more cases than are shown here may be useful. Sharing and conservative equality appear to pose no difficulty, though the global state traversal in late normalization might be a problem. The function of late normalization is to make the early equality smarter at very little expense at code generation time, and so may be pointless if code generation is slow. The output would be compiled with an ordinary Scheme (or ML) compiler. The advantage of this approach is leveraging existing backends and frontends. The disadvantage would be relatively slow code generation (high $\\overline{\\sigma}$ ), and slow residual code (not as fast as GCC).

Alternatively, a new specializer (frontend) may be developed. The system could work with an existing backend with a strongly typed intermediate langauge and fast code generation (e.g. the JVM), or a new backend could be derived from an existing system (e.g. TIL2 modified to reduce $\\overline{\\sigma}$ ).

Both of these approaches would probably solve the basic problems of Nitrous (slow execution and excessive code duplication) and Simple (restricted languages). But any system based on partial evaluation has a basic problem: debugging the binding times. A mechanism that might help is assertions (such as ``this variable should be static''), where the error message in case of violation includes the cascade of guilty variables. Manual placement of some lifts and other tweeks to adjust binding times and inlining seem inevitable. However, I suspect that using profile data from execution to guide specialization could significantly outperform the dynamic conditional heuristic.

This leaves us with the intrinsic problems of implementing bit-addressing as described here:

When should the cache be initialized (flushed)? This is a special case of when to lift. How many entries should allocated? The current strategy of flushing after a loop sometimes preserves too much information (Figure ipw above). Perhaps some signals should be uncached. Perhaps a heuristic can be developed.
How can we handle sharing information in a modular way? How can we access a specializer from a strongly typed language (recall the different arities of rgbm1 and rgbm2 in Figure assumptions). Perhaps the top-level procedure returned by specializer should not have redundant arguments removed, thus permitting a typable external interface.
How much can we do without negative may-alias info? I expect this invariant becomes difficult to maintain in a large system.
How can array-bounds checks be eliminated? All the examples in this chapter used unsafe arrays, though this does not scale to large systems. Good compilers can eliminate many of these checks, and I think that RTCG exposes many more. As noted in Section limits, the natural way to address this is with generalized partial computation. But GPC is dependent on a theorem prover. We assume that some form of annotations (or interactivity) to guide the prover will be necessary. The Extended Static Checking (ESC) framework [Detlefs96] may prove suitable.

Until more experience with better implementations of bit-addressing is collected, these plans and analyses will remain rather speculative.