I have shown how to apply specialization to problems in media-processing. The ideas has been implemented and the benchmarks show it has the potential to allow programmers to write and type-check very general programs, and then create specialized versions that are comparable to hand-crafted C programs.
The fundamental idea is that semantics-based compiler generation is a portable, easy-to-use interface to run-time code generation. This improves on performing traditional macro-expansion at run-time in three ways: first a program can be written and debugged with specialization disabled, that is, without RTCG. Second, variable capture and quoting errors are impossible to make. Third, the code generators can be specialized themselves, resulting in low-overhead RTCG.
Furthermore, I proposed introducing knowledge of the linear-algebraic properties of integers into the specializer instead of treating them as atoms. The programmer can write high-level specifications of loops, and generate efficient implementations with the confidence that the system will preserve the semantics of their code. By making aliasing and alignment static, many of the operations normally performed by a hardware cache at runtime can be done at code generation time.
Of course, problems remain. The next section describes the difficulties I encountered as a programmer while building the benchmark examples. I then take a step back, and assess the current usability of the systems, and the prospects for improvement.
Sometimes specializing programs with cyclic values produces excessive
special cases. Say I want to make a routine F
that manipulates
an arbitrary byte-vector in memory. As a bit-address, a byte-pointer is
a cyclic value zero mod eight. So I write
fun sumb (p, q) = let val s = vector_signal(set_base(p,32), set_base(q,32), 8, 8) in reduce(s, plus, 0) endF =
and indeed F
does exactly what I want, and is as fast as I
expect. But it unnecessarily contains four copies of the inner loop,
one for each possible alignment of the terminating pointer (q
). A
set_base
of just the initial pointer results in one copy of the
inner loop. I should be able to get the smaller code by writing
fun sumb (p, q) = let val evend = q - (q%32) val s0 = vector_signal(set_base(p,32), evend, 8, 8) val s1 = vector_signal(evend, set_base(q,32), 8, 8) in reduce(s0, op_plus, reduce(s1, op_plus, 0)) end
Nitrous frequently fails to terminate because it tries to generate either infinite or exponential quantities of code. This is not surprising considering that it was designed to err on the side of propagating too much information. For example, because of the code duplication in dynamic loops (see Section ss), nesting loops results in code whose size is exponential in the levels of nesting.
A more interesting source of code explosion is the cache. The inclusion of the partially-dirty cache lines in particular may produce a large number of static states. Irregular patterns of writes can create this (consider Figure ipw). Even if the size and stride of the signal are simple (say both are eight), then the masks in the cache still suffer from exponential explosion because each byte could be either clean or dirty. One way to reduce the expansion is to flush the cache to return it to a known state. This is like defining a procedure-calling convention for the cache.
fun make_write_head p stride size = (fn s_put => (fn v => store_sample p size v) | s_shift => (make_write_head (p+d*stride) stride size))
Section irregular presents an example of infinite
specialization resulting from too much inlining. The need for the
dcall
annotation was not anticipated.
An unexpected problem results from the combination of Simple's
restriction of dynamic control flow to loops and sharing. The code in
Figure sls (in the style of Appendix signals) exhibits
the problem. The two dynamic values in two_counters
(two
instances of (lift 0)
) are equal and shared. If two_counters
is passed to reduce
then after the first iteration the dynamic
values diverge, so the sharing is lost. But in Simple, the static
state at the top and bottom of a loop must match.
fun make_counter from by = fn s_end => true | s_next => make_counter (from+by) by | s_get => fromval count_by by = make_counter (lift 0) by
val two_counters = map2 op_plus (count_by 2) (count_by 3)
evens
will fail.
In summary, none of the implementations of bit-addressing is currently practical. Nitrous is too slow, and Simple is too restricted. The Similix implementation is promising because, like the Simple system, its specializers run in fractions of a second, but without the restrictions of Simple. None of these systems has a fast enough backend to generate code at runtime. But I believe that application of standard engineering practice to the ideas of this thesis would result in systems useful to expert programmers. I now outline such systems, and the difficulties they will face.
Modifying an existing second-generation polyvariant specializer like PGG (or its ML equivalent, if available) is a natural choice. Cyclic integers are easy to add, though more cases than are shown here may be useful. Sharing and conservative equality appear to pose no difficulty, though the global state traversal in late normalization might be a problem. The function of late normalization is to make the early equality smarter at very little expense at code generation time, and so may be pointless if code generation is slow. The output would be compiled with an ordinary Scheme (or ML) compiler. The advantage of this approach is leveraging existing backends and frontends. The disadvantage would be relatively slow code generation (high ), and slow residual code (not as fast as GCC).
Alternatively, a new specializer (frontend) may be developed. The system could work with an existing backend with a strongly typed intermediate langauge and fast code generation (e.g. the JVM), or a new backend could be derived from an existing system (e.g. TIL2 modified to reduce ).
Both of these approaches would probably solve the basic problems of Nitrous (slow execution and excessive code duplication) and Simple (restricted languages). But any system based on partial evaluation has a basic problem: debugging the binding times. A mechanism that might help is assertions (such as ``this variable should be static''), where the error message in case of violation includes the cascade of guilty variables. Manual placement of some lifts and other tweeks to adjust binding times and inlining seem inevitable. However, I suspect that using profile data from execution to guide specialization could significantly outperform the dynamic conditional heuristic.
This leaves us with the intrinsic problems of implementing bit-addressing as described here:
rgbm1
and rgbm2
in Figure assumptions).
Perhaps the top-level procedure returned by specializer should not
have redundant arguments removed, thus permitting a typable
external interface.
Until more experience with better implementations of bit-addressing is collected, these plans and analyses will remain rather speculative.