Monday, June 25, 2007

The joy of unhygienic macros

When working on the Unicode Collation Algorithm (alphabetical order; I'm still nowhere near done) I needed an efficient way to store a bunch of data in one fixnum--a bitfield. I needed this to hold the weights of various Unicode code points without taking an obscene amount of memory. For an introduction to this concept of bitfields in C/C++, see Wikipedia's article on them.

I could have done this the simple way, by manually defining a constructor which does all the left shifts and bitors, and accessors with right shifts and bitands. But I chose to do it a different way: since similar situations come up in other places (UTF-8/16 encoding and decoding, the assemblers of various platforms, and in the FFI), why not make a general library for it? In its current, incomplete state, it's in my repos. As I later found out, one already existed in the core, in the vocabulary math.bitfields. It's very simple and elegant, but somewhat low-level and lacking features I needed, like accessors and bounds checking.

math.bitfields uses a simple array-based signature to specify bitfields. By contrast, what I wrote uses a new definition syntax inspired by tuples. To define a bitfield which has three fields, each with 5 bits of data in them (a number from 0 to 31) and three bits of 0-padding between them, you would use


BITFIELD: foo bar:5 000 baz:5 000 bing:5 ;


This defines several words. First, it defines <foo>, which is the constructor for the bitfield. Second, it defines foo-bar, foo-baz and foo-bing which access the various fields. Third, it defines with-foo-bar, with-foo-baz and with-foo-bing which take a bitfield and a field value and return a new bitfield with the new field value. This is meant to be analogous to set-foo-bar of tuples, though it is somewhat different, as bitfields are immutable.

(An analogous parsing word, SAFE-BITFIELD:, defines the same thing, but includes bounds checks in the constructor and setters. This is useful for debugging, and in less certain situations, at a minimal performance penalty.)

All of these words are simply operations on integers. Yet through the macro, they create an abstraction allowing regular old unboxed fixnums to be viewed as complex objects. In the future, I may extend the bitfield system to create another word, foo, which contains metadata about the bitfield itself. I might also define foo?, to test if an integer can be interpreted as a well-formed foo. But I'm not sure how useful that would be.

To Lispers (and Factorers), this kind of thing elicits a big yawn. Of course you can do that. In Lisp, the invocation might look a little more like this:

(defbitfield foo :bar 5 "000" :baz 5 "000" :bing 5)

but it could do basically the same thing. [Note: the reason the padding is in quotes in Lisp but not in Factor is that, in Factor, it is easy to set things up such that 000 isn't parsed as a number, whereas this is a bit more difficult in Lisp.] I don't even have to mention the fact that this would be impossible in the languages most people use. (Don't you love that construction?) But there are some macro systems, most notably Scheme's (but also Dylan's and many others), which would not allow this. Scheme has what's called a hygienic macro system. The "hygiene" there refers to the fact that the macro doesn't come in contact with any variables which were not given as arguments to it. Basically, it's like putting lexical scope in macros. More formally put, it prevents inadvertent name capture, or name "overlap" between different definitions of the same thing. A better introduction is here.

This prevents a lot of subtle bugs for beginner macro writers*, but it also adds limitations. Most importantly, it is impossible to make new symbols. That means that, to have the same thing in Scheme, an invocation would have to look more like

(define-bitfield foo foo? (foo-bar with-foo-bar 5) "000" (foo-baz with-foo-baz 5) "000" (foo-bing with-foo-bing 5))

Discounting elaborate hacks, runtime penalties or a reduction in functionality, this is as simple as it can get. The Schemer may argue (as I once did) that it is clearer when all variables and function names are explicitly mentioned. However, when the functions defined are known to the programmer though macro conventions, there is no loss of clarity. In this case, I think it's a bit more clear what's going on without the redundant function name listings.



* Factor, due to its lack of lexically scoped local variables and the way it handles symbols, isn't actually subject to most of these bugs.

2 comments:

Graham Fawcett said...

Scheme has what's called a hygienic macro system.

Dan, while it may not be part of the specification of the Scheme language, most practical Scheme implementations support unhygenic macros.

Unknown said...

Thanks for your comment. I was semi-aware that some Scheme implementations had this, but I didn't know if it was very widespread. Anyway, the main point wasn't to criticize Scheme. I have no real interest in doing that; it's a great language. I should express the advantages of hygienic macros in Scheme more clearly, just to set the record straight: aside from allowing all gensym's to be implied, hygienic macros are the only real way to prevent the local shadowing of functions or macros which your macro call expands to. No matter how cond is implemented, this code should work:

(let ((if 1))
(cond ((> if 3) 'wtf)
(else 'good)))

Err, that doesn't work in Chicken Scheme in the interpreter, actually. (But it does work in SISC.) Either I'm understanding something wrong, or Chicken 2.6 has a bug...