Sunday, March 08, 2009

More on the new syntax engine for bespin

And so here continues my journey in porting the CodeMirror parser / tokenizer to the bespin syntax engine. While at my first iteration I basically just merged the codebase and managed some initial nicely-colorized rendering, I bypassed the mechanism which actually makes CodeMirror parser / tokenizer an interesting choice for bespin: the interruptable / resumable parsing. Now this is halfway working so let me explain it in detail:
The default bespin syntax highlighter works on a per line base and is extremely fast (magnitudes faster than any approach involving real parsing) and its performance is independent from the document size. But if we wanna have real syntax analysis in real-time, there is a price to pay for it and this price is a slower syntax engine. But with some real computer science (many dissertations have been written about parsers) and a lot of tricks, the user hopefully won't perceive any loss in responsiveness.

The streaming tokenizer / parser approach

While most parsers need to read the full document to parse it then all at once, CodeMirror is more like a streaming pipeline engine: it can start at any line in the document, after processing that line, the engine spits out immediately the tokens containing the syntax coloring information and in the near future callbacks for indentation and maybe even code-completion. So how does this engine work ? The main trick is that the parser stores its state at each line, so it can resume its operation at any line. Of course with all the details, it is a bit more complicated:

First we have our document, bespin internally represents it as an array of lines of text. Then there is a StringStream wrapper, which allows us to treat the document like a stream. Then there is a stream traverser which goes forward through the stream, passing the stream content to the tokenizer, which splits the stream into tokens and the last element in the pipeline, the parser does the main job of further analyzing those tokens. For every line the parser attaches a copy of itself (or better: its internal state) to he line. So once the document is initially parsed up to a certain line Y and the the user changes something at a previous line X, the engine retrieves the stream starting at line X, grabs the parser which has been attached to line X and continues its job there.

Of course there are a lot of things which can go wrong. For example the engine loosing sync between the stream representing the actual text lines, the document itself and the stored parsers. If this happens the document needs to be fully reparsed, which is a performance disaster. And this happens right now, because a few parts are not implemented yet, but I am working on it.

I am very excited about integrating the engine with Malte Ubi's work: offloading the engine to webworkers / gears for async background syntax analysis and merging it with the upcoming thunderhead editor component, which is optimized for calling the syntax engine only when really necessary. Right now I am messing with caching (otherwise the parser would block the UI, arrghh !!!), I try to detect whether calling the syntax engine is necessary, if not I provide cached results if available.

So the next week will be very interesting, to see how these things rapidly evolve.