Here's a quick summary of how 'Tavi's parse engine works. Feel free to flesh it out as you explore the code, or to ask particular questions to get particular clarifications.

Invokers

The parse engine is called in several locations. The primary caller is action_view, in action/view.php, which calls parseText to display a page.

Secondary callers that display page text are

Some other functions call the parse engine in slightly different ways. These are:

Basic Principles

The parse engine consists of two things. First, it is the routine parseText in parse/main.php. This function is passed a string of text to process. Second, it is the list of routines that are applied to this text (this is also passed as a parameter to parseText).

parseText splits the text it is given into lines. It then loops through the list of parsing routines, passing the text into each routine. The output of one routine is the input to the next routine. After the very last routine finishes, it stores the final result as the final formatted output.

The general purpose of the parsing routines is to take the text and do something to it. In the common case, this means to transform it from wiki markup (see FormalFormattingRules) into well-formed XHTML. Each routine is responsible for one aspect of this action. In the case of wiki markup, each routine generally handles one aspect of wiki markup.

Most of the parse routines are stateless; i.e., they act on a single line of text without concern for the lines that come before or after. However, some of the parse routines need to be stateful, such as the list parser or the table parser. These parsers use static variables to keep track of what they've seen on previous lines, because it influences how they process the current line.

$ParseEngine

The main parse engine is described by the $ParseEngine array; this is the parser used to transform wiki markup into XHTML. We'll concentrate on the mechanics of this parser.

For the sake of reminder, $ParseEngine is a list of routines for parseText to call in turn on each line.

The parse engines can't simply spit out XHTML right into the text that they return. Imagine the confusion that would result from trying first to convert http://somewhere.com/BlahBlah/BlahBlah into a URL, and then into a WikiWord link. Oops. We'd try to do both and make a terrible mess of it. So we have to temporarily store some text to be pasted back in later.

This is done using the $Entity global array of things that we store for later rememberance. In the text, we stick a marker referencing an element of the $Entity array. Each entry in the $Entity array is itself an array. Its first element is a text label identifying it ('raw' or 'bold_start' or somesuch; see $DisplayEngine in config.php), followed by zero or more pieces of textual information that accompany it. 'bold_start' doesn't need any more information: it signifies a need for a <strong> XHTML tag. But 'url' needs more information: it has to tell us, respectively, what the URL is that needs to be linked to, and also what text to use for the link.

The marker that is stuck in the text is a number, with ASCII character 255 preceding it and following it. Later, we know when we see the marker, we need to look up the corresponding element in the $Entity array and figure out what to do with it.

Obviously, the first order of business is to get rid of any existing 255 characters in the text. These are stored in the $Entity array so we can restore them later. Unlikely as we may be to find them, if we didn't do this, then it might wrongly signal a reference to $Entity.

Each parse routine is called in turn. It takes the line of text, replaces appropriate wiki markup with references to $Entity, and returns the modified line.

The very last parse routine called on each line is parse_elements. It runs through the line looking for references to $Entity. It then replaces these with what they stand for. For instance, imagine that it sees an item of type 'url'. It looks up 'url' in $DisplayEngine and finds that it maps to the function html_url. It calls html_url (passing it the two other elements in the array, which happen to be the URL and link text), which returns the actual HTML reference. It then pastes this into the line of text.

Presto! The wiki markup has been transformed into XHTML, piece by piece.

The Parse Routines

Most of the parse routines act on a single line. They call preg_replace to replace wiki markup with $Entity references. Let's look at a simple one as an example:

  function parse_bold($text)
  {
    return preg_replace("/'''(()|[^'].*)'''/Ue", "pair_tokens('bold', q1('\\1'))",
                        $text, -1);
  }

The first parameter to preg_replace is the pattern to match. In this case, it's simply the regular expression for the bold syntax. The "/Ue" instructs preg_replace to be Un-greedy (i.e., ".*" matches the shortest possible string, rather than the longest), and to treat the replacement string as executable code.

The replacement code is pair_tokens('bold', q1('\\1')). The function pair_tokens takes the first parameter and converts it into two $Entity entries, 'bold_start' and 'bold_end'. It then turns the $Entity indexes into entity references (numbers surrounded by 255 characters) and surrounds the given text (its second parameter) with them. It returns the whole lot.

So what is q1('\\1')? '\\1' serves as a back-reference to the first substring in parentheses in the pattern. In this case, it's everything between the triple quotes. So we want to pass this directly into pair_tokens, right? Wrong. When PHP creates a back_reference, it escapes whatever quotes were used for the replacement string. So, since the replacement string was enclosed in double quotes, it puts backslashes before every double quote. q1 is a simple function to strip out those backslashes.

Summary: parse_bold turns all occurrences of bold markup into two $Entity references, 'bold_start' and 'bold_end', and leaves the text in-between alone (so that later parse routines can still mess with it). Later, parse_elements will turn the 'bold_start' and 'bold_end' references into <strong> and </strong>.

The Other Parse Engines

A couple of the callers of parseText are only interested in finding wiki page names in a string. So they pass it an engine that consists solely of 'parse_freelink' and 'parse_wikiname' to look for free links (((free links))) and wiki names (WordsSmashedTogether). They then "cheat" by looking through $Entity for items of type 'ref' to see what page references were found. This is used, for example, when you type something in the "Add document to category" text box when you save a page.


Could anyone provide an example of adding a new routine to the parse engine? I want to develop a patch to add some new formatting (more table options, image alignment, images that link etc) but could do with an example to work from. Could just be me being thick, but I need a bit more spoonfeeding than the above.


How can the parse engine be used in other applications? It seems to me that the Tavi markup language is well designed, flexible and could therefore become a component of other projects.

-- LaurentJacques

Ask perhaps your question in the TaviMailingList.
Actually, I've solved this problem now and have a working modular Tavis style txtToXhtml filter that can be incorporated into any PHP project that needs it. It's currently working very nicely with the SmartyTemplateEngine?

-- SalimFadhley

Paul M Jones wrote a PEAR compatible Wiki parser being able to parse Tavi style syntax.

-- UrsGehrig