As can be stated in TaviBugs/InternationalCharacters we do have a problem with recognising WikiNames when they are written in Unicode. I see two different solutions to this problem:

  1. Do a byte-pr-byte search for WikiNames
  2. Changing the text into equivalent 8-bit characters (loosing some charsets), and using ordinary regexp

Further on this document discusses the size of a needed conversion tables, and some ramblings about which ranges of unicode characters we need to consider.

The two solutions

These solutions will be discussed more in detail in this section.

Byte-pr-byte search

There exists tables, [1] describing which characters are lowercase, and which are uppercase. These tables need to be converted into some php-variables which are easy and fast to search within.

The algorithm of this search is:

Set $prevChar = ''
For each character in string do: 
  $char equals next character
  IF $char is a multibyte starter (that is code > 128)
    $char .= next character; 
  identify $char as Uppercase(U), lowercase(l), "/" or other(o)) => $state = [Ulo/]
  IF ($state == U) 
    IF (not $withinWikiName)
      start looking for a wikiName, $withinName= true
      start collating the wikiName, $wikiName = $char; 
      start collating the casings, $wikiCase = "U"; 
    ELSE 
      IF (not $prevState == l or $prevState == "/")
        $endSearch = true
      ELSE
        extend $wikiCase with "U"; 
        extend $wikiName with $char;
  ELSE IF ($state == l and $withinWikiName)
    extend $wikiName with $char
    extend $wikiCase with "l"
  ELSE IF ($state == "/")
    IF (not $withinWikiName)
      start collating $wikiCase = "/"; 
      start collating $wikiName = $char;
    ELSE
      IF (not $prevChar == "/") 
        extend $wikiCase with "/"
        extend $wikiName with $char;
      ELSE
        $endSearch = true;
  ELSE
    IF ($withinWikiName)
      $endSearch = true;

  IF ($endSearch)
    IF (valid wikiname $wikiCase)
      generate wiki link of $wikiName
 
  $prevChar = $char;
  $prevState = $state;

This is easy to code in php, with the exception of the line 'identify state'. This could however be accomplished by having a table consisting of 128 lines where each line contains 256 characters representing the state of the corresponding character. This would solve the problem of most western languages.

An example on how to use this table can be given. Say we want the state of ö, that is ö. This code in decimal codes has the two byte encoding of: 195, 182. We subtract 128 from the first number, and looks for $convTable[195-128][182] which should return 'l' since ö is a lowercase character. The only problem with this approach is that 128x256 bytes results in a 32Kb big structure...

And this is only when dealing with two bytes encoding. What if some language requires three bytes? Or four bytes?

Changing into 8-bit equivalent

This also need the detection of multibyte characters, but we could still use regexp to identify the WikiNames. The problem arises in how to pick the correct sequence from the multibyte string after we've converted it to a 8-bit equivalent (which aren't usable due to the equivalent not being identical but rather a stripped version of accents/special characters/...)

An example to illustrate: Consider the string "Se SeilBåt" (this is norwegian for SailBoat, using the character å). This string would be encoded in 11 characters due to the å. The equivalent string would be: "Se SeilBat" (10 characters), replacing the multibyte encoded å with a single 'a'. This is easy to match using regexp's, but it's not easy to pick where in the multibyte string the string originated from or which length the original was. And therefore this approach, is not doable.

I also thought by replacing the multibyte extension by some other character so that in the previous example we would match against "Se SeilB??at" where "?" was an approriate character. This is doable, and it's rather easy to extend the regexp's to allow for this extraneous character to appear anywhere. The downside is, that after finding the WikiNames in the equivalet string, we need to first extract it position, and then pull the corresponding substring from the multibyte string. But this is an approach, which could be done.

It does however still depend on the correct identifaction/conversion of the multibyte characters, and that still needs to be done on an character-by-character basis, so I believe the first approach is the simplest and possible the way to go.

How many bytes are needed

The following table was snipped from the unicode site:

Table 3.1B. Legal UTF-8 Byte Sequences
Code Points1st Byte 2nd Byte 3rd Byte 4th Byte
U+0000..U+007F 00..7F    
U+0080..U+07FF C2..DF 80..BF   
U+0800..U+0FFF E0 A0..BF 80..BF  
U+1000..U+FFFF E1..EF 80..BF 80..BF  
U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF 80..BF

This indicates that my first thought of 128x256 is wrong, hooray! If looking at the case conversion tables, I see the need of translating the characters up to the range U+1FFF. What this really indicates in the context of which tables needs to be available, I need to look further into.

But at least the conversion tables need to include the first three/four rows, and that is:

But this could be reduced by ignoring multibyte characters where we're sure there don't exist any legal lowercase/uppercase characters. This is interesting. Maybe it's actual possible to do this!

Some range consideration

Is it possible to narrow down which ranges are used for cased characters? Of course, it's possible. And here are two different approaches. The first is based on manually looking through the list for lowercase/uppercase characters, and the other is based on another unicode document about european languages.

What ranges include lowercase/uppercase characters

A table as compiled by looking through the list of core properties, focusing on lowercase/uppercase.

Range Character set # chars # Bytes/char Total encoding
0041..02e4 Latin 676 2 1352
0345..03fb Greek 183 2 366
0400..050f Cyrilic 110 2 220
0531..0587 Armenian 57 2 114
10a0..10c5 Georgian, ignore? 38 3 114
1D00..1ffb latin/cyrillic 764 3 2292
2000... lowercase to be ignored?  

This totals to 1828 characters to be recognised the case of. And when these are fully encoded they extended to 4468 bytes. Indicating that the total range of the convTable should be in the vicinity of 5-6KB. And that's promising...

Based on unicode-document

To represent the European Alphabetic Scripts a document refers to the following ranges of characters:

Range Name Comments Useful?
0041..007A Basic Latin   Yes
00C0..00FF Latin-1 Supplement   Yes
0100..017F Latin Extended-A   Yes
0180..024F Latin Extended-B   Yes
0250..02AF IPA Extensions Phonetic No
1E00..1EFF Latin Extended Additional   Yes?
FB00..FB06 Latin ligatures   No
0370..03FF Greek   Yes
1F00..1FFF Greek Extended   Yes
0400..04FF Cyrilic   Yes
0530..058F Armeninan   Yes
10A0..10FF Georgian Only one case No?
16A0..16F0 Runic Ancient! No
1680..169F Ogham Ancient! No

As this correspond rather nicely with the previous table, I think this should be the base for identifying which ranges to convert/ignore. Do note that regarding the wiki use of characters every unicode combination are ignored. Only precomposed characters will be considered.

That is one glyph should be one multibyte encoding. An example, it's possible to code the ö as the combination of "o" and "combining diaresis". This combination will not be recognised in 'Tavi.

The column Useful? gives an indication whether this range are to be consider/checked for upper- or lowercase characters.

What about non-european languages?

According to these tables, and the documentation found so far, it seems like non-european languages does not have the distinction of uppercase and lowercase letters. As such they therefore don't need these extensions to recognising WikiNames, and must depend on only using free links. If this is a wrong assumption, please correct me.