Original problem statement:

When UTF-8 is used, the following string Chì will be mis-parsed (but chì is fine). Free links with alternative text involving certain accented characters may also misbehave for example, (Chì)?. (Since my language must use UTF-8, I have to part with Tavi. But I hope to be able to use it in the near future.) Henry H. Tan-Tenn

This problem arose a lot of problems when being debugged. First of all there was an issue regarding whether to parse for WikiNames or free links at first. This bug was fixed. Secondly there was the problem of accepting some foreign characters within free links, this has also been adressed in the upcoming version 0.25 of 'Tavi.

Thirdly, and that's what this page is about, it arose a problem regarding how 'Tavi uses coding internally. And this problem is 'not solved yet. But lets first clarify what this problem actually is.

As FredrikJonsson stated:

When I write a word with a leading upper case and that includes a high ASCII character (e.g åäö) that character will be replaced with a question mark that links to a new page. Använd becomes Anv?nd and Frösön becomes Fr?sön etc. Thank you for looking in to this problem!

This error is easily reproduced, and can be explained as follows. Tavi uses internally a 8bit coding scheme, which totally disregards the charset actually used (that is it presumes iso-8859-1). This in accordance with how php is written and interpreted. This leads to the following scenario:

A user enters 'Ansön' (that is Ansön using entities), when the parser of Tavi, when using UTF-8, the ö has been recognised as the unicode 0x00F6 and has been coded as the sequence: 0xC3 0xB6 or à ¶ or ö. Written using entities we now have that: Ansön has been changed into Ansön.

And since à actually is an uppercase letter, and ¶ is not recognised as a letter, Tavi (correctly according to iso-8859-1) parses this sequence as the WikiWord Ansà followed by the extra characters ¶.

This being said the problem can be reformulated to: Tavi breaks on words starting with an uppercase letter, when seing properly encoded Unicode characters which happens to have unicode encoding starting with the first byte being an actual uppercase letter in the iso-8859-1 encoding scheme.

There are two solutions to this problem: Avoid the issue or make Tavi unicode-readable.

Avoiding the issue

For the time being the best solution would be to add {$EnableWikiLinks=0 to your config.php and only use free links. Then Tavi doesn't care about the mystic characters (unicode encoded characters, that is) as of version 0.25. Free links will function, and everything looks nice and not broken...

Making Tavi understand Unicode

This is a lot more work, but should be done sometime. There are mainly two major issues regarding making Tavi work well together with unicode, and that is a problem related to php-handling of unicode strings and the issue of upper-/lowercase letters within unicode.

Php-handling of Unicode-strings
At the present there do exists some multibyte addition to php, but I, EvenHolen, haven't gotten around to looking at these. The issues involved here regards counting string lengths and how to match on multibyte characters and how to escape/match them properly within strings.
;I've now has looked a little bit further, and I don't like the looks of it. It seems like at the present the multibyte character-stuff is experimental, and may even be removed. In addition not all php-installations has the needed library included. As per now, I therefore think this is not the road to success for using unicode within 'Tavi --EvenHolen
Upper-/lowercase letters
In the unicode scheme it's not easy to detect whether a character is an uppercase or a lowercase letter. And since this is most crucial to a well behaven wiki this poses as a large problem. Tables do exist to state which characters belong to each category (or even the category titlecase), but as far as I know there doesn't exist php-functions to classify/match strings toward these categories. This leads to writing rather long strings encoded in multibyte to identify upper-/lowercase in character classes, and I suspect that this will make identifying of wikiwords a very time consuming business.
I'm afraid this might be our best bet, but it's solvable. We do however need to discuss how to do this the easiest/most efficient way. See TaviBugs/InternationalCharacters/CaseConversion --EvenHolen

If I however find solutions to these issues, and I do hope I will find them, I will swiftly change Tavi so as to be compliant to using unicode encoded characters. But as for now, I deeply regret that we have to turn to avoiding the issue rather than fixing it... ;-(