Monkey-UTF8: Full unicode plane support

Monkey Forums/User Modules/Monkey-UTF8: Full unicode plane support

Nobuyuki(Posted 2013) [#1]
This is a library to add basic UTF-8 parsing and display support to your Monkey projects. It can't match the convenience of the built-in String type, but (with a little effort) can store and display any character available to Unicode, including ones on extended planes.



This lib also includes a hacked version of the last stable build of AngelFont, which has been altered to support full unicode strings.

Q: What do I need this for?
A: Monkey has some native support for unicode strings, but they are limited to the 16-bits. This means that only 65536 characters are available, locking out supplementary planes useful for displaying certain glyphs. Emoji, Japanese graphical emoticons, are quickly gaining support in the west, and are becoming popular amongst both smartphone users and the online community at large (eg: github). With this library, you can handle strings which contain emoji, or any character whose codepoint exists beyond $FFFF in the address space.

Q: Can I retrieve a pure UTF-8 string from a TCPStream?
A: Not yet! There is no technical reason it can't be supported, I just haven't implemented it yet. Maybe hopefully soon, if someone doesn't fork the project and beat me to it :)

Q: Then, how can I retrieve a UTF-8 string?
A: Right now, use the LoadString() function and point it to a text file. This opens a FileStream relative to your data folder and returns a UTF8String object.

Q: Okay, I want it. So, how do I build a multi-lingual font?
A: Use one of the many tools available which supports the xml .fnt format, like AngelCode's BMFont. Be aware though that they don't support Unicode 6.1 yet, and hacking emoji and multiplane stuff in manually was a rather crash-laden affair for me! Bug reports, bug reports, how do I file them? That being said, all the common languages and Basic Multilingual Plane (BMP) codepoints should export nicely.

Community Challenges:

In the process of writing this code, I came across a number of issues and considerations that have the potential to be addressed through the community in a collective fashion. Therefore, I have some challenges for the community which we can try to meet if there is enough demand for it:

* Getting UTF8.LoadString to work on HTML5 without DataBuffers. There's no HTML5 support for this lib yet, and there's no reason why there shouldn't be! It's just that right now, the only method for loading a UTF8 string is through brl.FileStream. This can be worked around with a native method by someone more skilled in javascript than me -- http://jsfromhell.com/classes/binary-parser looks promising, for example!

* Improving AngelFont's performance. The last stable version of AngelFont used a fixed-sized Int array to support ASCII only. My hacked version uses a hybrid array/IntMap system now, but can it be improved? Please leave your comments below!
* Improving AngelFont's source. This code's a mess - It mixes static and instance variables in the lib (AngelFont.current), probably for backwards-compatibility. The ctor is also broken and would be confusing to a newcomer to AngelFont since it uses depreciated methods, probably for the same reason! I've written a wrapper to work around these issues in the past, but perhaps it's time to give AngelFont an overhaul? Let's make SimpleTextBox and AngelFont both fully instance-oriented and compatible with each other!
* Improving SimpleTextBox's source. IntStacks are used extensively in the new UTF8 methods, and there is a lot of duplicated code in there. Can it be improved? Comment below!

I've also set out a few challenges to myself, which I hope I can address in future iterations of this code:

* Improving UTF8String to support common string operations, such as Concat, Substr, Find/FindLast, Split, Format. None of this is in there yet, hopefully coming soon!
* Attempting to implement the above functions without relying too heavily on IntStacks. Finding the fastest way to allocate temp units ahead of time could be tricky and/or take a bit of effort.
* Separating out the LoadString() code from the actual parsing, so that brl.FileStream isn't a requirement, and TCPStreams can be used too.


Download:
https://github.com/nobuyukinyuu/monkey-utf8 (click the ZIP button, or clone it!)


c.k.(Posted 2013) [#2]
Wow, there's a lot of brain power in this forum. Well done!


Samah(Posted 2013) [#3]
I should probably look at making the Diddy I18N module support this, somehow.


Nobuyuki(Posted 2013) [#4]
Update: HTML5 is now supported. Separated the classes into separate files, separated the file loading/streaming methods from the parsing methods.

The class now also supports UTF-16 surrogate pair encoding/decoding -- Making me wonder if I should've called this class UTF8! But anyway, this means you can specify whether to pass these surrogate pairs directly through a monkey.String. On HTML5 specifically, this means that your browser handles the encoding automatically, which means you get full unicode in the debug output ;D


Midimaster(Posted 2013) [#5]
This is an excellent work and really important. The world not only uses US american characters and need UTF-8 im games! Only 14% of the sold apps go to US. That mean that we ignore 86% of possible customers. I would be very happy if I can support eastern languages in my apps.

The main problem is that we need a completely closed system. File-based and internenet-based streams have to support this, so that we do not loose language informations on the "transfer ways". If a customer sends me a file with translations, I have to be able to integrate it. If a customers sends character through the app, my server has to be able to store them... and so on.

Thanks a lot for this work. It would love to use it!!! What about the licence? Can we use it freely? What about the modified angel codes? What about support for the future? Are you plan to use this class for a long time in your own apps?

I like Angelfont too! It is easy an small. But it is slowly. Yesterday I started working on finding a faster workaround. Suddenly I noticed, that Angelfont becomes 10 times faster, if you render with SETCOLOR(255,255,255)

I created a new class DrawPhotoText(T$,0,0,font). Now I send phrases to this class. At first time the class renders the phrase with canvas propery SETMATRIX(0,0,0,0,0,0) and SETCOLOR(255,255,255), then "screengraps" the phrase to an array with READPIXELS(), re-correct the alpha channel of the image and stores all to a big image with WRITEPIXELS(). From the second call of DrawPhotoText(T$,0,0,font) with the same phrase it paints the complete image instead of the single characters. This is 5 times faster.


Nobuyuki(Posted 2013) [#6]
Yes, I couldn't decide on the license, but I'd say feel free to use it for what you like. Right now, it's on the honor system, but hopefully if you make good additions, contributions are always welcome! Consider the license BSD/MIT/ZLib-like. I can't decide between them, so I didn't include anything in the source tree. Probably MIT or BSD 3-Clause, which are essentially the same thing.

I don't use read/write pixels in any projects yet, because I there is no separate off-screen surface support in Monkey afaik. Many of the speed issues I can imagine associated with AngelFont might be creation of local variables which may be cache-able. Of course, best practices are always to avoid using SetColor() on sprites on non-HWA targets, because that is usually slow!

The best improvements I can imagine being made to AngelFont would be to streamline and simplify the API itself; Saner ctor defaults, less use of Globals, eliminating the alignment consts and replacing them with a float, etc. Since I don't want to create a rift in usage, I kinda don't want to fork AngelFont to implement this stuff! It depends on what people want, and what Beaker plans on doing with the project..


Nobuyuki(Posted 2013) [#7]
Now that BMFont 1.14 supports extended plane characters, I expect this lib might get a little bit more use. The included version of AngelFont is still a big hack, but I might start doing some tests with it and incorporate the changes to angelfont-tryouts ( http://github.com/nobuyukinyuu/angelfont-tryouts ) and not keep a split codebase.

Accordingly, I'm starting to clean up the lib a little bit. I'm considering replacing the FileStream stuff with DataBuffer stuff, to make it fully cross-compatible, and keep to the spirit of it being purely for UTF-8 -- The surrogate pair encoding is a nice trick to make extended planes work on HTML5 (and in debugger outputs!) but it's definitely a UCS2/UTF16 thing, and has little (if anything) to do with UTF-8.

The last update made has added Find/FindLast/Replace methods to the UTF8String class, and more string manip stuff will be forthcoming, including (hopefully) a normalization/folding function for those of you missing doing equivalence checks and sort collation with ToLower().


Aman(Posted 2014) [#8]
This is great. I was about to start porting this BlitzMax module to Monkey http://www.wieringsoftware.nl/blitz/

Thank god I did a thorough search first. This should be listed in the module registry.


Nobuyuki(Posted 2015) [#9]
Big update posted! I know it's been a while since I've visited this module, but for a while, I haven't needed it....
I decided to revisit it after starting a new project which might be a bit more text-heavy. This was done as part of the process of making a new (modern) fork of AngelFont (coming soon?) :)

Anyway, here's some of the new things you can see here:
utf8.monkey
* Full multiplatform support! No more workarounds on html5 which break full SMP compatibility.
* No more having to screw with UTF-16/UCS2 supplemental pairs! (Although you can still decode/encode these pairs for backwards-compatibility and interop with OS's that absolutely must give/take a ucs2 string....)

utf8string.monkey
* Case folding! You can now convert UTF8Strings to/from lowercase, UPPERCASE, and Title Case. Supports most languages with simple decomposition.
* Split operation -- Both Monkey style, and a new style which completely ignores nulls.
* ToString now supports autoboxing, for easier debugging. Be careful using this! If you accidentally convert from UTF8String to String and back, you might not be able to figure out why your emoji are broken...
* Trim operation -- Works nearly identical to Monkey's. Does not trim Unicode control characters beyond U+0020. Comments on how to best handle this are welcome.
* Join operation -- Join multiple UTF8Strings together
* Contains() method

builder.monkey
* Brand new C++_Tool CaseFolding builder! To update monkey-utf8's case folding tables, simply drop in a new CaseFolding.txt and run builder.monkey. The output file may be used to update case folding mappings.
* In the future, this tool will be used to provide extended support for Turkic, decomposition of complex forms (German ß for example), and full normalization options.


Future features, maybe:
* UTF8StringMap<T>
* Collating and normalization for UTF8StringMap
* Full decomposition forms for case folding
* Normalization rules based on culture context, similar to .NET's CultureInfo
* Equality checking based on cultural context (insert joke here....)
* UTF8String.Format()

Full commit details: https://github.com/nobuyukinyuu/monkey-utf8/commit/bbe1a3a06f52baaf80f80698ccf58d1dae861ba5
Download in the usual place.