charCodes over 65536 not supported / wrap around

Monkey Forums/Monkey Bug Reports/charCodes over 65536 not supported / wrap around

Nobuyuki(Posted 2013) [#1]
This has to do with Unicode support -- I'm not sure if this is a bug or if the functionality is simply missing, but it appears that both String.FromChar and the charCode retrieving convention StringName[index] only support the Basic Multilingual Plane ($0000-$FFFF). All codepoints $10000 and above wrap around, tested on both HTML5 and GLFW.

For example, U+1F44C, 👌 -- the OK Sign, cannot be expressed in Monkey using String.FromChar($1F44C) -- it will instead return $F44C, which is in the Private Use Area. Unicode inside the BMP appears to be supported, but other planes don't appear to be.

Practical applications of extended planes would be for things like modern extensions to CJK unified ideographs, and particularly, Emoji support (which is quickly becoming a standard on both Android and iOS, even in the Western world). Wrapping the codepoints around means that with user/external data, the use of the characters in a Monkey application will produce mojibake/garbled output, no visible output, or even potentially crash the application, depending on the implementation which deals with parsing in the string.

I appreciate all the strides made so far to make Monkey i18n-friendly! Hopefully, a fix (or rationale) is on the horizon :)


marksibly(Posted 2013) [#2]
This is really down to the capabilities of the underlying string class and, as far as I know, most targets only support 16 bit chars, eg:

http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html
http://msdn.microsoft.com/en-us/library/vstudio/x9h8tsay.aspx

Note sure about Javascript/actionscript but I'd be surprised if they were different, eg: 32bit.

The C++ target could theoretically use 32 bit chars, as char type is just a typedef in Monkey but that's not much use on it's own...?

A custom string class that wrapped an array of ints (or preserved UTF8 encoding) would be one way to deal with this, but how useful it is would depend on what you wanted to use it for as you'd never be able to convert to a 'real' string.


Nobuyuki(Posted 2013) [#3]
oi, sounds like it would be painful to "fix" / implement in a backwards-compatible way without introducing overhead or new issues, unfortunately. I don't have much experience with "wide" chars from the coding-side of things, but it appears the different targets handle it in quite different ways. JS and Java both appear to support surrogate pairs, which supporting would probably require new implementations of things like String.Length, etc. I have no idea about C++ or the other targets. (this is what I could find on the JS side, though. An implementation including length, split, etc. is further down: http://stackoverflow.com/questions/3744721/javascript-strings-outside-of-the-bmp )

This issue came up for me when I was testing the AngelFont module, actually. There's nothing in there that would necessarily prevent Plane >0 characters from being displayed, if the code point is specified in the font's metadata, but it comes down to being able to retrieve the code point from the string itself. Of course, a custom string class could resolve it, but would introduce new/different problems in the process...


marksibly(Posted 2013) [#4]
One option here would be:

* Add support for surrogates to the UTF decoders/loaders (where possible).

* Make String.ToChars:Int[]() 'surrogate aware'. This would probably slow it down marginally, as it'd have to do a 'calculate length' pass first. A faster ToChars( int[] ) version could also be added though.

* Ignore the language length/slice/split/index issues.

* Add a support module for slower length/slice/split/index.

This way, a 3rd party module can at least choose to handle surrogates, either by using String.ToChars, or by using the support module.


Nobuyuki(Posted 2013) [#5]
Hello!

I spent most of last night attempting to come up with an interim solution to this. Having an intrinsic type for UTF Strings would be a "good enough" solution for most purposes, especially if the predominant text rendering libs for Monkey supported it. It could probably be done in a way superior to the way I've implemented it -- as a pure Monkey class containing an array of Ints which store the codepoints, not very efficient but it does work. If supported by the language, support for concat and splitting via the normal operators and token parsing would be possible, which isn't the case for my code. I'm presuming they could also be stored/loaded into in memory a bit more efficiently, since Monkey only has signed Ints.

Anyway, I've got a thread forthcoming on this in User Modules soon enough; I used brl.FileStream to load UTF strings in, but there is no HTML5 support for that module, so the solution isn't ideal. There are binary parsers for js online, so I know it can probably be done, but I'm hoping I can hand this off by popping the work up on github and having the community address some of it maybe! Still, surrogate types built-in to Monkey sounds like a better solution, at least until we get something crazy like operator overload support ;)

Edit: I've posted my module, thoughts, and progress on it here. http://www.monkeycoder.co.nz/Community/posts.php?topic=4878