UTF-8 Seems Wrong
Archives Forums/BlitzMax Bug Reports/UTF-8 Seems Wrong
| ||
I'm using ToUTF8String here, but get similar results with text streams:SuperStrict Framework BRL.StandardIO Import BRL.Retro Local b:Byte Ptr = "äöå".ToUTF8String() Local i:Int While b[i] Print Hex(b[i])[6..] i:+1 Wend Output: C3 83 C2 A4 C3 83 C2 B6 C3 83 C2 A5 Expected: C3 A4 C3 B6 C3 A5 So there are two incorrect bytes in between with each of these characters AFAICT. I'm unsure how that's even possible as ToUTF8String seems to output at most 3 per character. Unless it is messed up in the compilation step... (I'm using 1.38 on Linux.) |
| ||
I can't remember if 1.38 had the "final" fix of the utf8 code, or if it was 1.37... or if 1.38 still needs to be patched... |
| ||
Maybe the problem is on my side, since I can't even seem to post the correct characters here? Edit: Also, I get a String.length of 6 for 3 characters. |
| ||
Related to this I'd guess. I get the "expected" result on Mac. |
| ||
I can't even seem to post the correct characters here? That's a forum "feature" :-) SuperStrict Framework BRL.StandardIO Import BRL.Retro Local s:String = Chr(228) + Chr(246) + Chr(229) Local b:Byte Ptr = s.ToUTF8String() Local i:Int While b[i] Print Hex(b[i])[6..] i:+1 Wend |
| ||
...but it should (hopefully) work now, since I use it all the time. |
| ||
Output on Linux :-)Building untitled1 Compiling:untitled1.bmx flat assembler version 1.68 (32768 kilobytes memory) 3 passes, 1662 bytes. Linking:untitled1.debug Executing:untitled1.debug C3 A4 C3 B6 C3 A5 Process complete :-) Are you sure you have the correct... everything? |
| ||
I just tried your Chr'ed version and that works. So it seems to be something with the compilation phase, but no idea where in the line... Here is what the compiled .s file says: _12: dd bbStringClass dd 2147483647 dd 6 dw 195,164,195,182,195,165 align 4 And the Chr version: _12: dd bbStringClass dd 2147483647 dd 3 dw 228,246,229 |
| ||
Interesting... I Printed s from the Chr version, and then copied those characters into a new String for s... and ran it again. It works both ways for me. This is the .s with the proper String s : _3: dd bbStringClass dd 2147483647 dd 3 dw 228,246,229 align 4 Which looks right. I'm using my GTK-built MaxIDE though, which is UTF8 friendly. I can't vouch for the FLTK one. |
| ||
What should the encoding of the source files be? Edit: Mine seems to be proper UTF-8 according to Gedit. Edit 2: Yes, I verified it by printing out the bytes. |
| ||
I resaved the file with Gedit using the Latin-1 encoding (ISO/IEC 8859-1) and that works correctly - i.e. I get the output I was originally expecting. I think one of the following is true: 1. Encoding for .bmx files should be Latin-1. There is a bug with MaxIDE on Linux. 2. Encoding for .bmx files should be UTF-8. There is a bug with the compiler on Linux. I'm not sure which, though. |
| ||
Just a follow-up (even though this thread is probably more about the UTF-8 conversion functions): UTF-16 encoding seems to work the best for string literals. It makes the most sense, seeing as Max strings use 16-bit characters internally. |
| ||
Hi I've a problem with UTF/Latin/whatsover codification in a BMX source code that contains text written in various european languages. If I open the file in GEdit I can read correctly every strings. When I open in BMX the source code is 'truncated'. For example: "Questa applicazione è freeware!" <--- OK In GEdit "Questa applicazione <--- there are not " at the end, and some characters are hidden, only visible when you highlight the line. When I compile the application, there are no errors, only the text is not displayed correctly in MaxGUI labels/gadgets. PS: I tried to 'convert' the .bmx file force the saving with a proper codification (ISO 8859-1 / 8859-2...) but everytime GEdit reports that cannot save the file because it contains different text codifications... I will test a different solution, with direct loading from an extern .txt file. Surely MaxIDE (I downloaded the source code from the account section and compiled myself) has still some problems under Linux about Unicode codification. Cheers |