UTF-8 Seems Wrong

Archives Forums/BlitzMax Bug Reports/UTF-8 Seems Wrong

Otus(Posted 2010) [#1]
I'm using ToUTF8String here, but get similar results with text streams:
SuperStrict

Framework BRL.StandardIO
Import BRL.Retro

Local b:Byte Ptr = "äöå".ToUTF8String()

Local i:Int
While b[i]
	Print Hex(b[i])[6..]
	i:+1
Wend


Output:
C3
83
C2
A4
C3
83
C2
B6
C3
83
C2
A5


Expected:
C3
A4
C3
B6
C3
A5


So there are two incorrect bytes in between with each of these characters AFAICT. I'm unsure how that's even possible as ToUTF8String seems to output at most 3 per character. Unless it is messed up in the compilation step...

(I'm using 1.38 on Linux.)


Brucey(Posted 2010) [#2]
I can't remember if 1.38 had the "final" fix of the utf8 code, or if it was 1.37... or if 1.38 still needs to be patched...


Otus(Posted 2010) [#3]
Maybe the problem is on my side, since I can't even seem to post the correct characters here?

Edit: Also, I get a String.length of 6 for 3 characters.


Brucey(Posted 2010) [#4]
Related to this I'd guess.

I get the "expected" result on Mac.


Brucey(Posted 2010) [#5]
I can't even seem to post the correct characters here?

That's a forum "feature" :-)

SuperStrict

Framework BRL.StandardIO
Import BRL.Retro

Local s:String = Chr(228) + Chr(246) + Chr(229)
Local b:Byte Ptr = s.ToUTF8String()

Local i:Int
While b[i]
	Print Hex(b[i])[6..]
	i:+1
Wend



Brucey(Posted 2010) [#6]
...but it should (hopefully) work now, since I use it all the time.


Brucey(Posted 2010) [#7]
Output on Linux :-)
Building untitled1
Compiling:untitled1.bmx
flat assembler  version 1.68  (32768 kilobytes memory)
3 passes, 1662 bytes.
Linking:untitled1.debug
Executing:untitled1.debug
C3
A4
C3
B6
C3
A5

Process complete

:-)

Are you sure you have the correct... everything?


Otus(Posted 2010) [#8]
I just tried your Chr'ed version and that works. So it seems to be something with the compilation phase, but no idea where in the line...

Here is what the compiled .s file says:
_12:
	dd	bbStringClass
	dd	2147483647
	dd	6
	dw	195,164,195,182,195,165
	align	4


And the Chr version:
_12:
	dd	bbStringClass
	dd	2147483647
	dd	3
	dw	228,246,229



Brucey(Posted 2010) [#9]
Interesting...

I Printed s from the Chr version, and then copied those characters into a new String for s... and ran it again. It works both ways for me.
This is the .s with the proper String s :
_3:
	dd	bbStringClass
	dd	2147483647
	dd	3
	dw	228,246,229
	align	4

Which looks right.

I'm using my GTK-built MaxIDE though, which is UTF8 friendly. I can't vouch for the FLTK one.


Otus(Posted 2010) [#10]
What should the encoding of the source files be?

Edit: Mine seems to be proper UTF-8 according to Gedit.
Edit 2: Yes, I verified it by printing out the bytes.


Otus(Posted 2010) [#11]
I resaved the file with Gedit using the Latin-1 encoding (ISO/IEC 8859-1) and that works correctly - i.e. I get the output I was originally expecting.

I think one of the following is true:

1. Encoding for .bmx files should be Latin-1. There is a bug with MaxIDE on Linux.
2. Encoding for .bmx files should be UTF-8. There is a bug with the compiler on Linux.

I'm not sure which, though.


plash(Posted 2010) [#12]
Just a follow-up (even though this thread is probably more about the UTF-8 conversion functions): UTF-16 encoding seems to work the best for string literals. It makes the most sense, seeing as Max strings use 16-bit characters internally.


degac(Posted 2010) [#13]
Hi
I've a problem with UTF/Latin/whatsover codification in a BMX source code that contains text written in various european languages.
If I open the file in GEdit I can read correctly every strings.
When I open in BMX the source code is 'truncated'.

For example:

"Questa applicazione è freeware!" <--- OK In GEdit
"Questa applicazione <--- there are not " at the end, and some characters are hidden, only visible when you highlight the line.

When I compile the application, there are no errors, only the text is not displayed correctly in MaxGUI labels/gadgets.

PS:

I tried to 'convert' the .bmx file force the saving with a proper codification (ISO 8859-1 / 8859-2...) but everytime GEdit reports that cannot save the file because it contains different text codifications...

I will test a different solution, with direct loading from an extern .txt file.

Surely MaxIDE (I downloaded the source code from the account section and compiled myself) has still some problems under Linux about Unicode codification.

Cheers