Debugging unicode issue

Archives Forums/BlitzMax Bug Reports/Debugging unicode issue

Brucey(Posted 2014) [#1]
Hello Bug Reports section,

While debugging unicode support for my Linux GTK-based MaxIDE I can across an issue which led me to spend far too much time debugging my own code, only to discover that it was BlitzMax itself that wasn't handling the unicode data correctly.

So, the problem shows itself when you look at some unicode text in the debug tree view of the IDE. It appears as garbled rubbish :


I eventually traced this down to the Writestderr function that is called by brl.appstub when it sends the debug data to the IDE. In blitz_app.c (brl.blitz) it is simply converting the String to a CString, which is never going to work very well when the character values are bigger than a byte.

Perhaps it should be sending a UTF8 bytestream instead? At least then you'd know for sure what to do with the data at the other end - when you are reading the stream back in the IDE, for example.

Of course, MaxIDE would need patched to support the incoming stream as UTF8.

Before I posted the report, I noticed there was already a report of the same kind of problem, and indeed, if you DebugLog unicode data, it comes out the same as above.


Brucey(Posted 2014) [#2]
I should add this also affects OS X - displaying the same text in the debug tree.

Haven't tested it on Windows yet, but I'd be surprised if the result was different.


Derron(Posted 2014) [#3]
running the code on the left (provided by Bruces per EMail) in Debug crashed on my Linux Box ... had fun as the console output was 2,3 mio lines (crashed my Thunderbird when I tried to copy-paste-unseen to brucey by mail).


I do not know whether it is directly connected BUT:

Saved a .bmx file as UTF8 without BOM:

-> Seems MaxIDE got problems handling it (it utilizes BlitzMax-functions to load a file)


Saved that .bmx file as UTF8 with BOM:

-> Seems boom-boom-MaxIDE wants BOM.


And if write your special chars within MaxIDE on Windows - and press "save" you end up having a ISO-8859-1 file (this is another bug I think) ... this behaviour was also in UNZ' indevIDE and he has to manually take care of it.
As German Umlauts (ä ö ü) are available in ISO this is might have been related to some "automatic" use-the-least-common approach...
I added this line from Brucey:

'the russian text... forum does not allow this chars and encodes it in a bugged way (encoding the encoded characters again showing the html-code) ...

Saved that file on Windows... result was an UTF16-LE-encoded file.
Of course all only worked if the file was properly written as "with-BOM" when creating it in Linux.
If you create a NEW file in MaxIDE, pasting that cyrillic test-string and save this file, it creates a with-BOM, UTF16-LE file.

If now converting this UTF16-LE to UTF8 (in Linux), saving the file and reopening the file in MaxIDE on Windows: displays it properly. Resaving in windows converts back to UTF16-LE. Removing the "with-BOM" bombs display in Windows-MaxIDE again.



bye
Ron


Grisu(Posted 2014) [#4]
Thanks for looking into this!

The issue mentioned isn't only debugmode-related and affects Windows systems as well.


Can I tell "openfile" to use UTF8?
So the simple example below works on all platforms.

Download: http://www.mediafire.com/download/aaibo3hmttuctpt/openfile_broken.zip
SuperStrict

Global sometext:String="&#1054;&#1096;&#1080;&#1073;&#1082;&#1072;: &#1048;&#1085;&#1080;&#1094;&#1080;&#1072;&#1083;&#1080;&#1079;&#1072;&#1094;&#1080;&#1103; &#1079;&#1074;&#1091;&#1082;&#1086;&#1074;&#1086;&#1081; &#1082;&#1072;&#1088;&#1090;&#1099; &#1085;&#1077; &#1091;&#1076;&#1072;&#1083;&#1072;&#1089;&#1100;!" '<- Russian ;o), but you can use Serbian, Hungarian, Greek... as well. 
Global file:TStream=OpenFile("garbage.txt",False,True)

If file Then
    Print "File created."
	WriteLine(file, "The next line is pure russian garbage: ")
	WriteLine(file, sometext)
    CloseFile(file)
    Print "File closed!"
EndIf



Brucey(Posted 2014) [#5]
You should be able to do something like :
SuperStrict

Global sometext:String="some unicode text"
Global file:TStream=WriteStream("utf8::garbage.txt")

If file Then
    Print "File created."
	WriteLine(file, "The next line is pure russian garbage: ")
	WriteLine(file, sometext)
    CloseFile(file)
    Print "File closed!"
EndIf

This will use the "utf8:" proto that is defined in brl.textstream to create a utf8 file.
Load it back with the same proto and the ReadStream() function.

But this isn't related to the bug report problem, just a misunderstanding of how best to use the available streams.


Grisu(Posted 2014) [#6]
Ok, thanks for the clarification.

Your example isn't working for me. Bmx creates a file named "utf8" with no bytes in it.


Brucey(Posted 2014) [#7]
Sorry, it's "utf8::filename"

two colons.


Grisu(Posted 2014) [#8]
That did it, thanks again.