UTF-8
BlitzMax Forums/BlitzMax Programming/UTF-8
| ||
Hiya, just tried loading in a Russian localised file in UTF-8 file and my word wrapping code crashes. Probably this is because Space and Carriage Return (CR) are not 32 and 13. Anyone know what they are in UTF please? Oh and what about Line Feed (LF) which is normally 10. Thanks! |
| ||
I'm sure that only ascii chars higher than 127 are different. Its been ages since I did anything like this, but when I did the Polish version of Double Top I had all sorts of bother. I recall having to find CE (central european) versions of the fonts I used. Rather you than me. |
| ||
I think they are 10 + 13 or 13 + 10 depending on the platform. |
| ||
Hmm, if they are the same then there must be a problem with my code, wonder what it is? ... It's weird because with a file saved as txt it word wraps, but with UTF-8 the word wrapped text array is empty because it's failing to wrap it. So what's so badly different between the file formats that causes the error I wonder? |
| ||
It can be the byte order mark that is confusing you program? |
| ||
What's that? Better wiki it... |
| ||
The byte order mark is usually a 2 bytes mark that indicates if the file is stored in big endian or little endian. In case of UTF-8 it should be little endian. I think it was something like FE and FF, but if you take a look to the loadtext function of blitzmax, you'll see the exact byte order signature. |
| ||
Ah, and that's at the start is it? Maybe I should view the file with a hex editor...Hmm, seems the start of the file is EF BB BF. That may be the problem, I'll check it out. Thanks. |
| ||
Why can't you load the file with TextStream? It checks the BOM and interprets the loading based on that... |
| ||
Text Stream reads the byte order marks as characters. I would recomend using LoadText, or using a function derived from LoadText. There's no encoding detector text stream in BlitzMax (As far as I know) |
| ||
LoadText is the one! Thanks ziggy and thanks Brucy for the advice. So I should read the manual a bit more! |
| ||
OK so LoadText reads the whole file into a giant string. How practical is that for big files? Also is there a quick way for splitting it into lines (stored in an array) based on CR LF? Or do I have write some code to chop it up? I'll write some code anyway... |
| ||
Hey, LoadText is internally using a Stream, but it makes the byte order check. Open the LoadText source code and in a few minutes you'll have a working text stream with encoding recognition. I've done mine, but I can't send it to you becouse I'm not at the office now. |
| ||
This seems to work:Strict Graphics 800,600,0 Global Font:TImageFont = LoadImageFont("arial.ttf", 20) Local File:TStream = OpenFile("Russian.txt") 'in UTF-8 format Global TestText $ = "" If File Then TestText = LoadText(File) CloseFile(File) EndIf Global Text$[] 'Split the TestLine into an array of Lines Local line = 0 While TestText<>"" 'Get the CRLF position Local Pos = Instr(TestText,Chr(13)+Chr(10)) Text = Text[..Len(Text)+1] 'expand the array 'Was the end of the text reached? If Pos = 0 Then Text[line] = Mid(TestText,1,Len(TestText)) 'Copy the rest of the text into the new array slot TestText = "" Else Text[line] = Mid(TestText,1,Pos-1) 'Copy the first line into the new array slot TestText= Mid(TestText,Pos+2,Len(TestText)) 'reduce the text we are reading from. EndIf line:+1 Wend While Not KeyHit(KEY_ESCAPE) Cls SetImageFont Font Local y = 50 'Print the lines from the array For Local i = 0 To Len(Text)-1 DrawText Text[i],10,y y:+20 Next DrawText Len(Text)+" lines",10,10 Flip Wend |
| ||
Yes but you're loading the whole file into a string. It work, but if you want to use text streams with encoding recognition, I would recomend you to just take a look to the loadtext function source code. It will be less RAM demanding when working with very large files (I think). Anyway, glad to see it works. :D |
| ||
Yes good advice thanks. For now I've added it to my framework as a simple function. |