UTF-8

BlitzMax Forums/BlitzMax Programming/UTF-8

Grey Alien(Posted 2007) [#1]
Hiya, just tried loading in a Russian localised file in UTF-8 file and my word wrapping code crashes. Probably this is because Space and Carriage Return (CR) are not 32 and 13. Anyone know what they are in UTF please?

Oh and what about Line Feed (LF) which is normally 10. Thanks!


GfK(Posted 2007) [#2]
I'm sure that only ascii chars higher than 127 are different.

Its been ages since I did anything like this, but when I did the Polish version of Double Top I had all sorts of bother. I recall having to find CE (central european) versions of the fonts I used.

Rather you than me.


ziggy(Posted 2007) [#3]
I think they are 10 + 13 or 13 + 10 depending on the platform.


Grey Alien(Posted 2007) [#4]
Hmm, if they are the same then there must be a problem with my code, wonder what it is? ...

It's weird because with a file saved as txt it word wraps, but with UTF-8 the word wrapped text array is empty because it's failing to wrap it. So what's so badly different between the file formats that causes the error I wonder?


ziggy(Posted 2007) [#5]
It can be the byte order mark that is confusing you program?


Grey Alien(Posted 2007) [#6]
What's that? Better wiki it...


ziggy(Posted 2007) [#7]
The byte order mark is usually a 2 bytes mark that indicates if the file is stored in big endian or little endian. In case of UTF-8 it should be little endian. I think it was something like FE and FF, but if you take a look to the loadtext function of blitzmax, you'll see the exact byte order signature.


Grey Alien(Posted 2007) [#8]
Ah, and that's at the start is it? Maybe I should view the file with a hex editor...Hmm, seems the start of the file is EF BB BF. That may be the problem, I'll check it out. Thanks.


Brucey(Posted 2007) [#9]
Why can't you load the file with TextStream? It checks the BOM and interprets the loading based on that...


ziggy(Posted 2007) [#10]
Text Stream reads the byte order marks as characters. I would recomend using LoadText, or using a function derived from LoadText. There's no encoding detector text stream in BlitzMax (As far as I know)


Grey Alien(Posted 2007) [#11]
LoadText is the one! Thanks ziggy and thanks Brucy for the advice. So I should read the manual a bit more!


Grey Alien(Posted 2007) [#12]
OK so LoadText reads the whole file into a giant string. How practical is that for big files?

Also is there a quick way for splitting it into lines (stored in an array) based on CR LF? Or do I have write some code to chop it up? I'll write some code anyway...


ziggy(Posted 2007) [#13]
Hey, LoadText is internally using a Stream, but it makes the byte order check. Open the LoadText source code and in a few minutes you'll have a working text stream with encoding recognition. I've done mine, but I can't send it to you becouse I'm not at the office now.


Grey Alien(Posted 2007) [#14]
This seems to work:

Strict

Graphics 800,600,0

Global Font:TImageFont = LoadImageFont("arial.ttf", 20)

Local File:TStream = OpenFile("Russian.txt") 'in UTF-8 format

Global TestText $ = ""

If File Then
	TestText = LoadText(File)
	CloseFile(File)
EndIf

Global Text$[]

'Split the TestLine into an array of Lines
Local line = 0
While TestText<>""
	'Get the CRLF position
	Local Pos = Instr(TestText,Chr(13)+Chr(10))
	Text = Text[..Len(Text)+1] 'expand the array
	'Was the end of the text reached?
	If Pos = 0 Then
		Text[line] = Mid(TestText,1,Len(TestText)) 'Copy the rest of the text into the new array slot
		TestText = ""
	Else		
		Text[line] = Mid(TestText,1,Pos-1) 'Copy the first line into the new array slot
		TestText= Mid(TestText,Pos+2,Len(TestText)) 'reduce the text we are reading from.
	EndIf
	line:+1 	
Wend

While Not KeyHit(KEY_ESCAPE)
	Cls
	SetImageFont Font
	Local y = 50
	'Print the lines from the array
	For Local  i = 0 To Len(Text)-1
		DrawText Text[i],10,y
		y:+20
	Next
	DrawText Len(Text)+" lines",10,10
	Flip
Wend



ziggy(Posted 2007) [#15]
Yes but you're loading the whole file into a string. It work, but if you want to use text streams with encoding recognition, I would recomend you to just take a look to the loadtext function source code. It will be less RAM demanding when working with very large files (I think).

Anyway, glad to see it works. :D


Grey Alien(Posted 2007) [#16]
Yes good advice thanks.

For now I've added it to my framework as a simple function.