UTF8 support

BlitzMax Forums/BlitzMax Programming/UTF8 support

GfK(Posted 2011) [#1]
I'm in the process of preparing my game for localisation (its not being done yet - just futureproofing).

I've figured out that I can't load UTF8 text via streams and have instead used LoadText(). I've loaded in some garbage Russian text to test things out and it works (I think) but I'm confused on the whole Unicode thing and not sure if what I'm getting is correct. Some of the ASCII codes for each letter in the string are 'expected' (i.e. below 127), while the Russian ones are 1xxx etc - four digits?!

Is this correct? I need to update my bitmap font class to cater for this so the bottom line is - how high can that number go?? 2000? 10000? A million????


ziggy(Posted 2011) [#2]
UTF8 sometimes uses 2 bytes for non latin codes. So yes, four digits is expected on a UTF-8 string.

Form the wikipedia:

UTF-8 encodes each of the 1,112,064 code points in the Unicode character set using one to four 8-bit bytes (termed “octets” in the Unicode Standard). Code points with lower numerical values (i. e., earlier code positions in the Unicode character set, which tend to occur more frequently in practice) are encoded using fewer bytes,making the encoding scheme reasonably efficient. In particular, the first 128 characters of the Unicode character set, which correspond one-to-one with ASCII, are encoded using a single octet with the same binary value as the corresponding ASCII character, making valid ASCII text valid UTF-8-encoded Unicode text as well.



ima747(Posted 2011) [#3]
My. Technique for handling UTF8 is to treat it as a data block. I've got some methods if you'd like. You have to make matching handlers for another programming languages etc. You want to access your data from but it's pretty easy. Not at my dev system right now but if there's interest I'll post later.


Brucey(Posted 2011) [#4]
BlitzMax only supports up to 3-byte UTF-8... so if you need to support Chinese, you may run into issues. So the theoretical max is somewhere in the 64k area, I think.

No reason why you can't use streams.


ima747(Posted 2011) [#5]
here's what I use for UTF-8, hope it helps someone.

Not sure why it's called WString but this is what evolved from my hacking up of other things to get it working so it was probably a holdover from something it was loosely based on...

' WString.bmx

SuperStrict

Function WriteWString(stream:TStream, text:String)
	Local buff:Short Ptr = text.ToWString()
	WriteInt(stream, Len(text))
	For Local onPos:Int = 0 Until Len(text)
		WriteShort(stream, buff[onPos])
	Next
	MemFree(buff)
End Function

Function ReadWString:String(stream:TStream)
	Local length:Int = ReadInt(stream)
	Local buff:Short[length + 1]
	For Local onPos:Int = 0 Until length
		buff[onPos] = ReadShort(stream)
	Next
	Local text:String = String.FromWString(buff)
	Return text
End Function



Brucey(Posted 2011) [#6]
ima747, that's not UTF-8 :-)

That's UTF-16/UCS-2, a 2-byte character sequence, which is Blitzmax's native string format.


GfK(Posted 2011) [#7]
Well if I wasn't confused before, I sure as hell am confused now.

Maybe someone could give me an idiot's guide to supporting languages such as Russian and Japanese? From the top...


therevills(Posted 2011) [#8]
I guess this is for Magicville? :)

Russian and Japanese


I didnt to do these languages of my games, I just stuck with the (mainly) euro ones :P

Im so glad that I loaded "most" of my text from text files, it makes the process so much better - although still a pain... graphics files etc (also I missed "Options" and a few words here and there I hard coded :()

Last edited 2011


GfK(Posted 2011) [#9]
Nope - not for Magicville. Its for a generic class for future stuff and I'd like to support the 'weirdy' charsets as well as the western european ones!


ima747(Posted 2011) [#10]
Ah, that's right, thanks Brucey! either way it gets done what I need which is to be character safe in strings that are stored (so I can store file paths, and user generated strings without it exploding if someone gets fancy, or isn't running english).

Might still be applicable to someone, it's a matter of do you need UTF-8 (I can't help) or do you need real strings (UTF-16 will cover better anyway), such as perhaps with chinese, japanese and other spooky character sets.


Brucey(Posted 2011) [#11]
I use XML.
With half-decent tools (for example libxml), it takes care of conversion to/from UTF-8/BlitzMax for you, and also handles encoding of the data in the file correctly - so that you get out what you expect.

There's even a BaH.Locale module which provides an example of how one might implement transparent localisation, and happens to support 100+ locales.

Each to their own of course.

Last edited 2011


therevills(Posted 2011) [#12]
Nope - not for Magicville.


Sorry - I didnt read your first post fully ;)


GfK(Posted 2011) [#13]
Still digging into this - is the Microsoft Multilingual User Interface pack useful for testing language versions, rather than trying to source and fathom a 'native' version of Windows for a given language?


ima747(Posted 2011) [#14]
I use the strings in data files that include other things like image data, maps, etc. All crunched together and intended to deter people from hacking them up.

I really should get around to implementing XML in some ways though... Would make life easier when I don't want/need a proprietary file format :0)


GfK(Posted 2011) [#15]
I just tried Brucey's libXML module. Despite me being an XML virgin I got it up and running in less than five minutes by following the libXML docs that came with it.