databuffer utf8 issue
Monkey Forums/Monkey Bug Reports/databuffer utf8 issue
| ||
Hello I got today an problem with databuffers and special chars like "ä" Strict Import brl.databuffer Function Main:Int() Local s:String = "ä" Local d:DataBuffer = New DataBuffer(s.Length()) 'UTF8 d.PokeString(0, s) Print "Result UTF8:" Print "original: "+s Print "read: "+d.PeekString(0) Print "--------" 'ASCII d.PokeString(0, s, "ascii") Print "Result ASCII:" Print "original: "+s Print "read: "+d.PeekString(0,"ascii") Return 0 End Function The Results are: Result UTF8: original: ä read: ᅢ -------- Result ASCII: original: ä read: ä any idea, what the problem is? |
| ||
Databuffer length looks wrong, you set it to '1' (length of s) but UTF8 will require multiple bytes for storing unicode values >127. Try using s.Length * 3 for a 'worse case' buffer size. |
| ||
Is it possible, that you extend the length method of a string to Length("utf8") or Length("ascii") |
| ||
marksibly wrote: Try using s.Length * 3 for a 'worse case' buffer size. Utf8 encoding is 1 to 4 bytes, please see http://en.wikipedia.org/wiki/UTF-8 UTF-8 encodes each of the 1,112,064 valid code points in the Unicode code space (1,114,112 code points minus 2,048 surrogate code points) using one to four 8-bit bytes (a group of 8 bits is known as an octet in the Unicode Standard). k.o.g. wrote: Is it possible, that you extend the length method of a string to Length("utf8") or Length("ascii"). You could write a function for it: Strict Import brl.databuffer Function Utf8Length:Int(s:String) ' returns the byte size of Utf8 encoded string Local sLen:Int = s.Length() If sLen = 0 Then Return 0 Local buf:DataBuffer = New DataBuffer(4) Local byteLen:Int = 0 For Local i:Int = 0 To sLen-1 buf.PokeString(0, s[i..i+1] ) Local firstByte:Int = buf.PeekByte(0) If firstByte & %10000000 = 0 byteLen += 1 ElseIf firstByte & %11100000 = %11000000 byteLen += 2 ElseIf firstByte & %11110000 = %11100000 byteLen += 3 ElseIf firstByte & %11111000 = %11110000 byteLen += 4 Endif Next buf.Discard() Return byteLen End Function Main:Int() Local strings:String[] = [ "ä", "€", "Hallo €uro", "Thai: สวัสดี", "Burmese: မင်္ဂလာပါ", "Arabic: مرحبا", "Russian: здравствуйте", "Slovak: haló", "Chinese: 您好", "Hebrew: שלום", "Korean: 안녕하세요." ] For Local s:String = Eachin strings Print "string: " + s Print "character length: " + s.Length() Print "Utf8 byte length: " + Utf8Length( s ) Local d:DataBuffer = New DataBuffer( Utf8Length(s) ) d.PokeString(0, s) Local s2:String = d.PeekString(0) Print "Utf8 PeekString: " + s2 If s = s2 Print "comparison: correct" Else Print "comparison: >>> WRONG! <<<" Endif Print "---------------------------" d.Discard() Next Return 0 End AsciiLength = string.Length(), but you can't map most Unicode characters to ASCII anyway, so it does not make much sense to convert Unicode strings into ASCII strings. |
| ||
Not sure if this is what you might need/want, since you're using a DataBuffer and doing peek/poke and that seems to imply you actually want to modify stuff on the byte level, but if you want to modify stuff on the char level without having a lot of hassle, I wrote a utf8 library to handle this... I believe I either wrote it before brl.DataBuffer was a thing or was unaware of it (it uses FileStreams instead), but it shouldn't be too difficult to use a DataBuffer instead. Many string operations are supported, but as Danilo said, mapping unicode to ascii isn't 1 to 1 and character folding is a bit more involved. |