databuffer utf8 issue

Monkey Forums/Monkey Bug Reports/databuffer utf8 issue

k.o.g.(Posted 2015) [#1]
Hello


I got today an problem with databuffers and special chars like "ä"


Strict


Import brl.databuffer

Function Main:Int()
	Local s:String = "ä"
	
	Local d:DataBuffer = New DataBuffer(s.Length())
	'UTF8
	d.PokeString(0, s)

	Print "Result UTF8:"
	Print "original: "+s
	Print "read: "+d.PeekString(0)
	
	Print "--------"
	
	'ASCII
	d.PokeString(0, s, "ascii")
	Print "Result ASCII:"
	Print "original: "+s
	Print "read: "+d.PeekString(0,"ascii")
	Return 0
End Function


The Results are:
Result UTF8:
original: ä
read: ᅢ
--------
Result ASCII:
original: ä
read: ä


any idea, what the problem is?


marksibly(Posted 2015) [#2]
Databuffer length looks wrong, you set it to '1' (length of s) but UTF8 will require multiple bytes for storing unicode values >127. Try using s.Length * 3 for a 'worse case' buffer size.


k.o.g.(Posted 2015) [#3]
Is it possible, that you extend the length method of a string to Length("utf8") or Length("ascii")


Danilo(Posted 2015) [#4]
marksibly wrote:
Try using s.Length * 3 for a 'worse case' buffer size.

Utf8 encoding is 1 to 4 bytes, please see http://en.wikipedia.org/wiki/UTF-8

UTF-8 encodes each of the 1,112,064 valid code points in the Unicode code space (1,114,112 code points minus 2,048 surrogate code points)
using one to four 8-bit bytes (a group of 8 bits is known as an octet in the Unicode Standard).



k.o.g. wrote:
Is it possible, that you extend the length method of a string to Length("utf8") or Length("ascii").

You could write a function for it:
Strict

Import brl.databuffer

Function Utf8Length:Int(s:String) ' returns the byte size of Utf8 encoded string
    Local sLen:Int = s.Length()
    If sLen = 0 Then Return 0

    Local buf:DataBuffer = New DataBuffer(4)
    Local byteLen:Int    = 0
    
    For Local i:Int = 0 To sLen-1
        buf.PokeString(0, s[i..i+1] )
        Local firstByte:Int = buf.PeekByte(0)
        If     firstByte & %10000000 = 0
            byteLen += 1
        ElseIf firstByte & %11100000 = %11000000
            byteLen += 2
        ElseIf firstByte & %11110000 = %11100000
            byteLen += 3
        ElseIf firstByte & %11111000 = %11110000
            byteLen += 4
        Endif
    Next
    
    buf.Discard()
    
    Return byteLen
End


Function Main:Int()
    Local strings:String[] = [ "ä",
                               "€",
                               "Hallo €uro",
                               "Thai: สวัสดี",
                               "Burmese: မင်္ဂလာပါ",
                               "Arabic: مرحبا",
                               "Russian: здравствуйте",
                               "Slovak: haló",
                               "Chinese: 您好",
                               "Hebrew: שלום",
                               "Korean: 안녕하세요." ]

    For Local s:String = Eachin strings
        Print "string:           " + s
        Print "character length: " + s.Length()
        Print "Utf8 byte length: " + Utf8Length( s )

        Local d:DataBuffer = New DataBuffer( Utf8Length(s) )
        d.PokeString(0, s)
        
        Local s2:String = d.PeekString(0)

        Print "Utf8 PeekString:  " + s2
        
        If s = s2
            Print "comparison:       correct"
        Else
            Print "comparison:       >>> WRONG! <<<"
        Endif
    
        Print "---------------------------"
        
        d.Discard()
    Next

    Return 0
End

AsciiLength = string.Length(), but you can't map most Unicode characters to ASCII anyway,
so it does not make much sense to convert Unicode strings into ASCII strings.


Nobuyuki(Posted 2015) [#5]
Not sure if this is what you might need/want, since you're using a DataBuffer and doing peek/poke and that seems to imply you actually want to modify stuff on the byte level, but if you want to modify stuff on the char level without having a lot of hassle, I wrote a utf8 library to handle this... I believe I either wrote it before brl.DataBuffer was a thing or was unaware of it (it uses FileStreams instead), but it shouldn't be too difficult to use a DataBuffer instead.

Many string operations are supported, but as Danilo said, mapping unicode to ascii isn't 1 to 1 and character folding is a bit more involved.