Unicode Bigendian estream reading bug

BlitzMax Forums/BlitzMax Programming/Unicode Bigendian estream reading bug

ziggy(Posted 2006) [#1]
If you save this file in unicode big endian, compile it and run it, every ReadLine statement reads a single char. so Unicode bigEndian is not well handled by BlitzMax.

Sample code:
Local theFilename:String="test.bmx"
Local mystream:TStream=OpenStream(theFilename)
While Not Eof(mystream)
  Local theString:String=ReadLine(mystream)
  Print theString
Wend 



This is an alternative code that reproduces exactly the same bug:
Local theFilename:String="test.bmx"
Local myBaseStream:TStream=OpenStream(theFilename)
Local MyStream:TStream = BigEndianStream(myBaseStream)
While not Eof(mystream)
  Local theString:String=ReadLine(mystream)
  Print theString
Wend 


NOTE: BlitzMax IDE defults saves to ANSI, so you will have to edit this in the note pad or on BLIde to see this bug in action. This bug applies to any Unicode Big Endian text file.


ImaginaryHuman(Posted 2006) [#2]
After editing in Notepad wouldn't notepad have corrupted the format?


Yan(Posted 2006) [#3]
Isn't this what TTextStreams are for?


ziggy(Posted 2006) [#4]
Yes, but there's no 'elegant' way to get the appropiate encoding of a text file (I thought it was recognized when creating the stream, but not...)

the LoadText function recognizes the encoding, why not adding this feature to TTextStreams?


marksibly(Posted 2006) [#5]
Hi,


The LoadText function recognizes the encoding, why not adding this feature to TTextStreams?



The problem here is the text 'type' marker at the start of the file can be anything from 0 to 3 bytes long, making it a bit tricky to use with an arbitrary stream without either buffering the input or using a 'putback' mechanism. This is less of an issue with LoadText as it can do it messily/internally.

It's not impossible, just something we haven't got around too yet.


ziggy(Posted 2006) [#6]
why not adding this functionality to the TextStream? I think it will be very useful for those who are not familiar to text encoding:

Type TextFileReader
	Function CreateStream:TTextStream(url:Object,DetectEncoding:Int = True,ForceEncoding:Int = 0 )
		Local Stream:TStream=ReadStream(url) 'This is the base stream
		If Stream = Null Then Return Null 'Unable to open URL
		If detectencoding = False Then
			Local TStream:TTextStream=TTextStream.Create( Stream,forceencoding)
			Return TStream
		Else
			Local TStream:TTextStream=TTextStream.Create( Stream,EncodingDetector(url))
			Return TStream
		EndIf
		
	End Function
	Function EncodingDetector:Int(url:Object)
		Local format:Int = 0  'This is to store the encoding
		Local Stream:TStream=ReadStream(url) 'This is the base stream
		If not Stream.Eof() 'check if there is data to read
			Local c:Byte=Stream.ReadByte()
			If not Stream.Eof()
				Local d:Byte=Stream.ReadByte()  'Get the firt byte to get the encoding
				If c=$fe and d=$ff  'if its FF and FE (Big endian byte order)
					format=TTextStream.UTF16BE
				Else If c=$ff and d=$fe  'Little endian byte order
					format=TTextStream.UTF16LE
				Else If c=$ef and d=$bb
					If not Stream.Eof()
						Local e:Byte=Stream.ReadByte()
						If e=$bf format=TTextStream.UTF8
					EndIf
				EndIf
			EndIf
		EndIf
		Return format
	End Function
End Type


that's what I use to read from txt files, and it's a little adaptation of the official loadtext command.

hope you'll find it useful.


Yan(Posted 2006) [#7]
Wouldn't you need something like...
Function EncodingDetector:Int(url:Object)
  
  ...ETC...
  
  If format = Null
    format = TTextStream.LATIN1
    Stream.Seek(0) 'this won't work on an unbuffered serial stream
  EndIf

  Return format
End Function
...too?

I presume this is what Mark was talking about when he said...
making it a bit tricky to use with an arbitrary stream without either buffering the input or using a 'putback' mechanism



ziggy(Posted 2006) [#8]
@Yan: Format never gets null, it's a Integer, and it's started with value 0 (LATIN1)
Ah, and yes, this will only work on seekable streams, but non seekable streams don't usualy have a byte order mark at it's begining.

what about this?

Function EncodingDetector:Int(url:Object)
  
  ...ETC...
  
  If format = 0
    try
        Stream.Seek(0) 'this will only work on an unbuffered serial stream
    catch err$
    end try
  EndIf

  Return format
End Function



Yan(Posted 2006) [#9]
1) In BMax Null = 0 for integers.
2) TTextStream.LATIN1 = 1
C) I think you're missing the point!


ziggy(Posted 2006) [#10]
@Yan: this code needs to start format = 1 instead of 0, you're right. But for anything else, I think is a quite usable routine. I will give it a deep try and let everybody know any prob using it.