Unicode Bigendian estream reading bug
BlitzMax Forums/BlitzMax Programming/Unicode Bigendian estream reading bug
| ||
If you save this file in unicode big endian, compile it and run it, every ReadLine statement reads a single char. so Unicode bigEndian is not well handled by BlitzMax. Sample code: Local theFilename:String="test.bmx" Local mystream:TStream=OpenStream(theFilename) While Not Eof(mystream) Local theString:String=ReadLine(mystream) Print theString Wend This is an alternative code that reproduces exactly the same bug: Local theFilename:String="test.bmx" Local myBaseStream:TStream=OpenStream(theFilename) Local MyStream:TStream = BigEndianStream(myBaseStream) While not Eof(mystream) Local theString:String=ReadLine(mystream) Print theString Wend NOTE: BlitzMax IDE defults saves to ANSI, so you will have to edit this in the note pad or on BLIde to see this bug in action. This bug applies to any Unicode Big Endian text file. |
| ||
After editing in Notepad wouldn't notepad have corrupted the format? |
| ||
Isn't this what TTextStreams are for? |
| ||
Yes, but there's no 'elegant' way to get the appropiate encoding of a text file (I thought it was recognized when creating the stream, but not...) the LoadText function recognizes the encoding, why not adding this feature to TTextStreams? |
| ||
Hi, The LoadText function recognizes the encoding, why not adding this feature to TTextStreams? The problem here is the text 'type' marker at the start of the file can be anything from 0 to 3 bytes long, making it a bit tricky to use with an arbitrary stream without either buffering the input or using a 'putback' mechanism. This is less of an issue with LoadText as it can do it messily/internally. It's not impossible, just something we haven't got around too yet. |
| ||
why not adding this functionality to the TextStream? I think it will be very useful for those who are not familiar to text encoding:Type TextFileReader Function CreateStream:TTextStream(url:Object,DetectEncoding:Int = True,ForceEncoding:Int = 0 ) Local Stream:TStream=ReadStream(url) 'This is the base stream If Stream = Null Then Return Null 'Unable to open URL If detectencoding = False Then Local TStream:TTextStream=TTextStream.Create( Stream,forceencoding) Return TStream Else Local TStream:TTextStream=TTextStream.Create( Stream,EncodingDetector(url)) Return TStream EndIf End Function Function EncodingDetector:Int(url:Object) Local format:Int = 0 'This is to store the encoding Local Stream:TStream=ReadStream(url) 'This is the base stream If not Stream.Eof() 'check if there is data to read Local c:Byte=Stream.ReadByte() If not Stream.Eof() Local d:Byte=Stream.ReadByte() 'Get the firt byte to get the encoding If c=$fe and d=$ff 'if its FF and FE (Big endian byte order) format=TTextStream.UTF16BE Else If c=$ff and d=$fe 'Little endian byte order format=TTextStream.UTF16LE Else If c=$ef and d=$bb If not Stream.Eof() Local e:Byte=Stream.ReadByte() If e=$bf format=TTextStream.UTF8 EndIf EndIf EndIf EndIf Return format End Function End Type that's what I use to read from txt files, and it's a little adaptation of the official loadtext command. hope you'll find it useful. |
| ||
Wouldn't you need something like...Function EncodingDetector:Int(url:Object) ...ETC... If format = Null format = TTextStream.LATIN1 Stream.Seek(0) 'this won't work on an unbuffered serial stream EndIf Return format End Function...too? I presume this is what Mark was talking about when he said... making it a bit tricky to use with an arbitrary stream without either buffering the input or using a 'putback' mechanism |
| ||
@Yan: Format never gets null, it's a Integer, and it's started with value 0 (LATIN1) Ah, and yes, this will only work on seekable streams, but non seekable streams don't usualy have a byte order mark at it's begining. what about this? Function EncodingDetector:Int(url:Object) ...ETC... If format = 0 try Stream.Seek(0) 'this will only work on an unbuffered serial stream catch err$ end try EndIf Return format End Function |
| ||
1) In BMax Null = 0 for integers. 2) TTextStream.LATIN1 = 1 C) I think you're missing the point! |
| ||
@Yan: this code needs to start format = 1 instead of 0, you're right. But for anything else, I think is a quite usable routine. I will give it a deep try and let everybody know any prob using it. |