FileType Unicode Bug on Linux?

Archives Forums/BlitzMax Bug Reports/FileType Unicode Bug on Linux?

Artemis(Posted 2009) [#1]
It seems that FileType does not work for unicode-containing files and dirs:

Tested on Ubuntu 9.10 64bit
SuperStrict

Local d:Int = ReadDir("f")
If Not d RuntimeError("cannot read .")

Repeat
	Local f:String = NextFile(d)
	If "" = f Then Exit
	If "." = f Or ".." = f Then Continue
	Print FileType("f/" + f) + " - " + f
Forever

CloseDir(d)


Original files:
txt ' just to check there is nothing wrong in the code
a ' the same as above
2×2
01. xä.mp3

Output:
1 - txt
2 - a
0 - 01. xÿ/¤.mp3
0 - 2ÿ/—2



Artemis(Posted 2009) [#2]
Could someone please confirm this, so I can be sure it is not my problem?


Brucey(Posted 2009) [#3]
It could be a Console/Print issue - concerning the display of unicode characters.


Artemis(Posted 2009) [#4]
Yeah, but that is something I don't care about.

What is really important are the ZEROS at the beginning indicating that the files and folders with unicode characters (which are read from NextFile and therefore exist) are no files or folders.


Artemis(Posted 2009) [#5]
Furthermore a FileType with hardcoded file name fails, too:
SuperStrict

Local d:Int = ReadDir("f")
If Not d RuntimeError("cannot read .")

Repeat
	Local f:String = NextFile(d)
	If "" = f Then Exit
	If "." = f Or ".." = f Then Continue
	Print FileType("f/" + f) + " - " + f
Forever

CloseDir(d)

Print ""
Print ""

Print FileType("f/2×2") + " - 2×2"
Print FileType("f/01. xä.mp3") + " - 01. xä.mp3"

Output (WATCH THE FILETYPE NUMBER AT THE BEGINNING)
1 - txt
2 - a
0 - 01. xÿ/¤.mp3
0 - 2ÿ/—2


0 - 2×2
0 - 01. xä.mp3



Brucey(Posted 2009) [#6]
It turns out that bbStringFromUTF8String() (in blitz_string.c) on Linux is not working as expected - in several places.

For values greater than 127,
c=*p++

is returning negative values. Which is kind of odd... One might expect a char to be 0-255. So, on Linux, a char is signed.. or?

Anyhoo, changing it to this, fixes it :
c=*p++ &0xff

...and the same for other similar parts.

Also, the strlen() value is being used as the character count to set the BB string size... which is not true for characters > 127, so instead of
str=bbStringFromShorts( d, n );

it should be
str=bbStringFromShorts( d, q-d );

, which is the size of the difference between the starting pointer and last, of the new data.

So, the function wants to be something a bit like this now :
BBString *bbStringFromUTF8String( const char *p ){
	int c,n;
	short *d,*q;
	BBString *str;

	if( !p ) return &bbEmptyString;
	
	n=strlen(p);
	d=(short*)malloc( n*2 );
	q=d;
	
	while( c=*p++ &0xff){
		if( c<128 ){
			*q++=c;
		}else{
			int d=*p++ & 0xff;
			if( c<224 ){
				*q++=(c-192)*64+(d-128);
			}else{
				int e=*p++ & 0x3f;
				if( c<0xf0 ){
					*q++=((c&15)<<12) | (d<<6) | e;
				}else{
					int f=*p++ & 0x3f;
					int v=((c&7)<<18) | (d<<12) | (e<<6) | f;
					if( v & 0xffff0000 ) bbExThrowCString( "Unicode character out of UCS-2 range" );
					*q++=v;
				}
			}
		}
	}
	str=bbStringFromShorts( d, q-d );
	free( d );
	return str;
}


But note, that for the first two parts (which I've fixed here), I've not used the same convention as Mark did (where he &0x3f the value, then | it on the end of the calculation). Some of the math, I just pulled straight from my equivalent Max UTF8 conversion function (in bah.libxml).
Mark will probably be able to fix this properly... but in my test, it seems to work - on linux.

It *should* also be compatible with the other platforms, but I haven't had time to test it there - it took long enough to work out where things were going wrong as it was :-p


HTH

:o)


marksibly(Posted 2009) [#7]
Hi,

Nice find(s) Brucey!

And yes, chars are 'signed' on Linux, 'unsigned' on Win32/MacOS.

'C/C++' language leaves the 'signedness' of chars up to the compiler implementation, probably for the sake of the good old days where there may have been more overhead involved in sign extending versus zero extending (or vice versa) chars to ints. These days, it just feels like yet another 'hole' in the languages.

Changing the arg to 'const unsigned char *p' would fix it, but I'd rather just leave the prototype as is and bung in an '&'.

Just one comment: your routine looks a little wrong in the 3 and 4 byte sequence cases, as you're using 'd' without masking out the top 2 bits, ie: the "& 0xff" for 'd' should be "& 0x3f" (and then you wont need the -128 from d later).

So I think all the BlitzMax version needs is an '& 0xff' added to the assignment in the while(), and the length fix.

And I can't believe length has been wrong all this time - ouch! Thought we tested this stuff in DevTeam...


Brucey(Posted 2009) [#8]
your routine looks a little wrong in the 3 and 4 byte sequence cases, as you're using 'd' without masking out the top 2 bits

I know.. my brain is slush today - too much of my day spent pretending to be an Oracle DBA... so I just grabbed some code from my 'max version, which meant I didn't have to think too much.

And I can't believe length has been wrong all this time - ouch! Thought we tested this stuff in DevTeam

Maybe it's only Linux that shows up the problem - sometimes.
It was only apparent on one particular string I was testing, while all the others looked fine. And if I added some extra chars to that string, it (apparently) came out okay.


Artemis(Posted 2009) [#9]
Thanks for this!

When can I expect the next release with this issue solved?

Can I use Brucey's code? I didn't get whether the issues Mark mentioned might break it?


Artemis(Posted 2010) [#10]
Sadly the 1.37 release does not seem to have fixed this…


Brucey(Posted 2010) [#11]
It doesn't? I suppose I should update to 1.37 and have a look...


Artemis(Posted 2010) [#12]
I did a diff of the old and new blitz_string.c file and noticed two changes
	while( c=*p++ &0xff){
and
	str=bbStringFromShorts( d, q-d );
as far as I remember but it does not work either.

Link dead - Download this and try out the test.bmx file. Watch the number at the beginning of the output lines.


marksibly(Posted 2010) [#13]
Hi,

Crap, sorry, my fault still.

This should work (please confirm!):
BBString *bbStringFromUTF8String( const char *p ){
	int c,n;
	short *d,*q;
	BBString *str;

	if( !p ) return &bbEmptyString;
	
	n=strlen(p);
	d=(short*)malloc( n*2 );
	q=d;
	
	while( c=*p++ & 0xff ){
		if( c<0x80 ){
			*q++=c;
		}else{
			int d=*p++ & 0x3f;
			if( c<0xe0 ){
				*q++=((c&31)<<6) | d;
			}else{
				int e=*p++ & 0x3f;
				if( c<0xf0 ){
					*q++=((c&15)<<12) | (d<<6) | e;
				}else{
					int f=*p++ & 0x3f;
					int v=((c&7)<<18) | (d<<12) | (e<<6) | f;
					if( v & 0xffff0000 ) bbExThrowCString( "Unicode character out of UCS-2 range" );
					*q++=v;
				}
			}
		}
	}
	str=bbStringFromShorts( d,q-d );
	free( d );
	return str;
}

Between Brucey and myself, surely we'll get this right one day!


Artemis(Posted 2010) [#14]
Yeah thanks.
That fixes it.