FileType Unicode Bug on Linux?
Archives Forums/BlitzMax Bug Reports/FileType Unicode Bug on Linux?
| ||
It seems that FileType does not work for unicode-containing files and dirs: Tested on Ubuntu 9.10 64bit SuperStrict Local d:Int = ReadDir("f") If Not d RuntimeError("cannot read .") Repeat Local f:String = NextFile(d) If "" = f Then Exit If "." = f Or ".." = f Then Continue Print FileType("f/" + f) + " - " + f Forever CloseDir(d) Original files: txt ' just to check there is nothing wrong in the code a ' the same as above 2×2 01. xä.mp3 Output: 1 - txt 2 - a 0 - 01. xÿ/¤.mp3 0 - 2ÿ/—2 |
| ||
Could someone please confirm this, so I can be sure it is not my problem? |
| ||
It could be a Console/Print issue - concerning the display of unicode characters. |
| ||
Yeah, but that is something I don't care about. What is really important are the ZEROS at the beginning indicating that the files and folders with unicode characters (which are read from NextFile and therefore exist) are no files or folders. |
| ||
Furthermore a FileType with hardcoded file name fails, too:SuperStrict Local d:Int = ReadDir("f") If Not d RuntimeError("cannot read .") Repeat Local f:String = NextFile(d) If "" = f Then Exit If "." = f Or ".." = f Then Continue Print FileType("f/" + f) + " - " + f Forever CloseDir(d) Print "" Print "" Print FileType("f/2×2") + " - 2×2" Print FileType("f/01. xä.mp3") + " - 01. xä.mp3" Output (WATCH THE FILETYPE NUMBER AT THE BEGINNING) 1 - txt 2 - a 0 - 01. xÿ/¤.mp3 0 - 2ÿ/—2 0 - 2×2 0 - 01. xä.mp3 |
| ||
It turns out that bbStringFromUTF8String() (in blitz_string.c) on Linux is not working as expected - in several places. For values greater than 127, c=*p++ is returning negative values. Which is kind of odd... One might expect a char to be 0-255. So, on Linux, a char is signed.. or? Anyhoo, changing it to this, fixes it : c=*p++ &0xff ...and the same for other similar parts. Also, the strlen() value is being used as the character count to set the BB string size... which is not true for characters > 127, so instead of str=bbStringFromShorts( d, n ); it should be str=bbStringFromShorts( d, q-d ); , which is the size of the difference between the starting pointer and last, of the new data. So, the function wants to be something a bit like this now : BBString *bbStringFromUTF8String( const char *p ){ int c,n; short *d,*q; BBString *str; if( !p ) return &bbEmptyString; n=strlen(p); d=(short*)malloc( n*2 ); q=d; while( c=*p++ &0xff){ if( c<128 ){ *q++=c; }else{ int d=*p++ & 0xff; if( c<224 ){ *q++=(c-192)*64+(d-128); }else{ int e=*p++ & 0x3f; if( c<0xf0 ){ *q++=((c&15)<<12) | (d<<6) | e; }else{ int f=*p++ & 0x3f; int v=((c&7)<<18) | (d<<12) | (e<<6) | f; if( v & 0xffff0000 ) bbExThrowCString( "Unicode character out of UCS-2 range" ); *q++=v; } } } } str=bbStringFromShorts( d, q-d ); free( d ); return str; } But note, that for the first two parts (which I've fixed here), I've not used the same convention as Mark did (where he &0x3f the value, then | it on the end of the calculation). Some of the math, I just pulled straight from my equivalent Max UTF8 conversion function (in bah.libxml). Mark will probably be able to fix this properly... but in my test, it seems to work - on linux. It *should* also be compatible with the other platforms, but I haven't had time to test it there - it took long enough to work out where things were going wrong as it was :-p HTH :o) |
| ||
Hi, Nice find(s) Brucey! And yes, chars are 'signed' on Linux, 'unsigned' on Win32/MacOS. 'C/C++' language leaves the 'signedness' of chars up to the compiler implementation, probably for the sake of the good old days where there may have been more overhead involved in sign extending versus zero extending (or vice versa) chars to ints. These days, it just feels like yet another 'hole' in the languages. Changing the arg to 'const unsigned char *p' would fix it, but I'd rather just leave the prototype as is and bung in an '&'. Just one comment: your routine looks a little wrong in the 3 and 4 byte sequence cases, as you're using 'd' without masking out the top 2 bits, ie: the "& 0xff" for 'd' should be "& 0x3f" (and then you wont need the -128 from d later). So I think all the BlitzMax version needs is an '& 0xff' added to the assignment in the while(), and the length fix. And I can't believe length has been wrong all this time - ouch! Thought we tested this stuff in DevTeam... |
| ||
your routine looks a little wrong in the 3 and 4 byte sequence cases, as you're using 'd' without masking out the top 2 bits I know.. my brain is slush today - too much of my day spent pretending to be an Oracle DBA... so I just grabbed some code from my 'max version, which meant I didn't have to think too much. And I can't believe length has been wrong all this time - ouch! Thought we tested this stuff in DevTeam Maybe it's only Linux that shows up the problem - sometimes. It was only apparent on one particular string I was testing, while all the others looked fine. And if I added some extra chars to that string, it (apparently) came out okay. |
| ||
Thanks for this! When can I expect the next release with this issue solved? Can I use Brucey's code? I didn't get whether the issues Mark mentioned might break it? |
| ||
Sadly the 1.37 release does not seem to have fixed this… |
| ||
It doesn't? I suppose I should update to 1.37 and have a look... |
| ||
I did a diff of the old and new blitz_string.c file and noticed two changeswhile( c=*p++ &0xff){and str=bbStringFromShorts( d, q-d );as far as I remember but it does not work either. Link dead - Download this and try out the test.bmx file. Watch the number at the beginning of the output lines. |
| ||
Hi, Crap, sorry, my fault still. This should work (please confirm!): BBString *bbStringFromUTF8String( const char *p ){ int c,n; short *d,*q; BBString *str; if( !p ) return &bbEmptyString; n=strlen(p); d=(short*)malloc( n*2 ); q=d; while( c=*p++ & 0xff ){ if( c<0x80 ){ *q++=c; }else{ int d=*p++ & 0x3f; if( c<0xe0 ){ *q++=((c&31)<<6) | d; }else{ int e=*p++ & 0x3f; if( c<0xf0 ){ *q++=((c&15)<<12) | (d<<6) | e; }else{ int f=*p++ & 0x3f; int v=((c&7)<<18) | (d<<12) | (e<<6) | f; if( v & 0xffff0000 ) bbExThrowCString( "Unicode character out of UCS-2 range" ); *q++=v; } } } } str=bbStringFromShorts( d,q-d ); free( d ); return str; } Between Brucey and myself, surely we'll get this right one day! |
| ||
Yeah thanks. That fixes it. |