Processing File Data Speed
Blitz3D Forums/Blitz3D Beginners Area/Processing File Data Speed
| ||
I've oft heard it mentioned that working with banks directly is much faster than typically variables, arrays, types etc. when proicessing large amounts of data. Naturally, any form of memory read/write should be faster than HDD reading, but I am not sure about a specific scenario: I have a file of various data types of various lengths that cannot be easily randomly-accessed to locate a specific field (This data is NOT my creation, and so no option is available to re-design this format!) I therefore need to iterate through the data, and can opnly do so Byte-by byte, searching for a specific "tag" which identifies the start of the data I need. This "tag" may not be unique, so a secondary "tag" is required to search which can identify the END of the data I want and also thereby confirm it is the correct instance of the data. There's no guarantee of length, so on locating the primary "tag", I must still iterate until I reach the secondary "tag" - however, if another instance of the Primary "tag" occurs, this process must recommence searching for the secondary tag from this point. If the EoF is reached and there is no Secondary "tag", or perhaaps even no primary "tag" (which is possible) then a null result or Fail can be returned ___ Now, if the data as a whole is contained as a STRING (data read as "ReadLine"), then Instr() can quickly(?) jump to the required instances of Tags, however, String functions are notoriously slow by default. The data must be less than 65536 bytes in length too and cannot contain 10,13 as these represent ASCII control demarking End Of Line ___ Alternatively, I could use ReadBytes and transfer the ENTIRE data into a Bank from the file, but doing so would then require bytes to be individually read one by one. ___ So which method ought to be faster, given the approximate, typical sixe of the data to be around 30Kb or so ? |
| ||
I think you have those two reversed: ReadLine needs to examine the data byte by byte until it finds the 0xA0D sequence, but since ReadBytes takes a block size it can use bulk-loading techniques behind the scenes to load whole pages into memory at a time. ReadString should be fundamentally identical to ReadBytes, as a saved Blitz string has no terminator, can contain any data, and is preceded in the file by an integer size. I think you'd do better to decide based on how you want to use the data after it's loaded: if you want to extract flexible-length substrings from the main data block, you would do better to treat the data as a string (and therefore use ReadString or ReadLine), as Mid() will make your life very easy, quickly extracting substrings. If you want to treat it as a flat array of binary objects, perhaps a bank would be better. A bank might also be better if you plan to modify the data in-place rather than reconstruct the whole thing when you need to save. So on the whole it sounds like your best option would be ReadString (not ReadLine), assuming you have control over the way the data is written; if not, ReadBytes. (Converting a bank to a string and back again can also be fast enough for realtime use: http://www.blitzbasic.com/codearcs/codearcs.php?code=2972 ) Although to be honest I'd be surprised if loading 30Kb takes any noticeable time at all no matter the method, unless you're repeating this for hundreds of files. It sounds a bit like you're reinventing XML... perhaps also consider using an existing XML framework? |
| ||
Thanks for the reply Yasha, I had pretty much looked at a few of your points already. I think you'd do better to decide based on how you want to use the data after it's loaded: if you want to extract flexible-length substrings from the main data block, Yup, that's the main intention here. ReadString should be fundamentally identical to ReadBytes, as a saved Blitz string has no terminator, can contain any data, and is preceded in the file by an integer size. ReadString will fail on the data, since this requires the data to have been written with "WriteString" and takes the first two bytes it reads as a Short Integer value for the length of string. Hence the 65535 byte limit. Therefore, ReadLine is necessary as this will read the entire data up to , yes. this will be reading Byte-by-byte 'internally', but I imagine this is faster than: Repeat n=ReadByte(filestream) Until ((n=10)+(n=13)) It sounds a bit like you're reinventing XML... perhaps also consider using an existing XML framework? I didn't know actually what XML Does or how it's used, but as mentioned, unfortunately I am not (responsible for or have any hand in ) creating these files, so I have no control over the format they are in. ReadBytes. (Converting a bank to a string and back again can also be fast enough for realtime use This concept also crossed my mind, reading the entire block, then converting to a string in order to then process it with "Instr" and "Mid" etc. However, isn't this effectively reading the entire code twice??? If, as it generally seems you are alluding to, that really, the time for whichever procedure is fast enough to not really be an issue (i.e. a second or two tops?) then I think I'll be good just reading the entirety as a string, ReadBytes! Thank you! |
| ||
Instr and mids are very slow. Instead you better make your own functions that are peeking in a bank directly. Not a big deal. Just make a substitute function for each command you'd use, that will access a bank instead of a file. |
| ||
JFK, are you suggesting that I should just use banks and create some form of function to mimic "Instr" or "Mid" that operates on the raw bytes, rather than use any strings at all? |
| ||
I would use whatever is most natural for the problem at hand, probably strings. Get it working and think about speed later, if it's a problem. Is this file small enough to fit comfortably into memory? And also to keep an extra copy on disk? And is one character equal to one byte? If the answers are yes then it's probably easiest to start experimenting by loading the entire file into a string. Then work on it with Instr etc. You will need to create a slightly modified version of the original file. Create a new file and WriteInt the length of the original file. Then read the bytes of the original file and write those to the new file. The new file can now be thought of as big string, preceded by a four byte value holding the length. Now ReadString on the new file will produce a string holding the original file. |
| ||
Yes. Get the filesize then create a bank of that size Readbytes the file to the bank. You may now seek for linebreaks, or parse the whole bank as one. So, instead of IF INSTR(a$ ,"hello") you would say IF myINSTR(ABANK, HELLOBANK), where abank is the line and hellobank the comparation string. You'd peek trough abank until the byte is the same as the first byte of hellobank. If so you'd compare the next bytes... You could then even outsource the timecritical parts in a dll that supports inline asm, if speed is crucial. |
| ||
One more thing about it is - what's really slow is reading and writmg Files in many little steps. If you dont write to the same file then the os may optimize the speed by using cache. Anyhow the implementation makes only sense when you've got huge amounts of data to parse. |
| ||
I already have this concept working, I just wasn't sure if it was 'badl;y done'. From my worklog, you can see the principle at work here, where it actually begins scanning whilst the code is being read from the file byte-by-byte: Function s_WEB_FindMidStringInFile$(FilePath$,Prefix$,Suffix$) Local PosStart Local PosEnd Local Seek$=cs_STRING_NULL_STRING Local Byte If (Not(FileType(FilePath))) Return cs_STRING_NULL_STRING End If Local File=OpenFile(FilePath) If (Not(File)) Return cs_STRING_NULL_STRING End If SeekFile File,0 While (Not(Eof(File))) Byte=ReadByte(File) If (Byte) Seek=Seek+Chr(Byte) End If PosStart=Instr(Seek,Prefix) If (PosStart) PosStart=PosStart+Len(Prefix)-1 Seek=cs_STRING_NULL_STRING SeekFile(File,PosStart) While (Not(Eof(File))) Byte=ReadByte(File) If (Byte) Seek=Seek+Chr(Byte) End If PosEnd=Instr(Seek,Suffix) If (PosEnd) Exit End If Wend Exit End If Wend CloseFile File If (Not(PosStart*PosEnd)) Return cs_STRING_NULL_STRING End If Seek=Trim(Left(Seek,PosEnd-1)) Return Seek End Function |
| ||
In your position I would load the data, scan through it, and convert to a string, in three separate operations. There's no real need to interleave the operations in that way and separation of tasks is always clearer. Most of the time used by this function (aside from that spent reading bytes) is probably going to be spent building up the initial "seek" string that then immediately gets discarded when the opening tag is found; you could likely make this faster if you found the opening tag using byte operations. But if it works, and it's fast enough, it's probably fine as-is. Speed doesn't matter if it's not in a loop - which file operations never should be so there's no issue here. Don't waste your mental energy "optimising" things that aren't bottlenecks. Consider that we have already likely spent more time and energy thinking about this than would be ever saved by changing it. |
| ||
Thanks, Yasha! Great advice as always! :) |