Processing File Data Speed

Blitz3D Forums/Blitz3D Beginners Area/Processing File Data Speed

_PJ_(Posted 2013) [#1]
I've oft heard it mentioned that working with banks directly is much faster than typically variables, arrays, types etc. when proicessing large amounts of data.
Naturally, any form of memory read/write should be faster than HDD reading, but I am not sure about a specific scenario:

I have a file of various data types of various lengths that cannot be easily randomly-accessed to locate a specific field (This data is NOT my creation, and so no option is available to re-design this format!)

I therefore need to iterate through the data, and can opnly do so Byte-by byte, searching for a specific "tag" which identifies the start of the data I need.

This "tag" may not be unique, so a secondary "tag" is required to search which can identify the END of the data I want and also thereby confirm it is the correct instance of the data.

There's no guarantee of length, so on locating the primary "tag", I must still iterate until I reach the secondary "tag" - however, if another instance of the Primary "tag" occurs, this process must recommence searching for the secondary tag from this point. If the EoF is reached and there is no Secondary "tag", or perhaaps even no primary "tag" (which is possible) then a null result or Fail can be returned

___

Now, if the data as a whole is contained as a STRING (data read as "ReadLine"), then Instr() can quickly(?) jump to the required instances of Tags, however, String functions are notoriously slow by default. The data must be less than 65536 bytes in length too and cannot contain 10,13 as these represent ASCII control demarking End Of Line
___

Alternatively, I could use ReadBytes and transfer the ENTIRE data into a Bank from the file, but doing so would then require bytes to be individually read one by one.

___

So which method ought to be faster, given the approximate, typical sixe of the data to be around 30Kb or so ?


Yasha(Posted 2013) [#2]
I think you have those two reversed: ReadLine needs to examine the data byte by byte until it finds the 0xA0D sequence, but since ReadBytes takes a block size it can use bulk-loading techniques behind the scenes to load whole pages into memory at a time.

ReadString should be fundamentally identical to ReadBytes, as a saved Blitz string has no terminator, can contain any data, and is preceded in the file by an integer size.

I think you'd do better to decide based on how you want to use the data after it's loaded: if you want to extract flexible-length substrings from the main data block, you would do better to treat the data as a string (and therefore use ReadString or ReadLine), as Mid() will make your life very easy, quickly extracting substrings. If you want to treat it as a flat array of binary objects, perhaps a bank would be better. A bank might also be better if you plan to modify the data in-place rather than reconstruct the whole thing when you need to save.

So on the whole it sounds like your best option would be ReadString (not ReadLine), assuming you have control over the way the data is written; if not, ReadBytes. (Converting a bank to a string and back again can also be fast enough for realtime use: http://www.blitzbasic.com/codearcs/codearcs.php?code=2972 ) Although to be honest I'd be surprised if loading 30Kb takes any noticeable time at all no matter the method, unless you're repeating this for hundreds of files.

It sounds a bit like you're reinventing XML... perhaps also consider using an existing XML framework?


_PJ_(Posted 2013) [#3]
Thanks for the reply Yasha, I had pretty much looked at a few of your points already.

I think you'd do better to decide based on how you want to use the data after it's loaded: if you want to extract flexible-length substrings from the main data block,

Yup, that's the main intention here.

ReadString should be fundamentally identical to ReadBytes, as a saved Blitz string has no terminator, can contain any data, and is preceded in the file by an integer size.


ReadString will fail on the data, since this requires the data to have been written with "WriteString" and takes the first two bytes it reads as a Short Integer value for the length of string. Hence the 65535 byte limit.
Therefore, ReadLine is necessary as this will read the entire data up to , yes. this will be reading Byte-by-byte 'internally', but I imagine this is faster than:
Repeat n=ReadByte(filestream) Until ((n=10)+(n=13))




It sounds a bit like you're reinventing XML... perhaps also consider using an existing XML framework?

I didn't know actually what XML Does or how it's used, but as mentioned, unfortunately I am not (responsible for or have any hand in ) creating these files, so I have no control over the format they are in.

ReadBytes. (Converting a bank to a string and back again can also be fast enough for realtime use

This concept also crossed my mind, reading the entire block, then converting to a string in order to then process it with "Instr" and "Mid" etc. However, isn't this effectively reading the entire code twice???

If, as it generally seems you are alluding to, that really, the time for whichever procedure is fast enough to not really be an issue (i.e. a second or two tops?) then I think I'll be good just reading the entirety as a string, ReadBytes!
Thank you!


jfk EO-11110(Posted 2013) [#4]
Instr and mids are very slow. Instead you better make your own functions that are peeking in a bank directly. Not a big deal. Just make a substitute function for each command you'd use, that will access a bank instead of a file.


_PJ_(Posted 2013) [#5]
JFK, are you suggesting that I should just use banks and create some form of function to mimic "Instr" or "Mid" that operates on the raw bytes, rather than use any strings at all?


Floyd(Posted 2013) [#6]
I would use whatever is most natural for the problem at hand, probably strings. Get it working and think about speed later, if it's a problem.

Is this file small enough to fit comfortably into memory? And also to keep an extra copy on disk? And is one character equal to one byte?

If the answers are yes then it's probably easiest to start experimenting by loading the entire file into a string. Then work on it with Instr etc.

You will need to create a slightly modified version of the original file. Create a new file and WriteInt the length of the original file. Then read the bytes of the original file and write those to the new file. The new file can now be thought of as big string, preceded by a four byte value holding the length.

Now ReadString on the new file will produce a string holding the original file.


jfk EO-11110(Posted 2013) [#7]
Yes. Get the filesize then create a bank of that size Readbytes the file to the bank.
You may now seek for linebreaks, or parse the whole bank as one.

So, instead of IF INSTR(a$ ,"hello") you would say IF myINSTR(ABANK, HELLOBANK), where abank is the line and hellobank the comparation string. You'd peek trough abank until the byte is the same as the first byte of hellobank. If so you'd compare the next bytes...
You could then even outsource the timecritical parts in a dll that supports inline asm, if speed is crucial.


jfk EO-11110(Posted 2013) [#8]
One more thing about it is - what's really slow is reading and writmg Files in many little steps. If you dont write to the same file then the os may optimize the speed by using cache. Anyhow the implementation makes only sense when you've got huge amounts of data to parse.


_PJ_(Posted 2013) [#9]
I already have this concept working, I just wasn't sure if it was 'badl;y done'.

From my worklog, you can see the principle at work here, where it actually begins scanning whilst the code is being read from the file byte-by-byte:
Function s_WEB_FindMidStringInFile$(FilePath$,Prefix$,Suffix$)
	Local PosStart
	Local PosEnd
	Local Seek$=cs_STRING_NULL_STRING
	Local Byte
	
	If (Not(FileType(FilePath)))
		Return cs_STRING_NULL_STRING
	End If
	
	Local File=OpenFile(FilePath)
	If (Not(File))
		Return cs_STRING_NULL_STRING
	End If	
	SeekFile File,0
	
	While (Not(Eof(File)))
		Byte=ReadByte(File)
		If (Byte)
			Seek=Seek+Chr(Byte)
		End If	
		PosStart=Instr(Seek,Prefix)
		If (PosStart)
			
			PosStart=PosStart+Len(Prefix)-1
			Seek=cs_STRING_NULL_STRING
			SeekFile(File,PosStart)
			While (Not(Eof(File)))
				Byte=ReadByte(File)
				If (Byte)
					Seek=Seek+Chr(Byte)
				End If
				PosEnd=Instr(Seek,Suffix)
				If (PosEnd)
					
					Exit
				End If	
			Wend
			
			Exit
		End If	
	Wend	
	
	CloseFile File
	
	If (Not(PosStart*PosEnd))
		Return cs_STRING_NULL_STRING
	End If	
	
	Seek=Trim(Left(Seek,PosEnd-1))
	Return Seek
End Function



Yasha(Posted 2013) [#10]
In your position I would load the data, scan through it, and convert to a string, in three separate operations. There's no real need to interleave the operations in that way and separation of tasks is always clearer.

Most of the time used by this function (aside from that spent reading bytes) is probably going to be spent building up the initial "seek" string that then immediately gets discarded when the opening tag is found; you could likely make this faster if you found the opening tag using byte operations.

But if it works, and it's fast enough, it's probably fine as-is. Speed doesn't matter if it's not in a loop - which file operations never should be so there's no issue here. Don't waste your mental energy "optimising" things that aren't bottlenecks. Consider that we have already likely spent more time and energy thinking about this than would be ever saved by changing it.


_PJ_(Posted 2013) [#11]
Thanks, Yasha! Great advice as always! :)