md5 of big files.

BlitzMax Forums/BlitzMax Programming/md5 of big files.

Moogles(Posted 2006) [#1]
I was reading this a medley of digests
and i was using the md5 string function. I've made some functions that format the md5 hash. ( pretty simple, but its helping me learn. )
Then I saw the md5 bank function. Now all I wanna know is. if i have a filesize bigger than the users memory, how would i use banks to use this function? would it load sections of the file? and if so. how would i get the final md5 of the file. :/


Yan(Posted 2006) [#2]
I had this to hand...
Strict

Local start:Int = MilliSecs()

Print

Print Hex$(CRCFile("C:\Downloads\other\far_cry_v1.3.exe"))

Print (MilliSecs() - start)

'FlushMem

End

'Adaptation of MrCredo's CRC code
Function CRCFile(FileName:String, bufferSize:Int=$400000) 'default bufferSize = 4MB
  Local crc:Int = $FFFFFFFF , fileStream:TStream
  Global CRCTable:Int[]
  
  If CRCTable = Null
    Local i:Int, j:Int, value:Int
    
    CRCTable = New Int[256]
  
    For i=0 To 255
      value = i
      For j=0 To 7
        If value & $1 
          value = (value Shr 1) ~ $EDB88320
        Else
          value = value Shr 1
        EndIf
      Next
      CRCTable[i] = value
    Next
  EndIf
  
  fileStream = ReadStream(fileName$)
  If fileStream = Null Then Return
  
  Local buffer:TBank = CreateBank(bufferSize)
  Local bankPtr:Byte Ptr = buffer.Buf()
  
  Repeat
    Local bytesRead:Int = ReadBank(buffer, fileStream, 0, bufferSize)
    
    For Local b=0 Until bytesRead
      crc = (crc Shr 8) ~ CRCTable[bankPtr[b] ~ (crc & $FF)]
    Next
  Until Eof(FileStream)
  
  CloseStream FileStream
  
  Return ~crc
End Function
Should give you a reasonable idea of how you could go about it.


Moogles(Posted 2006) [#3]
Thanks Kind Sir! Im trying to do the md5 way by reading the file through a stream too :P. thats the right way right?


Moogles(Posted 2006) [#4]
well im no genius, and im stuck. heres what code ive altered
Strict

Function md5_file:String(FileName:String, bufferSize:Int=$400000)
	Local fileStream:TStream
	fileStream = ReadStream(fileName$)
	If fileStream = Null Then Return
 	Local buffer:TBank = CreateBank(bufferSize)
  	Local bankPtr:Byte Ptr = buffer.Buf()
  
 	Local h0 = $67452301, h1 = $EFCDAB89, h2 = $98BADCFE, h3 = $10325476

	Local r[] = [7, 12, 17, 22,  7, 12, 17, 22,  7, 12, 17, 22,  7, 12, 17, 22,..
                5,  9, 14, 20,  5,  9, 14, 20,  5,  9, 14, 20,  5,  9, 14, 20,..
                4, 11, 16, 23,  4, 11, 16, 23,  4, 11, 16, 23,  4, 11, 16, 23,..
                6, 10, 15, 21,  6, 10, 15, 21,  6, 10, 15, 21,  6, 10, 15, 21]

	Local k[] = [$D76AA478, $E8C7B756, $242070DB, $C1BDCEEE, $F57C0FAF, $4787C62A,..
                $A8304613, $FD469501, $698098D8, $8B44F7AF, $FFFF5BB1, $895CD7BE,..
                $6B901122, $FD987193, $A679438E, $49B40821, $F61E2562, $C040B340,..
                $265E5A51, $E9B6C7AA, $D62F105D, $02441453, $D8A1E681, $E7D3FBC8,..
                $21E1CDE6, $C33707D6, $F4D50D87, $455A14ED, $A9E3E905, $FCEFA3F8,..
                $676F02D9, $8D2A4C8A, $FFFA3942, $8771F681, $6D9D6122, $FDE5380C,..
                $A4BEEA44, $4BDECFA9, $F6BB4B60, $BEBFBC70, $289B7EC6, $EAA127FA,..
                $D4EF3085, $04881D05, $D9D4D039, $E6DB99E5, $1FA27CF8, $C4AC5665,..
                $F4292244, $432AFF97, $AB9423A7, $FC93A039, $655B59C3, $8F0CCC92,..
                $FFEFF47D, $85845DD1, $6FA87E4F, $FE2CE6E0, $A3014314, $4E0811A1,..
                $F7537E82, $BD3AF235, $2AD7D2BB, $EB86D391]
                
	Local intCount = (((buffer.Size() + 8) Shr 6) + 1) Shl 4
	Local data[intCount]
	Local inPtr:Byte Ptr = buffer.Buf()
  
	For Local c=0 Until buffer.Size()
		data[c Shr 2] = data[c Shr 2] | ((inPtr[c] & $FF) Shl ((c & 3) Shl 3))
	Next

	data[buffer.Size() Shr 2] = data[buffer.Size() Shr 2] | ($80 Shl ((buffer.Size() & 3) Shl 3)) 
	data[data.length - 2] = (Long(buffer.Size()) * 8) & $FFFFFFFF
	data[data.length - 1] = (Long(buffer.Size()) * 8) Shr 32
  
 	For Local chunkStart=0 Until intCount Step 16
		Local a = h0, b = h1, c = h2, d = h3
        
		For Local i=0 To 15
			Local f = d ~ (b & (c ~ d))
			Local t = d
      
			d = c ; c = b
			b = Rol((a + f + k[i] + data[chunkStart + i]), r[i]) + b
			a = t
		Next
    
		For Local i=16 To 31
			Local f = c ~ (d & (b ~ c))
			Local t = d

			d = c ; c = b
			b = Rol((a + f + k[i] + data[chunkStart + (((5 * i) + 1) & 15)]), r[i]) + b
			a = t
		Next
    
		For Local i=32 To 47
			Local f = b ~ c ~ d
			Local t = d
      
			d = c ; c = b
			b = Rol((a + f + k[i] + data[chunkStart + (((3 * i) + 5) & 15)]), r[i]) + b
			a = t
		Next
    
		For Local i=48 To 63
			Local f = c ~ (b | ~d)
			Local t = d
      
			d = c ; c = b
			b = Rol((a + f + k[i] + data[chunkStart + ((7 * i) & 15)]), r[i]) + b
			a = t
		Next
    
		h0 :+ a ; h1 :+ b
		h2 :+ c ; h3 :+ d
	Next
  
	Return (LEHex(h0) + LEHex(h1) + LEHex(h2) + LEHex(h3)).ToLower()  
End Function

Function Rol(val, shift)
  Return (val Shl shift) | (val Shr (32 - shift))
End Function

Function Ror(val, shift)
  Return (val Shr shift) | (val Shl (32 - shift))
End Function

Function LEHex$(val)
  Local out$ = Hex(val)
  
  Return out$[6..8] + out$[4..6] + out$[2..4] + out$[0..2]
End Function

Print md5_file("C:\NVIDIA\Win2KXP\81.98\Setup.exe")


and heres the original
Strict
Function md5_file:String(inbank:TBank)
 	Local h0 = $67452301, h1 = $EFCDAB89, h2 = $98BADCFE, h3 = $10325476

	Local r[] = [7, 12, 17, 22,  7, 12, 17, 22,  7, 12, 17, 22,  7, 12, 17, 22,..
                5,  9, 14, 20,  5,  9, 14, 20,  5,  9, 14, 20,  5,  9, 14, 20,..
                4, 11, 16, 23,  4, 11, 16, 23,  4, 11, 16, 23,  4, 11, 16, 23,..
                6, 10, 15, 21,  6, 10, 15, 21,  6, 10, 15, 21,  6, 10, 15, 21]

	Local k[] = [$D76AA478, $E8C7B756, $242070DB, $C1BDCEEE, $F57C0FAF, $4787C62A,..
                $A8304613, $FD469501, $698098D8, $8B44F7AF, $FFFF5BB1, $895CD7BE,..
                $6B901122, $FD987193, $A679438E, $49B40821, $F61E2562, $C040B340,..
                $265E5A51, $E9B6C7AA, $D62F105D, $02441453, $D8A1E681, $E7D3FBC8,..
                $21E1CDE6, $C33707D6, $F4D50D87, $455A14ED, $A9E3E905, $FCEFA3F8,..
                $676F02D9, $8D2A4C8A, $FFFA3942, $8771F681, $6D9D6122, $FDE5380C,..
                $A4BEEA44, $4BDECFA9, $F6BB4B60, $BEBFBC70, $289B7EC6, $EAA127FA,..
                $D4EF3085, $04881D05, $D9D4D039, $E6DB99E5, $1FA27CF8, $C4AC5665,..
                $F4292244, $432AFF97, $AB9423A7, $FC93A039, $655B59C3, $8F0CCC92,..
                $FFEFF47D, $85845DD1, $6FA87E4F, $FE2CE6E0, $A3014314, $4E0811A1,..
                $F7537E82, $BD3AF235, $2AD7D2BB, $EB86D391]
                
	Local intCount = (((inbank.Size() + 8) Shr 6) + 1) Shl 4
	Local data[intCount]
	Local inPtr:Byte Ptr = inbank.Buf()
  
	For Local c=0 Until inbank.Size()
		data[c Shr 2] = data[c Shr 2] | ((inPtr[c] & $FF) Shl ((c & 3) Shl 3))
	Next

	data[inbank.Size() Shr 2] = data[inbank.Size() Shr 2] | ($80 Shl ((inbank.Size() & 3) Shl 3)) 
	data[data.length - 2] = (Long(inbank.Size()) * 8) & $FFFFFFFF
	data[data.length - 1] = (Long(inbank.Size()) * 8) Shr 32
  
 	For Local chunkStart=0 Until intCount Step 16
		Local a = h0, b = h1, c = h2, d = h3
        
		For Local i=0 To 15
			Local f = d ~ (b & (c ~ d))
			Local t = d
      
			d = c ; c = b
			b = Rol((a + f + k[i] + data[chunkStart + i]), r[i]) + b
			a = t
		Next
    
		For Local i=16 To 31
			Local f = c ~ (d & (b ~ c))
			Local t = d

			d = c ; c = b
			b = Rol((a + f + k[i] + data[chunkStart + (((5 * i) + 1) & 15)]), r[i]) + b
			a = t
		Next
    
		For Local i=32 To 47
			Local f = b ~ c ~ d
			Local t = d
      
			d = c ; c = b
			b = Rol((a + f + k[i] + data[chunkStart + (((3 * i) + 5) & 15)]), r[i]) + b
			a = t
		Next
    
		For Local i=48 To 63
			Local f = c ~ (b | ~d)
			Local t = d
      
			d = c ; c = b
			b = Rol((a + f + k[i] + data[chunkStart + ((7 * i) & 15)]), r[i]) + b
			a = t
		Next
    
		h0 :+ a ; h1 :+ b
		h2 :+ c ; h3 :+ d
	Next
  
	Return (LEHex(h0) + LEHex(h1) + LEHex(h2) + LEHex(h3)).ToLower()  
End Function

Function Rol(val, shift)
  Return (val Shl shift) | (val Shr (32 - shift))
End Function

Function Ror(val, shift)
  Return (val Shr shift) | (val Shl (32 - shift))
End Function

Function LEHex$(val)
  Local out$ = Hex(val)
  
  Return out$[6..8] + out$[4..6] + out$[2..4] + out$[0..2]
End Function

i know your looping and finding the crc32 of the buffersize. but then how do you take that info and find the crc32 of the total file? im sorry for being a pain. :/ Its hard because I dont know what the md5 is doing. Im reading up on shr and the md5 function


VP(Posted 2006) [#5]
Have a look for ADLER32 as well ;) It's faster than CRC32 but only provides a good unique key after 2K (for binary) or 4K (for text).

B3D Code below is untested (and doesn't handle big files).

Function pngExporterAdler32%(databank%)
	;==========================================================================
	; DESCRIPTION
	;
	; Computes the Adler-32 value for data stored in a bank.
	;--------------------------------------------------------------------------
	; PARAMETERS
	;
	; databank%	: Valid Blitz bank.
	;--------------------------------------------------------------------------
	; RETURNS
	;
	; 32 bit Adler checksum for the data held in the bank.
	;==========================================================================

	; !! This function has not been tested or verified.

	adler%=1
	
	length%=BankSize(databank%)-1

	s1% = adler% And $FFFF
	s2% = (adler% Shr 16) And $FFFF

	For i%=0 To length%
		s1% = s1% + PeekByte(databank%,i%) % 65521
		s2% = (s2% + s1%) % 65521
	Next

	Return (s2% Shl 16) + s1%
	
End Function



Moogles(Posted 2006) [#6]
no no.. i just want to know how would i calculate the md5 of a massive file, using chunks of it. like ernie showed with the crc one above. thats all.


Yan(Posted 2006) [#7]
You should be doing something like this (psuedo code)...
Function md5_file:String(FileName:String, bufferSize:Int=$400000)
  ...Set up MD5 constants, create bank and open file stream, etc...
   
  repeat
    fill buffer with data from file
    
    if EOF
      set upper data boundry to (number of bytes read this loop) mod (64 - 9) bytes
      write $80 after the last byte
      pad to (upper data boundry - 8) with null bytes
      write little endian long of total length of data in bits to (upper data boundry - 8)
    end if
      
    For Local chunkStart=0 Until (end of data in ints) Step 16
      ...MD5 passes getting data from the bank as little endian ints...	
    Next
  
  until EOF
	Return (LEHex(h0) + LEHex(h1) + LEHex(h2) + LEHex(h3)).ToLower()  
End Function