Code archives/Algorithms/Burrows-Wheeler-Transformation

This code has been declared by its author to be Public Domain code.

Download source code

Burrows-Wheeler-Transformation by Vertex2005
Hi!
The BWT-Algorithm optimize a huffmancompression extremely. This algorithm is using for example in bzip2 compression.

The Huffmancompression you can find by:
http://blitzbasic.com/codearcs/codearcs.php?code=195
http://blitzbasic.com/codearcs/codearcs.php?code=195
(I was to putridly to write a huffmancode self :))

Huffman create a binary tree. This tree has any branches that represent the chars. Chars that occurs frequently become a shorter bit-path, chars that occurs rarely become a long bit-path.

The BWT optimize datablocks for huffman(and/or for Run Length Encoding).

For example: "blitzbasicrules!" = 2:2:2:1:1:1:1:1:1:1:1:1:1
create a rotated list:
blitzbasicrules!
litzbasicrules!b
itzbasicrules!bl
tzbasicrules!bli
zbasicrules!blit
basicrules!blitz
asicrules!blitzb
sicrules!blitzba
icrules!blitzbas
crules!blitzbasi
rules!blitzbasic
ules!blitzbasicr
les!blitzbasicru
es!blitzbasicrul
s!blitzbasicrule
!blitzbasicrules


now you must sort this table:
!blitzbasicrule[s]
asicrules!blitz[b]
basicrules!blit[z]
blitzbasicrules[!]
crules!blitzbas[i]
es!blitzbasicru[l]
icrules!blitzba[s]
itzbasicrules!b[l]
les!blitzbasicr[u]
litzbasicrules![b]
rules!blitzbasi[c]
s!blitzbasicrul[e]
sicrules!blitzb[a]
tzbasicrules!bl[i]
ules!blitzbasic[r]
zbasicrules!bli[t]


rotate te original 1x:
"blitzbasicrules!" -> "litzbasicrules!b"
and search it, in the sorted table. You can find it on index 9.

Save the cloum and the index as following:
"sbz!ilslubceairt, 9"

Now make a alphabet that contain all chars:
"!abceilrstuz"

searching this alphabet for the char, replace this char at the first in this alphabet:
                 !abceilrstuz
s -> index  8 -> s!abceilrtuz
b -> index  3 -> bs!aceilrtuz
z -> index 11 -> zbs!aceilrtu
! -> index  3 -> !zbsaceilrtu
i -> index  7 -> i!zbsacelrtu
l -> index  8 -> l!zbsaceirtu
r -> index  9 -> rl!zbsaceitu
s -> index  5 -> srl!zbaceitu
t -> index 10 -> tsrl!zbaceiu
u -> index 11 -> usrl!zbaceit
z -> index  5 -> zusrl!baceit


now you must write "8 3 11 3 7 8 9 5 10 11", 9 = 2:2:2:1:1:1:1
(ok, ok it was a bad example :))

to decode it, you must do the same:
                 !abceilrstuz
index  8 -> s -> s!abceilrtuz
index  3 -> b -> bs!aceilrtuz
index 11 -> z -> zbs!aceilrtu
index  3 -> ! -> !zbsaceilrtu
index  7 -> i -> i!zbsacelrtu
index  8 -> l -> l!zbsaceirtu
index  9 -> r -> rl!zbsaceitu
index  5 -> s -> srl!zbaceitu
index 10 -> t -> tsrl!zbaceiu
index 11 -> u -> usrl!zbaceit
index  5 -> z -> zusrl!baceit


to decode "sbz!ilslubceairt, 9" you must do this:
Last = "sbz!ilslubceairt" sorting this to:
First = "!abbceiillrsstuz"

Finding to all first entrys the last entry, set the last to " "(for no doublefinding!), and save the lastindex in to a transform vektor:
 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
------------------------------------------------
 s  b  z [!] i  l  s  l  u  b  c  e  a  i  r  t
[!] a  b  b  c  e  i  i  l  l  r  s  s  t  u  z
------------------------------------------------
 3

->

 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
------------------------------------------------
 s  b  z     i  l  s  l  u  b  c  e [a] i  r  t
 ! [a] b  b  c  e  i  i  l  l  r  s  s  t  u  z
------------------------------------------------
 3 12

...

->

 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
------------------------------------------------
                                               
 !  a  b  b  c  e  i  i  l  l  r  s  s  t  u  z
------------------------------------------------
 3 12  1  9 10 11  4 13  5  7 14  0  6 15  8  2


Trans = " 3 12 1 9 10 11 4 13 5 7 14 0 6 15 8 2", 7

With this vector you can rebuild the data:
start at position 9:
Index = 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
Trans = 3 12  1  9 10 11  4 13  5  7 14  0  6 15  8  2
Last  = s  b  z  !  i  l  s  l  u  b  c  e  a  i  r  t


at Last[9] is the char "b", at Trans[9] is the index 7. So the next index is at 7 Last[7] = l and so on:
                Index = 9
Last[9]  = b -> Index = Trans[9]
Last[7]  = l -> Index = Trans[7]
Last[13] = i -> Index = Trans[13]
Last[15] = t -> Index = Trans[15]
Last[2]  = z -> Index = Trans[2]
Last[1]  = b -> Index = Trans[1]
Last[12] = a -> Index = Trans[12]
Last[6]  = s -> Index = Trans[6]
Last[4]  = i -> Index = Trans[4]
Last[10] = c -> Index = Trans[10]
Last[14] = r -> Index = Trans[14]
Last[8]  = u -> Index = Trans[8]
Last[5]  = l -> Index = Trans[5]
Last[11] = e -> Index = Trans[11]
Last[0]  = s


Puhhh, sorry for my bad english :)

I have testet this with a 171 KB Bitmap file. Without BWT, Huffman compress it to 119 KB. With BWT to 52,2 KB!

When I ready, i will write my own huffman code, to optimizie it, for 512 byte datablocks.

You can set the datablock size bigger, for better results, but need more time.

cu olli
Strict

Print "Start"
BWT_Encode("test.bmp", "test.bwt")
Print "End"
Print "Start"
BWT_Decode("test.bwt", "test2.bmp")
Print "End"

End

Function BWT_Encode(sFileIn:String, sFileOut:String)
	Local iIndex:Int
	Local iFileSize:Int, iEncodeEnd:Int, tStreamIn:TStream, tStreamOut:TStream
	Local tDataBlock:TBank, sFirst:String, sSortedTable:String[512]
	Local iFirstIndex:Int
	Local sAlphabet:String, bFind:Byte

	iFileSize = FileSize(sFileIn)
	iEncodeEnd = iFileSize-(iFileSize Mod 512)
	
	tStreamIn = ReadFile(sFileIn)
	If tStreamIn = Null Then
		Return False
	EndIf
    
	tStreamOut = WriteFile(sFileOut)
	If tStreamOut = Null Then
		CloseFile tStreamIn
		Return False
	EndIf
	
	WriteInt tStreamOut, iEncodeEnd
	
	tDataBlock = CreateBank(512)
	While StreamPos(tStreamIn) < iEncodeEnd
		ReadBank(tDataBlock, tStreamIn, 0, 512)
		
		sFirst = ""
		For iIndex = 0 To 511
			sFirst = sFirst+Chr(PeekByte(tDataBlock, iIndex))
		Next
		sSortedTable[0] = sFirst
		
		For iIndex = 1 To 511
			sSortedTable[iIndex] = sSortedTable[iIndex-1][1..512]+Chr(sSortedTable[iIndex-1][0])
		Next
		sSortedTable.Sort()
		
		sFirst = sFirst[1..512]+Chr(sFirst[0])
		For iIndex = 0 To 511
			If sSortedTable[iIndex] = sFirst Then
				iFirstIndex = iIndex
				Exit
			EndIf
		Next
		
		sAlphabet = ""
		For iIndex = 0 To 255
			sAlphabet = sAlphabet+Chr(iIndex)
		Next
		
		For iIndex = 0 To 511
			bFind = sSortedTable[iIndex][511]
			WriteByte tStreamOut, sAlphabet.Find(Chr(bFind))
			sAlphabet = Chr(bFind)+sAlphabet.Replace(Chr(bFind), "")
		Next
		WriteInt tStreamOut, iFirstIndex
		
	Wend
	
	While Not Eof(tStreamIn)
		WriteByte tStreamOut, ReadByte(tStreamIn)
	Wend
	
	CloseFile tStreamOut
	CloseFile tStreamIn
	
	Return True
End Function

Function BWT_Decode(sFileIn:String, sFileOut:String)
	Local iIndex:Int, iIndex2:Int, bChar:Byte
	Local iEncodeEnd:Int, tStreamIn:TStream, tStreamOut:TStream
	Local tDataBlock:TBank, iFirst:Int[512], iLast:Int[512], iTrans:Int[512]
	Local iFirstIndex:Int
	Local sAlphabet:String, bFind:Byte
	
	tStreamIn = ReadFile(sFileIn)
	If tStreamIn = Null Then
		Return False
	EndIf
    
	tStreamOut = WriteFile(sFileOut)
	If tStreamOut = Null Then
		CloseFile tStreamIn
		Return False
	EndIf
	
	iEncodeEnd = (ReadInt(tStreamIn)/512)*516
	
	tDataBlock = CreateBank(512)
	While StreamPos(tStreamIn) < iEncodeEnd
		ReadBank(tDataBlock, tStreamIn, 0, 512)
		iFirstIndex = ReadInt(tStreamIn)

		sAlphabet = ""
		For iIndex = 0 To 255
			sAlphabet = sAlphabet+Chr(iIndex)
		Next
		
		For iIndex = 0 To 511
			bFind = PeekByte(tDataBlock, iIndex)
			PokeByte tDataBlock, iIndex, sAlphabet[bFind]
			sAlphabet = Chr(sAlphabet[bFind])+sAlphabet.Replace(Chr(sAlphabet[bFind]), "")
		Next
		
		For iIndex = 0 To 511
			iFirst[iIndex] = PeekByte(tDataBlock, iIndex) 
			iLast[iIndex] = PeekByte(tDataBlock, iIndex) 
		Next
		
		For iIndex = 0 To 511
			iFirst[iIndex] = PeekByte(tDataBlock, iIndex) 
			iLast[iIndex] = PeekByte(tDataBlock, iIndex) 
		Next
		iFirst.Sort
		
		For iIndex = 0 To 511
			bChar = iFirst[iIndex]
			For iIndex2 = 0 To 511
				If iLast[iIndex2] = bChar Then
					iTrans[iIndex] = iIndex2
					iLast[iIndex2] = iLast[iIndex2]+256
					Exit
				EndIf
			Next
		Next

		iIndex2 = iFirstIndex
		For iIndex = 0 To 511
			WriteByte tStreamOut, iLast[iIndex2]-256
			iIndex2 = iTrans[iIndex2]
		Next
		
		FlushMem
	Wend

	While Not Eof(tStreamIn)
		WriteByte tStreamOut, ReadByte(tStreamIn)
	Wend
	
	CloseFile tStreamOut
	CloseFile tStreamIn
	
	Return True
End Function

Comments

Jeroen2005
gheh nice work man


GW2005
Very nice!
How would you modify this to work with bmax strings?


Filax2005
I don't understand very well :) but i have try with a BMP file

Original : 921 656 byte
BWT file : 928 860 octets

Original file with winrar : 404 603 byte
BWT file with winrar : 692 631 byte ????

Strange :)

May be winrar use different than huffman compress method ?


Murilo2005
Very nice - It greatly improves the output of my huffman routines. I'm considering releasing my revised and improved algorithm soon...


skidracer2006
Removed FlushMems, can someone confirm this code still works as expected.


Chroma2008
Yep you're right Filax this is no good. It doesn't make the .bwt file smaller at all.


ImaginaryHuman2010
This is neat but I'm not sure about the bit near the end of the encoding where you form a separate alphabet of characters and map them to numbers ... this seems like a way of `compacting` the range of symbols used to represent the string, as an extra step, which might not be in the original BWT? And maybe that is what's throwing it off? I'll have to try this for myself.


Code Archives Forum