Code archives/Algorithms/Burrows-Wheeler-Transformation
This code has been declared by its author to be Public Domain code.
Download source code
| |||||
Hi! The BWT-Algorithm optimize a huffmancompression extremely. This algorithm is using for example in bzip2 compression. The Huffmancompression you can find by: http://blitzbasic.com/codearcs/codearcs.php?code=195 http://blitzbasic.com/codearcs/codearcs.php?code=195 (I was to putridly to write a huffmancode self :)) Huffman create a binary tree. This tree has any branches that represent the chars. Chars that occurs frequently become a shorter bit-path, chars that occurs rarely become a long bit-path. The BWT optimize datablocks for huffman(and/or for Run Length Encoding). For example: "blitzbasicrules!" = 2:2:2:1:1:1:1:1:1:1:1:1:1 create a rotated list: blitzbasicrules! litzbasicrules!b itzbasicrules!bl tzbasicrules!bli zbasicrules!blit basicrules!blitz asicrules!blitzb sicrules!blitzba icrules!blitzbas crules!blitzbasi rules!blitzbasic ules!blitzbasicr les!blitzbasicru es!blitzbasicrul s!blitzbasicrule !blitzbasicrules now you must sort this table: !blitzbasicrule[s] asicrules!blitz[b] basicrules!blit[z] blitzbasicrules[!] crules!blitzbas[i] es!blitzbasicru[l] icrules!blitzba[s] itzbasicrules!b[l] les!blitzbasicr[u] litzbasicrules![b] rules!blitzbasi[c] s!blitzbasicrul[e] sicrules!blitzb[a] tzbasicrules!bl[i] ules!blitzbasic[r] zbasicrules!bli[t] rotate te original 1x: "blitzbasicrules!" -> "litzbasicrules!b" and search it, in the sorted table. You can find it on index 9. Save the cloum and the index as following: "sbz!ilslubceairt, 9" Now make a alphabet that contain all chars: "!abceilrstuz" searching this alphabet for the char, replace this char at the first in this alphabet: !abceilrstuz s -> index 8 -> s!abceilrtuz b -> index 3 -> bs!aceilrtuz z -> index 11 -> zbs!aceilrtu ! -> index 3 -> !zbsaceilrtu i -> index 7 -> i!zbsacelrtu l -> index 8 -> l!zbsaceirtu r -> index 9 -> rl!zbsaceitu s -> index 5 -> srl!zbaceitu t -> index 10 -> tsrl!zbaceiu u -> index 11 -> usrl!zbaceit z -> index 5 -> zusrl!baceit now you must write "8 3 11 3 7 8 9 5 10 11", 9 = 2:2:2:1:1:1:1 (ok, ok it was a bad example :)) to decode it, you must do the same: !abceilrstuz index 8 -> s -> s!abceilrtuz index 3 -> b -> bs!aceilrtuz index 11 -> z -> zbs!aceilrtu index 3 -> ! -> !zbsaceilrtu index 7 -> i -> i!zbsacelrtu index 8 -> l -> l!zbsaceirtu index 9 -> r -> rl!zbsaceitu index 5 -> s -> srl!zbaceitu index 10 -> t -> tsrl!zbaceiu index 11 -> u -> usrl!zbaceit index 5 -> z -> zusrl!baceit to decode "sbz!ilslubceairt, 9" you must do this: Last = "sbz!ilslubceairt" sorting this to: First = "!abbceiillrsstuz" Finding to all first entrys the last entry, set the last to " "(for no doublefinding!), and save the lastindex in to a transform vektor: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ------------------------------------------------ s b z [!] i l s l u b c e a i r t [!] a b b c e i i l l r s s t u z ------------------------------------------------ 3 -> 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ------------------------------------------------ s b z i l s l u b c e [a] i r t ! [a] b b c e i i l l r s s t u z ------------------------------------------------ 3 12 ... -> 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ------------------------------------------------ ! a b b c e i i l l r s s t u z ------------------------------------------------ 3 12 1 9 10 11 4 13 5 7 14 0 6 15 8 2 Trans = " 3 12 1 9 10 11 4 13 5 7 14 0 6 15 8 2", 7 With this vector you can rebuild the data: start at position 9: Index = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Trans = 3 12 1 9 10 11 4 13 5 7 14 0 6 15 8 2 Last = s b z ! i l s l u b c e a i r t at Last[9] is the char "b", at Trans[9] is the index 7. So the next index is at 7 Last[7] = l and so on: Index = 9 Last[9] = b -> Index = Trans[9] Last[7] = l -> Index = Trans[7] Last[13] = i -> Index = Trans[13] Last[15] = t -> Index = Trans[15] Last[2] = z -> Index = Trans[2] Last[1] = b -> Index = Trans[1] Last[12] = a -> Index = Trans[12] Last[6] = s -> Index = Trans[6] Last[4] = i -> Index = Trans[4] Last[10] = c -> Index = Trans[10] Last[14] = r -> Index = Trans[14] Last[8] = u -> Index = Trans[8] Last[5] = l -> Index = Trans[5] Last[11] = e -> Index = Trans[11] Last[0] = s Puhhh, sorry for my bad english :) I have testet this with a 171 KB Bitmap file. Without BWT, Huffman compress it to 119 KB. With BWT to 52,2 KB! When I ready, i will write my own huffman code, to optimizie it, for 512 byte datablocks. You can set the datablock size bigger, for better results, but need more time. cu olli | |||||
Strict Print "Start" BWT_Encode("test.bmp", "test.bwt") Print "End" Print "Start" BWT_Decode("test.bwt", "test2.bmp") Print "End" End Function BWT_Encode(sFileIn:String, sFileOut:String) Local iIndex:Int Local iFileSize:Int, iEncodeEnd:Int, tStreamIn:TStream, tStreamOut:TStream Local tDataBlock:TBank, sFirst:String, sSortedTable:String[512] Local iFirstIndex:Int Local sAlphabet:String, bFind:Byte iFileSize = FileSize(sFileIn) iEncodeEnd = iFileSize-(iFileSize Mod 512) tStreamIn = ReadFile(sFileIn) If tStreamIn = Null Then Return False EndIf tStreamOut = WriteFile(sFileOut) If tStreamOut = Null Then CloseFile tStreamIn Return False EndIf WriteInt tStreamOut, iEncodeEnd tDataBlock = CreateBank(512) While StreamPos(tStreamIn) < iEncodeEnd ReadBank(tDataBlock, tStreamIn, 0, 512) sFirst = "" For iIndex = 0 To 511 sFirst = sFirst+Chr(PeekByte(tDataBlock, iIndex)) Next sSortedTable[0] = sFirst For iIndex = 1 To 511 sSortedTable[iIndex] = sSortedTable[iIndex-1][1..512]+Chr(sSortedTable[iIndex-1][0]) Next sSortedTable.Sort() sFirst = sFirst[1..512]+Chr(sFirst[0]) For iIndex = 0 To 511 If sSortedTable[iIndex] = sFirst Then iFirstIndex = iIndex Exit EndIf Next sAlphabet = "" For iIndex = 0 To 255 sAlphabet = sAlphabet+Chr(iIndex) Next For iIndex = 0 To 511 bFind = sSortedTable[iIndex][511] WriteByte tStreamOut, sAlphabet.Find(Chr(bFind)) sAlphabet = Chr(bFind)+sAlphabet.Replace(Chr(bFind), "") Next WriteInt tStreamOut, iFirstIndex Wend While Not Eof(tStreamIn) WriteByte tStreamOut, ReadByte(tStreamIn) Wend CloseFile tStreamOut CloseFile tStreamIn Return True End Function Function BWT_Decode(sFileIn:String, sFileOut:String) Local iIndex:Int, iIndex2:Int, bChar:Byte Local iEncodeEnd:Int, tStreamIn:TStream, tStreamOut:TStream Local tDataBlock:TBank, iFirst:Int[512], iLast:Int[512], iTrans:Int[512] Local iFirstIndex:Int Local sAlphabet:String, bFind:Byte tStreamIn = ReadFile(sFileIn) If tStreamIn = Null Then Return False EndIf tStreamOut = WriteFile(sFileOut) If tStreamOut = Null Then CloseFile tStreamIn Return False EndIf iEncodeEnd = (ReadInt(tStreamIn)/512)*516 tDataBlock = CreateBank(512) While StreamPos(tStreamIn) < iEncodeEnd ReadBank(tDataBlock, tStreamIn, 0, 512) iFirstIndex = ReadInt(tStreamIn) sAlphabet = "" For iIndex = 0 To 255 sAlphabet = sAlphabet+Chr(iIndex) Next For iIndex = 0 To 511 bFind = PeekByte(tDataBlock, iIndex) PokeByte tDataBlock, iIndex, sAlphabet[bFind] sAlphabet = Chr(sAlphabet[bFind])+sAlphabet.Replace(Chr(sAlphabet[bFind]), "") Next For iIndex = 0 To 511 iFirst[iIndex] = PeekByte(tDataBlock, iIndex) iLast[iIndex] = PeekByte(tDataBlock, iIndex) Next For iIndex = 0 To 511 iFirst[iIndex] = PeekByte(tDataBlock, iIndex) iLast[iIndex] = PeekByte(tDataBlock, iIndex) Next iFirst.Sort For iIndex = 0 To 511 bChar = iFirst[iIndex] For iIndex2 = 0 To 511 If iLast[iIndex2] = bChar Then iTrans[iIndex] = iIndex2 iLast[iIndex2] = iLast[iIndex2]+256 Exit EndIf Next Next iIndex2 = iFirstIndex For iIndex = 0 To 511 WriteByte tStreamOut, iLast[iIndex2]-256 iIndex2 = iTrans[iIndex2] Next FlushMem Wend While Not Eof(tStreamIn) WriteByte tStreamOut, ReadByte(tStreamIn) Wend CloseFile tStreamOut CloseFile tStreamIn Return True End Function |
Comments
| ||
gheh nice work man |
| ||
Very nice! How would you modify this to work with bmax strings? |
| ||
I don't understand very well :) but i have try with a BMP file Original : 921 656 byte BWT file : 928 860 octets Original file with winrar : 404 603 byte BWT file with winrar : 692 631 byte ???? Strange :) May be winrar use different than huffman compress method ? |
| ||
Very nice - It greatly improves the output of my huffman routines. I'm considering releasing my revised and improved algorithm soon... |
| ||
Removed FlushMems, can someone confirm this code still works as expected. |
| ||
Yep you're right Filax this is no good. It doesn't make the .bwt file smaller at all. |
| ||
This is neat but I'm not sure about the bit near the end of the encoding where you form a separate alphabet of characters and map them to numbers ... this seems like a way of `compacting` the range of symbols used to represent the string, as an extra step, which might not be in the original BWT? And maybe that is what's throwing it off? I'll have to try this for myself. |
Code Archives Forum