Tuesday, July 7, 2009

Farey - Huffman Compress

The algorithm starts with computing a vector with the number of appearance of each character. At each step are combined two nodes to form an binary tree structure: the one with lowest appearance values. At the first step are combined the two characters with lowest appearance. These two node forms a new node with an appearance equal with the sum of the children. Than the vector of appearance is sorted and the process of combining two nodes starts again (it is possible to combine a simple character with a “sum” node or two “sum” nodes).


http://en.wikipedia.org/wiki/Huffman_coding
This want to be a method to optimize the Huffman tree compression and all the other algorithms based on this. For this is used the Farey addition.

The main idea use a correspondence between the Huffman tree and the binary search; for this the structure for the Huffman tree is viewed like a same tree but who stores in the roots the median values of an array in which is supposed to make an virtual search. All this values are after this, researched with a new method; instead calculating the half value of an interval, is computed the median value of that interval. This is known also like Farey addition; let’s suppose that the left and right limits of the interval "a", and "c" are viewed like a/1 and c/1. At each step starting from the fractions a/b and c/d is calculated (a+c)/(b+d). Similar to the Huffman codes - if the search is made in the left subinterval we memo an "0" and for the right an "1"- here if we want to merge the two methods and compare with the median value of binary search and with the Farey median value, we have six cases. But we can memo also with 0 and 1, using for this also codes with 0 and 1 and an array in which we store something similar to an impress; in this vector we memo 0 if the code starting for a position is not optimized and a 1 in other case.

This search produce smallest codes in approx. 30% of total cases in comparison with binary search.
Why we can use this method instead of using an other division of interval? The answers can be:- if we used an logx(n) the complexity of divide et impera algorithm grows; the steps are smaller but the length of the codes increase if we want to represent using 0 and 1 or if want to use an impress array the impress will grow because that can be cases in which an same binary code to represent multiple searched values.- the Farey bisection of interval produce an proportion -related to the length of intervals- equal to gold section number phi if we calculate at limit. - the number of "questions" to find a fraction represent log2(1+sum(fi(n)) approx. log2(3*n^2/pi^2) < log2(n).- on subintervals the envelope of the distribution looks like the Gauss bell. The generation of the Farey numbers can use some matrixes which are sub matrix of the Hadamard matrix/transform which is used for the generating of the Gauss distribution due to the central limit theorem.

I observed that if the "impress" is compressed again with UHBC algorithm (I found it on the internet) than the result is smaller, lets note this by "UhbcA".
The size of Farey codes + "UhbcA" << The size of Huffman codes.

PS. The impress can be memo using something like sparse matrix memorisation.