Categories
Blog

Doodles for Lossless Compression of CV Data

I’ve been developing a few electronics ideas that require storing CV data. I thought for a first stab ‘keep it small’, so I’ll try it with an 8k or 16k AVR chip, and I am curious how many seconds of data, sampling at say 100Hz, that I could store in that space. Obviously it will get saturated at some point, but that’s a subject for a later doodle where I need to get into lossy compression (for example, make space by removing small deltas from it).

A friend recommended run length compression on deltas. I did some experiments with that in a jupyter session using smoothed random noise as test data. Here are those early experiments where, speaking very loosely, things compress to around 30%:

%matplotlib inline
from matplotlib import pyplot as plt
import random

N = 300
t = [i for i in range(N)]

# Data starts as noise and is then smooth many times for random curve
y = [int(.5 + random.uniform(0, 127)) for _ in range(N)]
SMOOTH_N = 20
for _ in range(SMOOTH_N):
    for i in range(N - 2):
        y[i + 1] = int((y[i] + y[i + 1] + y[i + 2]) / 3)
        
def delta_values(y):
    # Delta y values
    yd = [y[0]]
    for i in range(len(y) - 1):
        yd.append(y[i + 1] - y[i])
    return yd

def rle_pack(y):
    """Convert y array into array of (run count, value) tuples"""
    yd = delta_values(y)
    y_rle = [(1, yd[0])]
    for i in range(len(y) - 1):
        if yd[i + 1] == y_rle[-1][1]:
            y_rle[-1] = (y_rle[-1][0] + 1, y_rle[-1][1])
        else:
            y_rle.append((1, yd[i + 1]))
    return y_rle

def rle_unpack(y_rle):
    y = []
    v = 0
    for tup in y_rle:
        count = tup[0]
        for _ in range(count):
            v += tup[1]
            y.append(v)
    return y

a = rle_pack(y)
b = rle_unpack(a)

plt.plot(t, y, b)

print(f'Compressed to {100 * len(a) / N}% of original')

This morning it struck me there may be a way to enhance this, to use delta compression but use Huffman coding to prioritize the deltas; that is for small changes, rather than storing the delta in a byte, use it as a code that maps to a frequently-used value in a lookup tree. However, as data is going to be recorded on-the fly, I won’t be able to build a prefix tree upfront that this optimal for a data set; file compression algorithms can check the frequencies of what they are storing retrospectively to maximise compression. Instead I think I’ll have to explore for a best-fit of the frequencies of the deltas. For example it could be something like this:

[
  11 = +1, 
  10 = -1
  011 = +2
  010 = -2
  0011 = +3
  0010 = -3
]

Depending on the bits per sample size though, there is a potentially very large possible set of deltas; I’m using 8 bit (other projects may use 10 or 12 bit), so the worst case is +/- 255. I need to explore a few data sets to confirm (TBC ‘1’ below), but it looks like worst cases for big deltas would be horribly long streams of bits. Or putting it in a Huffman code perspective, it would look up values deep in the tree.

So along with exploring this generalised case, I’d like to explore what optimisations are possible too. For example, rather than having, +/- 1,2,3,4,5…,255, perhaps I could encode as powers of two +/- 1,2,4,8,16…,128. For example, read in N bits and if you encounter a 1 then read the next bit B. The value to add to the delta is +/- 2^(N+1) (plus for B == 1 and minus for B == 0).

[
  11 = +1,
  10 = -1
  011 = +2
  010 = -2
  0011 = +4
  0010 = -4
]

There are two glaring omissions with the above: firstly there is no +0, so if we’re not doing run length compression there’s no way to repeat a value; secondly we’re always working with powers of two. The latter point could be remedied by having a means to choose N or 2^N when most applicable. That is, along with codes for values, have certain codes that change how the encoder is working.

Say it accumulates delta values, but doesn’t actually proceed to the next sample until a unique code has been picked up:

[
  10 = +0,
  11 = <tally up deltas so far and step to next sample>
  011 = +1, 
  010 = -1
  0011 = +2
  0010 = -2
  00011 = +4
  00010 = -4
]

This is already looking ugly but I’m keen to measure it (TBC ‘2’ below).

However, I suspect that this powers of two approach goes to the other extreme: pathological cases for large non-power of two deltas, say 0xFF, will require summing so many chains of bits that it will be sub optimal. I need to confim that though (TBC ‘3’ below). Regardless, maybe sometimes we’d like to use a delta of N, rather than 2^N, or some other function, or perhaps sometimes we’d prefer to run-length encode some parts (or all of the stream), or perhaps just read in a binary value verbatim.

Extending that idea, what this ends up becoming is a Huffman tree with functions as well as values. For example, ’11’ could lead in to multiple functions that change the decoder’s state (TBC ‘4’ below):

[
  10 = +0,
  1100 = <store last sample>,
  11100 = <set function D() = +/- 2^N>,
  11101 = <set function D() = +/- N>,
  11110 = <set value verbatim and store>,
  1111... = other, e.g. store last value N times?
  011 = +D(1)
  010 = -D(1)
  0011 = +D(2)
  0010 = -D(2)

]

I’ll be poking around with those various TBCs. Also in retrospect the latter case may consolidate more neatly into another experiment: ‘0’ prefix => use suffix bit stream for a value depending on state (be it Huffman code for a delta, a direct binary value); ‘1’ prefix => use suffix bits to change the internal state (TBC ‘5’).

Also after re-reading this it may be best to add a new even simpler case: use a +/- delta lookup for small changes (say up to +/- 5), and fall back to reading a whole value from the binary stream if the delta is greater than 5. That leaves my TBC-tally as:

  • TBC 1: Measure +/- N prefix tree
  • TBC 2: Measure +/- 2^N prefix tree
  • TBC 3: Measure +/- 2^N prefix tree with 0 and store functions
  • TBC 4: Explore multifunction encoding with delta tree
  • TBC 5: Try generalised encoding with 0 for value, 1 for function.
  • TBC 6: Stream with fixed +/- up to, say, 5 and fall back to raw data otherwise.

I’m not a compression bod and I’m not academically versed, so there’s probably a blazing oversight here. But this is a play project to get into Huffman trees, and I hope to discover interesting interesting things along the way. TBCs will be coming in later posts.

Leave a Reply

Your email address will not be published. Required fields are marked *