Repeated fingerprinting after array updates

Question

I've got a microprocessor and want to quickly identify the settings of my application (stored in some eeprom regions) via a fingerprint instead of having to dump the entire memory every time.

So I was going to compute a 32-bit hash from the few kbytes, but I am not sure which algorithm to use. It doesn't need to be cryptographically secure, but it should be fast and not need some large lookup tables. And most importantly, whenever one of the settings (bytes) changes I need to recompute the hash, which I want to do efficiently.

I guessed I could use any regular hashing algorithm and modify it so that it stores intermediate results in a tree-like structure which trades some memory for a logarithmic benefit in recomputation complexity. (As I found in my research, this concept seems to be called merkle tree). But surely there's an even better way to do it.

I then found about rolling hashes which employ some recursive algorithms, generalised to

$H(arr) = f_r(\ldots f_r(f_r(f_r(f_0 \oplus arr_0) \oplus arr_1) \oplus arr_2) \ldots \oplus arr_n)$

that can be expanded to

$H(arr) = f_0 \oplus f^0(arr_0) \oplus f^1(arr_1) \oplus f^2(arr_2) \oplus \ldots \oplus f^n(arr_n)$

This appears perfect for my use case if $\oplus$ is associative, commutative and reversible, which allows to do the recomputation for updating the array at index $i$ by

$H(arr_{new}) = H(arr_{old}) \ominus f^i(arr_{old,i}) \oplus f^i(arr_{new,i})$

The natural choices for $\oplus$ are bitwise XOR and modular addition, but which hash algorithms employ them, use a both simple and fast $f^i$ implementation, and still have good hashing characteristics?

I found some but am not totally satisfied.

CRC32 is known to be fast, and although all descriptions of the algorithm are iterative, I'm confident to be able to derive a non-recursive $f$ function (knowing that crcs have the property $\operatorname{crc}(x \oplus y \oplus z) = \operatorname{crc}(x) \oplus \operatorname{crc}(y) \oplus \operatorname{crc}(z)$). Still, CRC is used and designed for error detection, not fingerprinting, so maybe it's not suitable at all. Am I missing something there?
Rabin fingerprint appears to be suited better, but I only found iterative implementations with explicit polynom division. Not sure how to derive $f^i$ for this (is it even possible?). Also I'm not certain how it does differ from CRC computation.
Rabin-Karp rolling hash appears to be a good choice, but I have concerns. Does "modulo $n$" imply I won't get all 32 bit values? Isn't the exponentiation $f^i(x) = x \cdot a^i$ relatively slow? And what constants $a$/$n$ should I use?
Cyclic polynomial / Buzhash looks easy and efficiently to implement at first. But what about the $h$ in $f^i(x) = rotate(h(x), i)$, wouldn't this need to be a function that generates a 32-bit value from my 8-bit bytes? (I read a lookup table with random data works fine, but 256*4 bytes is too large for me). And isn't this rotation susceptible for swapping two settings that happen to be 32 bytes apart? Could the upper bytes of $i$, which is effectively used modulo 32, be utilised for something?

Can you dispel my doubts about any of these? Or is there some other algorithm that I missed entirely?

$\begingroup$ This question seems related, but couldn't solve my problem. $\endgroup$
– Bergi
Commented Jan 23, 2017 at 20:33 — Bergi, Commented Jan 23, 2017 at 20:33

D.W. · Accepted Answer · 2017-01-23 20:57:20Z

Any of the first three options you list are likely to be fine for what you want.

The CRC provides a good fingerprint (indeed, there's some theory to back that up, based on the theory of Carter-Wegman hashing, 2-universal hashes, etc.).

The Rabin fingerprint is also fine, and is essentially equivalent to the CRC (modulo small technical details that are unimportant). $f^i(m_i) = m_i x^i \bmod p(x)$, which can be readily computed even for very large values of $i$ using fast modular exponentiation.

The Rabin-Karp hash is also fine, as long as you choose a good modulus $m$. Don't pick a power of two; use a prime number. Exponentiation is again fast if you use standard algorithms for fast modular exponentiation. If possible, use a constant $a$ that is larger than the largest possible value of any single array element (single byte), i.e., larger than 256. (Alternatively, it might be possible to pick $m$ to be a power of two if you also choose the constant $a$ to be a large random number modulo $m$; I haven't thought about that too carefully, so if that sounds attractive, please post a separate question asking about that specific scenario.)

Buzhash is a little weaker than the other hash functions, though I suppose it's possible it might be good enough in practice. Yes, Buzhash is susceptible to swapping of two values that are a multiple of 32 positions apart. There might be ways to fix this, but given the large lookup tables required, that seems moot. As you say, Buzhash requires a lookup table that maps from 8-bit values to 32-bit values (i.e., 256 entries, where each entry is a 32-bit word), so it sounds like Buzhash is not going to be feasible in your situation.

Thanks for the insights! I thought about fast exponentiation, I meant "relatively" slow when compared to the rotate of Buzhash. Regarding the modulus in Rabin-Karp, $2^{31}-1$ seems popular. — Bergi, Commented Jan 23, 2017 at 21:23

Stack Exchange Network

Repeated fingerprinting after array updates

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged
algorithms
information-theory
hash
rolling-hash
or ask your own question.

Linked

Hot Network Questions

Repeated fingerprinting after array updates

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged algorithmsinformation-theoryhashrolling-hash or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
algorithms
information-theory
hash
rolling-hash
or ask your own question.