What is a good Hash function? I saw a lot of hash function and applications in my data structures courses in college, but I mostly got that it's pretty hard to make a good hash function. As a rule of thumb to avoid collisions my professor said that:
function Hash(key)
return key mod PrimeNumber
end
(mod is the % operator in C and similar languages)
with the prime number to be the size of the hash table. I get that is a somewhat good function to avoid collisions and a fast one, but how can I make a better one? Is there better hash functions for string keys against numeric keys?
There's no such thing as a “good hash function” for universal hashes (ed. yes, I know there's such a thing as “universal hashing” but that's not what I meant). Depending on the context different criteria determine the quality of a hash. Two people already mentioned SHA. This is a cryptographic hash and it isn't at all good for hash tables which you probably mean.
Hash tables have very different requirements. But still, finding a good hash function universally is hard because different data types expose different information that can be hashed. As a rule of thumb it is good to consider all information a type holds equally. This is not always easy or even possible. For reasons of statistics (and hence collision), it is also important to generate a good spread over the problem space, i.e. all possible objects. This means that when hashing numbers between 100 and 1050 it's no good to let the most significant digit play a big part in the hash because for ~ 90% of the objects, this digit will be 0. It's far more important to let the last three digits determine the hash.
Similarly, when hashing strings it's important to consider all characters – except when it's known in advance that the first three characters of all strings will be the same; considering these then is a waste.
This is actually one of the cases where I advise to read what Knuth has to say in The Art of Computer Programming, vol. 3. Another good read is Julienne Walker's The Art of Hashing.
For doing "normal" hash table lookups on basically any kind of data - this one by Paul Hsieh is the best I've ever used.
http://www.azillionmonkeys.com/qed/hash.html
If you care about cryptographically secure or anything else more advanced, then YMMV. If you just want a kick ass general purpose hash function for a hash table lookup, then this is what you're looking for.
There are two major purposes of hashing functions:
to disperse data points uniformly into n bits.
to securely identify the input data.
It's impossible to recommend a hash without knowing what you're using it for.
If you're just making a hash table in a program, then you don't need to worry about how reversible or hackable the algorithm is... SHA-1 or AES is completely unnecessary for this, you'd be better off using a variation of FNV. FNV achieves better dispersion (and thus fewer collisions) than a simple prime mod like you mentioned, and it's more adaptable to varying input sizes.
If you're using the hashes to hide and authenticate public information (such as hashing a password, or a document), then you should use one of the major hashing algorithms vetted by public scrutiny. The Hash Function Lounge is a good place to start.
This is an example of a good one and also an example of why you would never want to write one. It is a Fowler / Noll / Vo (FNV) Hash which is equal parts computer science genius and pure voodoo:
unsigned fnv_hash_1a_32 ( void *key, int len ) {
unsigned char *p = key;
unsigned h = 0x811c9dc5;
int i;
for ( i = 0; i < len; i++ )
h = ( h ^ p[i] ) * 0x01000193;
return h;
}
unsigned long long fnv_hash_1a_64 ( void *key, int len ) {
unsigned char *p = key;
unsigned long long h = 0xcbf29ce484222325ULL;
int i;
for ( i = 0; i < len; i++ )
h = ( h ^ p[i] ) * 0x100000001b3ULL;
return h;
}
Edit:
Landon Curt Noll recommends on his site the FVN-1A algorithm over the original FVN-1 algorithm: The improved algorithm better disperses the last byte in the hash. I adjusted the algorithm accordingly.
I'd say that the main rule of thumb is not to roll your own. Try to use something that has been thoroughly tested, e.g., SHA-1 or something along those lines.
A good hash function has the following properties:
Given a hash of a message it is computationally infeasible for an attacker to find another message such that their hashes are identical. Given a pair of message, m' and m, it is computationally infeasible to find two such that that h(m) = h(m')
The two cases are not the same. In the first case, there is a pre-existing hash that you're trying to find a collision for. In the second case, you're trying to find any two messages that collide. The second task is significantly easier due to the birthday "paradox."
Where performance is not that great an issue, you should always use a secure hash function. There are very clever attacks that can be performed by forcing collisions in a hash. If you use something strong from the outset, you'll secure yourself against these.
Don't use MD5 or SHA-1 in new designs. Most cryptographers, me included, would consider them broken. The principle source of weakness in both of these designs is that the second property, which I outlined above, does not hold for these constructions. If an attacker can generate two messages, m and m', that both hash to the same value they can use these messages against you. SHA-1 and MD5 also suffer from message extension attacks, which can fatally weaken your application if you're not careful.
A more modern hash such as Whirpool is a better choice. It does not suffer from these message extension attacks and uses the same mathematics as AES uses to prove security against a variety of attacks.
Hope that helps!
What you're saying here is you want to have one that uses has collision resistance. Try using SHA-2. Or try using a (good) block cipher in a one way compression function (never tried that before), like AES in Miyaguchi-Preenel mode. The problem with that is that you need to: 1) have an IV. Try using the first 256 bits of the fractional parts of Khinchin's constant or something like that. 2) have a padding scheme. Easy. Barrow it from a hash like MD5 or SHA-3 (Keccak [pronounced 'ket-chak']). If you don't care about the security (a few others said this), look at FNV or lookup2 by Bob Jenkins (actually I'm the first one who reccomends lookup2) Also try MurmurHash, it's fast (check this: .16 cpb).
A good hash function should
be bijective to not loose information, where possible, and have the least collisions cascade as much and as evenly as possible, i.e. each input bit should flip every output bit with probability 0.5 and without obvious patterns. if used in a cryptographic context there should not exist an efficient way to invert it.
A prime number modulus does not satisfy any of these points. It is simply insufficient. It is often better than nothing, but it's not even fast. Multiplying with an unsigned integer and taking a power-of-two modulus distributes the values just as well, that is not well at all, but with only about 2 cpu cycles it is much faster than the 15 to 40 a prime modulus will take (yes integer division really is that slow).
To create a hash function that is fast and distributes the values well the best option is to compose it from fast permutations with lesser qualities like they did with PCG for random number generation.
Useful permutations, among others, are:
multiplication with an uneven integer
binary rotations
xorshift
Following this recipe we can create our own hash function or we take splitmix which is tested and well accepted.
If cryptographic qualities are needed I would highly recommend to use a function of the sha family, which is well tested and standardised, but for educational purposes this is how you would make one:
First you take a good non-cryptographic hash function, then you apply a one-way function like exponentiation on a prime field or k
many applications of (n*(n+1)/2) mod 2^k
interspersed with an xorshift when k
is the number of bits in the resulting hash.
I highly recommend the SMhasher GitHub project https://github.com/rurban/smhasher which is a test suite for hash functions. The fastest state-of-the-art non-cryptographic hash functions without known quality problems are listed here: https://github.com/rurban/smhasher#summary.
Success story sharing
There's no such thing as a “good hash function” for universal hashes (ed. yes, I know there's such a thing as “universal hashing” but that's not what I meant).
- What's the difference in meaning between "universal hashes" and "universal hashing?"