Mostly Polynomial: Horrible Hashes

Saturday, November 28, 2009

Horrible Hashes

Let's talk about djb2 hash function (which was a subject of topcoder contest, where it's choice rendered the contest far too trivial).

unsigned long
 hash(unsigned char *str)
    {
        unsigned long hash = 5381;
        int c;

        while (c = *str++)
            hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

        return hash;
    }

and another version which has
hash = ((hash << 5) + hash) ^ c;
(reference: http://www.cse.yorku.ca/~oz/hash.html)
The function itself is not bad for it's original use where very few lowest bits are used for bucketing in a hash table; but as a 32-bit hash, it stinks.

What's stupid is that if you search for djb2 on google, you see all sorts of people 'personally recommending' it as best and fastest simple hash, people trying to explain why it is good (when the answer is: it is not particularly good), people wondering why 5381 is better (it's not), people tracking the history of this "excellent" function, etc. All in all people presuming that 5381 and 33 got some special significance and are much better than e.g. 0 and 31.

What is so bad about it? For starters, even though the output of this function is 32 bits, not even for the 2 char alphanumeric ASCII strings do you have a guarantee for lack of collisions. In fact "cb" collides with "bC", in the version with addition, and "bA" collides with "ab" when using xor, just to name two examples out of hundreds. Each character except first provides only about 5 bits because that's how much you get out with *33.
That's not good. From a 32-bit hash, you would normally expect to get no collisions at all between 2 character strings, especially restricted to alphanumeric.
Most primes work no worse; you can use 257 and then your function at least will not collide on 2-character strings (it will still be crap though, especially if you use parts of hash; this doesn't need to be a prime, only needs to be odd and you ought to run code to select best for hashing some real data like list of all file names if you want a good number. I think 293 should be pretty good here). Furthermore, there are a lot of collisions between strings that differ by 2 characters, because 2 consecutive characters can be altered to keep same hash.

Got to give some credit though. In some very limited original usage (hash table of specific size, with specific key statistics, e.g. English words), which I do not know, and which you are highly unlikely to replicate, it may have been excellent. Or not too bad.

What is the significance of 5381 ? Apart from low 8 bits of 5381*33 (in the variation which has xor instead of add), it is pretty much totally irrelevant to collision resistance, it is just multiplied by 33ⁿ and added in. This function is pretty much as crap with start value of 5381 as with start value of 42 or 100 or 12345 - the only difference is that unexplained 5381 hints at some deep wisdom whereas 12345 does not.

All in all, you should not trust magical looking code. The best magical constants were selected for some very particular case which you know nothing about, by a method which you know nothing about, and are still most likely than not bad for whatever you want to do.
Do not trust internet advice or consensus either. Keep in mind that majority of acclaimed programming experts are experts at posting a lot of stuff online, being out to be noticed.
Keep in mind that majority of people in 'consensus' are simply repeating each other, and haven't devoted much brain time to thinking about the question (or the question they thought about may be a different question).

This is why science does not and cannot function by reference to authority, but only by reference to argument, to actual reasons, and why if no reasons are given you shouldn't assume that any exist.

edit: also, don't even get me started on "fast". If you want fast, you'd better do 4 chars at once, on a 32-bit machine.

edit: clarified on the version with + and version with ^, even though those have very similar properties.
edit: god damn that article sucked (I wrote it something like 8 years ago if not longer), rewrote a few bits.

9 comments:

KayruNovember 28, 2009 at 2:22 PM
In many cases, though, fast and imperfect hashes are adequate. For example, for hash tables with string keys.
I think that the best tool must be chosen for a particular job. Properties of hash functions must be known before use. If someone uses DJB or FNV as the only way of uniquely identifying data, then the problem is with the programmer, not the hash function.
ReplyDelete
Replies
DmytryNovember 28, 2009 at 3:42 PM
well, the point of this post is that this function is quite bad for the use for hash tables with string keys. The collisions are important for hash tables. For example, simple sum of characters is truly horrible choice for hash tables.
ReplyDelete
Replies
KayruNovember 28, 2009 at 7:04 PM
The cost of computing a strong hash may be much greater than comparing keys in case of collisions. Of course, length of the key and size of the hash table are important factors here. So again -- its about best tool for a specific task. DJB, FNV and simillar have their place. But programmers must make no assumptions about properties of algorithms they employ.
ReplyDelete
Replies
DmytryNovember 28, 2009 at 9:21 PM
Well, yea. Or of algorithms they personally recommend as "best" for that matter, because the usages vary, there's no "best" here, and it matters what you do with hash (e.g. you could use bottom 8 or 6 bits specially).

On topic of speed, this function is actually not quick-and-dirty, it's dirty, but its not quick. A quick and dirty function ought to do 4 chars at once; this way it can either work faster or have better collision properties at same performance.

This looks like a good article about hash functions:
http://www.codeproject.com/KB/recipes/hash_functions.aspx
ReplyDelete
Replies
Kevin DillMay 20, 2016 at 3:39 PM
So... what hash function would you use? I've been using djb2 pretty heavily for years, and once I switched to a 64-bit hash have never had a collision (I did once see a collision with a 32-but hash, but that's not surprising). For an "I spent 5 minutes on the Internet and this is what I found" algorithm, it seems to work just fine. Is there a better, equally simple algorithm out there that I just don't know about?
ReplyDelete
Replies
DmytryMay 21, 2016 at 4:35 AM
This comment has been removed by the author.
ReplyDelete
Replies
AnonymousMay 10, 2017 at 10:56 AM
can you show how it is implemented, in my code its not colliding
ReplyDelete
Replies

Add comment

Mostly Polynomial

Saturday, November 28, 2009

Horrible Hashes

9 comments:

My main project

Top pages

Blog links

Followers

Blog Archive