Saturday, November 28, 2009

Horrible Hashes

Let's talk about djb2 hash function (which was a subject of topcoder contest, where it's choice rendered the contest far too trivial).

unsigned long
hash(unsigned char *str)
{
unsigned long hash = 5381;
int c;

while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

return hash;
}
and another version which has
hash = ((hash << 5) + hash) ^ c;
(reference: http://www.cse.yorku.ca/~oz/hash.html)
The function itself is not bad for it's original use where very few lowest bits are used for bucketing in a hash table; but as a 32-bit hash, it stinks.

What's stupid is that if you search for djb2 on google, you see all sorts of people 'personally recommending' it as best and fastest simple hash, people trying to explain why it is good (when the answer is: it is not particularly good), people wondering why 5381 is better (it's not), people tracking the history of this "excellent" function, etc. All in all people presuming that 5381 and 33 got some special significance and are much better than e.g. 0 and 31.

What is so bad about it? For starters, even though the output of this function is 32 bits, not even for the 2 char alphanumeric ASCII strings do you have a guarantee for lack of collisions. In fact "cb" collides with "bC", in the version with addition, and "bA" collides with "ab" when using xor, just to name two examples out of hundreds. Each character except first provides only about 5 bits because that's how much you get out with *33.
That's not good. From a 32-bit hash, you would normally expect to get no collisions at all between 2 character strings, especially restricted to alphanumeric.
Most primes work no worse; you can use 257 and then your function at least will not collide on 2-character strings (it will still be crap though, especially if you use parts of hash; this doesn't need to be a prime, only needs to be odd and you ought to run code to select best for hashing some real data like list of all file names if you want a good number. I think 293 should be pretty good here). Furthermore, there are a lot of collisions between strings that differ by 2 characters, because 2 consecutive characters can be altered to keep same hash.

Got to give some credit though. In some very limited original usage (hash table of specific size, with specific key statistics, e.g. English words), which I do not know, and which you are highly unlikely to replicate, it may have been excellent. Or not too bad.

What is the significance of 5381 ? Apart from low 8 bits of 5381*33 (in the variation which has xor instead of add), it is pretty much totally irrelevant to collision resistance, it is just multiplied by 33n and added in. This function is pretty much as crap with start value of 5381 as with start value of 42 or 100 or 12345 - the only difference is that unexplained 5381 hints at some deep wisdom whereas 12345 does not.

All in all, you should not trust magical looking code. The best magical constants were selected for some very particular case which you know nothing about, by a method which you know nothing about, and are still most likely than not bad for whatever you want to do.
Do not trust internet advice or consensus either. Keep in mind that majority of acclaimed programming experts are experts at posting a lot of stuff online, being out to be noticed.
Keep in mind that majority of people in 'consensus' are simply repeating each other, and haven't devoted much brain time to thinking about the question (or the question they thought about may be a different question).

This is why science does not and cannot function by reference to authority, but only by reference to argument, to actual reasons, and why if no reasons are given you shouldn't assume that any exist.

edit: also, don't even get me started on "fast". If you want fast, you'd better do 4 chars at once, on a 32-bit machine.

edit: clarified on the version with + and version with ^, even though those have very similar properties.
edit: god damn that article sucked (I wrote it something like 8 years ago if not longer), rewrote a few bits.

8 comments:

  1. In many cases, though, fast and imperfect hashes are adequate. For example, for hash tables with string keys.
    I think that the best tool must be chosen for a particular job. Properties of hash functions must be known before use. If someone uses DJB or FNV as the only way of uniquely identifying data, then the problem is with the programmer, not the hash function.

    ReplyDelete
  2. well, the point of this post is that this function is quite bad for the use for hash tables with string keys. The collisions are important for hash tables. For example, simple sum of characters is truly horrible choice for hash tables.

    ReplyDelete
  3. The cost of computing a strong hash may be much greater than comparing keys in case of collisions. Of course, length of the key and size of the hash table are important factors here. So again -- its about best tool for a specific task. DJB, FNV and simillar have their place. But programmers must make no assumptions about properties of algorithms they employ.

    ReplyDelete
  4. Well, yea. Or of algorithms they personally recommend as "best" for that matter, because the usages vary, there's no "best" here, and it matters what you do with hash (e.g. you could use bottom 8 or 6 bits specially).

    On topic of speed, this function is actually not quick-and-dirty, it's dirty, but its not quick. A quick and dirty function ought to do 4 chars at once; this way it can either work faster or have better collision properties at same performance.

    This looks like a good article about hash functions:
    http://www.codeproject.com/KB/recipes/hash_functions.aspx

    ReplyDelete
  5. So... what hash function would you use? I've been using djb2 pretty heavily for years, and once I switched to a 64-bit hash have never had a collision (I did once see a collision with a 32-but hash, but that's not surprising). For an "I spent 5 minutes on the Internet and this is what I found" algorithm, it seems to work just fine. Is there a better, equally simple algorithm out there that I just don't know about?

    ReplyDelete
    Replies
    1. Dunno about the 64 bit version, how is it different, just the datatype is 64 bit or what? Are you using xor or + ?

      Here's some test code: http://pastebin.com/6UQ9sSxD

      Prints a ton of collisions even for 2-character alphanumeric strings.

      It wouldn't do too bad with a larger multiplier like say 293, I think. Didn't test.

      Come to think about it, the original use - long time ago - may have been an 6..8 bit hash (where only the lowest 6..8 bits are used for bucketing), maybe also on a machine without a hardware multiplier (so you had to have very few bits set in the multiplier).

      For that specific use the constant choice is pretty good.

      Then it got copied around and lost it context, and now you see people asking "what makes those constants so good" and so on, when nothing does or at best someone had some kind of use and a dataset long time ago. E.g. maybe a very long time ago someone smart used the bottom 6 bits of that hash for 64 buckets, and they did testing against all-lowercase or all-uppercase strings, possibly in a non-ASCII encoding).

      Really, what irritates me about those things is the magical properties people end up attributing to any magic numbers.

      Delete
    2. Ohh and as for your question, what are you using it for? How much performance actually matters? (note that the CPU speeds had increased much faster than memory speed, often making complicated computations effectively "free")

      Since you say "no collisions" I presume you aren't using it for buckets in a hash table, which is what Bernstein's hash is made for.

      Delete
  6. This comment has been removed by the author.

    ReplyDelete