Tuesday, January 12, 2016

Is String.hashcode() unique enough?

Given a set of unique Strings is their set of String.hashcode() values unique enough?
Well... it depends on what you define as enough.
In my case below it was enough. Read how I assessed it.

It's clear that different inputs might map to the same hashcode value (2^32 different options), but what are the chances for it to happen?
I have one million users, each user owns 50 private items. An item is identified by a UUID.
I had these two conflicting goals:
(I) Represent each item as an integer instead of a UUID
(II) Avoid collisions. Any pair of items owned by the same user should resolve to a different hashcode.

What is enough: I could live with up to 10 users, out of a million, experiencing a collision. Most of these 10 users will never notice the collision. I assume the system will have other bugs with higher probably than that.
We're all unique

Assessing uniqueness

One way to asses is computing the statistical probability for such an event . But I preferred a "proof" that any programmer could appreciate even those without good statistics skills. Therefore I coded a simulation that simply ties it in practice:

Download from Gist
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
package collisions.test;

import java.util.HashSet;
import java.util.Set;
import java.util.UUID;

public class UUIDToHashcodeUniquenessTestMain {

 private final static int num_of_users = 1000 * 1000;
 private final static int num_of_stacks = 50;

 public static void main(String[] args) {
  int collisions = 0;
  for (int i = 0; i < num_of_users; i++) {
   collisions += calcCollisionsForUser();
  }
  System.out.println("Had " + collisions + " collisions for " + num_of_users + " users");
 }

 private static int calcCollisionsForUser() {
  int collisions = 0;
  Set<Integer> uuidSet = new HashSet<Integer>(num_of_stacks * 2);
  for (int i = 0; i < num_of_stacks; i++) {
   String uuid = UUID.randomUUID().toString();
   Integer uuidHashcode = uuid.hashCode();
   if (uuidSet.contains(uuidHashcode)) {
    collisions++;
   }
   uuidSet.add(uuidHashcode);
  }
  return collisions;
 }
}

The program comes back saying that a collisions aren't really something to worry about:
Iteration 0: Had 0 collisions for 1000000 users
Iteration 1: Had 0 collisions for 1000000 users
Iteration 2: Had 0 collisions for 1000000 users
Iteration 3: Had 0 collisions for 1000000 users
Iteration 4: Had 0 collisions for 1000000 users