Given a set of unique Strings is their set of String.hashcode() values
unique enough?
Well... it depends on what you define as
enough.
In my case below it was enough. Read how I assessed it.
It's clear that different inputs might map to the same hashcode value (2^32 different options), but what are the chances for it to happen?
I have one million users, each user owns 50 private items. An item is identified by a UUID.
I had these two conflicting goals:
(I) Represent each item as an integer instead of a UUID
(II) Avoid collisions. Any pair of items owned by the same user should resolve to a different hashcode.
What is
enough: I could live with up to 10 users, out of a million, experiencing a collision. Most of these 10 users will never notice the collision. I assume the system will have other bugs with higher probably than that.
|
We're all unique |
Assessing uniqueness
One way to asses is computing the statistical probability for such an event . But I preferred a "proof" that any programmer could appreciate even those without good statistics skills. Therefore I coded a simulation that simply ties it in practice:
Download from Gist
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
| package collisions.test;
import java.util.HashSet;
import java.util.Set;
import java.util.UUID;
public class UUIDToHashcodeUniquenessTestMain {
private final static int num_of_users = 1000 * 1000;
private final static int num_of_stacks = 50;
public static void main(String[] args) {
int collisions = 0;
for (int i = 0; i < num_of_users; i++) {
collisions += calcCollisionsForUser();
}
System.out.println("Had " + collisions + " collisions for " + num_of_users + " users");
}
private static int calcCollisionsForUser() {
int collisions = 0;
Set<Integer> uuidSet = new HashSet<Integer>(num_of_stacks * 2);
for (int i = 0; i < num_of_stacks; i++) {
String uuid = UUID.randomUUID().toString();
Integer uuidHashcode = uuid.hashCode();
if (uuidSet.contains(uuidHashcode)) {
collisions++;
}
uuidSet.add(uuidHashcode);
}
return collisions;
}
}
|
The program comes back saying that a collisions aren't really something to worry about:
Iteration 0: Had 0 collisions for 1000000 users
Iteration 1: Had 0 collisions for 1000000 users
Iteration 2: Had 0 collisions for 1000000 users
Iteration 3: Had 0 collisions for 1000000 users
Iteration 4: Had 0 collisions for 1000000 users