Well... it depends on what you define as enough.
In my case below it was enough. Read how I assessed it.
It's clear that different inputs might map to the same hashcode value (2^32 different options), but what are the chances for it to happen?
I have one million users, each user owns 50 private items. An item is identified by a UUID.
I had these two conflicting goals:
(I) Represent each item as an integer instead of a UUID
(II) Avoid collisions. Any pair of items owned by the same user should resolve to a different hashcode.
What is enough: I could live with up to 10 users, out of a million, experiencing a collision. Most of these 10 users will never notice the collision. I assume the system will have other bugs with higher probably than that.
We're all unique |
Assessing uniqueness
One way to asses is computing the statistical probability for such an event . But I preferred a "proof" that any programmer could appreciate even those without good statistics skills. Therefore I coded a simulation that simply ties it in practice:Download from Gist
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | package collisions.test; import java.util.HashSet; import java.util.Set; import java.util.UUID; public class UUIDToHashcodeUniquenessTestMain { private final static int num_of_users = 1000 * 1000; private final static int num_of_stacks = 50; public static void main(String[] args) { int collisions = 0; for (int i = 0; i < num_of_users; i++) { collisions += calcCollisionsForUser(); } System.out.println("Had " + collisions + " collisions for " + num_of_users + " users"); } private static int calcCollisionsForUser() { int collisions = 0; Set<Integer> uuidSet = new HashSet<Integer>(num_of_stacks * 2); for (int i = 0; i < num_of_stacks; i++) { String uuid = UUID.randomUUID().toString(); Integer uuidHashcode = uuid.hashCode(); if (uuidSet.contains(uuidHashcode)) { collisions++; } uuidSet.add(uuidHashcode); } return collisions; } } |
The program comes back saying that a collisions aren't really something to worry about:
Iteration 0: Had 0 collisions for 1000000 users Iteration 1: Had 0 collisions for 1000000 users Iteration 2: Had 0 collisions for 1000000 users Iteration 3: Had 0 collisions for 1000000 users Iteration 4: Had 0 collisions for 1000000 users