Monday, October 17, 2016

Unreliable affected-rows with conditional upserts in MySQL

This post is about my attempts with affected-rows for MySQL conditional Upserts to be unreliable.

Image result for conditionalUPSERTs are tempting, with perks like single round trip to DB,  atomic properties, and simpler SQL client code.
Then, if your update depends on the existing row data, there's even Conditional UPSERT.
A conditional UPSERT can have one of three outcomes:
  1. Row is inserted as a new row
  2. An existing row/column is updated (condition evaluated True).
  3. Existing row/column wasn't updated (condition evaluated False).
The SQL client can tell the UPSERT outcome with the Affected-rows value.

For non-conditional UPSERTs the affected-rows value per row is 1 if the row is inserted as a new row, 2 if an existing row is updated, and 0 if an existing row is set to its current values.

For conditional UPSERTs I saw non consistent results. affected rows was 2 OR 3, when an existing row was updated (condition true), 2 when an existing row wasn't updated. 1 for new row is inserted.
Such inconsistent behavior could easily cause bugs. I didn't find a clear pattern of when it happens. I also didn't find proper documentation of what should is the expected affected-row value. 
Since the ON DUPLICATED KEY UPDATE uses an IF statement I suspect it loses track of the results of the IF statement. Also each column's condition can evaluated differently than others columns conditions.

Conclusion

without knowing what to expect of the # of affected rows, and UPSERTs being easily breakable in nature when switching columns update order). I decided to ditch them completely, and ended up implementing the conditional logic in Java client side, using more than one SQL command wrapped in a transaction.
 

 My conditional UPSERT (expect inconsistent affected rows value):

INSERT INTO account_last_touch
            (service_id, account_name, user_name, touch_time)
            VALUES('123', '456', 'u1', '2016-07-19 12:11:15')
ON DUPLICATE KEY UPDATE 
user_name
 =IF('2016-07-23 12:11:15'>touch_time, 'u1', user_name), 

touch_time = IF('2016-07-24 12:11:15'>touch_time, '2016-07-24 12:11:15', touch_time)

Tuesday, January 12, 2016

Is String.hashcode() unique enough?

Given a set of unique Strings is their set of String.hashcode() values unique enough?
Well... it depends on what you define as enough.
In my case below it was enough. Read how I assessed it.

It's clear that different inputs might map to the same hashcode value (2^32 different options), but what are the chances for it to happen?
I have one million users, each user owns 50 private items. An item is identified by a UUID.
I had these two conflicting goals:
(I) Represent each item as an integer instead of a UUID
(II) Avoid collisions. Any pair of items owned by the same user should resolve to a different hashcode.

What is enough: I could live with up to 10 users, out of a million, experiencing a collision. Most of these 10 users will never notice the collision. I assume the system will have other bugs with higher probably than that.
We're all unique

Assessing uniqueness

One way to asses is computing the statistical probability for such an event . But I preferred a "proof" that any programmer could appreciate even those without good statistics skills. Therefore I coded a simulation that simply ties it in practice:

Download from Gist
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
package collisions.test;

import java.util.HashSet;
import java.util.Set;
import java.util.UUID;

public class UUIDToHashcodeUniquenessTestMain {

 private final static int num_of_users = 1000 * 1000;
 private final static int num_of_stacks = 50;

 public static void main(String[] args) {
  int collisions = 0;
  for (int i = 0; i < num_of_users; i++) {
   collisions += calcCollisionsForUser();
  }
  System.out.println("Had " + collisions + " collisions for " + num_of_users + " users");
 }

 private static int calcCollisionsForUser() {
  int collisions = 0;
  Set<Integer> uuidSet = new HashSet<Integer>(num_of_stacks * 2);
  for (int i = 0; i < num_of_stacks; i++) {
   String uuid = UUID.randomUUID().toString();
   Integer uuidHashcode = uuid.hashCode();
   if (uuidSet.contains(uuidHashcode)) {
    collisions++;
   }
   uuidSet.add(uuidHashcode);
  }
  return collisions;
 }
}

The program comes back saying that a collisions aren't really something to worry about:
Iteration 0: Had 0 collisions for 1000000 users
Iteration 1: Had 0 collisions for 1000000 users
Iteration 2: Had 0 collisions for 1000000 users
Iteration 3: Had 0 collisions for 1000000 users
Iteration 4: Had 0 collisions for 1000000 users