‘Morning folks,
Today’s blog will be a bit out-of-order (there’s a subsequent blog that I haven’t finished that structurally precedes this one), but I hope it won’t throw too much wire into the hay-baler.
Persistence IDs
The problem we are trying to solve is how to identify objects within a system. Most of the time, we think of object identity as having two components:
- Object type and
- Some unique name or tag for a given object instance
Selecting the correct persistence ID scheme for your application can be quite a task, and people generally don’t give it much thought at the outset. Later on, when the initial ID scheme isn’t ideal for the application, developers have to perform an ID type migration, which for live systems is frequently one of the nastier modifications to make.
About 80% of the applications that I’ve seen and worked with just use sequential, database-generated integral IDs, and that mostly works for them. There are well documented problems with them, so we’ll avoid them altogether (except under very specific circumstances, which will be discussed on a case-by-case basis). They do have one desirable property that we’ll discuss below, viz. that they’re sorted.
UUIDs
Now that database-generated sequential IDs are out, what about UUIDs? It’s a good question, and I’ve used and seen used UUIDs quite successfully in quite a few applications. But we’re not using them in Sunshower for the following reasons:
- They’re pretty ugly. I mean, URLs like
sunshower.io/orchestrations/123e4567-e89b-12d3-a456-426655440000/deployments/123e4567-e89b-12d3-a456-426655440000
aren’t great. We previously base-58 encoded our UUIDs to produce prettier URLs along the lines of sunshower.io/orchestrations/11W7CuKyzdu7FGXEVQvK/deployments/11W7CuKz27Y9ePpV2ju9
. One of the problems that we encountered was that having different string representations of IDs inside and outside our application made debugging a less straightforward than it needed to be.
- They’re pretty inconsistent across different databases and workloads.
— For write-intensive workloads, UUIDs as primary keys are a poor choice if you don’t have an auxiliary clustering index (which requires that you maintain 2 indexes per table, at least). Insertions into database pages will happen at random locations, and you’ll incur jillions of unnecessary page-splits in high-volume scenarios. On the other hand, adding the additional clustering index will incur additional overhead to writes.
— Index scans that don’t include the clustering index can perform poorly because the data are spread out all over the disk.
So, is there a better way?
How about Flake?
Twitter back in the day encountered similar issues with ID selection, so they designed the Snowflake ID scheme. There is a 128-bit extension that minimizes the need for node-coordination, which is desirable in our case (especially since we were willing to tolerate 128-bit IDs for UUIDs). The layout of the ID is as follows:
- The first 64 bits are a timestamp (for a single-node without modifications to the system clock, monotonically increasing).
- 48 random bits (usually a MAC address of a network interface, other schemes could be used)
- A 16-bit monotonically-increasing sequence that is reset every time the system-clock ticks forward. This is important because it places an upper limit on the number of IDs that can safely be generated in a given time-period (65,535/second). My implementation provides back-pressure, but this can cause undesirable behavior (contention) in very high-volume scenarios. To put this in perspective, Twitter accommodates an average of 6,000 Tweets/second, but even this would only consume about 10% of the bandwidth of our ID generation for a single node.
Our implementation
full source
I’m sorry. I’m pretty old-school. I like Spring and JPA (and even EJB!). Things like JPQL and transparent mapping between objects and their storage representations (e.g. tables) are important to me. I also super-like transactions, and I really really like declarative transaction management. Why? Because not having these things places a very high burden on development teams, and in my experience, reduces testability, frequently dramatically. Another requirement is that we be able to easily serialize IDs to a variety of formats, so we’ll make our ID JAXB-enabled. Here’s the important parts of the Identifier class:
//not using @Embeddable because we will create a custom Hibernate type for this--that way we can use the same annotations for everything
@XmlRootElement(name = "id")
@XmlAccessorType(XmlAccessType.NONE)
public class Identifier implements
Comparable<Identifier>,
Serializable {
static final transient Encoding base58 = Base58.getInstance(
Default
);
@XmlAttribute
@XmlJavaTypeAdapter(Base58ByteArrayConverter.class)
private byte[] value;
protected Identifier() {
}
Identifier(byte[] value) {
if(value == null || value.length != 16) {
throw new IllegalArgumentException(
\"Argument cannot possibly be a valid identifier\"
);
}
this.value = value;
}
// other stuff
}
Now, ideally, we would be able to make value
final. If an ID is created from thread A and somehow immediately accessed from thread B, final
would guarantee that thread A and thread B would always agree on the value of value
. Since neither JAXB nor JPA really work with final fields, we can’t really do that. We could partially fix value
‘s publication by marking it volatile
, but there are downsides to that as well. The solution that I’m opting for is protecting the creation of Identifiers by forcing the creation of IDs to occur within a sequence (note the protected and package-protected constructors of Identifier):
public interface Sequence<ID extends Serializable> {
ID next();
}
with a Flake ID sequence (full source: [](Flake ID))
@Override
public Identifier next() {
synchronized (sequenceLock) {
increment();
ByteBuffer sequenceBytes =
ByteBuffer.allocate(ID_SIZE);
return new Identifier(
sequenceBytes
.putLong(currentTime)
.put(seed)
.putShort((short) sequence).array()
);
}
}
Now, we’re guaranteed that sequences can be shared across threads, and we have several options:
- Each entity-type could be assigned its own sequence
- Entities can share sequences
We don’t really care too much about ID collisions across tables, and we can generate a ton of IDs quickly for a given sequence, so we’ll just default to sharing a sequence for entities:
@MappedSuperclass
@XmlDiscriminatorNode(\"@type\")
public class AbstractEntity extends
SequenceIdentityAssignedEntity<Identifier> {
static final transient Sequence<Identifier> DEFAULT_SEQUENCE;
static {
DEFAULT_SEQUENCE = Identifiers.newSequence(true);
}
@Id
@XmlID
@XmlJavaTypeAdapter(IdentifierAdapter.class)
private Identifier id;
protected AbstractEntity() {
super(DEFAULT_SEQUENCE);
}
@Override
public boolean equals(Object o) {
if (this == o) return true;
if (!(o instanceof AbstractEntity)) return false;
if (!super.equals(o)) return false;
AbstractEntity that = (AbstractEntity) o;
return id != null ? id.equals(that.id) : that.id == null;
}
@Override
public int hashCode() {
int result = super.hashCode();
result = 31 * result + (id != null ? id.hashCode() : 0);
return result;
}
@Override
public String toString() {
return String.format(\"%s{\" +
\"id=\" + id +
'}', getClass());
}
}
MOXy only allows you to use String @XmlID
values, so we need to transform our IDs to strings (hence the @XmlJavaTypeAdapter)
In the next blog post, we’ll demonstrate how to make Identifier
s JPA native types!