Let’s start with a thought exercise: imagine your application issues a write() call followed by a read(). What would you expect to get back in the read operation? Here are your options:
- Always get back the latest update
- Get back either the latest update or the previous value, but never partial updates
- Always get back the latest update if the read is issued from within the same session
- Once the latest value is returned by the read, the subsequent reads will never return the old stale value i.e., monotonic guarantee
- Bounded staleness i.e., reads issued after a given interval will always see the latest updates
- And so on…
All the options listed above are valid — the correct answer depends on the internal implementation of the storage system– namely, how it implements write buffering, atomicity granularity, data durability, read-write protocol for accessing replicas, ordering of updates, etc. The motivation of this exercise was to illustrate the need for a “semantic contract” between the application developer and the underlying storage system. The contract defines what the application programmer should expect as an outcome of the operation, without necessarily getting exposed to the complex nuts-and-bolts of the storage system.
POSIX can be considered as the most well-known, de-facto contract between application developers and the platform. The standard has been around for four decades, and was originally invented to allow interoperability between *nix systems. While POSIX is a fairly broad standard defining both syntax and semantics of the contact, the focus of this blog is on the IO related semantics of POSIX.
At the time when POSIX was defined, the hardware building blocks, enterprise applications, and data-center deployment models were all quite different. For instance, POSIX applications were mostly designed for scale-up (instead of scale-out), and relied on the following storage system semantics for:
- Strong consistency in tracking the file metadata attributes such as access time, size, etc.
- Traversing the entire namespace (i.e., directories) hierarchy
- Guarantee of mutual exclusion of read-write operations
- Guarantee of strict serializability of concurrent updates
Fast forward to the Big Data Era — applications are being designed ground-up with scale-out, micro-service principles. Traversing the namespace with billions of objects is an exception more than a norm. Similarly, the associated metadata attributes are expected to be eventually consistent, to avoid the performance overhead of synchronization. Similarly, there are different degrees of read-write and write-write serializability based on availability, performance, scaling, and application requirements.
To sum it up, the POSIX world was envisioned to be a one-size-fits-all. The Big Data Era, on the contrary, is dealing with significant diversities in data volume, velocity, and variety. As such, applications need to trade-off their storage contract requirements in lieu of better performance, scaling, availability. There have been numerous blogs highlighting the deficiencies of POSIX in addressing the evolving landscape of applications and storage architectures. There have been unsuccessful attempts in the past (especially in the HPC community) to revisit the basic tenants of POSIX IO, and propose new extensions to POSIX.
The objective of this post is not solely to highlight that POSIX is not a silver bullet (hopefully that point is clear by now!). We still have the unanswered question of: how can we create an interoperable semantic contract for Big Data applications to run seamlessly across platforms as well as private/public cloud environments? Today, most Big Data applications (such as Hadoop, Cassandra, MongoDB, etc.) accomplish interoperability by bundling together a scale-out storage layer that runs on local disks/file systems. The sprawl of one-off bundles are not sustainable in the long run w.r.t. maintenance and deployment, and we need a single platform that can instead support a wide variety of semantic contracts.
To address the unanswered question, I would like to propose a different perspective on POSIX — instead of treating POSIX as a one-size-fits-all contract, we treat it as a blueprint for defining contracts. In other words, we essentially extract the different dimensions that POSIX standardizes, and instead of having a single hard-coded semantic behavior, we allow a range of semantic models to be defined for each dimension. In the proposed model, the storage system advertises the supported semantics for each dimension — an application is interoperable with the storage platform, if its minimum required guarantees match those advertised by the storage system across all dimensions.
To illustrate the concept of range semantics, consider the read-write serialization example that was introduced in the beginning of this blog. Leslie Lamport defined a classic taxonomy for wait-free coherence models. The taxonomy defined three semantic models:
- Safe registers: A read overlapping a write can return an arbitrary value i.e., non-atomic
- Regular registers: A read overlapping a write can either return the old value or the new value
- Atomic registers: The monotonic guarantee that if a read returns a new value, the subsequent read cannot return an older value
So, in the world where POSIX is a blue-print, if the application was implemented assuming a Regular register model, any storage system that advertised Regular or Atomic for read-write serialization would be considered interoperable for this dimension.
Following is the list of key dimensions (not exhaustive) that POSIX IO defines, as well as a few more that are actually missing. The description below uses the term object in a generic fashion for the entities exposed by the storage system. Also, these semantics can be defined differently for sync versus async IO operations.
- Namespace schema: Defines rules related with naming of the objects as well as the associated hierarchy
- Object Addressability: Defines the semantics for addressing the update within the object. In the current POSIX model, the addressability is a single flat “stream of bytes,” where the addressing is a tuple of object and offset address. In contrast, a “vector of bytes” or a record-based model is more intuitive for Big Data applications
- Update Atomicity: Guarantees that the result of an object update are either visible in its entirety or none at all. This dimension also defines the atomicity granularity which could be sector-, block-, object-level, etc.
- Granularity of Ordering: Defines the granularity at which the storage system will serialize the read and write operations. POSIX actually does not define ordering semantics. There are interesting taxonomy proposals for ordering IO operations on a per-object, per-replica, or the entire namespace
- Read-Write Serialization: Defining the behavior when concurrent read and write operations are issued for the same record. This was covered earlier in the blog.
- Write-Write Serialization: Defines how concurrent write-write operations are handled. POSIX today defines mutual exclusion semantics. Relaxed alternatives are Last Writer Wins semantics or Versioned updates
- Separation of Ordering and Durability: This was a proposed extension to POSIX where the application is notified when the update is buffered, and then separately when the data is actually made persistent on durable media
- Metadata consistency: Allows explicitly calling out the consistency of system metadata (such as size, access times) that is associated with the data objects. POSIX enforces strong consistency for metadata
- Transactions: Defines whether the storage system supports ACID-like semantics across multiple storage objects. The transactions can be further specialized into read-only transactions, etc.
To summarize, POSIX IO has been extremely valuable over the last several decades as a contract between the application and storage. POSIX is also relevant for the Big Data era, but the one-size-fits-all is not the most flexible given the diversity of applications and infrastructure models. Instead of trying to standardize on a single semantic model, the community should aim to use POSIX (with extensions) as a blueprint to interoperate in the wild west of non-POSIX systems.
Comments