Traditionally, persistence media has been a synonym for rotating disks — devices that are relatively naïve, providing basic read and write functionality with block-level atomicity. Rotating disks expose Logical Block Addresses (LBAs) that are statically mapped by the disk firmware to Physical Block Address (PBAs). The role of traditional disks has been limited to primarily providing power-safe durability, while the OS/hypervisor stack does the heavy lifting with regard to functionality for performance, crash consistency, redundancy, data services, recovery, etc.
Flash technology has been a key disruption for traditional enterprise storage tiers — it exhibits two to three orders of magnitude improvement to latency and throughput. Besides the performance disruption of current NAND-based flash and future NVM incarnations (which is a topic for a future blog post), there is another important dimension with respect to the intelligence of these devices. As a brief background, NAND-based Flash devices implement a Flash Translation Layer (FTL) which maps the LBA to PBA. In contrast to traditional disks, this mapping is not static, but changes each time the block is written– i.e., out-of-place updates. This log-based FTL design arises from a physical necessity that requires erasing a flash page (a slow process) before it can be re-written.
Given that Flash devices are already internally running the equivalent of a log-based file-system, can they be programmed to play a bigger role in the end-to-end IO stack? The answer (obviously) is yes, and this is currently an active area of research in academia, as well as version 1.0 products (such as key-value Ethernet drives) already hitting the market. In retrospect, the technology wave with regard to new interfaces and functionality offload reminds me of the Active Disk/NASD effort from CMU in the late ’90s!
Moving forward, I envision a phased evolution of the end-to-end IO stack to take advantage of these devices beyond their basic block capabilities. The critical catalysts for this evolution are standardization of functionality, as well as the rapid commoditization of features. My view of the three phases of the IO stack evolution are as follows:
Extensions to existing block IO standards: This is the lowest hanging fruit! IO stacks today implement some form of journaling or Copy-on-Write semantics to address the gap in commit atomicity required at the application-level, versus the block-level device guarantee. The increasing support towards T10/NVMe standardization of interfaces such as atomic vectored writes, sparse addressing, scattered writes, and gathered reads, will certainly help simplify existing IO stacks. Academia (correctly so) is looking even further, into implementing transaction semantics with new techniques such as editable atomic writes.
Integration with the IO management plane: Besides the data access operations, enterprise-grade solutions implement multiple data services that are “table stakes” for any solution today. A few example data services are snapshot, cloning, deduplication, compression, encryption, and integrity checksums. These services are good candidates for offloading to the persistence media. For instance, the update lineage and ordering maintained by the FTL log can easily be used to implement fine-grained snapshots.
Emergence of next generation data management interfaces: In this phase, we look beyond the traditional block-model, and instead expect application-centric interfaces such as sparse file-systems, KV stores, and data intensive query offloading, as first class interfaces provided by persistence media. Essentially, a majority of the IO stack functionality runs within the device, and will be more applicable for specific use-cases, rather than general purpose.
In summary, with standardization and commoditization, the baseline for the persistence media functionality is evolving. The software IO stack could certainly leverage these extended capabilities, based on their maturity and wide-spread adoption by IHVs. The evolution phases are not sequential but rather can move in parallel, depending largely on the application requirements, and custom co-designing of the IO stack with the hardware capabilities.