Introduction
Large-scale scientific applications, including simulations, experiments, and observations, generate tens of petabytes of data objects and are forecasted to grow even further [1]–[3]. The critical attributes required by such applications include parallel I/O for high-performance and minimal I/O latency in accessing the data objects from storage systems [4]. In addition, the scientific applications, whether running on a single server, small clusters, or HPC systems, all deal with creating, modifying, and processing data objects in memory [5]. The bottleneck between storage and memory has arisen because data must be loaded into memory from slow storage.
Memory centric computing (MCC) has recently emerged to overcome such memory and storage bottlenecks [6]. The HPC has attempted to adopt MCC by enabling a shared memory storage abstraction across the hundreds of compute nodes [6]–[9]. Thus, the upcoming construction of larger MCC infrastructures is expected to be equipped with an array of persistent memory devices co-located with DRAM on each node or shared among all the nodes via high-speed interconnects such as Gen-Z [6] and Infiniband to improvise MCC [6]–[8]. In MCC, the nodes are equipped with non-volatile memories (NVMs), such as Intel Optane DC Persistent Memory (PM) which offers high capacity at low cost, byte-addressability, low idle power, persistence, and performance closer to DRAM than SSD or disks [10]–[12]. A single machine can be equipped with up to 6 TB of PM providing an opportunity to build rack-scale shared memory pools for scientific computing applications [9], [11], [13]. Recently, several studies have shown PM as a full or partial substitute for DRAM [14], [15]. For instance, pVM [14] employs NVRAM to seamlessly expand virtual memory for memory-intensive applications. Similarly, [16] proposed a data-centric OS based on PM. Due to such properties, they are considered a major contender for future main memory fabric and MCC [9], [10].
Therefore, PM given its properties, offers an opportunity to store and manage the millions and billions of objects beyond the lifetime of application in shared memory pool [17]–[19]. Such management of application memory objects on shared PM pool enables multiple benefits, i.e., i) low access latency, ii) low serialization and deserialization overhead, and iii) efficient computation via direct byte-addressability. We refer application objects on PM as Persistent Memory Objects (PMO1). Such PM application model also brings us the opportunity to enable PM level object sharing across different users/scientists and applications to facilitate effective scientific collaborations.
Unfortunately, the PM application model stated above creates new data and metadata management challenges. First, there is a need to ensure data and metadata consistency, i.e., data is modified atomically when moving from one consistent state to another. Applications should be able to access PMOs after a crash or ungraceful shutdowns [20]–[23]. Second, scientific application data objects are self-described and packed in versatile scientific data formats, i.e., metadata is embedded inside the data object [24]–[26]. Without additional descriptive metadata, PMO may become unidentifiable, siloed, and in general, not useful to either scientists who own the data or the broader scientific community. Third, where and how to manage, store, and associate object metadata along with user-defined custom metadata is challenging. It is a common standard in the scientific community to tag or annotate data objects with additional descriptive metadata for a better understanding of data for collaborators [2], [25], [27]. Fourth, with ever-changing data analysis scenarios, to select a subset of PMOs from million and billions of PMOs in a shared PM pool based on metadata or user-defined tags without additional indexing becomes highly challenging [27]–[30]. These challenges drive the need for essentially effective scientific metadata search services and querying on top of persistent memory object storage abstraction.
An approach to build a high-performance scientific metadata search service on top of the PM pool is storing, indexing, and querying all of its metadata and at least significant data in the main memory [25], [26], [31], [32]. Although, there exist in-memory scientific data management solutions [25], [30], [31]. However, in-memory solution suffers from failure-tolerance and recurring recovery cost in case of failures. MIQS [25] is the state-of-the-art research, offering an effective in-memory metadata indexing and querying for self-described scientific data formats such as HDF5 [33] and netCDF [34] for scientific applications. It extracts metadata in the form of
In this paper, we propose to build MOSIQS, an application framework that enables applications, scientists, and researchers to create, modify, search, and delete memory objects on a large shared PM pool. A PMO is a self-described object, i.e., an object can contain a single value, multi-dimensional array or composite value similar to scientific data formats such as HDF5 and netCDF data objects. We design MOSIQS based on the key idea that memory objects on PM pool can live beyond the application lifetime and can become the sharing currency for applications and scientists. Moreover, providing controls and annotations to memory objects will bring more friendly storage model in scientific computing environments.
Our key contributions in this paper are:
We propose an application framework for PM to store and access memory objects via persistent pointers beyond the application lifetime and to share objects across applications, scientists, and collaborators with flexible data sharing controls (Section III). We designed a self-described metadata object and use PMEMKV [35], a light-weight persistent memory key-value store to manage PMO’s metadata.
To enable efficient multi-attribute metadata search and querying for scientific data objects resident in PM pool, we introduce Group Split-and-Merge (GSM) index data structure (Section IV). GSM dynamically split and merge based on sharing frequency, application/user’s defined threshold and query patterns to minimize the query search space. Further, GSM also reduces the additional per object index metadata storage overhead.
For effective storage and easier data sharing, we provide namespace abstraction. Such an abstraction enables a process to share its PMOs with other processes accessing the namespace. We also provide post-storage attribute tagging and annotation to PMO and enable indexing on such application or user-defined metadata attributes annotations(Section III).
We develop a prototype implementation of the proposed PM application framework using Intel’s PMDK [20]. We conduct preliminary evaluations on a Intel many-core server equipped with 1.5 TB real Intel Optane DC 3D-XPoint PM. (Section V). Experimental results show that MOSIQS gains a 100% performance improvement compared to the PM-aware file system approach and executes multi-attribute queries efficiently with
less index storage overhead.$2.7\times $
Background and Motivation
In this section, we briefly describe the self-describing scientific data formats. Then, we present the background on PM and elaborate a need for object storage abstraction and search services.
A. Scientific Data Formats
Many applications from various scientific domains tend to store experimental, simulation, and analytical data in domain-specific scientific data formats such as HDF5 [33], netCDF [34], FITS [36], Plot3D [37], and GRIB [38] in the underlying parallel file system. These scientific data formats are often referred to as self-described and self-contained, i.e., the metadata is stored alongside the data objects [26]. This metadata enables use and reuse of scientific data [30]. HDF5 is one of the most widely used data formats in the scientific community. It stores data in binary file organized hierarchically for high-performance access [25]. Here, we show the HDF5 internal layout in Table 1. Several optimizations and improvements have been made on top of HDF5 in recent years such as HDF5 virtual object layer (HDF5-VOL) to enable HDF5 data model on different storage backends [39]. Baryon Oscillation Spectroscopic Survey (BOSS) from Sloan Digital Sky Survey (SDSS) [40], typically produces a single data file per object and store it in FITS format [36]. Originally, FITS API lacks parallel IO and data query support and externally relies on HDF5 library [28]. H5Boss [28] tool converts multiple FITS files to HDF5 format to achieve parallel IO and data query.
B. Memory Centric Computing
The memory centric computing (MCC) has emerged recently to satisfy the requirements of memory-intensive scientific computing applications [9]. MCC architecture benefits scientific applications in many ways. First, MCC provides a high storage capacity and can store large scientific datasets that could not traditionally fit in the memory. Second, MCC mitigates the performance gap between storage and memory, i.e., fast computation is provided on in-memory large datasets. Third, MCC enables in-memory data sharing across the applications and processes. In particular, MCC operates on the principle of memory-first, i.e., the data resides in memory to provide in-memory speeds to deliver tremendous performance. In MCC, each node is equipped with a storage-class non-volatile memory such as Intel Optane DC PM. The PM technology can potentially reduce latency and increase bandwidth of I/O operations by many orders of magnitude, but fully harnessing the device capability requires overcoming the legacy IO stack of disk-based storage systems [11]. A few studies have enabled the use of PM in scientific applications, e.g., NV-Process [41] proposed a fault tolerance process model based on PM and provides an elegant way for the applications to tolerate system crashes. Similarly, [17] evaluates different fault-tolerance approaches for porting scientific applications to use PM. DAOS-M [18] employs PM to store metadata and small writes, whereas larger writes are redirected to NVMe SSDs. Similarly, [16] proposed a data-centric OS based on PM.
C. Serialization/Deserialization on PM
In the conventional scientific computing model, application relies on the CPU to handle the task of deserializing file contents into memory objects. Such an approach requires the application to first load raw data into the system main memory from the storage. Then, the CPU parses and transforms the file data to objects in other main memory locations for the rest of the computation in the application [5], [9]. Such deserialization takes up almost 64% of the application’s total execution time [5], [42]. Our previous work [9] provides a conceptual overview of the scientific computing model based on PM, where application objects persist in PM address space, and direct computation is performed, avoiding additional serial- and deserialization operations. Such usage of PM-based storage and computing model also minimizes the decades-old file system IO stack overhead (paging, context switching, kernel code executions).
D. Object Management on PM
Employing PM directly for legacy scientific applications is challenging. As, the existing applications are built on notion of block-based file system interface and are a clear mismatch with PM hardware, i.e., byte-addressable. A simple solution is to deploy a PM-aware file system and enable applications to use PM but as reported in [13], ext4-DAX [43] specially designed for PM incurs up to
E. Motivation
Arguably, storing application objects directly on persistent memory without a file system interface provides multiple benefits such as faster storage without file system overhead and direct computations. But, it poses several challenges at the same time. First, data sharing across applications and other collaborators is an essential requirement of the scientific community [27], [44]. It is challenging to access, select and share a PMO without additional descriptive metadata. As, object access and sharing require object semantics such as object name, size, and owner provided by the application, user or scientists, whereas, PMOs are simple memory allocated objects and can only be accessed and shared via persistent pointers. For instance, with Intel’s PMDK
The
To this end, we intend to build an application framework with object storage abstraction on top of the shared PM pool. The proposed application model employs PMDK provided transactions to ensure atomicity and consistency. The metadata is managed in a lightweight persistent key-value store. For multi-attribute indexing, search, and query, we introduce Group Split and Merge (GSM) index data structure. GSM minimizes the query search space and storage overhead by grouping various memory objects based on a specific sharing set of attributes.
MOSIQS: Design and Implementation
In this section, we present our key design goals, target architecture and system overview.
A. Design Goals
Our key design goals include:
Simple and Generic Storage Model: MOSIQS should have a simple, generic, and schema-less storage model to ensure the compliance to diverse scientific formats and applications, i.e., persistent memory objects should be orthogonal to a domain-specific datatype or format.
High-Performance and Scalability: One critical goal of MOSIQS is to meet the performance and scalability requirements of scientific applications by fully exploiting the underlying hardware architecture, i.e., Shared PM Pool. Furthermore, MOSIQS should be capable of handling concurrent workloads in a scalable manner while ensuring the correctness of individual transactions.
Multi-attribute Indexing and Query Support: Self-described scientific data formats such as HDF5 and NetCDF contain additional descriptive metadata. Oftentimes data is retrieved based on additionally stored metadata. Thus, MOSIQS should provide a capability to search based on object metadata.
Flexible Data Sharing and Controls: Another important goal of MOSIQS is to facilitate scientists and researchers with easier data sharing controls, i.e., ability to export or publish a particular PMO or a collection of PMOs based on certain criteria with other scientists and collaborators. Such PMO sharing also minimizes data movement overhead.
Application and User-provided Hints: The scientific-friendly storage model provided by MOSIQS intends to empower scientists and applications to pass hints to control and set application specific configurations such as creating a shareable and read-only PMO, post-storage tagging of PMO, and excluding metadata extraction or index construction for specific PMOs.
B. Target Architecture
MOSIQS is a PM object storage framework providing a scalable data management and metadata search service for scientific applications. MOSIQS’s target architecture is an array of PM devices distributed across hundreds of compute nodes. PM on each compute node is shared with other compute nodes via a shared PM pool abstraction via high-speed fabric attached memory (FAM) interconnect such as Gen-Z [6], [45]. Figure 2 depicts a high-level architectural overview of the MOSIQS. Multiple compute nodes can create a shared namespace abstraction atop the shared PM pool via the MOSIQS library and directly store and manage memory objects on these namespaces. Multiple processes running at these compute nodes can access and share PMOs via namespace abstraction. Figure 2 shows that a process running at compute node 2 accesses two PMOs from namespace 1 and 2. To enable memory-level object storage abstraction to applications, we employ Intel’s PMDK provided
MOSIQS target architecture and memory object storage abstraction integrated with scientific discovery service.
C. System Overview
MOSIQS primarily consists of the following key abstractions: Shared memory pool, Namespace manager, Metadata extractor, Sharing manager, Group manager, Index manager, and Query manager. Figure 3 presents the multi-layered architecture of MOSIQS. The bottom layer is the shared PM pool, which aggregates all PM devices and exposes them as a single PM pool. Next, the PMDK [20] layer provides low-level primitives, e.g., transactional and reliable object manipulation, via
The metadata extraction and storage management layer is stacked on top of the bottom layer. The metadata extractor is responsible to extract and populate the object name and
D. Data Model
MOSIQS data model consists of three major building blocks.
Persistent Memory Object (PMO): A PMO is a self-described entity and represents a single-value, an array or a compound datatype. It can be created by application or user. A PMO is placed in a group, and additional annotations and hints can be specified. In MOSIQS, a PMO is the minimum sharing currency between applications and users. A PMO requires several properties to be supported: crash consistency to ensure consistent state, system naming, and permission controls to enable PMO to be discovered and shared with other processes and collaborators.
Group: A group represents a collection of PMOs that share common properties and attributes. MOSIQS supports inclusive relationships between groups, i.e., a group can have nested groups similar to nested directories in file systems. Specifically, the group provides a logical organization and a way to store and share collection of PMOs. A key property of a group is its Sharing Frequency (SF), which we discuss later in Section IV.
Attribute: An attribute is a
<key,value> pair which enables annotations, user-defined tags, and properties of groups and objects. Our attribute concept is the same with attributes in scientific data formats, i.e., HDF5 and netCDF. Each attribute also maintains SF which we explain later in Section IV.
The prior version of this work [9] shows an example application for group and PMO creation with attribute annotations in Listing 1.
E. Shared Persistent Memory Pool
The shared persistent memory pool empowers MOSIQS to provide applications with collective view and an aggregate capacity of an array of PM devices. This satisfies the intense capacity desire of scientific applications [15]. Internally, MOSIQS creates the shared PM pool via
1) Namespace Management
MOSIQS provides a namespace abstraction atop its data model to enable easier storage for applications using a shared PM pool. A namespace in our design is the same as memory address space for a process except that our namespace is persistent and stays beyond the application lifetime. Each namespace has its own metadata KV storage engine to store and locate PMOs inside the namespace. Applications or scientists using a shared PM pool can access PMOs in another namespace, provided awareness of namespace metadata such as name, owner and access permissions. Such namespace management offers an easier and simpler storage model per application or scientist.
F. Metadata Extraction and Storage
1) Metadata Extraction
We analyzed that a general design technique that proved crucial for MOSIQS is simplifying and minimizing the number of operations in critical I/O path. The key idea to extract and store PMO metadata and user/application annotated tags is to enable sharing and to build indexes for quick access, efficient retrieval of PMO and to enable future analysis. The metadata extractor is implemented as a service by which application or user annotated tags can be extracted from group or PMO. It creates a single metadata KV object for each PMO or group and inserts it in pool KV store. MOSIQS defines its own layout of metadata object for PMO and group.
Figure 4 shows an overview of extracted and stored metadata KV object of both types, PMO metadata and group metadata object in pool KV store. The OID denotes the PMO object, whereas GID refers to the group metadata object. The value in
To ensure the consistency of metadata extraction, we encapsulate each operation as a transaction backed by a logging approach. To minimize the performance degradation, we perform metadata extraction in the background and
2) Object Sharing Controls
We design MOSIQS aiming to make it as simple as possible for scientists and applications to enable fast memory-level object sharing. PMDK [20] provides persistent pointers, i.e.,
To further ease the sharing controls and bring similarity closer to POSIX like permissions controls, a group can be marked as a shared group that minimizes the data sharing overhead, i.e., sharing a directory in file system compared to sharing an individual file. An application or a collaborator initiates a sharing request for a group. In such a case, the sharing manager validates the group scope and properties from the pool KV store. If the group is annotated with a global and shared scope then, returns the list of OIDs enclosed in the group data part to the requesting application or collaborator. Note that, the group-level abstraction provides file system like semantics, e.g.,
G. Metadata Search and Query
MOSIQS enables scientists and applications to search and query PMOs using the metadata stored in the pool key-value store. However, despite the simpler API for metadata storage, an inherited limitation of key-value store is; lack of multi-attribute or multi-dimensional search queries, which is a common trend in scientific applications and communities [2], [27], [30], [46]. Scientific data is largely unstructured and contains a lot of descriptive metadata in the form of key-value pair attributes [26], and retrieving the desired dataset usually depends on such multiple metadata attributes [30], [47]. Besides, such retrieval often includes additional tags and annotations in search query provided by scientists and applications [2], [26]. Hence, persistent memory objects also require such multi-attribute metadata search services to accelerate the performance of scientific applications. However, simply utilizing existing index data structures for billions of memory objects on an aggregate memory pool entails ample query search space and index metadata storage overhead.
To overcome large query search space and index storage overhead challenges, we intend to introduce and embed a persistent yet simple index data structure on top of PMEMKV [35] to improve scientific queries spanning over multiple metadata attributes. We present the proposed persistent index data structure in the next Section IV.
H. PMO Portability and Mobility
MOSIQS offers high portability and mobility of PMOs, i.e., PMOs can be packed into any scientific data format and flushed to underlying storage for later use in longer future. Therefore, we consider the portability and mobility of PMOs in MOSIQS for scientific applications to be favorable when compared to domain-specific solutions. However, it is not in scope of this study that how objects will be drained to underlying storage system.
GSM: Multi-Attribute Indexing and Querying
Group Split and Merge (GSM) is a dynamic index data structure designed for PMO metadata indexing and querying. It leverages the property that memory objects have many common attributes shared among each other and query can benefit from such shared associations. Realistically, a scientific object can be associated with more than a hundred attributes. For example, MIQS [25] stated that an HDF5 file object contains more than 200 unique attributes, which overlap with other objects in the same file. We elaborate it by using an example of scientific objects associated with multiple attributes. For the sake of simplicity, we use only 3 attributes per each object. Figure 5(a) presents multiple objects with multiple attributes. The white box attributes are unique to each object, whereas, colored box are duplicate attributes which are common among objects as shown in Figure 5(a). Note that, as mentioned earlier (Section III-D), the attribute is a key-value pair and the same attributes across many objects not necessarily have the same value, i.e., the attribute value can vary from object to object in the pool key-value store.
Representation of PMOs and overview of group split and merge index tree. SF denotes the sharing frequency.
To transform such associations to the GSM index, we extract metadata and associated properties/tags from each object when it is created and pass it to the index manager for further processing. The index manager determines the location of the object in the GSM tree. The location is determined based on the associated attributes of the object. We strive to keep objects with sharing or overlapping attributes into the same group. Recalling, a group object is defined as a set-associative structure to accommodate multiple objects in a single key-value entry, as shown in the Figure 4. A single object may overlap multiple groups. As there is a high probability that future queries will be performed on the attributes with a higher number of associations. Figure 5(b) shows the GSM index tree build from the objects with unique and overlapping attributes. We introduce and bind an important element to each group and attribute, i.e., Sharing Frequency (SF). The group SF is defined as: the number of attributes contained in a group, whereas attribute SF denotes the number of objects associated with a particular attribute contained in a group. For example, as shown in Figure 5(b) the Group <A, B>’s SF is 2 because it maintains two attributes. On the contrary, attribute A and B sharing frequency is 4 because there are 4 PMOs associated with these two attributes, as shown in Figure 5(a).
A. GSM: Insert Operation
When the application or user creates a group, the group manager simply adds it to the first level of the index tree unless an existing group is annotated. The insert operation starts upon receiving an object along with its tags and attributes. There are two possible ways to associate an object to a group in the GSM index tree. First, an application or user-defined group for the object, where group annotation is provided at object creation, as shown in Listing 1 in our prior work [9]. We refer to such an object as Tagged Object. Second, if no group is annotated to object by application or user, then we search the appropriate group for object placement based on associated attributes. We refer to such an object as Free Object.
Therefore, the insert algorithm treats each object differently, i.e., ❶ Tagged Object, and ❷ Free object. For ❶, the insert operation is simple; we select the group and update the attribute’s SF, and check if the group or attribute SF meets split and merge threshold value. If split and merge threshold is not violated, then we update the group value accordingly in the pool key-value store.
For ❷ Free object, we first scan the groups based on matching attributes; if a group is found, the rest of the procedure is carried out in the same fashion with the tagged object. Note that the insertion of a free object is carried on per attribute. There is a high possibility that free objects with multiple attributes/tags will overlap in multiple groups. Figure 6 presents the insertion of a free object in the GSM index tree. If no group is found then, we create and initialize a new group, update the header of the newly created group, associate object to the group and write the group metadata object in pool key-value store.
A free object insertion in GSM index tree. The changes are shown in bold and red box. Note that, for simplicity we only show the groups with changes.
B. GSM: Split Operation
When a thread has exhausted all the options to insert an entry into GSM tree without violating the split threshold or SF value, it triggers a group split operation, which is possibly the expansion of the GSM index tree. A split threshold is a tunable parameter configured by application or user to minimize the query search space. Figure 7 shows the split operation in GSM index tree. Such group split not only offers high parallelism for querying but also enables a broad-range of queries, such as splitting the group based on provided query hint. For example,
C. GSM: Merge Operation
On the contrary to split operation, when the sharing frequency drops below a threshold, groups can be merged to minimize index storage overhead and to accelerate the query. The merge operation is required in cases when objects are overlapping in multiple groups, i.e., sharing attributes from multiple groups. The merge operation squeezes the GSM tree. A merge operation is useful in minimizing the number of additional I/Os where a query criterion contains multiple attributes, and irrespective of query search space, multiple groups, are iterated to find the desired datasets. Another use-case of the merge operation is; when an application provides a query hint in advance, then groups can be merged to execute the query efficiently.
Both split and merge operations can be carried out depending on application, sharing frequency or user-defined query patterns.
D. Crash Consistency
There is a need to ensure that GSM index tree is modified atomically when an insert, split or merge operation is performed, in order to provide consistency after a crash due to application error, power loss, or hardware failure. For this reason, we employ a unified logging (ulog) approach, i.e., an undo log for metadata operation and redo log for data operation. The undo log prevents the inconsistency of group metadata such as attribute and group sharing frequency. Whereas, a redo log preserves the group, attribute, and object association. The size of our log records is 8 byte and we mostly write two log records for each operation in GSM tree. Note that, the redo log hits group data part and can sustain duplicate results, whereas we avoid duplication in group header as query relies on header portion and such duplication in header results in population of false positive results.
E. Multi-Attribute Search and Querying
As stated in previous studies [26], [30], the queries bound on multiple attributes are necessarily the most interesting or complex type of queries encountered in real scientific applications. In fact, it is usually the case that specific properties may largely be the unknown parts of queries; indeed, one may query for relationships between objects spanned over multiple files (consider, for example, to retrieve all the objects having temperature and pressure attribute generated at timestamp t1 from simulation s1). Therefore, we intend to provide such multi-attribute querying on persistent memory objects in MOSIQS.
Here, we describe how MOSIQS efficiently responds to queries spanning over multi-attribute ample search space. A scientist or an application provides a query in a string form, comprising of a specific field or spanning over multiple attributes, as shown in Figure 8. For instance, retrieve all the PMOs associated with set of attributes as defined in Figure 8(c). The query manager scans the group for query attributes in the GSM index tree. If all the attributes are in a single group header, then we read the group metadata object from the pool key-value store and return the list of
An example of versatile querying provided by MOSIQS. In particular, (a) and (b) shows a point query.
An example of union operation in parallel querying. The gray-color boxes represent duplicate OIDs in result. PQ_i denotes parallel queries issued to multiple groups.
The duplication may arise due to overlapping
Note that, we consider extending and improving the GSM index tree for attribute-value based querying on persistent memory objects as our future work.
Evaluation
This section presents MOSIQS performance evaluation.
A. Experimental Setup
1) TESTBED
We perform our experiments on a machine equipped with second-generation Intel Xeon scalable dual-socket 56-core processor (hyper-threading enabled) with 1.5 TB Intel Optane DC 3D-XPoint PM, and 768 GB DRAM. Each socket has two memory controllers, six memory channels, and twelve memory DIMM slots installed with
2) Implementation
We modified Intel’s PMDK library and provide a wrapper API on top of persistent memory object API. For metadata backend storage, we used PMEMKV [35] with a fully persistent B+-Tree storage engine. Though it is not production-ready yet, we still used it as the goal of our study is to show the feasibility of the proposed PM application framework. We use R-trees to implement our GSM index data structure and internally, rely on PMDK transactions for GSM split and merge operations to ensure atomicity. This frees us from handling low-level memory management while guaranteeing safe and atomic allocations.
3) Benchmark and Workloads
We use PIOK [24] benchmark to evaluate the performance of MOSIQS. In particular, we use two PIOK provided kernels, i.e., VPIC-IO and BDCATS-IO to show the read and write performance. VPIC-IO is an extracted kernel that simulates the particle data write behavior by the real VPIC scientific application [50]. Similarly, BDCATS-IO demonstrates the data read patterns of a parallel program that analyze the particle data generated by VPIC [50]. We modified the two kernels using MOSIQS object storage abstraction API, where each particle denotes a PMO and is associated with a fixed metadata. We changed 120 and 270 lines in VPIC-IO and BDCATS-IO to adopt MOSIQS. We show the efficiency of MOSIQS’s GSM indexing and multi-attribute querying by using the most commonly used HDF5 scientific data format. We download real-world publicly available NASA’s GLAS/ICESat L2 Sea Ice Altimetry HDF5 dataset [49]. The dataset details are provided in Table 2. We run each experiment multiple times and report the mean. In all cases, the standard deviation was less than five percent of the mean.
We compare our approach with the following systems:
MIQS+: We implement and emulate MIQS [25] on top of ext4-DAX file system mounted PM and refer to it as MIQS+. MIQS [25] implements various DRAM-based index data structures such as ART and SBST trees to maintain HDF5 file indexes for querying on self-describing scientific datasets, stored in parallel and distributed file systems. The metadata indexing is conducted after the data is written successfully.
MOSIQS-NoIndex: MOSIQS with no metadata indexing and search service, but including the software implementation overhead of MOSIQS on top of PMDK [20].
MOSIQS-Sync: MOSIQS with metadata extraction enabled in inline synchronous mode, i.e., metadata populates in pool key-value store and write operation finishes.
MOSIQS-Async: MOSIQS with metadata extraction enabled in inline asynchronous mode, i.e., metadata populates in pool key-value store after the write I/O. We use separate dedicated threads executing concurrently, one for processing application I/O and another for metadata extraction.
B. I/O Performance Analysis
1) Throughput Analysis
Figure 10 shows the peak throughput of read and write operations with varied number of processes using a fixed PMO (i.e., 256 Bytes). As, observed from the Figure 10 (a) and (c), MIQS+ performs poorly compared to the proposed MOSIQS variants. MIQS+ access data in the block size granularity exposed to the OS, which is typically 4KB. Further, MIQS+ always needs to go through the I/O stack to fetch data, adding extra system call overheads. We observed that in MIQS+, the IO stack overhead has a much higher impact than the write amplification due to block size mismatch, i.e., MIQS+ wastes I/O bandwidth if the required I/O size is smaller than the block size. If the block size is bigger, MIQS+achieve better bandwidth, but I/O stack overhead remains the same. On the other hand, this overhead can be easily amortized in MOSIQS variants as there is no file system or kernel involved. However, throughput difference in MOSIQS variants is mainly attributed to the additional metadata extraction and management in the critical I/O path. MOSIQS-NoIndex shows a consistent performance trend with varied processes. It reaches the peak write bandwidth including our software implementation overhead. It incurs a single metadata insertion operation per I/O to populate a mapping entry in pool key-value store compared to MOSIQS-Sync and Async approach. Therefore, with varying processes we can see performance drop in MOSIQS-Sync and Async. For read throughput, we observe a scalable performance trend for all the approaches as shown in Figure 10(b) and (d).
MOSIQS I/O performance analysis by varying number of processes using 256B and 512KB PMO size.
2) Bandwidth Analysis
We have already discussed the read/write bandwidth of MOSIQS with varying PMO size in our previous work [9]. The peak write bandwidth of MOSIQS-NoIndex is 63% of 6.6 GB/s due to PMDK’s internal transaction management, atomic memory allocations, pointer assignments and MOSIQS’s object to persistent pointer mappings. Its peak read bandwidth is 40% of 23GB/s with 512KB PMO size. It is mainly due to iMC’s cache misses, accessing object’s persistent pointer and PMDK’s internal persistent pointer to memory address translation. As expected, we observe a scalable read performance trend in all MOSIQS variants. The difference in read and write throughput of MOSIQS and its variants is mainly derived from PM properties, i.e., the read and write performance of PM is highly asymmetric.
C. Multi-Attribute Indexing and Querying
Here, we present the evaluation of metadata search and query service provided by MOSIQS.
1) Index Construction and Consistency Overhead
Figure 11(a) & (b) depict the average index construction runtime of a single PMO with and without ensuring the consistency. We varied the number of attributes per PMO to measure the time spent on each operation for constructing index. We observed that all operations such as metadata extraction, group scanning, and GSM metadata object persistence linearly increase with varying number of attributes except the index update sub-operation. The index update incurs unified logging overhead, i.e., a redo log and an undo log for data and metadata updates. However, excluding logging overhead can benefit from improved runtime but at the cost of inconsistency tolerance.
2) Index Storage Overhead
Here, we show the index storage overhead of MIQS+ and MOSIQS variants with variable group sharing frequency values. Recall, group sharing frequency defines the number of attributes contained in a group, which directly impacts the index storage space overhead. Figure 12 presents the index storage space consumed by MIQS+ and MOSIQS variants. #A/G in Figure 12 denotes group sharing frequency, i.e., number of attributes per each group. The distribution of attributes in group and attribute sharing frequency affects the index storage overhead. For instance, if attribute sharing frequency is less per group, then there is a high probability that attribute will overlap across the groups, which in turn increases the index storage overhead. Similarly, if a group has many attributes and attribute sharing is very high that all associated PMOs can fit in a single group, then index storage overhead will be very minimum. We use real HDF5 dataset files and observe that MOSIQS-4A/G (4 attributes per group) consume almost half index storage space compared to MIQS+.
MOSIQS’s index storage space overhead. #A/G denotes number of attributes per group.
Surprisingly, with increasing count of HDF5 files, i.e., 4K, MIQS+ takes up almost 1.35GB. On the contrary MOSIQS–4A/G and 32A/G consume only 920MB and 500MB of index storage space for a total of (
3) GSM Load Factor
Here, we explain the correlation between attribute and group sharing frequency while taking into account multi-attribute query, index storage overhead, and most importantly, query search space. To analyze the trade-off between query search space and index storage overhead, we insert 1 Million zero-size PMOs, each tagged with 16 attributes. We consider two multi-attribute queries, i) Q1 includes a single metadata attribute and Q2 uses only 4 attributes with sharing frequency
Load factor between query search space and storage space. The storage and query search space % is normalized with respect to MIQS+ (100%).
Note that, the MIQS+ has an imbalance load factor among the index trees, i.e., when a particular attribute shows a higher association with PMOs, the resulting tree can be of higher depth thus overloading a particular index tree.
4) Multi-Attribute Query Performance
To analyze the query performance using realistic scientific HDF5 datasets, we use previously populated PMO mappings and GSM index tree from the experiment. The group sharing frequency is set to 4 as explained earlier. We define five realistic PMO metadata, and associated attribute queries, as shown in Table 4. In particular, Q4 and Q5 represent a query spanning across multiple groups. Therefore, the query search space will profoundly impact Q4 and Q5 throughput as compared to Q1-Q3. For this experiment, we compare MIQS+ with two variants of MOSIQS. MOSIQS-Fixed, manages all attributes in a single group and do not exhibit any split or merge operation. Whereas, MOSIQS-Dynamic adjusts group and attribute sharing frequency dynamically. For example, if two groups have high overlapping objects or many groups containing low attribute sharing frequency, we merge the two groups dynamically to minimize index storage overhead and number of additional I/Os. Similarly, if a group has high attribute sharing frequency than pre-configured (4 in this case), MOSIQS-Dynamic splits the group in two sub-groups, i.e., child groups to benefit with query parallelism. Table 4 shows the average query throughput and end-to-end time (query time + time to read the data object). In real scientific usecases, such queries are used to find and retrieve the data items. Therefore, we measure end-to-end time because MIQS+ caches the indexes in DRAM, whereas data item retrieval requires accessing disk storage system. We read varied number of PMOs against each query for MIQS+ and MOSIQS variants. On average, MOSIQS-Dynamic only shows 7% query throughput degradation and MOSIQS-Fixed shows 27% compared to MIQS+. However, for end-to-end time, MOSIQS outperforms the MIQS+ due to dataset storage location, i.e., PM pool against the file system.
Related Work
In this section, we review the related work on in-memory indexes, PM in scientific computing along with significance of metadata indexing and querying.
A. In-Memory Volatile Indexes
An approach to build a high performance scientific data management system is to store and index all of its metadata and at least significant data in the main memory [31]. Several scientific and database communities have adopted such in-memory ordered and unordered index data structures including B/B+-trees, Hashes, Tries, and Radix trees to mitigate the data access problems [30], [31]. MIQS [25] provides a classical example of building such in-memory indexes to provide efficient querying support on scientific datasets. MIQS employs multiple trees such as ART and SBST altogether to ensure quick retrieval of dataset. However, in-memory volatile data structures or DRAM-resident indexes have an inherent limitation, i.e., they cannot survive power failures and unexpected crashes [32]. A simple power-failure makes the index unreachable and requires rebuilding or recovering the whole index.
B. PM in Scientific Computing
A few recent studies employ PM to benefit scientific applications [4], [14], [17], [41]. NV-Process [41] proposed a fault tolerance process model based on PM and provides an elegant way for the applications to tolerate system crashes. Similarly, [17] evaluates different fault-tolerance approaches for scientific applications to use PM. pVM [14] provides the dynamic expansion of the virtual memory technique with an object storage abstraction on PM. DAOS-M [18] proposed an object storage model on PM and NVMe SSDs, using PM for small writes and metadata, whereas large objects are written directly on SSDs. However, pVM and DAOS-M lack scientific metadata indexing and querying for memory-level objects. Further, these studies also lack data sharing controls, an important requirement of scientific computing [27].
C. Scientific Data Indexing and Querying
Several studies have proposed various solutions and optimizations to improve the scientific indexing and querying tightly integrating into the file system or using external databases [2], [9], [25]–[27], [30]–[32], [46], [51]–[56]. SoMeta [51] provides scientific metadata querying via scalable self-described object and map objects to files internally at the file system level. GUFI [55] embeds SQLite in every directory hierarchy to improvise indexing and metadata query in deeply nested file system directories. Brindexer [46] employs a relational database to track and manage metadata changes on the lustre. TagIt [2] offers a file system integrated indexing and querying for a large collection of scientific files stored on a shared-nothing distributed file system. Spyglass [56] proposed a hierarchical partitioning approach to exploit locality for effective querying. EMPRESS [54] stores and manages metadata in PostgreSQL. Dataspaces [57] proposed a general programming model for data exchange using virtual shared space abstraction with online data indexing support and querying for coupled scientific workflows. Similarly, ExaHDF5 [24] provides metadata extraction and querying for scientific data formats such as HDF5 and NetCDF. Furthermore, scientific data queries vary from simple point query to prefix, suffix and range query based on complex selection criteria involving multi-attributes as described in [26]. A few common challenges solved by previous studies involve: effective metadata extraction and indexing, efficient metadata management, and querying over domain-specific datasets.
A common goal among previous studies and our work is to enable metadata indexing, and querying over a large collection of diverse and complex layout scientific datasets. However, it is very challenging to build a general framework due to a variety of applications, diverse scientific data formats, and APIs. Moreover, most of the aforementioned solutions are built on the notion of underlying parallel file system, which is block-addressable. Thus, requires a lot of engineering and complex implementation efforts to apply on persistent memory. Many of the challenges we face are not radically new, but take on new characteristics when coupled with object storage abstraction on shared PM pool with high-speed interconnect such as Gen-Z [6], e.g., ensuring consistency and failure-atomicity in case of application crashes and ungraceful power failures.
Therefore, to this end, MOSIQS provides a memory-level persistent object storage abstraction via a shared memory pool on top of distributed array of PM devices. The scientists and collaborators can create and remove persistent objects from shared memory pool via MOSIQS. Moreover, MOSIQS is equipped with metadata indexing and efficient multi-attribute querying mechanism to accelerate scientific computing applications.
Conclusion
In this paper, we present MOSIQS, a persistent memory object management system to accelerate scientific computing. MOSIQS provides application to efficiently attach and detach memory objects into their address space and enables effective sharing of persistent memory resident objects across different applications and collaborators. The proposed PM-based application model not only allows effective metadata extraction and tagging of memory objects but is also equipped with indexing and querying service aimed at scientific datasets to further accelerate scientific experiments, simulations and analysis. The preliminary evaluation confirms a 100% improvement for write and 30% in read performance against a PM-aware file system approach. Further, GSM index data structure incurs