Introduction
With the maturity and widespread use of perception technology, a large number of records containing temporal and spatial marker information are generated, which are called spatio-temporal data [1]. Spatio-temporal data present multisource, heterogeneous, and multidimensional scale characteristics. Space and time are the basic attributes of spatio-temporal data and the basic characteristics of spatio-temporal data processing. With the rapid development of mobile Internet, location services and other technologies and the popularization of mobile devices, e.g., traffic trajectories [2], social media [3], remote sensing image [4], [5], climate observation [6], and other data containing spatio-temporal information rapidly accumulate, are forming a special kind of spatio-temporal big data [7]. In a general sense, spatio-temporal big data refers to the massive spatio-temporal data collection that is difficult to carry out data management, scientific computing, and value discovery within an acceptable time frame by using existing theories, methods, technologies, and tools.
Spatio-temporal big data contains fund of knowledge value. However, due to the large volume of spatio-temporal data, high computational complexity, and the large amount of time spent in complex spatio-temporal queries, it becomes important to support high-performance queries on spatio-temporal data. Distributed spatio-temporal databases provide a variety of spatio-temporal indexes and queries to achieve fast data retrieval, but the “weak” scalability of index schemas and the “strong” limitations of strict database specifications make them unsuitable for processing spatio-temporal big data. Distributed computing technology achieves high scalability and excellent performance through its computing power, and it has spawned the development of modern parallel processing systems, e.g., Hadoop, HBase, and Cassandra, based on MapReduce. They offer excellent scalability and high-performance computing capabilities compared with traditional centralized systems that process large amounts of data. Recently, distributed computing systems and NoSQL databases have been widely used for indexing of structured, semistructured, and unstructured spatio-temporal data.
Although the distributed system solves the storage management problem of spatio-temporal big data to a certain extent, however, the input data are not organized by spatio-temporal associations, which means that spatio-temporal properties are not considered when deciding where each record is physically stored. This results in poor retrieval performance for applications that are sensitive to spatio-temporal correlations. Therefore, establishing a spatio-temporal index is an effective way to improve the access efficiency of spatio-temporal big data. In recent years, some surveys have been published on the research progress of spatio-temporal data indexing [8], [9], But it is mainly aimed at the analysis of spatio-temporal indexes under centralized systems, e.g., based on main memory [10]–[15], based on flash extended cache [16], based on disk [17]–[20], based on multicore multithreading [21], [22], and GPU-based spatio-temporal indexing [23], [24]. The spatio-temporal index structure mainly includes B-tree, grid [25], quadtree [26], R-tree [27], and some variants of R-trees [20], [25]. The index structure is simple and easy to maintain, which can meet the high-efficiency retrieval requirements of conventional spatio-temporal applications. However, due to the limitation of single-machine resources, it cannot support the application requirements of low-latency access and high concurrent access to spatio-temporal big data. In a distributed system, because the data storage architecture, data management model, and data processing method are very different from the centralized system, the indexing technology cannot be easily transplanted into the distributed system. Therefore, the distributed elasticity for spatio-temporal big data. The indexing method has become the main research hotspot at present.
The research scope of this article consists the papers about spatio-temporal big data indexing published between 2010 and 2020. The main literatures come from VLDB, TKDE, ICDE, SIGMOD, CIKM, MDM, and other journals and conferences (including but not limited to SCI and EI). According to the research content, the evolution process of different index modes is classified and summarized. As shown in Fig. 1, the lines represent the relationship between an original index structure evolving into a spatio-temporal index structure.
Spatio-temporal indexing method (2010–2020). The cross lines represent the relationship between a new spatio-temporal index structure and the original index structure it has evolved from.
The batch processing system is suitable for application scenarios, where the storage is first and then the calculation is performed, the real-time requirements are not high, and the accuracy and comprehensiveness of the data are more important. The spatio-temporal index structure often chooses R-tree, 3DR-tree, and Quadtree to form a global–local index structure. Streaming data processing systems no longer store data, but directly perform computations in memory after streaming data are accessed, which is suitable for high real-time and low-precision application scenarios. The spatio-temporal index structure often chooses
This article takes the spatio-temporal big data indexing method as the core, systematically sorts out and analyzes the current research and development status of spatio-temporal big data indexing methods, and according to the existing spatio-temporal big data index construction platform, the spatio-temporal big data-oriented spatio-temporal index can be divided into three categories:
spatio-temporal index based on distributed computing system;
spatio-temporal index based on NoSQL database;
spatio-temporal index based on standalone distributed system.
Spatio-Temporal Index Based on Distributed Computing System
This type of spatio-temporal index is built on the existing distributed computing system, and the main advantage of this method is to inherit the scalability and fault tolerance of the underlying big data processing system. Distributed computing systems mainly include batch processing systems, stream data processing systems, and hybrid processing systems [31]. Batch processing systems (e.g., Hadoop [32]) are mainly used to process static spatio-temporal data and return the results after the calculation is completed. The calculation process takes several minutes or even hours, so it is not suitable for processing tasks that require high real-time performance. Streaming data processing systems (e.g., Storm [33] and S4 [34]) perform calculations on spatio-temporal stream data and process the data with minimal delay. Hybrid processing systems (e.g., Spark [35] and Flink [36]) can handle both batch and stream processing workloads. This section will focus on the related research results of extending these three types of systems to build spatio-temporal indexes. Table I provides an overview of spatio-temporal indexes based on distributed computing systems.
A. Spatio-Temporal Indexing for Extended Batch System
The big data batch processing system is suitable for application scenarios with low real-time requirements and high requirements for data accuracy and comprehensiveness. The typical representative system is Hadoop. As the basic platform of big data, Hadoop is widely used in the storage and processing of spatio-temporal big data. Extending Hadoop’s spatio-temporal index is mainly used to index static spatio-temporal objects (e.g., STQuery [6], ST-Hadoop [38], QaDR-tree [39], and CloST [40]) and moving object history track index (e.g., HadoopTrajectory (HT) [41]). It can satisfy high throughput while requiring low latency of query response. Use batch index creation method. Data import and index creation use two MapReduce tasks, respectively. The MapReduce program is simple and easy to implement, but the program running time is expensive. The spatio-temporal index structure under the Hadoop framework is shown in Fig. 2. There are two types of global indexes: multilevel index (ML index) A and multidimensional index (ML index) B. ML indexes usually take the time dimension as the first level, select the appropriate time granularity to process the dataset slices, and the spatial dimension as the second level. Spatial indexing technology mainly includes grid, quadtree, Kd-tree, R-tree, and some variant structures of R-tree. Local indexes are mainly based on 3DR-tree [46] and BR-tree. In this section, we will focus on the main spatio-temporal indexes that scale Hadoop. The general index flowchart over batch system is shown in Fig. 3.
ST-Hadoop [38] is a spatio-temporal big data management framework that extends SpatialHadoop [47], [48]. It is used to store, index, and query spatio-temporal big data. ST-Hadoop extends the Hadoop system by injecting spatio-temporal awareness in the spatio-temporal query language layer and spatio-temporal indexing layer. ST-Hadoop is designed as a static hierarchical index structure with time slicing and then spatial indexing. The creation of the spatio-temporal index executes the following stages:
sampling;
time slice;
spatial index;
physical write.
HT [41] is a spatio-temporal big data processing system with built-in spatio-temporal data types and spatio-temporal operations on Hadoop, and supports storage, indexing, and querying of moving objects and their trajectory data. HT supports two index structures: space-based indexing and data-driven indexing. In space-driven indexing, the selection grid divides the 3-D extent of the input data into cubes, each of which maps to one or more files. The grid structure divides a trajectory into multiple consecutive grid cells. In the data-driven index, select 3DR-tree to partition the large data file of the moving object into multiple blocks, and the block size is set to 4 M, which not only preserves the movement Object Semantics and Structure (for example, a moving object must not be split into multiple partitions) and unnessary to traverse multiple objects in a block file. The spatio-temporal bounding box of the trajectory is stored in the 3DR-tree, and the index information is stored in the master node. When creating an index, you need to specify whether the index type is a track or a moving object. In the track type, the 3-D minimum bounding rectangle (3-D MBR) of each track is stored in the index, whereas in the moving object type, the 3-D MBRs of all trajectories belonging to the same moving object are merged into one 3-D MBR and stored in the index. HT supports single query window, multiple query windows, specific moving objects, or specific trajectories for query.
QaDTree&TGrid [39] are distributed indexing method for querying historical spatio-temporal data. It is designed to perform spatio-temporal range queries on different distributed spatio-temporal data under HDFS. QaDR-Tree is an ML index structure for nonuniformly distributed spatio-temporal data, which consists of a global index and a local index. The global index is built based on octrees [49]. It splits the spatio-temporal data into blocks, and the block size is set to 60 MB, which is smaller than the default size of 64 MB for Hadoop data blocks, and the purpose is to allocate additional storage space for local indexes to be stored in Hadoop data blocks along with the spatio-temporal data. A local index is a 3DR-tree, and a 3DR-tree is an R-tree with a time dimension added. The leaf node points to the local index tree. When performing a spatio-temporal range query, the 3-D quadtree quickly finds the node position of the data block intersecting the spatio-temporal window; then, the 3DR-tree in each data block is used to filter the final query result. TGrid is an MD index structure for uniformly distributed spatio-temporal data. It adds time dimension to the traditional grid index (the grid is represented as the MBR) to construct the TGrid (the grid is represented as the minimum bounding cube MBC). TGrid first sets a fixed time granularity in the time dimension and then divides the space dimension uniformly. In the local index stage, a 1-D time index is created for the spatio-temporal data of different MBCs, and the local index results are aggregated on the master node to build a global index.
STQuery [6] is a spatio-temporal indexing method that uses mapreduce to efficiently retrieve and process array climate data, and supports storage in HDFS in the form of raw files. STQuery uses a grid to map the logical multidimensional array data model of a file into a MapReduce key-value pair structure. The spatio-temporal index information is stored in a relational database. The index structure is designed as a relational database table structure, contains gridId, startByte, endByte, nodeList, and fileId. StartByt and endByte record the starting and ending positions of the grid where gridId is stored on HDFS, nodeList represents the node location of grid storage, and fileId records the file to which the grid belongs and the file compression type. Based on the index, a grid allocation and grid combination strategy is proposed to reduce data transmission across nodes and balance the workload of nodes to optimize MapReduce computing performance. However, 1) when the array file is uploaded to HDFS in the form of a byte stream, the logical grid data will be divided into two blocks due to the default block size limit, and the data will be transmitted across nodes during the query process, which will affect the performance of MapReduce; 2) the amount of data contained in each grid under the high resolution of space and time will be very large, and a large amount of space and time data will still be scanned when performing range query, and the query efficiency will not be very high.
CloST [40] is a spatio-temporal big data system based on Hadoop. CloST supports single-object
ATLAS [50] is a distributed file system for storing, indexing, and querying spatio-temporal data that extend Hadoop HDFS. ATLAS utilizes the idea of data file partitioning to improve data access efficiency. When a block of data is written to the file system, the GeoHash value of the spatio-temporal objects within the block is calculated, and the spatio-temporal objects are rearranged to match the spatio-temporal boundaries. The spatial index based on radix tree [51] is constructed according to the GeoHash value of the data block. Each node in the tree maintains a list containing a block ID, offset, and length tuple representing the data belonging to the corresponding node GeoHash. Queries on spatial indexes return a superset of the requested data to ensure full coverage. For instance, a query on data with a GeoHash value of “8bce” will return data with an index of “8b.” ATLAS maintains a temporal index using a
To sum up, as shown in Fig. 5, in an MD index, space and time are treated as dimensions in multidimensional space, and there is no difference in the order of processing space and time. In ML index, the processing of space dimension and time dimension is different, and the idea behind this type of index is to build a separate spatial index for each time instance. In addition, for different types of spatio-temporal queries, the data are actually stored in multiple copies, which speeds up the query efficiency, but results in redundancy of multiple copies storage.
B. Spatio-Temporal Indexing for Scaling Streaming Data Processing System
With the continuous generation of spatio-temporal data by intelligent positioning devices, the demand for ingestion and retrieval of streaming data is also growing rapidly. Traditional big data streaming computing no longer stores data but when streaming data arrives. After that, real-time processing is performed directly in memory. The spatio-temporal flow data present the characteristics of real-time, volatile, sudden, disordered, unbounded, and spatio-temporal correlation, which puts forward many new and higher requirements for the system [52]. In 2010, Yahoo launched the S4 streaming system with a symmetric architecture. In 2011, twitter launched the Storm streaming system with a master–slave architecture. These promote the development of big data streaming data processing technology to a certain extent, and also provide the possibility for processing spatio-temporal data streams. The general index flowchart over stream processing system is shown in Fig. 5.
DITIR [53] is a distributed index structure for high-throughput trajectory insertion and real-time trajectory data query. It is built on the distributed data flow system Apache Storm. DITIR assumes that all data tuples are incoming in timestamp order. DITIR uses transceivers to transmit data. The spatial information
Zhang et al. [56] aimed at the real-time spatial query under the high dynamic data update of single moving object and multiple moving objects, and reduced the time complexity of data update by modifying the traditional spatial index. Single moving object query first groups spatial data into Storm bolts by the FieldsGrouping method according to the moving object ID; second, builds distributed indexes (e.g., R-tree and grid) in each bolt instance, and obtains local results through the main memory spatial query algorithm; and finally, aggregate the local results to get the global results. Single moving object query distributes update messages of spatial information to executors based on point ID, thereby reducing the pressure of frequent updates and real-time calculation and analysis of a large number of moving objects. Multimoving object index is a distributed hierarchical index structure. The spatial range is divided into uniform grids, which is used as the main index. Two sets of secondary indexes (e.g., R-trees, hash tables, and quadtrees) are maintained for each grid to organize spatial data from the two datasets, respectively, and a result table to store spatial join results.
There is also a special real-time moving object query, that is, the k-NN query of real-time moving objects. This query only needs to maintain a list of moving objects, and there is no throughput constraint caused by index structure and data insertion, but it will involve the problem of throughput between nodes: mass communication and transfer of information. Since the moving objects are constantly updated, the spatio-temporal flow is an unbounded flow type, and the number of moving objects is theoretically infinite, but in the actual environment, the value is approximately unchanged, only a limited list of moving objects and index queries need to be maintained.
When implementing k-NN query of real-time moving objects in a master–slave architecture, the master node is responsible for storing, maintaining index information, and distributing query requirements, but it will face the following problems.
For master–slave architecture, nodes may be distributed to different servers, resulting in frequent interaction between master and slave servers, increasing the communication burden.
The master node is responsible for distributing queries to the slave nodes; however, the frequent movement of objects can easily make the master node into a performance bottleneck.
DSI [57] is a distributed index structure based on the Apache S4 system, which supports k-NN queries for real-time moving objects. DSI divides the 2-D space into two sets of nonoverlapping strips (vertical strips and horizontal strips). The index structure is designed as
Most existing grid-based k-NN search methods iteratively expand the search area to identify k-NN, when adopted in a master–slave architecture, let the master node maintain partition information (e.g., cell boundaries) and the slave nodes maintain the index of each cell. In this case, each iteration requires a round of communication between the master and slave nodes holding the relevant cells, which is an expensive operation compared to other costs. When adopted in a symmetric architecture, the abovementioned problems can be avoided, but the required number of iterations cannot be guaranteed, resulting in unpredictable query performance. Tree-based indexes (e.g., R-trees, Kd-trees,
“Strong” real-time requirements for k-NN queries for real-time moving objects are as follows:
indexes can be easily partitioned and distributed to different servers in the cluster;
the index can be efficiently updated with the continuous movement of the object;
the index supports an efficient k-NN search algorithm with fewer iterations.
For the ingestion of spatio-temporal stream data, real-world applications usually require extremely high tuple insertion throughput and real-time response to range queries on a specified domain. Therefore, real-time indexing of spatio-temporal stream data is performed in the traditional streaming computing mode, and range query processing faces the following two main challenges.
How to efficiently maintain real-time data range queries while also supporting efficient insertion of spatio-temporal stream data?
How to efficiently index data tuples on both the temporal and spatial domains while keeping new incoming tuples immediately visible to queries as they arrive?
The usual solution is to use a global data partitioning model, assuming that the streaming data arrives approximately in order. Specifically, the spatio-temporal space is divided into cubes, which are called data regions, and the incoming data tuples are stored in their corresponding data regions when they arrive. For intuitive display, Fig. 4 shows a 2-D schematic diagram of the spatio-temporal space. Given any query with a specific query region, this partitioning mode can effectively speed up query execution by skipping data regions that do not have any overlap with the query region. More importantly, since data tuples arrive at the system roughly in the order of their timestamps, newly arrived tuples are always inserted into the data region with the latest timestamp, i.e., the rightmost data region (shaded) in Fig. 4, while not a historical data area. Therefore, the physical partitioning between recent and historical data in the global data partition mode avoids costly global data merging, enabling high-throughput data insertion, compared to other data structures (for example LSM tree [60]) that mix historical and new data.
An index server is maintained for each unique key in the spatial domain, so that newly arriving tuples of different keys can be injected into the system in parallel. To implement time partitioning, each index server accumulates received data tuples in memory and flushes them to the file system as immutable data blocks when the size of in-memory tuples reaches a predetermined threshold (e.g., 16 MB). Refresh operations on different index servers are asynchronous, so the time boundaries between different key intervals are not aligned, as shown in Fig. 4. In order to further improve the query efficiency, it is necessary to index the data tuples according to the key and time domain in each data area. Based on the following two considerations, it is often chosen to build a
Spatio-temporal application queries usually involve low selectivity in the key domain and high selectivity in the time domain.
The cost of inserting data into other 1-D index structures is significantly higher than that of
-trees.\rm B^{+}
If the query is selective in both fields, a technique, e.g., Z-order, can be applied to convert the two fields to 1-D integers before building a
To sum up, from the perspective of maintaining the spatial location information of moving objects, the maintenance cost of tree-based indexes is high in the case of frequent updates, especially when deployed in streaming data processing systems. Grid indexing usually involves iteratively expanding the search area to identify the set of cells contained in the grid. In general, the number of such iterations is indeterminate. More importantly, in streaming data processing systems, grid indexing can lead to excessive communication between nodes in the cluster, thereby reducing the performance of query processing. Considering the communication cost, maintenance cost, and index performance, whether it is a tree-based index or a grid index, the two existing main index types for real-time spatio-temporal stream range query and k-NN query cannot be directly used for stream processing system.
C. Spatio-Temporal Indexing for Extended Hybrid Processing System
The hybrid processing system is another mode of real-time spatio-temporal stream processing, and the typical representative systems are Apache Spark and Apache Flink. Apache Spark uses distributed memory for data computing to quickly respond to queries and return analysis results in real time. Spark provides a higher level API than Hadoop, and the same algorithm runs 10 to 100 times faster in Spark than Hadoop [61]. Apache Flink is a scalable platform for batch and streaming data processing. Flink itself does not support the efficient processing of spatial data streams. In view of this, Flink-based spatio-temporal index research is roughly divided into two categories: building spatio-temporal indexes on Flink (e.g., QBS-Tree [62]) and extending Flink’s spatio-temporal indexes (e.g., GeoFlink [63]), both of which support continuous queries on real-time spatio-temporal streams. The general index flowchart over hybrid processing system is shown in Fig. 6.
1) Apache Spark-Based Spatio-Temporal Index
In specific application scenarios with high real-time requirements (e.g., real-time location recommendation and moving object behavior analysis), the system is required to provide second-level real-time data stream processing capabilities. Obviously, the disk-based Hadoop cannot satisfy real-time requirements. The Spark-based streaming framework Spark Streaming can solve such problems. The principle is as follows: the continuous stream data are represented as a highly abstract discrete stream packaged into RDDs, and then the RDD is operated in a batch-like manner. Spark Streaming is still a batch method in essence, but due to the characteristics of Spark’s in-memory computing, it has a faster processing speed and, thus, achieves quasi-real-time performance. Spark runs on the Hadoop cluster and accesses the distributed file system HDFS, and can also process structured data in Hive and streaming data in Flume and Kafka. Spark RDD provides the possibility of explicitly cache data between iterations to improve computational fault tolerance.
Distributed trajectory R-tree (DTR-tree) [64] is an R-tree index implemented on Apache Spark. It is widely used for trajectory and active trajectory query, e.g., DMTR-tree [65], which supports skyline query of active trajectories with the help of DTR-tree and inverted list. DTR-tree index’s trajectories and active trajectories are using a distributed R-tree based on the spatial properties of trajectories. DTR-tree is a global index and a local index structure. The global index adopts an R-tree to provide partitioning of the index track data. The leaf nodes of the R-tree represent the sub-R-trees stored in the distributed nodes. The dataset D is divided into N partitions using a balanced partition strategy, and an R-tree local index is built on each worker node. During range query, the 2-D R-tree index track is used in DTR-tree according to the spatial position information of the track. First, the partition id that overlaps with the query point
ST-Joins [66] is an extension of the spatio-temporal connection under Apache Spark, and the spatio-temporal connection methods, Broadcast join and Bin join, are designed to adapt to the connection query of datasets of different sizes. The index construction order is space first and then time. Broadcast join is suitable for at least one dataset that can be completely stored in Spark executor memory, the remaining data are distributed to each executor, a local quadtree is established in each executor memory, and all local quadtrees are assembled to form a global quadtree. Bin join performs regular grid or adaptive grid (based on quadtree) division according to the spatial distribution of the dataset to generate grid cell bins, group the data in the same bin together, and construct a quadtree index for the grouped data to further filter the spatial information. Since the data skew leads to the partition bin load, the time dimension is introduced to distribute the spatially dense regions uniformly in time. Adaptive grid division is performed on the dataset, although the grid data distribution is guaranteed to be balanced, but the time complexity of the calculation is increased.
DITA [67] is a research prototype for extending Spark for memory trajectory analysis. DITA uses global/local indexes and proposes a pivot point-based approximate representation technique for trajectories. The STR algorithm is used to divide the data, and the selected trajectory points are operated, that is, the starting and ending points of each trajectory. Trajectories are grouped based on their first point, and then subgroups are created based on the grouping of the last point. Then, the global indexing system consists of two R-trees, one based on the first point of the MBR and the other based on the last point of the MBR. The local index is a variant of the trie-based index, which is built on the pivot of the trajectory above the point. At the algorithm/processing level, DITA adopts a filter-optimization paradigm in order to efficiently handle similarity search and similarity connection.
STARK [68] is a framework for processing connection query, k-NN query, and range query based on Spark’s support for spatio-temporal data. STARK supports fixed grid partitioning and binary space partitioning. The fixed grid partition divides the data space dimension into multiple intervals to form a grid cell with equal dimensions, then calculates the partition boundary, traverses the entire dataset, and calculates the grid cell to which each piece of data belongs according to the spatio-temporal information. STARK supports real-time indexing. When executing a partition query, an R-tree index is established for the partitioned data, and finally, the index is used to find the query object. The live indexing method takes a preorder traversal of the tree and an optional partition type as parameters. STARK supports three spatio-temporal operators: contains, is contained, and intersects. DBSCAN clustering based on spatial partitioning is also supported. Back up the points within the distance partition boundary
OceanST [72] is a Spark-based spatio-temporal mobile broadband data (MBB) exact and approximate spatio-temporal aggregation query framework. OceanST adopts a three-level hierarchical partitioning strategy. First, the MBB data are divided into multiple first-level partition buckets according to the hash value of the user id and the coarse-grained time range; second, each first-level partition is divided into multiple; finally, each secondary partition is divided into multiple tertiary partition blocks according to the fine-grained time range, and each bucket can perform data recording and storage optimization. OceanST supports intrablock indexing and inverted indexing. The block file data are grouped according to the user id, the data in each group are sorted by time, and the
Simba [73] is an extension of Spark designed to provide efficient query and analysis systems for spatial big data. Simba adopts global indexing and local indexing strategies. Global indexes have simple index structures that support 1-D data indexing, e.g., sorted arrays and complex indexing structures, e.g., R-trees or Kd-trees, that support multidimensional data indexing. The global index is stored in the master node memory by default, and also supports persistence to the file system. Simba establishes a custom index (e.g., R-tree) on each partition data as a local index, and introduces IndexedRDD to support fast random access of spatial data in the partition. Simba supports range queries, distance queries, k-NN queries, as well as distance join queries and k-NN join queries. Simba optimizes spatial queries but does not support spatio-temporal queries.
Hu et al. [74] proposed a Spark-based geospatial raster big data processing framework to support indexed queries on raw geospatial data and designed a three-level hierarchical index:
a global index, which builds a Kd-tree at the master node to see the physical storage locations of all blocks in the cluster;
a local index, which builds a hash table for each variable at each worker node and index all blocks stored in local worker nodes and provide block layout information at byte, block, and file level;
RDD index, build Kd-tree to persist and index all memory blocks in each RDD partition to accelerate spatial queries of in-memory data and avoids linear scans.
KDB-tree [75] proposes a spatio-temporal Ripley’s K-function computation framework based on Apache Spark, which aims to rapidly analyze spatio-temporal point patterns. For spatio-temporal data partitioning, a spatio-temporal index based on KDB-tree [76] is designed to accelerate the computation of spatio-temporal Ripley’s K function in a distributed environment. KDB tree construction consists of three steps:
sampling, randomly selecting spatio-temporal points and sending them to the cluster master node;
the master node establishes a spatio-temporal index based on the KDB tree on the sampling point, and the number of partitions is determined by the construction parameters of the tree;
the index is sent to the worker node, and the worker queries the leaf node to which each spatio-temporal point belongs and constructs key-value pairs, which will have key-value pairs with the same key value are divided into the same partition.
2) Apache Flink-Based Spatio-Temporal Index
Apache Flink is an open-source system for scalable processing of batch and streaming data. In this processing system, each input tuple acts as both retrieval and update, that is, each input tuple issues a query, gets the result from the data stored in the index, and then inserts the tuple into the index to for subsequent inquiries. After a period of time, the expired tuples are removed from the index to improve query efficiency. In Flink, all tuples with the same key are processed by a single operator instance. The Keyby operator logically partitions stream tuples based on their keys.
Streams can transfer data between two operators in one-to-one (or forwarding) mode or redistribution mode. A pair of first-class patterns preserves the partitioning and order of elements, while redistribution changes the partitioning of the stream. Depending on the selected transformation operator, each operator subtask sends data to a different target subtask. By default, each operator preserves both the partitioning and ordering of their previous operators, thereby preserving source parallelism. However, the key operation will lead to data reorganization and distribution overhead, and data forwarding may lead to load imbalance or even idle CPU cores so that the computing power of the entire cluster cannot be fully utilized. Therefore, in order to ensure the efficient execution of queries, it is necessary to find the right balance between data redistribution and data forwarding. Furthermore, since parallel instances of an operator cannot communicate with each other, it must be up to the user to ensure data locality for each instance.
In the case of frequent updates, tree indexes either need to adjust the structure of the tree frequently or need to redistribute a large amount of data. More specifically, for Flink processing systems, regular tree indexes (e.g., R-tree, Quadtree, and KD-tree) can only meet the requirements of query efficiency, but it does not perform well in update efficiency, especially in the case of a large number of insertions and deletions. Therefore, building a tree-based spatio-temporal index on the Flink processing system needs to satisfy the following characteristics.
High query efficiency: It is required to build a well-structured and low-depth balanced tree to achieve maximum query speed.
High update efficiency: The tree structure is required to avoid structure adjustment or data redistribution as much as possible.
In addition, grid-based indexes can achieve fast updates. However, query efficiency is much lower than tree-based indexes.
QBS-Tree [62] is an efficient spatial index based on the Flink processing system, inherits and extends B-tree and R-tree, and supports node insertion and deletion operations. QBS-Tree is a self-balancing tree that is equally efficient in the face of spatially unbalanced index objects. QBS-Tree draws on the idea of R-tree to aggregate data points with similar distances, and represents the index objects of these data points as the center point of a 2-D rectangle to reduce the overlapping area between the MBR of different index objects. When a new index object is inserted, if the number of leaf node index items exceeds the threshold M, the node executes the splitting algorithm of the improved STR, the leaf node is divided into four child nodes, and the QBS-Tree is adjusted to be a balanced tree. When the index object is deleted from the leaf node, when the number of index items of the leaf node is less than the threshold M, the subnodes of this layer are merged to form a new leaf node, and the QBS-Tree is readjusted to a balanced tree. QBS-Tree novelly proposes a delayed update mechanism, which can optimize the update operation and avoid unnecessary structural adjustment of the QBS tree. Due to the rapid arrival and termination of streaming data, the cost of maintaining QBS-Tree to complete query and update tasks is high.
GeoFlink [63] is a streaming computing framework that extends Flink to support continuous queries on spatio-temporal data streams. GeoFlink introduces a grid-based spatial index. The grid G is constructed by dividing a 2-D rectangular space into grid cells of length
D. Summary
This article mainly discusses extending spatio-temporal indexes under existing distributed computing systems. These indexing methods not only inherit the fault-tolerant and highly scalable characteristics of big data processing systems but also make full use of the computing power of the system to satisfy spatio-temporal queries in different scenarios. However, as the spatio-temporal application scenarios gradually become real-time, the traditional single index structure is difficult to effectively support the actual demand. Therefore, in the face of spatio-temporal big data, the spatio-temporal indexing technology based on distributed computing systems can summarize three development trends.
Specialization of index technology: In order to reduce query cost and improve response efficiency, spatio-temporal indexes need to get rid of the traditional centralized index system and tend to specialized index technology.
The index structure is diversified: Mature centralized indexing technology has been gradually integrated into distributed computing systems. However, the uniqueness of the distributed architecture and the rich query operations of spatio-temporal big data have unique requirements for index structure design. ML index, MD index, and combined index structure have gradually become the mainstream technologies of spatio-temporal big data indexing.
Real-time index query: In the context of spatio-temporal big data and practical application requirements, as a spatio-temporal index supplement for the extended batch system, it aims to shorten the query response time of PB-level data to seconds and real-time indexes that can perform continuous queries are receiving increasing attention.
Spatio-Temporal Index Based on NoSQL Database
In order to solve the problems of the traditional relational databases in spatio-temporal big data management, high concurrent reading and writing, and expansion. NoSQL database technology has become the technical support for spatio-temporal big data storage management. At present, NoSQL databases usually use the key–value data model to store data. Divided into three data models: key–column, key–value, and key–document [77]. A NoSQL database accesses data based on the key–value model. After the data table is sharded, it is distributed and stored on the server cluster.
There are two structures for existing spatio-temporal indexes based on NoSQL databases. 1) The spatio-temporal multidimensional index, as shown in Fig. 7, is an extension of spatial multidimensional indexes, which regard time as an additional dimension and jointly encode spatio-temporal information into 1-D data. Establish the mapping relationship between RowKey and the node where the data block is located, which is suitable for three data models. 2) Joint index, mainly used for key–column data model. The essence of joint index is also an extension of spatial multidimensional index. GeoHash, Z-order, H-Curve, and other technologies are encoded into 1-D data, and the 1-D data are organized in a tree structure to speed up the positioning of spatial data. Second, the mapping relationship between 1-D data and data nodes is established by using the Hash algorithm. The last 1-D data are jointly encoded with time information and other attributes as RowKey as the only access entry for data rows in the NoSQL database. The joint index is designed to quickly locate spatial data blocks, and does not do special processing for time information. In order to improve the query efficiency of spatio-temporal big data, some have built a secondary index [78]. Table II provides an overview of different spatio-temporal indexes under NoSQL databases. The general index flowchart over NOSQL database is shown in Fig. 8.
A. Key–Column Database
The key–column data model is mainly from Google’s BigTable. The typical representatives are Cassandra [79], HBase [80], and Accumulo [81]. The key–column data model can be understood as a multidimensional mapping, mainly including concepts, e.g., RowKey, ColumnFamily, and Column. In the key–column data model, a column is the smallest element stored in the database. It is a triplet of key, value, and timestamp. Any row is organized based on the primary key RowKey. In short, the key–column data model simulates the storage format of traditional tables through multilayer mapping. In fact, it is similar to the key–value data model, which needs to be searched by key.
1) Spatio-Temporal Index Based on Cassandra Database
Cassandra is a distributed nonrelational key–value columnar store that can use the gossip protocol to store massive amounts of data on a large number of servers without a single point of failure. Cassandra is a partitioned data store, and data rows are organized based on primary keys. One part is the partition key, which identifies the uniqueness of records in the database and determines where the data are stored. Within a partition, rows are organized together by the rest of the primary keys (e.g., the clustering key or sort key): Cassandra-specific partitioning keys and clustering keys. The cluster key structure can realize multigranularity and multilevel spatio-temporal indexing [82], which effectively promotes the search mechanism.
PSTQ [78] is a Cassandra-based spatio-temporal trajectory query framework. Each row in the track table consists of the partition key Rid, the clustering key
a random number from 0 to 9, which aims to eliminate index hotspots by randomizing the partition fragmentation of each entry;
a Z-order value;
long time partitions with fixed time step.
MLS3 [82] is an adaptive spatio-temporal hierarchical indexing method that supports querying of different spatio-temporal feature data. The main ideas are as follows.
Spatio-temporal partitioning, which is divided according to time granularity and spatial hierarchy.
Information encoding, which jointly encodes spatio-temporal information and stores it in the primary key. The partition key is identified by a larger temporal granularity and a parent spatial grid to ensure that data with spatio-temporal proximity is distributed in the same or adjacent logical partitions. Clustering key is identified by a smaller time granularity and subspace cell.
Index is established, and an ML index tree MLS3 with adaptive hierarchical division is proposed.
2) Spatio-Temporal Index Based on HBase Database
The HBase database usually organizes and manages data in the form of key–value pairs. Due to the lack of an effective auxiliary indexing mechanism, it cannot natively support multidimensional queries. For spatio-temporal queries, HBase databases usually perform full table scans on the data, resulting in low efficiency. In recent years, a lot of research work using HBase database to support multidimensional data query has emerged, mainly focusing on the design of rowkey keys, and a small part of supporting data index query in the region. Summarize the spatio-temporal index structure under HBase.
Linear index: The spatio-temporal data are converted into one dimension, the Z-order value is used as a rowkey to organize the data, and each region represents an independent spatio-temporal cube region.
ML index: The time dimension is used as the first layer of the index, which reduces the length and complexity of the rowkey and improves the retrieval efficiency. The spatial dimension is the second level, and the spatial unit information is represented by linearization technology or tree structure.
UQE-Index [83] is an index framework based on HBase database that supports high insertion throughput and efficient multidimensional query, and builds a local R-tree index that supports efficient retrieval of data within a region. First, divide the data into nonoverlapping time intervals and organize and manage them using
MD-HBase [84] is a multidimensional data management system that supports multidimensional range queries and nearest neighbor queries. The basic idea is: divide the multidimensional space into grids, use the space-filling Z curve to encode the 2-D grid into 1-D data, and take the longest common prefix of the 1-D data as the primary key. MD-HBase only provides spatial indexes in the META surface layer, the time attribute is not considered, so time-related queries and filtering of data in blocks still perform full scan operations, and the query efficiency is low.
STCode [85] adopts a MD index mode, encodes latitude, longitude, and time as Base64-bit string as HBase’s RowKey. For each dimension, STCode recursively performs binary partitioning, dividing the dimension into two equal parts, with 0 on the left and 1 on the right. Each character in the string is represented by a 6-bit sequence interleaved in longitude, latitude, and time order, e.g., longitude 1001, latitude 0111, time 1001, bit sequence 101010010111, and Base64 string is fM. Spatio-temporal adjacent points share the STCode prefix. When executing a query, only the RowKey with the same prefix as the area needs to be scanned, reducing the delay of search and calculation, reducing the I/O load between client and server.
HBase-STR [86] is an AIS big data index structure built on HBase. RowKey consists of five parts:
Geohash: divide the geographic space into equal subspaces and perform GeoHash encoding, and use R-tree to construct spatial indexes on the subspaces, aiming to accelerate the matching of RowKey and GeoHash prefixes;
coarse-grained time information;
other AIS attributes;
fine-grained time information;
GeoHash suffix.
HSTI [87] is a multilevel spatio-temporal indexing method based on the HBase database, which is mainly used to support spatio-temporal k-NN queries. First divide the entire space into grid cells of equal size and nonoverlapping, and use Z-order technology to map 2-D values into 1-D values
SGR-tree [88] is an indexing method that supports queries on moving objects on fixed road networks. First slice in the temporal dimension and divide the slice space into a fine-grained grid. The amount of data in a specific time period
JUST [89] is a data management engine for processing dynamic trajectory big data. Based on the NoSQL database HBase, the open source project GeoMesa is a spatio-temporal data indexing tool, and Apache Spark is a spatio-temporal data processing tool, which supports spatial range query, spatio-temporal range query, and k-NN query. JUST proposes Z2T and XZ2T indexing technologies based on GeoMesa’s Z2, XZ2, Z3, and XZ3 spatio-temporal indexes. The Z2T index first divides the time dimension into multiple disjoint time segments and then establishes a separate Z2 index for the data records in each time segment and the Z2T index key. The combination is Num(t)::Z2(lng, lat). The XZ2T index strategy first divides the time dimension into multiple disjoint time periods and then constructs a separate XZ2 index in each time period. JUST supports new data insertion and historical data update without rebuilding the index.
3) Spatio-Temporal Index Based on Accumulo Database
Accumulo supports storing and managing large datasets across clusters. It uses HDFS to store data and uses ZooKeeper for coordination. Accumulo tables exhibit sparsity, ordering, multidimensional mapping and dynamic scalability. As the amount of data becomes larger, Accumulo splits the table divided into smaller parts, called tablets, which can be distributed across multiple table servers. By default, a table will be split into tablets on row boundaries, thus guaranteeing that the entire row is on a single tablet server, and you can set the split point by addingSplits method to control the cutting position of the table. Scanner retrieves a single range of data on a single thread and returns the keys sequentially. BatchScanner uses multiple threads to retrieve multiple ranges of data, which has higher performance, but does not guarantee the return key order ordered. Iterators allow users to implement custom retrievals in the tablet server, so iterators can be used to perform spatio-temporal predicate queries.
DBZ [90] is a globally distributed spatio-temporal index that extends BZ-trees [91], aiming to optimize search performance and space utilization. DBZ supports static indexes and dynamic indexes, and the index mode needs to be specified when creating an index. Static indexes are suitable for initial batch loading of spatio-temporal data. MapReduce jobs are used to scan spatio-temporal data to obtain the spatio-temporal dimension information of each row of ST objects. The spatio-temporal information
Fox et al. [92] propose a spatio-temporal index based on a sorted distributed storage structure, which aims to query geo-temporal data of spatial geometry composed of single points. The spatial index information is represented as a 35-bit GeoHash, and the time information is encoded as “yyyyMMdd.” A part of the GeoHash string is interleaved with part of the datetime string to build the index key. The column family of the index key contains more fine-grained spatial ranges, e.g.,
Park et al. [93] establish a dynamic multilevel grid index in Accumulo, which can improve the parallelism and adapt to the skew of spatio-temporal data. The main idea is as follows: first use H-Curve technology to map the spatial characteristics of spatio-temporal data stream timestamp, id, location, Record to 1-D data cellID, convert the data into timestamp, id, cellID, location, record. The ingest manager then stores the transformed data in a data buffer consisting of a hash table. The index manager creates a Kd-tree index on the hash table records and stores the index in the index buffer. The two buffer data reach their respective buffer size thresholds Refresh the data table and index table in Accumulo, respectively. Data table
GeoMesa [94] is a commercial spatio-temporal database based on distributed systems. It is used for storage, query, and analysis of large-scale spatio-temporal data. It is built on the Apache Accumulo database. The main idea of GeoMesa is to use space-filling curves to reduce multidimensional data into 1-D linear values, and records that are similar in the spatio-temporal dimension are transformed. These data can be efficiently stored and loaded in batches into lexicographically close keys. GeoMesa uses geohashes and timestamps to organize data. A key is a combination of a geohash and a timestamp. It is primarily used for point data Access; line and area data must be broken up into multiple disjoint geohashes. For each indexing strategy, GeoMesa adds a random prefix to the generated key, which distributes the records across region servers and achieves load balancing. When querying data, based on the geohash value, only data items that intersect the query area are considered. GeoMesa uses 3-D Z-order curves to index spatio-temporal point data, and XZ-ordering [95] to index spatio-temporal line/surface data.
So far, for key–column type databases, the research on spatio-temporal indexes has mainly focused on the design of rowkeys. These indexes follow the traditional strategy of generating spatio-temporal joint encoded values to achieve the fastest retrieval speed. But only quickly locate the data block. Physical nodes cannot quickly filter the data in the block.
B. Key–Value Database
The key–value pair data model is actually a mapping. The key is the unique key to find each data address, and the value is the actual stored content of the data. Typically, a hash function is used to realize the mapping from key to value. When querying, the data row is directly located based on the hash value of the key to achieve fast query.
The key–value database that currently supports spatio-temporal data indexing is Redis [96]. Redis is a memory-based database with large data throughput, efficient read and write capabilities, and supports horizontal expansion and high concurrent queries. However, it is mainly aimed at efficient query of single-key values and there are limitations in storage performance, and as the amount of data increases, its storage speed gradually decreases. Therefore, add the “Field” field as the auxiliary filter information of the key. Redis usually provides millisecond-level real-time query, and data persistence tasks are often submitted to other databases, e.g., HBase.
Redis-Geo [86] is an index structure that supports efficient spatio-temporal query of AIS big data. It uses the “Field” field feature of Redis to build a spatial index. When AIS data are written to Redis, the time and MMSI of the AIS data are extracted as the key of the table, the latitude and longitude information is extracted, and the geoadd method is used to add the latitude and longitude to the specified key, and use the entire data as the value. The Redis database can provide millisecond-level real-time response speed, but there are serious limitations in storage performance. When it comes to spatio-temporal big data, it seems stretched.
C. Key–Document Database
MongoDB is a high-performance NoSQL database with built-in support for spatio-temporal indexes [97]. MongoDB uses sharding for horizontal scaling. The user chooses a shard key, which determines how the data in the collection will be distributed. Data are divided into ranges (based on the shard key) and distributed across multiple shards. MongoDB can run on multiple servers, balancing load and/or replicating data to keep systems running in the event of hardware failures.
STIG [23] is an MD index structure of extended Kd-tree, which aims to support complex interactive spatio-temporal range query for static data requiring point-in-polygon (PIP) test. A single index is used to filter spatio-temporal data simultaneously on multiple dimensions to reduce the number of PIP operations. STIG utilizes parallel processors in the GPU to execute multiple independent PIP tests in parallel. STIG tree k = 2 × s + m, where s represents spatial dimension and m represents other attributes, such as time dimension. The index is designed for data with multiple spatio-temporal attributes, such as taxi log data with pickup and drop-off location and time. The tree nodes are divided into internal nodes and leaf nodes, and each leaf node points to a leaf disk block. The internal node of the STIG tree with depth d stores the median of the (d%k) coordinates of the coverage point and pointers to the left and right child nodes. Among them, the leaf node stores pointers to the leaf disk block and k-dimensional borders that define all records in the block. Leaf disk blocks store the property values for which the index was created and pointers to the actual record locations. STIG aggregates points along k dimensions to speedup query processing and maximize utilization of the underlying GPU. STIG does not support dynamic updates, and indexes must be rebuilt periodically when new records are added to the database.
Guan et al. [98] propose an ST-Hash for the query processing of spatio-temporal trajectory under frequently updated trajectory data. It first divides the spatio-temporal dimension into half, the range of the child node is equal to half of the parent node interval, the left node is marked with 0, and the right node is marked with 1, until the desired spatio-temporal granularity is divided. Then, the trajectory point latitude and longitude and time information are mixed and encoded into a 1-D string. For example, (-140, 20, 2015-6-1 00:00:00) is converted to 2015-Re+BP, which shortens the length of the string, reduces the memory space of the index, and improves the index efficiency. The object ID and time information are used as the primary key, and ST-Hash encodes values as new attributes. In addition, B-tree indexes are created on ST-Hash fields to speed up spatio-temporal queries. ST-Hash supports spatio-temporal point queries, spatio-temporal range queries, and spatio-temporal circle queries.
SPTEJ [99] is an MD index structure designed to accelerate the join and query different subjects of the Internet of things under the same spatio-temporal window. The main idea is as follows: divide the spatio-temporal cube into equal-sized cells
find the cells and adjacent cells that contain subjects
;o_{i} retrieve tuples contained in the cell set;
sperform join query on candidate tuples.
S4STRD [101] is a scalable in-memory storage system for accessing real-time trajectory data. RAM reserves the temporary storage of field data with high access frequency, and MongoDB is used as auxiliary storage to persist the original data. Index construction processes the following. 1) Spatial index, which divides the geographic space into an n×n uniform grid, and the coordinates of the grid represent the spatial area location. Merge adjacent low-density trajectory point grids together to form a region, reduce the density gap between grids, realize even distribution of data in space, and establish a one-to-one mapping relationship between regions and servers. 2) Time index, sort the writing time of the data, and establish a B-tree index for the writing time. When a new query is executed, it is searched in RAM according to the index information. When no corresponding result is returned, the original MongoDB data are loaded into the memory according to the index information and the query is executed again.
ST-Index [102] is a spatio-temporal index based on MongoDB database, which supports spatio-temporal range query of massive trajectory data. The core idea is ST-index divides the spatio-temporal cube recursively into
The “weak” requirement feature of MongoDB document structure can flexibly handle data of different structures, the automatic sharding mechanism can achieve load balancing, support the creation of single and composite indexes on specified attributes, and can combine with external data structures to build spatio-temporal indexes, but its storage mode mainly suitable for data models with embedded or reference relationships; therefore, it is difficult to efficiently process spatio-temporal data.
D. Summary
This article mainly discusses spatio-temporal indexes based on NoSQL databases. As with all databases, fast access to raw data in NoSQL databases depends on the structural organization of the stored information and the availability of suitable indexing methods. Well-designed data structures facilitate the use of simple techniques quickly extract the required information from a collection of data, and sophisticated indexing methods can be used to quickly locate single or multiple objects in a database. However, many queries tend to support some specific application.
The flexible storage mode of the NoSQL database provides simplicity for the storage and retrieval of spatio-temporal big data, but the strong encapsulation of the system and the storage structure of the fixed mode make the spatio-temporal index research only focus on the design of the RowKey. Therefore, the spatio-temporal index based on the NoSQL database has three development trends, which are summarized as follows.
Multi-level index mode: Mature linear indexing technology has gradually become the main means of managing spatio-temporal data in NoSQL databases. However, sufficient spatio-temporal query operations require index delinearization, which tends to develop in a ML index mode.
Single index platform: Redis low scalability and MongoDB “loose” data structure make key–column database a mainstream NoSQL database for spatio-temporal big data management.
Diversified index structure: The existing rowkey mainly supports the global index, and the research on supporting the local index of region data is imminent.
Spatio-Temporal Indexing for Standalone Parallel and Distributed Systems
There are also some researches devoted to building spatio-temporal indexes on an independent and scalable distributed system for efficiently querying target data from spatio-temporal big data. Such spatio-temporal indexing methods do not depend on existing distributed computing systems and NoSQL databases, but are built on a distributed system using a P2P node architecture. The advantages are as follows.
Consistent hashing is used to distribute data, which avoids the data skew problem caused by spatio-temporal indexes.
Node decentralization is a good way to avoid the single point of failure problem in the master–slave architecture.
A. Index Mechanism
At present, the spatio-temporal indexes in independent parallel and distributed systems mainly adopt two mechanisms: ML index and MD index. The general index flowchart over standalone parallel and distributed systems is shown in Fig. 11.
1) Multidimensional Spatio-Temporal Indexing Schema Under Standalone Distribution
As shown in Fig. 9, first, the spatio-temporal dimension information is jointly encoded, and the sharding mechanism is used to store the codes with the same features in the same data server. Then, in order to speedup the matching speed of encoding, a local index based on B-tree or
Query execution process: Randomly send the query request proposed by the user to any node in the network, go to the node to find out whether there are data that satisfy the query, if so, perform a specific data retrieval operation and return a local result set. Then, forward the query request to the next node through routing and perform the same steps; if not, skip the node and directly forward the query request to the next node.
2) Multilevel Spatio-Temporal Index Mode Under Standalone Distribution
As shown in Fig. 10, the time information is firsts divided into small-granularity time ranges, and indexes are established to speed up the matching of time information. For each time range, a distributed cluster of P2P architecture is constructed, and the data are distributed to the node server through the hash function. The overlay network stores the routing information and data information of adjacent nodes with each other to achieve cross-network communication and data retrieval. Then, a local index is built for the local data on each node server. Each node server in the overlay network acts as two role, it is both a data storage node and a local index node.
Query execution process: Extract the time range of the user’s query request, find the corresponding overlay network through a 1-D time index, send the query to any node in the network, and execute. 1) Determine whether there are data satisfying the query under the node search, if there is, perform a specific data retrieval operation, return the local result set, and then forward the query request to the next node through routing, and perform the same steps; if not, skip the node and directly forward the query request to the next node. 2) Judge whether the adjacent nodes of this node have data that satisfy the query, if so, perform a cross-network data retrieval operation, return the adjacent node result set, and then forward the query request to the next node through routing, and perform the same steps; if not, skip the adjacent node and directly forward the query request to the next node.
B. Individual Prototype and Techniques
Elite [2] is a method that supports parallel update and query processing of spatio-temporal big data, and supports spatio-temporal range query (STRQ) and spatio-temporal nearest neighbor query (STNNQ). The Elite index consists of three layers: 1) skip list layer, 2) ring layer, and 3) OCT tree layer. The skip list layer and ring layer constitute a global index, and the OCT tree layer is a local index and contains an OCT tree and hash table, the OCT tree stores the trajectory observation positions, and the hash table maps the unique identifier ID of the trajectory to the earliest observation position and the latest observation position of the trajectory. The skip list layer contains a doubly linked skip list, where each node in the skip list corresponds to a torus cluster. A node's key consists of the time interval of the ring cluster and a pre-allocated segment of consecutive IP addresses. The ring layer consists of chain rings, in which each ring body is a cluster of nodes. Each node in the ring maintains a routing table that contains adjacent node IP addresses and data ranges. The information in the routing table is used for communication between nodes within the ring. For communication between two rings, a ring node randomly selects an IP address from its IP address segment connecting the rings, and then, the randomly selected node communicates within the ring to find the target node. The trajectory query is divided into three stages.
Filter: Identify ring nodes that overlap the query area with a distributed index. All candidate circle nodes run the corresponding subqueries in parallel. Access the local index to retrieve candidate trajectories.
Node assignment: Assign free circle nodes to each candidate node to perform further result set refinement.
Refinement: Use a sampling method to generate a set of possible instances, simulate the uncertainty of the trajectory, calculate the corresponding qualified probability, and then combine the query results of all nodes to form the final result.
Mapping: Convert the latitude, longitude and time to 32-bit binary strings respectively, forming a tuple
.T_{k}=(T_{\text{lat}}, T_{\text{lon}}, T_{t}) Index: Map all GeoTrie nodes to the DHT structure through the
method.K=\text{Hash}(l)
A-tree [106] solves the distributed indexing problem of multidimensional data in cloud computing environment. The basic idea is: the master node builds a global index through a 1-D array, and each index on the array is assigned to a node. Each slave node builds an R-tree and creates a Bloom filter at the same time. The data are allocated according to the DHT principle to the node storage. When querying a point, first verify it through Bloom filter. If the query point is not in it, then do not perform the R-tree query, otherwise continue to perform the R-tree query. When querying a range, you cannot use the Bloom filter, you must do an R-tree query.
Galileo [107] is a high-throughput storage system using zero-hop DHT peer-to-peer network architecture, supporting hundreds of millisecond read and write operations. A single storage node in the system is divided into multiple groups, and each group is assigned a unique UUID stream. Galileo designs a two-layer hashing scheme: first, the GeoHash is calculated from the data space information to determine the target group of the data. Then, use temporal and characteristic metadata sets of data compute SHA-1 hashes to determine storage nodes within a group. Group and storage node hashes can determine the specific node UUIDs with which to communicate. Galileo utilizes the host file system to store data on physical media, the unit of storage is called a block, and each block carries a set of metadata, which contains both temporal and spatial information. When queried, it incrementally returns the metadata of matching blocks to the requester and organizes these metadata blocks into traversable in-memory metadata set subgraph. Galileo can freely switch the root node of the dataset subgraph traversal through redirection to meet the needs of different spatio-temporal application queries.
DISTIL+ [108] is a distributed spatio-temporal data processing system for highly updated location data that extends DISTIL [109]. DISTIL+ builds a three-tier distributed in-memory index by leveraging the asynchronous partitioned global address space paradigm, aiming to support highly concurrent spatio-temporal range queries and spatio-temporal k-NN queries. The location, velocity, orientation, and timestamp records of moving objects are stored in the location table. The main idea of index design is to discretize the spatial and temporal dimensions. The global index (level 1) adopts a spatial domain-based quadtree partition. The local index (levels 2 and 3) of each node consists of a spatial index and a partial temporal index (PTIndex). The spatial index maintains information about each grid cell in the spatial domain, and each grid cell has a PTIndex, which is defined by an interval Table composition, each entry in the interval table contains a bit vector (Bit-vector) and a hash table (RIDListMap), the former identifies the moving objects in a specific grid cell within a given time interval, RIDListMap will each A moving object is associated with a list of record identifiers that locates the actual moving record of the moving object in the location table. When a new location update is received from the moving object, the location record object LRec is inserted into the concurrent queue by the coordinator component, and then use the producer–consumer paradigm to implement parallel processing.
Toss-it [110] supports indexing the current position of the moving object. The main idea is to build a new index every time the location changes, instead of updating the old one, so that there is no need to maintain a centralized update buffer to maintain indexes that have not been updated yet, improving the scalability of the system. ToSS-it uses Voronoi diagrams distributed over multiple nodes, first distributing all objects on cloud servers while maintaining their spatial locality, constructing Voronoi diagrams in a first-distribution-then-build fashion. The initial distribution of data is executed using a centralized server. Then, a local Voronoi diagram (LVD) is built on each server, and the LVD decomposes the space into disjoint polygons. The generation of the LVD utilizes all available cores of the CPU to further expand on each node and divide objects. Build a hierarchical Voronoi index structure on each server to speedup query processing. Submit a query
The spatio-temporal indexes discussed previously use a variety of indexing techniques to improve spatio-temporal data processing capabilities without relying on existing distributed computing systems and NoSQL databases. These indexes satisfy scalability and fault tolerance, but the disadvantages are as follows: 1) Consistent hashing is used to distribute data, although load balancing is achieved, but it weakens the spatio-temporal relationship between data and destroys data locality, resulting in indexes that can only satisfy simple spatio-temporal data point queries, but cannot meet the needs of more complex spatio-temporal applications; 2) the decentralization of nodes makes it difficult to directly address the index server whether it is an ML index or an MD index. When data are inserted frequently, index updates take a long time.
Comparison of Indexing Mechanisms
In this section, we summarize three main indexing mechanisms and indexing methods from three distributed environments, and compare and analyze the advantages and disadvantages of these indexing mechanisms from four aspects: adaptive, data skew, index size, and query efficiency. Finally, the typical application scenarios of each indexing mechanism are given in Table III.
In terms of adaptive, this metric can be described as finding the best match for a given query workload and choosing when the system should adjust its storage at runtime. Both ML idnex and Joint index is able to identify changes in query workload.
In dealing with data skew, this metric aims to determine the most appropriate partitioning technique for a given dataset in the presence of data skew. ML index select the most suitable partitioning technology for each level and the partitioning technology between levels may be different. Therefore, ML index solve this problem. MD index and joint index typically split the spatiotemporal dimension into uniform cells and are, therefore, not suitable for dealing with skewed data.
In terms of query efficiency, ML index has a higher query efficiency when dealing with range query, k-NN query, and similarity query. When processing join queries, the data of each layer need to be processed, so the efficiency is moderate. MD index can handle fast processing of single-dimensional and multidimensional queries, ignoring the effect of partition order on the nature of partitions, so MD index can efficiently handle range queries. The joint index can accurately and quickly find the location of the data through the key value, so it has high efficiency in processing conditional queries, such as range query, k-NN query, and similarity query.
To sum up, ML indexes are suitable for queries focused on the time or space dimension, built on distributed computing systems and independent parallel and distributed systems, with custom index granularity and high throughput requirements. MD indexes are suitable for processing data evenly distributed, general range queries, built on distributed computing systems and independent parallel and distributed systems, with high throughput, high real-time, and low latency requirements. Joint index is suitable for building NoSQL databases, with data update, data insertion, high real-time, and low latency requirements.
Experimental Results
In this section, we compare and analyze the performance of spatio-temporal indexing over different distributed systems.
A. Settings
To evaluate the performance of indexing methods, we use a real spatio-temporal datasets: T-Drive,1 which contains a one-week trajectories of 10 357 taxis. The total number of points in this dataset is about 15 million. Table IV lists the softwares and their versions used in our experiments.
B. Spatio-Temporal Index Performance Over Batch System
In the performance comparison of spatio-temporal indexing over batch system, we choose ST-Hadoop,2 HT, STQuery, and CloST3 with open-source code or clear algorithm flow as comparison methods, and choose the query time varying with the size of the data, query time, index time, and index size as the evaluation indicators for comparison.
We compare the performance of the indexing method for spatio-temporal range queries with different data sizes, where the spatio-temporal window remains unchanged by default. As shown in Fig. 12(a), ST-Hadoop has the fastest query speed, because 1) ST-Hadoop adopts an ML indexing mechanism, which can make better use of machine hardware resources, and 2) the global/local index mode of ST-Hadoop can quickly locate the location of the physical node where the data block is located.
Performance comparison of different indexing methods over the Hadoop framework. (a) Query time over data size. (b) Query time over index method. (c) Index size. (d) Index time.
Fig. 12(b) shows the changes in the performance of the spatio-temporal range query under different spatio-temporal query windows. We randomly set 100 different query windows and each query is executed only once, since the cost of different queries can vary widely, we report query times using 5%–95% intervals and the median of these 100 queries. As shown in Fig. 12(b), ST-Hadoop has led to the best query time among all index methods. Although CloST also adopts an ML indexing mechanism, its query performance is lower than ST-Hadoop. The main reason is that CloST’s indexing method cannot handle skewed data, which can easily lead to load imbalance.
As shown in Fig. 12(c), the storage size under different spatiotemporal indexes is compared with the change of data volume. The horizontal axis represents the percentage of usage data in the corresponding dataset. Note that the storage index size here includes the data itself. We can know from the figure that 1) as the amount of data increases, the size of all spatiotemporal indexes increases linearly, and 2) the ST-Hadoop ML index is larger than the HT index size, because the ML index may copy data once at each level.
Fig. 12(d) shows how the storage index time varies with the size of the data. It can be seen from the figure that with the increase of data volume, for all indexing methods, the storage indexing time increases exponentially. ST-Hadoop takes more time than HT, the main reason is that ST-Hadoop needs to process data hierarchically, under Hadoop architecture, this will generate more intermediate result storage, which means more IO operations are required.
C. Spatio-Temporal Index Performance Over Stream Processing System
In the performance comparison of spatiotemporal indexing under batch system, we choose DITIR and DSI with clear algorithm flow as comparison methods, and choose index size and throughput as the evaluation indicators for comparison.
Fig. 13(a) shows the storage cost of the two methods at different raw data sizes. Note that the storage cost includes the index structure and the data. As we can see, with more raw data, it requires more storage space to store both datasets. Clearly, DSI takes up more space than DITIR, mainly because DSI build indexes in the x-axis and y-axis directions, respectively.
Performance comparison of different indexing methods over the stream processing system. (a) Index size. (b) Throughput.
Fig. 13(b) shows the throughput of the two methods at different raw data sizes. As we can see, as the amount of data increases, the throughput of both methods decreases, mainly because the two indexing methods do not consider load balancing.
D. Spatio-Temporal Index Performance Over Hybrid Processing System
In the performance comparison of spatiotemporal indexing over batch system, we choose DITA,4 Simba,5 STARK,6 and GeoFlink7 with open-source code or clear algorithm flow as comparison methods, and choose index size and throughput as the evaluation indicators for comparison.
Fig. 14(a) shows the storage cost of the four methods at different raw data sizes. Note that the storage cost includes the index structure and the data. As we can see, with more raw data, it requires more storage space to store both datasets. For the dataset, DITA and Simba takes up more space than other methods, mainly because they use a ML indexing mechanism to deal with the spatio-temporal dimension separately.
Performance comparison of different indexing methods over the hybrid processing system. (a) Index size. (b) Throughput.
Fig. 14(b) shows the throughput of the four methods at different raw data sizes. As we can see, although DITA and Simba take up more space than other methods, their throughput is higher, and the global/local index mode can make full use of machine hardware resources and improve query performance.
E. Spatio-Temporal Index Performance Over HBase Database
In the performance comparison of spatio-temporal indexing over HBase database, we choose GeoMesa8 and JUST9 with open-source code as comparison methods, and choose query time and index size as the evaluation indicators for comparison.
As shown in Fig. 15(a), as the dataset is larger, both methods require more time to answer the spatiotemporal range query because more data are scanned and returned. JUST is much faster than GeoMesa because JUST encodes both temporal and spatial information into the keys of the NoSQL data store, and encodes temporal information first and spatial information second, which allows us to quickly locate eligible records directly.
Performance comparison of different indexing methods over the HBase database. (a) Query time. (b) Index size.
Fig. 15(b) shows the storage cost of the two methods at different raw data sizes. Note that the storage cost includes the index structure (i.e., keys) and the data (i.e., values) itself. As we can see, with more raw data, it requires more storage space to store both datasets. JUST takes up more space than GeoMesa, mainly because JUST applies more space to store index information.
Challenges and Outlook
Most of the existing prototypes and systems adopt distributed indexing, and the basic idea is to adopt a two-level indexing scheme. At the local indexing layer, indexing structures, e.g., R-tree, Quadtree, and grid are usually used. Another way is to use spatial fill curves map data into 1-D values and use traditional B-tree for local indexing. NoSQL databases widely adopt the latter approach. In the case of spatio-temporal data, some indexing methods consider the temporal dimension first, then the spatial dimension. In the global index layer, the most common method is to collect summary information from the local indexes of nodes in order to build a global index and quickly locate queries to data nodes. The index also brings new challenges and future research directions.
A. Multimodal Adaptive Indexing Mechanism for Spatio-Temporal big data in Multicopy Mode
In the conventional big data storage environment, multiple copies are the basic strategy to ensure data security and reliability. A unified index mode cannot support the diverse application requirements of spatio-temporal big data, and different index modes are established for different copies of the same data, it can improve the storage access efficiency of data according to different application requirements. Therefore, how to design a multi-modal adaptive spatio-temporal big data indexing mechanism is a hot research direction.
B. Spatio-Temporal Index Mode for Low-Latency Queries in High Real-Time Applications
With the application of 5G network, in real-time application fields, e.g., unmanned driving, real-time navigation, and short-term prediction, the requirements for retrieval efficiency of spatio-temporal big data is getting higher and higher. Considering the communication cost, maintenance cost, and index performance, neither the tree-based index nor the grid index can be directly applied to the streaming data processing system. Therefore, how to design a spatio-temporal index mode with low communication cost, low maintenance cost, and high index performance on the streaming data processing system, which can shorten the query response time of petabyte-level data to the second level, has received more and more attention.
C. Spatio-Temporal Indexing Technology and Theoretical Specification Based on NoSQL Database
At present, the existing spatio-temporal indexes based on NoSQL databases are only oriented to specific applications, the supported queries, and key technologies used are very different, and a set of systematic normative criteria has not been formed. Therefore, it is urgent to standardize NoSQL spatio-temporal indexes and their support theories, which mainly include the following.
Specification technology and supported query operations for NoSQL spatio-temporal index.
The guidelines for application-oriented design of NoSQL spatio-temporal indexes, the spatio-temporal index structure for minimizing spatio-temporal granularity, and the heuristic criteria for minimizing query response time provide guidelines for the optimal spatio-temporal index structure.
Study the spatio-temporal semantics expressed by the spatio-temporal index and the correctness verification rules to provide a certain basis for the rationality and correctness of the spatio-temporal data organization.
Conclusion
Spatio-temporal indexing is the core content of spatio-temporal big data management, especially with the introduction of concepts, e.g., knowledge computing, social computing, and urban computing [112]. Various parallel and distributed spatio-temporal indexing methods to improve access efficiency to spatio-temporal big data. This review covers spatio-temporal indexing methods proposed from 2010 to 2020, namely:
spatio-temporal indexing based on distributed computing system;
spatio-temporal indexing based on NoSQL database;
spatio-temporal indexing for standalone parallel and distribution systems.
ACKNOWLEDGMENT
The authors would like to thank to X. Du, C. Li, G. Peng, and C. Li, with special thank A. Eldawy, Assistant Professor, with the University of California Riverside. This article was produced by the IEEE Publication Technology Group.