Introduction
In relational and NoSQL databases, the ability to obtain aggregate information from a (possibly large) set of “records” (tuples, documents, etc.) has always been an important feature. Usually, the set on which to apply an aggregation function (e.g., COUNT or SUM in SQL [20]) is identified by some form of record selection. In the simplest form, records that have one of their fields in a given range are selected and the aggregation function is applied on them. This kind of queries are usually referred to as aggregate range queries. Applications of aggregate range queries can be easily found in many data analytics activities related to business intelligence, market analysis, user profiling, IoT, etc. Further, this feature is one of the fundamental elements of a good system for big-data analytics. In many applications, the speed at which queries are fulfilled is often critical, possibly marking the distinction between a system that meets the user needs and one that does not. Typical examples are interactive visual systems, where users expect the system to respond in less than a second.
The vast majority of DataBase Management Systems (DBMSes) supports aggregate range queries to some extent. When data are large, aggregation performed by a complete scan of the selected data can be too costly to be viable. In theory, adopting proper data structures for indexing [16], it is possible to answer aggregate range queries in
However, in practice, indexes realized by many DBMSes are designed to support regular (non aggregated) queries. This is reflected in the limited advantage that regular indexes can provide to aggregate range queries. In particular, most DBMSes can exploit regular indexes only for aggregation based on MIN and MAX functions. Other typically DBMS-supported aggregation functions, like SUM, AVG, etc., usually require sequential scans. To speed up these sorts of queries, a typical trick is to keep data in main memory, which is costly and usually not possible for big-data applications.
In this paper, we introduce the concept of overlay index, which is a data structure that plays the role of an index but is explicitly stored in a database along with regular data. The logic to use an overlay index is not frozen in the DBMS but can be programmed and customized at the same level of the application logic, obtaining great flexibility. However, designing an overlay index rises specific challenges. In principle, any memory oriented data structure could be easily represented in a database. Nonetheless, operations on these data structures typically require traversing a logarithmic number of elements, each of them pointing to the next one, which is an inherently sequential task. While this is fast and acceptable in main memory, sequentially performing a logarithmic number of queries to a DBMS is extremely slow. In fact, each query encompasses client/server communication, usually involving the network and the operating system, which introduces a large cost that cannot be mitigated by the DBMS query planner. Further, performing queries sequentially prevents the exploitation of the capability of RAID arrays and DBMS clusters to fulfill many requests at the same time and of disk schedulers to optimize the order of disk access [56].
Our main contribution is a new data structure, called DB-tree, for the realization of an overlay index. DB-trees are meant to be stored and used in standard DBMSes to support custom aggregate range queries and possibly other needs, like, for example, data authentication by Authenticated Data Structures [55] (ADS). DB-trees are a sort of search trees whose balancing is obtained through randomization, as for skip lists [47]. In a DB-tree, query operations require only a constant number of range selections, that can be executed in parallel in a single round and that return a logarithmic amount of data. Updates, insertions and deletions, also involve a logarithmic amount of data and can be executed in at most two rounds. We formally describe all algorithms and prove their correctness and efficiency.
Additionally, we present experimental evidence of the efficiency of our approach, on two off-the-shelf DBMSes, by comparing the queries running times using DB-trees against the same operations performed with the only support of the DBMS. Experimental results show that the adoption of DB-trees brings a large gain for range queries, but they may introduce a non negligible overhead when data changes. We also discuss several applicative and architectural aspects as well as some variations of DB-trees.
This paper is structured as follows. In Section II, we show the state of the art. In Section III, we introduce the concept of overlay index and discuss several architectural aspects. DB-trees are introduced in Section IV along with some formal theoretical results. Algorithms to query a DB-tree and perform insertions, deletions and updates are shown in Section V. In Section VI, we show an experimental comparison between DB-trees and plain DBMS. In Section VII, we show how it is possible to use DB-trees to perform, efficiently, aggregate range queries grouped by values of a certain column. We also show an experimental comparison with other approaches. In Section VIII, we show how DB-trees can be adapted to realize persistent ADSes. In Section IX, we discuss some architecture generalizations. In Section X, we draw the conclusions.
State of the Art
In this section, we review the state of the art of technology and research about aggregate range queries optimization. We also review the state of the art about persistent representation of ADSes, which is a relevant application of the results of this paper.
Optimization of query processing in DBMSes is a very classic subject in database research, which historically also dealt with the optimization of aggregate queries (see for example [59]). Indexes are primary tools for the optimization of query execution. They are data structures internally used by DBMSes to speed up searches. They are persisted on disk in a DBMS proprietary format. Typical indexes realize some form of balanced search tree or hash table [16], [49]. Specific indexing techniques for uncertain data are also known, see for example [3], [11]–[13]. Some research effort was also dedicated to the creation of a general architecture for indexing, see, for example, [29], [33]. However, these results have been adopted only by specific DBMSes (see, for example, [4]). The query optimizer of DBMSes can take into account the presence of indexes in the planning phase if this is deemed favorable. In a typical DBMS, the creation of an index is asked by the database administrator, or the application developer, using proper constructs (available for example in SQL). The decisions about indexes creation is usually based on the foreseen frequencies of the queries, the involved data size, and execution time constraints. Several works deal with the possibility for a DBMS to self tune and to choose right indexes (see, for example, [9], [34]). Usually, there is no way for the user of the DBMS to access an index directly or to create indexes that are different from the ones the DBMS is designed to support. A notable exception to this is the PostgreSQL DBMS that provides some flexibility [1].
Concerning aggregate range queries, in common DBMSes, regular indexing is usually only effective for some aggregate functions, like MIN and MAX. Sometimes, certain index usages can provide a great speed-up for aggregate range queries due to the fact that putting certain data in the index may avoid random disk access and/or most processing could occur in-memory [48]. The current DBMS technology and the SQL standard do not allow the user to specify custom indexes to obtain logarithmic time execution of aggregate range queries, even for SQL-supported aggregation functions for which this would be theoretically possible.
A wide variety of techniques was proposed to optimize the execution of aggregate range queries in DBMSes. A whole class of proposals deal with materialized views (see for example [23], [27], [28], [40], [53]). These techniques require the DBMS to keep the results of certain queries stored and up-to-date. These can be used to simplify the execution of certain aggregate queries and are especially effective when data is not frequently updated.
A largely investigated approach is called Approximate Query Processing (AQP), which aims at gaining efficiency while reducing the precision of query results. A survey of the achievements in this area is provided in [37]. This approach is relevant especially when data are large. An approximate approach targeted to big-data environments is proposed in [62]. A method to perform approximate queries on granulated data summaries of large datasets is shown in [52]. Approximated techniques are now available on some widely used systems [8], [54]. The VerdictDB [45] system provides a handy way to support AQP for SQL-based systems by adding a query/result-rewriting middle layer between the client and the DBMS.
A context in which aggregation speed is very relevant is in On-Line Analytical Processing (OLAP) systems (see, for example, [26], [30]). Several works deal with fast methods to obtain approximated results [2], [18], [51] for OLAP systems. In this field, specific indexing techniques may be adopted [50]. Specific data structures were proposed to support aggregated range queries, like for example aR-trees [44] for spatial OLAP systems. The work in [38] surveys several aggregation techniques targeted to the storage of spatiotemporal data.
To support strict time bounds, specific techniques for in-memory databases exist [10].
As will be clear in the following, one of the applications of our results is to support Authenticated Data Structures (ADS) (see Section VIII). ADSes are useful when we need a cryptographic proof that the result of a query is coherent with a certain version of the dataset and that version is succinctly known to the client by a cryptographic hash that the client trusts. The first ADS was proposed by Merkle [39]. Merkle trees are balanced search trees in which each node contains a cryptographic hash of its children and, recursively, of the whole subtree. The work in [46] shows how to arbitrarily scale the throughput of an ADS-based system in a cloud setting with an arbitrarily large number of clients. ADSes are especially desirable in the context of outsourced databases. The work in [35] studies data structures to realize authenticated versions of B-trees to authenticate queries. This proposal is meant to be used in DBMSes as an internal indexing structure. Other works tried to represent ADSes in the database itself. Several general-purpose techniques to represent a tree are presented in [7]. The problem of representing an authenticated skip list [55] in a relational table was investigated in [19]. They propose the use of nested sets to perform queries in one query round. An efficient use of this approach to compute authenticated replies for a wide class of SQL queries is provided in [42]. Nested sets represents nodes of trees as intervals bounded by integers and a parent-child relation is represented as a containment relation between two intervals. Unfortunately, nested sets cannot be updated efficiently. In [58], several variations of nested sets are described, varying the way in which intervals bounds are represented. These methods are based on numerical representations of rational numbers and are limited by precision problems.
Overlay Indexes
In this section, we discuss the rationale for introducing overlay indexes, the applicative contexts where overlay indexes can be fruitfully applied and discuss some architectural aspects. This discussion is largely independent from the DB-tree data structure, which we propose as one form of realization of overlay indexes, described in Section IV.
In the following, when we refer to a query, we may intend either a proper data-reading query or a generic statement, which can also change the data. The distinction should be clear form the context.
In Figure 1, we show a schematic architecture for the adoption of overlay indexes. We suppose to have a client and a DBMS connected by a connection. In our scheme, the client contains the application logic and performs queries to the DBMS through the connection. Note that, we use this scheme in the following discussion but we do not intend to restrict the use of overlay indexes to this simple architecture. For example, the client may be the middle tier server that performs DBMS queries on behalf of user clients, the storage of the DBMS may be located in a cloud, the client may be on the same machine of the DBMS and the connection may be just inter-process communication, and so on. The protocol used to perform queries to the DBMS is a standard one (e.g., SQL or one of the many NoSQL query languages). Additional requirements on features supported by the DBMS depend on the way overlay indexes are realized. We provide these details for DB-trees in Section IV-B.
A schematic architecture for the adoption of overlay indexes. Thick arrows shows the data path when overlay logic is realized client-side. Dotted arrows shows data path when overlay logic is realized by stored procedures in the DBMS.
A. Applicative Contexts
The applicative contexts where we believe overlay indexes may turn out to be of great help are those where (1) the application performs both updates (possibly comprising insertion and deletion) and reads on the database, (2) some read operations have strong efficiency requirements, like, for example, in interactive applications, (3) the amount of data is so large that in-memory solutions cannot be applied and, hence, the efficiency of those read operations can only be obtained by adopting indexes, and (4) the DBMS does not support the right indexing for those read operations. For example, an application might need to provide the sum of a column for those tuples having the value of their key contained in a certain range, which depends on (unpredictable) user requests, or might need to provide a cryptographic hash of the whole table to the client for security purposes. The first case, is often inefficiently supported and often efficiency is obtained not by proper indexing but by keeping data in main memory. The second case is usually not supported by DBMSes. Clearly, in the above described conditions, approximate query processing could be adopted. However, in certain situations approximation is not desirable or is ruled-out by the nature of the aggregation function, as for the cryptographic hash case.
B. Rationale
Clearly, the introduction of an overlay index has some performance overhead when data are updated, as any indexing approach has, but can be the only viable approach for certain non-supported aggregation functions or can dramatically reduce the time taken to reply to certain queries for which specific indexing is not available from the DBMS. For example, in the cases where the DBMS cannot use indexes, an aggregated query may run in
C. Architectural Aspects
We refer to Figure 1. Realizing an overlay index requires to introduce (i) one or more specific table(s) into the database, which we call overlay table(s), alongside the regular data, with the purpose to store the overlay index, and (ii) a (possibly only conceptual) middle layer, which we call overlay logic, that is in charge of keeping overlay tables up-to-date with the data and to fulfill the specific queries the overlay index was introduced for.
We observe that the overlay logic can be naturally designed as a real middle layer, which should (1) take an original query performed by the client, for example expressed in plain or augmented SQL language, (2) produce appropriate actual queries that act upon regular tables and/or overlay tables and submit them to the DBMS, (3) get their results from the DBMS, and (4) compose and interpret the results to reply to the original query of the client. This is the approach taken by VerdictDB [45] for approximate processing of aggregated range queries. Since we aim at proving the soundness and the practicality of the overlay index approach, the construction of such a middle layer for overlay indexes is out of the scope of this paper.
A simpler approach is to have a library that is only in charge to change or query overlay tables. In this case, it is responsibility of the application to call this library to change an overlay table every time the corresponding regular data table is changed. Special care should be taken to keep consistency between the two. For example, the whole update (of data and overlay tables) should be performed within a transaction. In certain situations, it may be convenient to keep all the data within the index itself without having a distinct regular data table. This is the approach that we adopted for the experiments described in Section VI.
When designing a data structure for an overlay index, the time complexity of its operations is clearly a major concern. However, we should also take into account its efficiency in terms of data transferred for each operation, and also how this transfer is performed. In fact, to perform a query on an overlay index, it is likely that several actual queries to the DBMS should be performed. It is important that these queries could be done in parallel. A round is a set of actual queries that can be performed in parallel to the DBMS. We assume actual queries to be submitted to the DBMS at the same instant. The duration of a round is the time elapsed from queries submission to the end of the longest one. Round duration is composed of the execution time (on the DBMS) of the longest query and the communication latencies in both directions. A poor implementation may require the original query to take several rounds to be accomplished, possibly consisting of only one actual query each. For example, a plain porting of any balanced data structure (like, for example, AVL-trees, skip lists, or B-trees) to use a DBMS as storage would take
To improve performances, the overlay logic may realize some form of caching by keeping part of the overlay table in the memory of the client. While this may speed up some queries, it introduces a cache consistency problem, if more clients are present. For data-changing queries, it is likely that the overlay logic needs to know beforehand the part of the overlay table that is going to be modified, before submitting the changes to the DBMS (this is the case for DB-trees, described in Section IV). The obvious approach is to perform changes in (at least) two rounds. The first (read round), to retrieve the part of the overlay table that is involved in the change and, the second (update round), to actually perform the change. The introduction of caching may help in reducing the amount of data transferred from the database in the read round. However, since the data needed for the change depends on the request performed by the user, caching is unlikely to make the read round unnecessary, unless the application always updates the same set of data. A particular case is the insertion of a large quantity of data (also called batch insert in technical DBMS literature). In this case, many insertions may be cumulated in cache and written in one round, possibly using batch insert support from the DBMS itself. We consider all aspects introduced by caching to be outside of the scope of this paper. In the realization adopted for the experimentation of Section VI, we do not adopt any form of caching and we have only one client.
It is worth mentioning that overlay logic may also be realized exploiting programmability facilities of certain DBMSes, usually called stored procedures. In this case, the impact of communication between the overlay logic and the DBMS would be negligible. This has two notable effects: (1) the inefficiency of data-changing operations due to the need of two query rounds is mitigated and (2) the adoption of a realization that performs in many rounds (e.g., logarithmic in the data size) becomes more affordable. Anyway, having many sequential query rounds still makes a poor use of RAID arrays, DBMS clusters, and disk schedulers. Hence, also in this setting, it is advisable to adopt special data structures that limit the number of query rounds, like DB-trees. We note that stored procedures are proprietary features of DBMSes, hence, exploiting them links overlay logic to a specific DBMS technology. Further, not all DBMSes support them, especially in the NoSQL world.
The DB-Tree Data Structure
In this section, we describe the DB-tree data structure, which we propose to realize an overlay index. We first describe it intuitively. Then, we provide a formal description of the data structure with its invariants. Finally, we describe its fundamental properties.
A. Intuitive Description
A DB-tree stores key-value pairs ordered by key. To simplify the description we assume that keys are unique and both keys and values have bounded length. A DB-tree is a randomized data structure that shares certain features with skip lists [47] and B-trees [15], which are widely used data structures to store key-value pairs, in their order. We start by recalling skip lists, which are conceptually simpler than DB-trees, and then we show how DB-trees derive from them. A formal description of DB-trees is provided in Section IV-B.
A skip list is made of a number of linked lists. An example is shown in Figure 2, where linked lists are drawn horizontally. Each element of those linked lists stores a key and each list is ordered. Each list is a level that is denoted by a number. Level zero contains all the keys and it is the only level that also stores values. Higher levels are progressively decimated, that is, each level above zero contains only a fraction of the keys of the level below. In this way, each key
An example of a skip list. For each element, only the key is shown. Level 0 also stores values, not shown.
In traditional skip lists, when
Algorithm 1 Extraction of a Random Level
A random level for a skip list or a DB-tree.
while RandomChoice is GO-UP do
end while
return
Skip lists are regarded as efficient and easy-to-implement data structures, since they do not need any complex re-balancing. However, they are meant to be stored in memory, where traversing pointers is fast. An attempt to represent a skip list in databases was made in [19], resulting in operations taking
DB-trees can be interpreted, and intuitively explained, as a clever way of grouping elements of a corresponding skip list so that update and query operations can be performed in only one or two query rounds. The DB-tree corresponding to the skip list shown in Figure 2 is shown in Figure 3. To simplify the picture, in this and in the following examples, we show only keys and omit values, implicitly intending (only for the purpose of the examples) that for each
A DB-tree corresponding to the skip list shown in Figure 2 for the SUM aggregation function. For simplicity, we do not show values. In this and in the following examples, for each key
Since we intend to use DB-trees to support aggregate range queries, we store in each node the aggregate values of the key-value pairs that are between two adjacent keys in the node, which are indeed explicitly contained in its descendants. We have one aggregate value for each child and the sequence of keys of a node is interleaved with the aggregate values related to children. An aggregate value is omitted if the corresponding child is not present. In the example of Figure 3, aggregate values are shown in the nodes interleaved with keys. We recall that, for the sole purposes of this example, values
To realize a DB-tree, the overlay logic (see Section III) stores its data in a single overlay table
B. Formal Description
A DB-tree contains key-value pairs, where keys are non-null, distinct, and inherently ordered, and values are from a set
is associative, that is$f: A \times A \rightarrow A$ , and there exists an identity element denoted by$f(a,b,c)=f(f(a,b),c)=f(a,f(b,c))$ , that is for any$1_{A}$ . The associativity of$x \in A: f(1_{A}, x)= f(x,1_{A})= x$ allows us to write$f(a,b)$ , since the grouping according to which$f(a_{1}, {\dots },a_{n})$ is applied is irrelevant for the final result.$f$ .$g: V \rightarrow A$ .$h: A \rightarrow C$
We define
This model can support many practical aggregation functions, like, count, sum, average, variance, top-
, where we intend that the first element of each pair is the maximum, the second is the second maximum,$A = \{V,\bot \} \times \{V,\bot \}$ means undefined, and$\bot $ is less than any element in$\bot $ ,$V$ ,$g(v)= (v, \bot)$ ,$h(x)=x$ where$f((v_{1},v_{2}), (v_{3},v_{4}))= (m_{1}, m_{2})$ and$m_{1}=\max \{v_{1},v_{3}\}$ is the second maximum in$m_{2}$ , that is$U=\{v_{1},v_{2},v_{3},v_{4}\}$ , and$m_{2}=\max (U-\{m_{1}\})$ the identity element is
.$(\bot,\bot)$
We are now ready to formally describe the DB-tree data structure. DB-trees support any aggregation function that fits the definition of decomposable aggregation function stated above. We assume that the aggregation function to be supported is known before the creation of the DB-tree, or at least
A DB-tree, keeps certain aggregate values ready to be used to compute aggregate range queries on any range, quickly. Its distinguishing feature with respect to other results known in literature is that it is intended to be efficiently stored and managed in databases. While typical research works in this area show data structures whose elements are accessed by memory pointers (for in-memory data structures) or by block addressing (for disk/filesystem based data structures), the primary mean to access the elements of a DB-tree is by range selection queries provided by the DBMS itself.
We do not restrict the kind of underlying database that can be used to store a DB-tree. However, for simplicity, we describe our model using terminology taken from relational databases. We only require the underlying database to have (i) the ability to perform range selection on several columns, efficiently, which is usually the case when proper indexes are defined on the relevant columns, and (ii) the ability to get, efficiently, the tuple containing the maximum (or minimum) value for a certain column among the selected tuples.
A DB-tree
A node
A node
For each node
Each node
When a key
C. Summary of Invariants
We now formally summarize the invariants that must hold for each node
If
is the root of$n$ ,$T$ ,$n.\mathrm {min}=-\infty $ ,$n.\mathrm {max}=+\infty $ . If$n.\mathrm {level}=+\infty $ is empty,$T$ ,$n.m=\bot $ is an empty sequence and$n.\mathrm {aseq}$ has no children. If$n$ is not empty,$T$ ,$n.m=0$ and$n.\mathrm {aseq}=a_{0}$ has one child.$n$ If
is not root,$n $ and$n.m>0$ .$n.\mathrm {min} < k_{1} < k_{2} < {\dots } < k_{m} < n.\mathrm {max}$ If
(with$n_{i}$ ) is present,$0\leq i \leq m$ ,$n_{i}.\text {level} < n.\text {level}$ $n_{i}.\mathrm {min}= \begin{cases} n.\mathrm {min}& \text {if } i=0\\ k_{i} & \text {otherwise} \end{cases}$ and
$n_{i}.\mathrm {max}= \begin{cases} n.\mathrm {max}& \text {if } i=m\\ k_{i+1} & \text {otherwise} \end{cases}$ for all
,$i=0, {\dots },m$ is present iff$a_{i}$ is present and$n_{i}$ $a_{i}= f(n_{i})$
D. Fundamental Properties
In this section, we introduce some fundamental properties that are important for proving the efficiency of the algorithms described in Section V.
The following property is a direct consequence of Invariants 2 and 3.
Property 1 (Range Monotonicity):
Given a node
Now we analyze the relationship among nodes and between nodes and keys. Let
Property 2 (Unique Containment):
A key
Property 3 (Lowest Level):
The node that contains a key
Concerning space occupancy, we note that in a DB-tree, a node contains one or more key-value pairs and each of them is contained in at most one node. This means that nodes are at most as many as the key-value pairs stored in the DB-tree. Hence, the space occupancy for a DB-tree is
To reason about the size of the data that are transferred between the overlay logic and the DBMS, it is useful to state the following property.
Property 4 (Expected Node Size):
The expected number of keys that are contained in each node of a DB-tree is the same for all nodes.
This property can be derived from a consideration on the homogeneity of the levels. The number of keys contained in a node at level
The following lemma states a fundamental result on which the efficiency of querying a DB-tree is based.
Lemma 1 (Expected maximum level):
Given a set of
In other words, considering
The probability that
The following lemma is relevant to understand the correctness of the query procedure that is algorithmically introduced in Section V.
Lemma 2 (Aggregate Range Query Construction):
Given a sequence of keys
the node that contains
, denoted$k_{1}$ ,$n_{L}$ the node that contains
, denoted$k_{r}$ ,$n_{R}$ the common ancestor of
and$n_{L}$ with minimum level, denoted$n_{R}$ ,$\bar n$ the ancestors of
and$n_{L}$ up to$n_{R}$ .$\bar n$
Proof:
We build a sequence
We start with
An example of construction of an aggregate range query according to the proof of Lemma 2. In this example, the query regards the sum of the values of the keys in the range from 10 to 25. The aggregation function is SUM and for each
Lemma 2 states that it is possible to answer to an aggregate range query considering only certain parts of the aggregate sequences of a small set of nodes
Lemma 3:
Given a sequence of keys
The above lemma is easily proven considering the construction used for proving Lemma 2. Each of the keys
Lemma 4 (Aggregate Range Query Size):
Given a DB-tree
Proof:
The proof of Lemma 2 constructs an aggregated sequence
DB-Tree Algorithms
In this section, we describe the algorithms to perform queries on a DB-tree and to modify its content. We also state their correctness and efficiency on the basis of the fundamental properties provided in Section IV-D. At the end of the section, we provide a comparison with similar data structures.
In the following, to simplify the description, we always assume
When writing the algorithms, we denote by
A. Aggregate Range Query
The procedure to compute an aggregate range query is shown in Algorithm 2. Two keys
Algorithm 2 Algorithm for Performing an Aggregate Range Query on a DB-Tree
Two keys
for all nodes
end for
for all nodes
end for
Let
return
Theorem 1 (Query Efficiency):
On a DB-tree
Since Algorithm 2 processes each retrieved node a constant number of times, the execution of the part of the algorithm after data retrieval has expected time complexity
We now focus on the correctness of Algorithm 2. Referring to the symbols introduced by the statement of Lemma 2, we consider
We show that the set of nodes
We now analyze the differences between
An example in which Algorithm 2 selects more nodes than those considered in Lemma 2. The additionally selected nodes are evidenced with fat borders. See text for details.
This means that
We have just proven that
The above considerations together with the Lemma 2 prove the following theorem about the correctness of Algorithm 2.
Theorem 2 (Query Correctness):
An aggregate range query executed by Algorithm 2 on a DB-tree
B. Update, Insertion and Deletion
All the algorithms to change a DB-tree
The nodes of
that are relevant for the change are retrieved from the database, in one single round, and put into a pool of nodes stored in main memory (read round).$T$ The pool is changed to reflect the change required for
. Old nodes may be updated or deleted and new ones may be added.$T$ The nodes of the pool are stored into the database, in one round (update round).
In our description, we always assume the pool to be empty when the algorithm starts. A rough form of caching could be obtained by not cleaning up the pool between two operations, but we do not consider that, to simplify algorithms descriptions. It is worth mentioning, that the expected size of the pool is always
A particular operation, performed in Phase P2, recurs in all algorithms. When a value of a key is changed or a key is inserted or deleted, the corresponding aggregate values of nodes in the path to the root should be coherently updated (see Invariant 4 in Section IV-C). This is always performed as the last step of Phase P2. This procedure is shown in Algorithm 3. It simply traverses all nodes in a path to the root computing and updating aggregate values from bottom to top. If previously missing children was added, it also restore the corresponding aggregate values.
Algorithm 3 This Algorithm Propagates the Change of Values of a Certain Node Into the Aggregate Values in All Its Ancestors. It Also Inserts Aggregate Values That Should Be Present in the Aggregate Sequence Since the Corresponding Child is Present. This Algorithm is Intended to Be Performed on a Pool of Nodes Stored in Main Memory
A sequence
Aggregate values of nodes in
Remove
for each node
Let
switch do
case
case
case
case
end for
1) Update
The procedure to update the value of a key already present in a DB-tree
Algorithm 4 This Algorithm Updates a Key $k$
in a DB-Tree $T$
With a New Value $v$
Assuming That $k$
is Already Present in $T$
A key
The value for
Update the value for
Update affected aggregate values of nodes in
Store all nodes in
2) Insertion
Algorithm 5 shows the procedure to insert a key
Algorithm 5 This Algorithm Inserts a New Key-Value Pair Into a DB-Tree $T$
, Supposing the Key is Not Contained in $T$
A key-value pair
if
end if
Insert
Let
for all nodes
Create nodes
,$n_{L}.\mathrm {level} \gets n.\mathrm {level}$ ,$n_{L}.\mathrm {min} \gets n.\mathrm {min}$ ,$n_{L}.\mathrm {max} \gets k$ the order-preserving subsequence of$n_{L}.\mathrm {aseq} \gets $ containing all keys less than$n.\mathrm {aseq}$ with all aggregate values at their left (if present).$k$ ,$n_{R}.\mathrm {level} \gets n.\mathrm {level}$ ,$n_{R}.\mathrm {min} \gets k$ ,$n_{R}.\mathrm {max} \gets n.\mathrm {max}$ the order-preserving subsequence of$n_{R}.\mathrm {aseq} \gets $ containing all keys greater than$n.\mathrm {aseq}$ with all aggregate values at their right (if present).$k$
Add
Add
end for
Update affected aggregate values on
Update affected aggregate values in
Write nodes in
Actual insertion of
Example of execution of Algorithm 5: insertion of a new key-value pair
Algorithm 6 This Algorithms Deletes a Key $k$
From a DB-Tree $T$
A key
The key
Let
Delete from
Let
for
switch do
case
Create a new node
Let
Let
case
Let
case
Let
Add
end for
Update affected aggregate values by calling Algorithm 3 on
Write nodes in
3) Deletion
Algorithm 6 shows the procedure to delete a key
Example of execution of Algorithm 6: deletion of key
4) Bulk/Batch Insertion
The insertion of a large number of elements that are known in advance is more efficiently performed as a single operation. In particular, if the insertion starts from an empty DB-tree, the whole DB-tree can be built from scratch in main memory (supposing that this is large enough). This can be easily performed applying, for each element, a procedure similar to that shown in Algorithm 5, where sets
In Section VI, we show that single insertions into DB-trees are quite slow with respect to insertion into regular tables. We can adopt a bulk-like approach to speed-up many insertions (a batch) in a non-empty DB-tree. In this case, we have to be careful to read first into main memory all the nodes that are affected by all insertions. This can be done by executing Lines 1, 4, and 6 of Algorithm 5, for each element. Note that, this can be done in one query round. We also have to keep track of deleted and changed nodes to correctly apply these changes at the end of the bulk insertion. We do not further detail these procedures that are simple variations of the algorithms shown in this section.
C. Comparison of DB-Trees With Skip Lists and B-Trees
Now, we present a detailed comparison of DB-trees with skip lists [47] and B-trees [15].
The most important distinguishing feature of DB-trees is related to the representation of relationships between nodes. In DB-trees, we do not store any pointer in the nodes: a parent-child relationship between nodes
Another fundamental difference with respect to common data structures is that DB-trees are targeted to support aggregate range queries and authenticated data structures, and are not targeted to speed up search. In fact, DBMSes already perform searches very efficiently. On the contrary, DB-trees rely on DBMS search efficiency to speed up aggregate range queries. DB-trees reach this objective without relying on traversals.
In Section IV-A, we described DB-trees starting from skip lists. In fact, each skip list instance is in one-to-one correspondence with a DB-tree instance. Both data structures associate levels with keys and the level of each key is selected in the same random way. However, while a skip list redundantly stores a tower of elements for each key, the corresponding DB-tree stores each key only once: essentially only the top element of the tower is represented. Further, in a DB-tree, sequential keys in a level may be grouped into a single node equipped with some metadata (level, min, and max) that allow them to be efficiently retrieved from the database. In DB-trees, a node is the smallest unit of data that is selected, retrieved, or updated when the DBMS is contacted. Due to all these differences, algorithms performing operations on DB-trees turns out to be very different from the corresponding algorithms for skip lists. Since the level associated with each key is the same in DB-trees and skip lists, statistical properties of DB-trees (see Section IV-D) also hold for skip lists (e.g., Lemma 1), or can be recast to be applicable in the skip list context (e.g., in the skip list context, Property 4 can be interpreted as related to the expected number of consecutive tower-top elements in a level).
While statistical aspects of DB-trees are very similar to skip list ones, storage aspects are somewhat similar to B-trees. In B-trees, each node
Finally, we note that the depth of a B-tree and the number of levels of a skip list are closely related to the efficiency of the operations on the respective data structure. For DB-trees, the depth of the tree is related to the size of the data that should be handled by any operation. As described in Section V, DB-tree operations, which are locally performed by the overlay logic, take a time that is proportional to the handled data and hence turns out to be
Experiments
We developed a software prototype with the intent to provide experimental evidences of the performances of our approach.
The objectives of our experiments are the following.
We intend to show the response times of aggregate range queries that we obtain by using DB-trees. We expect them to be ideally logarithmic in the size of the selected range. We compare them against the response time obtained by using a plain DBMS with the best available indexing.
We intend to measure the response time for insert and delete operations using DB-trees and compare them against the response time of the same operations performed using a plain DBMS.
Tests were performed on a high-performance notebook with 16GB of main memory (SDRAM DDR3), M.2 solid-state drive (SSD), Intel Core i7-8550U processor (1.80GHz, up to 4.0GHz, 8MB Cache). To understand the behavior of our approach on a less performant hard-disk-based system, we also performed the tests on an old machine equipped with Intel Core 2 Quad Q6600 2.4GHz, 8 GB of main memory and a regular hard disk drive (HDD). In the following, the two platforms are shortly referred to as SSD and HDD, respectively. Both platforms are linux-based and during the tests no other cpu-intensive or i/o-intensive tasks were active.
We repeated all experiments with two widely used DBMSes: PostgreSQL (Version 11.3) and MySql (Version 8.0.16). DBMSes were run within docker containers, however, no limitation of any kind was configured on those containers.
Our software runs on the Java Virtual Machine (1.8). Since the JVM incrementally compiles the code according to the “HotSpot” approach [43], for each test we took care to let the system run long enough before performing the measurements, to be sure that compilation activity was over: we run the whole tests more than once and consider only the times measured in the last run. The DB-tree code is written using the Kotlin programming language and adopt the standard JDBC drivers for the connection with the DBMSes. Time measurements are performed using the Kotlin standard API measureNanoTime, that executes a given block of code and returns elapsed time in nanoseconds. In our experiments, we set the GO-UP probability for DB-trees to 0.5 (see Section IV-A).
Regarding Objective O1, we prepared a dataset of 1 million key-value pairs, whose keys are the integers from 1 to 1 million and whose values are random integers. We randomly shuffled the pairs and inserted them into the DBMS. For testing the plain DBMS, we created a table with two columns, with the key of the pair as primary key of the table, and the two possible indexes on (key, value) and on (value, key). For the table that represents the DB-tree, we have columns for
Insertion of DB-trees was performed with a prototypical implementation of the bulk insertion approach described in Section V-B. The time taken for bulk insertions are reported in Table 1.
We show tests using the SUM aggregation function, which is a very widely used one and whose optimization is likely to be an objective of DBMS designers. In the tests based on a plain DBMS, we performed SQL queries like the following, where we used self explanatory names.
For the DB-trees tests, we used our realization of Algorithm 2.
Actually, we also performed the same tests with aggregation functions MIN, MAX, COUNT, and AVG. The MIN and MAX functions are highly optimized in both DBMSes, and queries take very small time to execute, independently from the size of the range. Hence, there is no point in using DB-trees for MIN and MAX, at least in the considered DBMSes. On the contrary, the results for tests with COUNT and AVG are very much like those that we show for function SUM, hence we decided not to show additional charts for them.
We generated the queries to be performed for the tests in the following way. We considered range sizes from 50.000 to 975.000, with steps of 25.000. For each range size, we executed 200 queries with random ranges of that size (i.e., with a random shift) and took the average of execution times. The dataset and the ranges are the same for the tests using plain DBMSes and for the tests using DB-trees.
Queries are executed without any delay in between. Results are shown in Figures 8 and 9, for platform SSD, and in Figures 10 and 11, for platform HDD. In each figure, the x-axis shows the size of the range and the y-axis show the query duration in milliseconds. Figures show performances for tests on plain DBMSes and for tests adopting DB-trees. Using plain DBMS queries, the response time for the aggregate range query is linear, for both PostgreSQL and MySQL. The tests show that, using DB-trees, the response time is limited and well below the one obtained using plain DBMS queries, starting from a certain range size. Concerning the shape of the curves, we note that aggregate range queries are theoretically easy in two extreme cases (i) when the range is just a tuple, in this case an aggregate range query is equivalent to a plain selection, and (ii) when the range covers all the data, in this case a good strategy is to keep an accumulator for the whole dataset. In all charts related to aggregate range queries, we note that the curve for DB-trees response time is somewhat bell-shaped, reflecting that hard instances are in the middle between the two extreme cases just mentioned.
Performances of aggregate range queries with a PostgreSQL DBMS, on the SSD platform.
Performances of aggregate range queries with a PostgreSQL DBMS, on the HDD platform.
We point out that the DBMS internally performs several non-obvious query optimizations that can profoundly change the way the same kind of query is performed on the basis of estimated amount of data to retrieve. A clear effect of this can be seen in the roughly piecewise linear trend, shown in Figures 8 and 10, for the performances of aggregate range queries in PostgreSQL for plain DBMS tests. Further, comparing charts for SSD and HDD platforms, we notice the expected degradation due to the slower HDD technology and a less regular behavior for the HDD case. However, trends are quite similar for both platforms for all cases.
Concerning space occupancy on disk, this clearly depends on actual data types, indexes, and DBMS used. As an example, for the above described experiments, the occupancy is summarized in Table 1. We report occupancy of data and indexes for a plain table, i.e., the one used for plain DBMS tests, and for the table used for DB-tree tests. We note that, the number of rows in the DB-tree table is about half of the number of rows of the plain table. In fact, in our representation each row represents a node and each node contains one or more key-value pair. For PostgreSQL, the size of the data for the DB-tree is roughly the same as the size of the data for the plain table, while indexes are larger for the latter. For MySQL, the size of the data for the DB-tree is about twice the size of the data for the plain table, while indexes are comparable.
Concerning Objective O2, we created 10 datasets containing from 100.000 to 1 million key-value pairs, with step of 100.000. For each dataset, we created the corresponding databases for both plain DBMS tests and for DB-trees tests. Tables and indexes are as for Objective O1. To test insert response time, we performed 200 insertions of random keys that were not in the dataset. For each of them, we measured the response time and then deleted the just inserted key to keep the dataset size constant. Operations are executed without any delay in between. Results are shown in Figures 12 and 13, for platform SSD, and in Figures 14 and 15, for platform HDD. In these figures, the x-axis shows the size of the number of key-value pairs in the database and the y-axis shows the response time in milliseconds.
Performance of the insert operation of a key with a PostgreSQL DBMS, on the SSD platform.
Performance of the insert operation of a key with a MySQL DBMS, on the SSD platform.
Performance of the insert operation of a key with a PostgreSQL DBMS, on the HDD platform.
Analogously, we measured response time for deletion. We performed 200 deletions of random keys that were present in the dataset. For each of them, we measured the response time and then re-inserted the just deleted key-value pair to keep the dataset size constant. Results are shown in Figures 16 and 17, for platform SSD, and in Figures 18 and 19, for platform HDD. Axes are as in the figures for insert tests. Operations are executed without any delay in between for all cases, but for the experiment shown in Figure 19. In fact, in this case, with no delay, we observed an anomalous ramp-up trend, which we think was due to the exceeding of the maximum throughput of the system. In this case, we performed the test waiting 400ms after each operation.
Performance of the delete operation of a key with a PostgreSQL DBMS, on the SSD platform.
Performance of the delete operation of a key with a MySQL DBMS, on the SSD platform.
Performance of the delete operation of a key with a PostgreSQL, on the HDD platform.
As expected, charts show that, adopting DB-trees, insert and delete operations are more costly. For plain DBMS tests, the response time looks essentially constant. This is likely due to the fact that changes are just written in a log before acknowledging the operation to the client, and operations are all equal in size. On the contrary, for DB-trees, we can see a slow but clearly increasing trend, which conforms to the expected logarithmic trend predicted by the theory. In fact, also in this case operations are just stored in a log, but the number of actual operations requested to the DBMS is logarithmic. For our experiments, when using DB-trees, the slow-down factor for both insert and delete operations is within 2-4, for PostgreSQL, and within 10-20, for MySQL. These factors are quite independent from the kind of platform adopted (i.e., SSD vs. HDD).
About the comparison of charts for SSD and HDD platforms, the same remarks we made for Objective O1 applies.
Supporting Group-by Range Queries
In this section, we address a common need that is slightly more complex than an aggregate range query. We deal with the aggregation of values of a certain column performed on the basis of distinct values of a second column limited to a certain range of a third one. This need is usually fulfilled using the SQL construct GROUP-BY and we refer to this kind of queries as group-by range queries. For example, suppose to have a table that represents the sales of goods performed by each department of a company for each day, and we want to know the total sales in a specific period for each department. Suppose the table is called Sale and has columns Department, Date, and Amount. Figure 20 shows an example of group-by range query to obtain this result, for an arbitrary range, expressed in plain SQL.
To exploit DB-trees with this purpose, we define the key of the DB-tree to be the pair (Department, Date), where Department is the most significant part. A possible approach to perform the query, is to execute a plain query to obtain the distinct departments
We now show that, adopting DB-trees, we can perform the above query in one query round using only three distinct queries as in Algorithm 2. The procedure is shown in Algorithm 7, which is a variation of Algorithm 2. We assume a setting with a regular data table
This Algorithm Performs a Group-by Range Query on a DB-Tree and its Associated Regular Data Table
A regular data table
A set of pairs in the form
Let
For all
let
let
let
Let
for all triples
Execute Algorithm 2 starting from Line 4 with
Let
Add
end for
return
The first objective of Algorithm 7 is to obtain from
Figure 21 shows an example of DB-Tree on which a group-by range query is performed. Squares of distinct colors correspond to keys with distinct values of
An example of DB-tree evidencing sets
It is useful to make some remarks on Lines 1-4. Line 1 introduces a further condition with respect to what we find in Algorithm 2. In fact, by requiring only
A case handled by Algorithm 7 in a special way. Node
About Line 4, performing this operation with one SQL query is not obvious. In Figure 22, we show an example of this query using the PostgreSQL dialect. The
The correctness of this algorithm derives from the fact that, by construction, ranges for each
A. Experiments With DB-Trees for Group-by Range Queries
We performed some experiments to assess the performance of Algortihm 7 on realistic data. In Section VI, we noted that DB-trees are more advantageous for aggregate range queries with large ranges. As we show in the following, this is also true for group-by range queries. The dataset we used for our experiments is derived from the TPC-H benchmark [17]. From that dataset, we considered the
Optimized version of the query shown in Figure 22 that was adopted for testing group-by range queries. See text for further details.
Performances of the execution of group-by range queries on PostgreSQL (SSD platform), on the modified TPC-H dataset.
B. Comparison With Materialized Views
We run the same tests for group-by range queries using materialized views, to understand how this common technique compare with DB-trees. We performed our experiments using PostgreSQL. For the same dataset adopted in the above experiments, we created materialized views at different granularities: month, day, hour, and minute. We executed the same group-by range queries described above on these materialized views and we measured execution times. We picked ranges giving a number of elements per group between 50,000 and 300,000, with steps of 50,000 and 200 random queries, as above, for each step. Figure 26 shows the execution times taken by the queries performed on the materialized views vs. execution times of the same queries performed using DB-trees. For materialized views, the execution time increases linearly with the number of elements contained in each group, while it decreases for DB-Trees. However, DB-Trees perform better only for minute granularity, in our tests. Since the query is performed on aggregated data, the precision of the result of the queries performed on materialized views depends on the granularity of the view. It is possible to obtain precise results by independently querying the extremes of the range but we did not performed any experiment regarding this aspect. In fact, we think that the above results already show the great potentiality of materialized views. Table 2 shows the time taken to refresh the whole materialized view. They are between 5 and 12 seconds for our dataset. The number of rows of the views are also reported. We note that materializing with minute granularity provides very little benefit, since the original table contains 6,001,215 rows.
Performances of group-by range queries. Materialized views vs. DB-Trees on PostgreSQL (SSD platform), on the modified TPC-H dataset.
From the trend shown by our experiments, DB-trees may result to be a better approach than materialized views when the number of elements in the groups are very large and this is not compensated by the coarseness of the granularity. This may occur in practice, since granularities of materialized views are statically decided in advance, while queries might be on ranges whose size can vary across several orders of magnitude. For example, this may occur in graphical systems if the user is allowed to zoom in and out on a timeline and a corresponding histogram for the current zoom level should be shown. DB-trees have a substantial overhead but they are adaptive, in the sense that no decision in advance is needed about the order of magnitude of the ranges to be queried. Further, we recall, that DB-trees may be the only viable solution in situations in which the DBMS does not natively support the needed aggregation function.
C. Comparison With VerdictDB
We now compare our DB-tree approach for the execution of group-by range queries with the solution provided by VerdictDB [45]. VerdictDB is a very interesting tool that practically realizes an approximate query processing technique. It allows the user to perform group-by range queries choosing a trade-off between speed and precision of the result. It relies on regular relational DBMS to store data. The user can ask the creation of a “sampled” version of a table, called scramble, with a certain size ratio. Higher size ratios provide better precision at the expense of longer execution time. We performed our experiments using PostgreSQL (SSD platform) as underlying DBMS. For the same dataset adopted in the above experiments (modified TPC-H), we created several scrambles with different size ratios: 10%, 50%, 90%, and 100%. We executed the same group-by range queries described above on all scrambles. We again picked ranges giving a number of elements per group between 50,000 and 300,000, with steps of 50,000, running 200 different random queries in each step, as in Sections VII-A and VII-B.
VerdictDB is very efficient when the size ratio is low. In any case, execution times increase linearly with the number of elements contained in each group, as for the execution using the plain DBMS approach. In Figure 27, we show the time taken by VerdictDB to process the group-by range query for increasing range sizes using the four size ratios mentioned above. For a clear comparison, we also report the time taken by our approach based on DB-trees. The comparison is favorable to our approach when the number of elements in the groups is large and when the size ratio is not too low. Since a low size ratio negatively impacts the precision of the results produced by VerdictDB, we also measured errors. In Table 3, we reported, for each point of the chart in Figure 27 that is related to VerdictDB results, the maximum error we obtained, relative to the correct answer. Each shown percentage is the maximum of the relative errors across all the queries (200) that we run for each step. These results are specific for our case and are only useful for the purpose of comparison with the DB-tree approach in this specific test. A broad description of VerdictDB with respect to precision can be found in [45].
Performances of VerdictDB vs. group-by range queries using DB-Trees on PostgreSQL (SSD platform), on the modified TPC-H dataset.
In our experiments, the time taken by VerdictDB to generate the scramble from scratch is always about 6 seconds.
The most evident difference between DB-trees approach and VerdictDB is that DB-trees always produce an exact result, while results produced by VerdictDB may have non-negligible errors. It depends on the application how much precision can be traded for speed. In any case, when the number of elements in the groups are large, the DB-tree approach outperform VerdictDB even for moderately small size ratios. For example, in our case, when the number of elements for each group is 250,000 or more, DB-trees provide exact results and perform better even if VerdictDB is used with a size ratio of 50%, which gives a maximum error of about 1.5% in our case.
Supporting Authenticated Data Structures
In this section, we show how it is possible to use DB-trees to support Authenticated Data Structures (ADS). We briefly introduce ADSes with their properties and show a typical use case. Then, we show how we can use DB-trees as a persistent and efficient ADS.
An Authenticated Data Structure (ADS) is an ordered container of elements that deterministically provides a constant-size digest of its content that has the same properties of a cryptographic hash. We call this digest the root-hash of the ADS, and we denote it by
A simple ADS is the Merkle Hash Tree [39] (MHT). An example of MHT is shown in Figure 28. In our example, the MHT is a binary search tree in which every leaf is associated with a key-value pair
An example of Merkle Hash Tree with four leaves and a binary structure. We evidenced the elements regarding the proof of
A typical application of MHT is to allow a client with limited amount of resources to outsource the storage of a large amount of data to an untrusted server. The client keeps only a trusted version of the root-hash while the server keeps the MHT. The server provides proofs for each query performed by the client. A proof for a leaf is obtained by considering the path from the leaf to the root and, for each node of the path, putting the hash of the sibling into the proof (see Figure 28 for an example). This allows the receiver of the proof to be able to compute the root-hash from the content of the leaf, i.e., from the result of the query. The client can compare the resulting root-hash against its trusted copy to verify authenticity of the reply. The length of the proofs is
Suppose that the client intends to change the value associated with a certain key
A DB-tree can be adapted to serve as a persisted and efficient ADS. In this case, we call it an authenticated DB-tree. From the point of view of the algorithms, we use
In an authenticated DB-tree, the hash of a node
The client checks that the response to an authenticated query operation is genuine, by the following procedure. It computes
In an authenticated DB-tree, the algorithms to perform update, insertion and deletion of a key-value pair are very similar to those of the corresponding operations for a regular DB-tree. We should take care of the following aspects. When the server provides a set of nodes they should always be considered proofs and checked against the trusted root-hash of the client before proceeding. This is always possible since all DB-tree-changing algorithms deal with paths up to the root of the DB-tree. After the change, hashes, including the root-hash, are updated by the execution of Algorithm 3. After executing any changing algorithm, the client should locally store the new root-hash as the trusted root-hash to be used in the following operations.
Concerning security, we assume a threat model in which the DBMS can perform any tampering on the DB-tree with the purpose to change the key-value pairs it contains. We do not consider other kinds of attacks. The security of an authenticated DB-tree is its ability to always detect any misbehavior of the DBMS, i.e., we would like to rule out any false negative. As for Merkle Hash Trees, this is a direct consequence of the inability of the untrusted DBMS to find a collision on the adopted cryptographic hash function. In the context of ADSes, correctness means that any misbehavior detected by proof checks is a real DBMS misbehavior, i.e., we would like to rule out any false positive. Correctness of DB-trees derives directly from the ability of keeping invariants after all DB-tree-changing operations. This comprises correctly computing the hashes of all nodes, which is responsibility of Algorithm 3.
Concerning performances of authenticated DB-trees in practice, we note that all their operations interact with the DBMS using exactly the same queries as for plain DB-trees. Only the local processing performed by the overlay-logic is slightly different. In all experiments of Section VI, the time spent for the overlay-logic processing is negligible with respect to the time spent to execute the queries. For authenticated DB-trees, the overlay-logic may additionally need to check proofs performing all needed cryptographic hashes. But this turns out to be negligible, too, as we show in the following. For these reasons, in practice, authenticated DB-trees perform like regular DB-trees. In particular, for insertion and deletion, experimental results are essentially the same of those shown in Section VI, hence they are not reported here. An authenticated query for a key
Discussion of the Simplifying Assumptions
In Section III-C, we presented general architectural problems related with overlay indexes and introduced some simplifying assumptions. The objective of this section is to discuss these assumptions, also considering the specific DB-tree approach, and show whether generalizations are simple or require further scientific investigation.
A. Middle Layer
In Section III-C, we mentioned the possibility to develop a query rewriting middle layer to support overlay-indexes. Query rewriting is a classical topic in database research (see, for example, [2], [23], [26], [28], [45]), but practical widely used realizations are rare. Concerning the development of a middle layer targeted to DB-trees, we identify some challenges. Firstly, query rewriting means dealing with a complex SQL syntax and its many proprietary variations. We think that the scientific interest of this aspect is marginal, but the development effort is large. Further, a middle layer may address only the exact kinds of queries shown in this paper (passing all others queries to the DBMS unchanged) or trying to support the optimization of more complex aggregate (group-by) range queries, for example comprising equality selections, two dimensional range selections, joins, etc. Some of these objectives are direct extensions of what is described in this paper, while some may require further scientific investigation. For example, suppose to have an aggregate range query
B. Transactions, Multiple Clients, Fault Tolerance
In this paper, for the sake of simplicity, we deliberately avoided to consider the use of overlay-indexes and DB-trees in a context where transactions are needed. Typical situations in which this occurs is when multiple clients access the same DB-tree and when faults of the DBMS may occur. In principle, nothing prevents to execute the algorithms presented in this paper within transactions (supposing to adopt a DBMS that supports them). Algorithms 4, 5, and 6 follow the scheme showed at the beginning of Section V-B. That is, they wait for the reply of the read round, compute the changes and perform the update round. However, the best practice is to avoid the execution of schemes like this within a transaction. In fact, a communication problem with the DBMS may keep the transaction open (and involved tables locked) until a timeout expires. A possible workaround to this problem is to move the overlay-logic within the DBMS by using stored procedures (if supported) or moving overlay-logic so close to the DBMS so that network faults cannot independently occur (for example within the same machine).
The above considerations also apply to the case in which a DB-tree is associated with a regular table. In this case, Algorithms 4, 5, and 6 should be executed in a transaction together with the change of the regular table.
When changing operations are rare, and the DB-tree is not associated with a regular table, optimistic approaches may be adopted. For example, we could perform the read round of a changing operation within a transaction and perform the update round of that operation in a distinct transaction. In the second transaction, before applying changes, it should be checked, preferably directly within the DBMS, that no change was applied (e.g., by another client) in the meantime to the DB-tree. If the check fails, the transaction should be aborted and the whole changing operation should be re-run, starting from the query round. An interesting way to perform this check is to use the same DB-tree to realize an authenticated data structure (see Section VIII). In this case, the root-hash can be checked to verify if any change to the DB-tree occurred from the last read. It is possible to support an aggregation function and a cryptographic hash within the same DB-tree as explained in the following.
C. Multiple Aggregation Functions
There are cases where more than one aggregation function
D. Caching, Consistency, Batch Insertion
In Section III-C, we mentioned the possibility to implement caching in the overlay-logic. We also mentioned that consistency problems may arise in the case of multiple clients. This occurs when the DB-tree is changed by Algorithms 4, 5, and 6. The most obvious workaround is to broadcast all changes to all clients, so that they can update (or invalidate) their cache. This approach is very demanding in terms of networking and CPU resources, especially if changes and clients are many. We may obtain most of the benefits of caching by moving the overlay-logic close to the DBMS or by using stored procedures. In fact, in this case, the overlay-logic can quickly access the DBMS, which we suppose to have its own cache. Finally, we note that the notable case of batch insertion performed by one client described in Section V-B is essentially a special case of caching.
Conclusion and Future Work
We showed how it is possible to support aggregate range queries efficiently on conventional databases that do not implement specific optimizations. To do that, we introduced the DB-tree: a new data structure that realizes a specific type of index that is represented at the database level (an overlay-index) and can be accessed by performing regular queries to the DBMS. It can be also customized in many ways to support a wide class of aggregation functions, authenticated data structures, and group-by range queries. DB-trees can be queried downloading only
Regarding future research directions, from the theoretical point of view, it would be interesting to investigate a bi/multi-dimensional generalization of DB-trees to support speedup of aggregate range queries in GIS systems. This generalization may be inspired to the work of Eppstein et al. [21] about skip quadtrees.
From the practical point of view, it would be desirable to have a framework that streamlines the use of DB-trees in practical contexts, like VerdictDB [45] does for approximate query processing. Further, experiments in a big-data context may be useful to better asses the spectrum of applicability of DB-trees.