In this paper we introduce an abstraction called a synchronizable database, which is best viewed as an extension of traditional database functionality. Databases with this new functionality can then become building blocks of larger distributed web applications.
We give an architecture for a synchronizable database that divides the problem into two levels. The lower level provides traditional database functionality in addition to a summarization function that operates on a specified range of keys. The summary of a range of keys is defined as a digest(hash) of all records in the specified range. The top level implements a synchronization protocol, calling on the lower level to perform traditional database functions as well as to provide range summaries.
Distributed databases are already users of synchronization services [PUMA], and synchronization of a complete modern relational database system can be quite complex (see [ORA]). The motivation for synchronization is clear: commercial distributed databases are often interrelated and require regular content-aware updates [AWE] to ensure their consistency with each other; database updates can trigger important actions. For instance, in a chain of retailers backed by a central warehouse, a purchase in a branch causes update of the local database, which must be propagated to the central one, potentially causing restocking of the branch or even the warehouse itself. Large volumes of transactions create large databases, which generate excessive traffic when synchronized on a daily basis. Another web activity in need of better synchronization methods is web-site replication and mirroring [WWW8]. Publishing, distribution, and accessibility of web sites is often achieved by extensive replication of the sites, when files and whole subdirectories are transferred across the network recursively until an up-to-date image of the local site is copied to the remote node. The problem is further complicated as sites increasingly rely on databases to drive them. Simply mirroring these databases, which can be quite large but change slowly, is clearly not the best solution. Our work aims to provide a new building block to deal with these problems and others.
Traditional synchronization tools, e.g. Unix diff and patch, are file-based, requiring both files to be present locally. A network application built using these must transfer an entire file before updating another, e.g. via rcp. An example of file-based synchronization system is CVS [CVS], often used for updating a web site from a local development tree. [FILE] describes a formal semantics for file synchronizers. File-based synchronization proves too coarse when a small change requires resending a whole database file. A better synchronization architecture allows for minimal amount of data to be identified and sent over the network when synchronizing databases. Rsync [RSYNC] is a replacement for rcp which breaks files into pieces, computes their hashes, and transfers only the hashes first. Then only the pieces of data which are absent on the other side (as determined by comparing the hashes), are sent in full. One can even replace rcp transport in CVS by rsync. Since the summary of the changes is obtained by partitioning and hashing in rsync, it has to be recomputed every time. Moreover, the hashes thus computed are not reusable for other applications. We came to see data summary as a general service, provided to any applications using the database. We introduce the concept of a summarizable database, which can summarize subsets of its data for general use in applications managing change in a local or distributed setting (see Section 3, Applications below).
For our purposes a database is defined to be a collection of (key, value) pairs, along with certain traditional functionality and some form of transaction support. Our two-layer architecture starts by enhancing (key, value) databases by adding summary capabilities to foster synchronization, and further includes a protocol that uses these enhanced databases to solve the complete synchronization problem. In the following sections we describe the summarizable database (the lower layer) and synchronization facility (the upper layer) in more detail, briefly report on a prototype implementation, and overview the promising applications of synchronizable databases.
The major operations supported by a synchronizable database as an abstract data type are insertion, deletion, retrieval, and range synchronization. The first three are standard operations on databases with records. The last one, unique to synchronizable databases, is specified below.
The input for range synchronization is an interval I of K (remember that K is the totally-ordered universe of keys) and two databases D_1 and D_2. The range synchronization operation basically tries to make the restrictions of D_1 and D_2 to I identical. In particular, it identifies three sets of keys, which we call the discrepancy sets, K_1, K_2 and K_12. These three sets are all subsets of the key interval I. Discrepancy set K_1 is the set of keys in D_1 which are not in D_2, K_2 is the set of keys in D_2 which are not in D_1, and K_12 is the set of keys present in both D_1 and D_2 but whose corresponding records in the two databases differ in the value field. The three discrepancy sets can be processed by different handler functions. Typically, the handler functions for K_1 and K_2 would copy the missing records from one database to the other. A typical handler function for K_12 would, for each key in the discrepancy set, compare the records in D_1 and D_2 that have the key and replace one of them with the other, e.g. the older one by the more recent one.
The top layer consists of a synchronization facility, which makes use of the summarizing capabilities of the layer below it to implement the range synchronization operation. The synchronization facility is optimized for the case when the databases being synchronized are located on different processors connected via a limited bandwidth link. Thus, one of the goals here is to minimize the network traffic generated by the range synchronization operation. A naive implementation of the facility would just bring one of the databases to the other side and do a comparison to identify the discrepancies. But this is found to be both inefficient and unnecessary. In our design, only summaries of portions of one database are transferred across the network in order to identify the discrepancies. Once the discrepancies are identified, the handler functions can choose to transfer only those records which are need to be transferred to make the databases synchronized. Thus, unnecessary transfer and processing of large amounts of data is prevented.
Get_All_Summaries: The input is an interval I of K. The output is a list of pairs of the form (key, summary). The list has one pair for each record in the database whose key field belongs to I. The first element in the pair is the key field of the record, and the second element is a fixed size summary of the record. If the database has no record with key field belonging to I, an empty list is returned. (The summary is supposed to include a cryptographically strong hash of the record so that the probability of two different records yielding the same summary is extremely small. This is important because the synchronization facility, as described below, would conclude that two records are equal if their summaries are equal.)
Get_Interval_Summaries: The input is an interval I of K and a positive integer H. The function partitions I into at most H disjoint sub-intervals and returns a list of triplets of the form (key_interval, num_records, summary). The list has one triplet for each sub-interval. The first element of the triplet is the sub-interval itself; the second and third elements are, respectively, the number and a fixed size summary of the set of records in the database whose key fields belong to the sub-interval. Whether the database has any records with key field belonging to I or not, in either case the list returned is non-empty and the sub-intervals in the list form a disjoint partition of I. (Again, the summary should be computed in such a way that the event of two different sets of records yielding the same summary is extremely unlikely.) It turns out that it helps the synchronization facility to have the sub-intervals such that the database has about equal amount of data in each of the sub-intervals. In particular, it improves the efficiency of the synchronization algorithm by reducing the number of communication rounds. This point will become clearer after the description of the synchronization algorithm in the next subsection.
The synchronization algorithm starts by asking both databases to compute a single summary of all records lying in the given key interval. The Get_Interval_Summaries function is invoked for this. The remote summary is transferred to the local site and compared with the local summary. If the summaries match, it is concluded that the databases are already synchronized in the given key interval. Otherwise the size of the remote database restricted to the given key interval is checked (remember that the triplets returned by the Get_Interval_Summaries functions also has a num_records field). If the remote database only has a small number of records lying in the key interval, then summaries for all those individual records are transferred from the remote to the local site (the Get_All_Summaries function is invoked for this), and a record-by-record comparison is made to identify discrepancies. Otherwise the remote database is asked (by calling the Get_Interval_Summaries function) to partition the key interval into smaller sub-intervals and send summaries for each of the sub-intervals. These remote summaries are then compared against corresponding local summaries and the synchronization operation is invoked recursively for those sub-intervals whose summaries do not match.
Given the assumption that the Get_Interval_Summaries function partitions the input key range into sub-intervals containing almost equal amount of data, the synchronization operation takes at most log n communication rounds to identify each discrepancy, where n is the total size of the two databases restricted to the given key range. Thus, the total number is communication rounds is O(t log n), where t is the combined size of the three discrepancy sets. As we describe below, all operations for summarizable databases can be supported with O(log n) worst case time complexity. Thus, the overall computational complexity for the synchronization operation is O(t log2 n).
In practice however, the number of rounds taken and the network traffic generated depend on several factors, including how the discrepancies are distributed across the entire key range and how the parameters in the algorithm are chosen. There are two main parameters in the synchronization algorithm which need to be carefully chosen. The first one determines when to invoke the Get_All_Summaries operation instead of further partitioning with the Get_Interval_Summaries operation. The second parameter determines the number of partitions obtained through calls to Get_Interval_Summaries. The choice of these parameters to a very large extent determines the actual network traffic generated and the number of communication rounds taken. For example, if the algorithm decides to invoke the Get_All_Summaries function on key intervals for which the remote database has huge number of records, then the number of communication rounds would be small but every round would generate heavy network traffic. Similarly, if the Get_All_Summaries function is invoked only on very small key intervals, then the number of rounds will be large.
The OSYNCH module (for Object Synchronization) implements the synchronization algorithm described in section 2.4. The BXTREE (for Extended B+-Tree) and BEDROCK modules constitute a summarizable database. When invoked for range synchronization, OSYNCH communicates with two instances of the BXTREE-BEDROCK package - a local instance and a remote instance. (Also see Figure 3, which shows the various modules interact during a synchronization of two databases located on two different machines.)
The BEDROCK module is a simple transactional memory management system for the disk. It abstracts the disk as an array of blocks that can be allocated, written to, read from and deallocated. It allows the user to perform long sequences (i.e. transactions) of these operations with all-or-none semantics. In other words, every transaction can be ended either with a commit operation or an abort operation. A commit operation upon finish guarantees that all the operations in the transaction had their effect. A finishing abort operation reverts the state of the bedrock back to what it was immediately after the last commit. In case of a system crash, the transaction going on at the time of the crash is aborted. Thus, upon restart, the bedrock is found to be in the state it was in immediately after the last finishing commit. Often applications need to maintain on the disk large and complex data-structures (like balanced trees) which make sense only when certain invariants are preserved in the data. Using the BEDROCK for storing these data-structures can guarantee an always-consistent state of the data-structure.
For storing the records, we use the well-known B+-Tree data-structure, and augment it so that the summarizing operations can be performed efficiently. In a B+-tree, the real records are stored in the leaf nodes and the internal nodes only store a vector of keys and a vector of pointers to guide the search for records. In our implementation, with each record in a leaf node, an MD5 hash of the record is also stored. Also with every child pointer in every internal node, a summary of the subtree rooted at the child is also stored. The summary in our case contains the number of records in the subtree and the XOR of hashes of all the records in the subtree. It is easy to see that the summary information in the internal nodes can be maintained efficiently during tree-rebalancing operations. We store the blocks of the BXTREE in a bedrock, thus ensuring that the tree is always consistent.
With each internal node storing the summaries of each of its children's subtrees, the Get_Interval_Summaries operation can be implemented easily in time proportional to the height of the tree and the size of the output. In our implementation, the summary of a set of records is the XOR of the MD5 hashes of the records in the set. Note that our summary computation function (the XOR function) is associative, and the summary of the set of records in a subtree does not depend on how the records are stored in the subtree. Hence it makes sense for the synchronization facility to try to compare summaries from two different databases. Also, the Get_All_Summaries operation can be implemented with a similar worst-case time bound.
An alternative to using the BXTREE-BEDROCK package as the summarizable database is to use a general-purpose transactional database like Berkeley DB [BDB] and augment it with the summary computation functions. We preferred to implement the BXTREE-BEDROCK package due to the following design considerations. Firstly, we wanted to separate the balanced-tree data structure and the transactional support it needs as two different software components. As a consequence, we get a transactional support mechanism which can be used separately to implement other data-structures. Similarly, the BXTREE module can also be reused with a different transactional support mechanism. Secondly, we wanted an implementation which has a small memory foot-print and which uses a fixed-size pre-formatted file for storing the records. Thus, all disk activity happens in this bedrock file which is verified at the time of creation.
The three modules were written in ANSI-C. They together make about 25000 lines of code. The source code will soon be distributed under GNU Public License.
With the master process running the synchronization protocol, the slave process just passively acts as a summarizable database answering queries posed by the master. There are two light-weight wrappers involved (one each on the master and the slave side) which are responsible for the network communication between the two sides.
As described above, the synchronization facility identifies the discrepancy sets, which are sets of missing/mismatching keys and makes calls to user-defined handler functions for each of the discrepancy sets. The number of communication rounds taken to identify all the discrepancies is O(t log n), where t is the combined size of the discrepancy sets, and n is the combined size of the two databases restricted to the key range being synchronized. Each communication round generates a fixed amount of network traffic (fixed by the choice of parameters in the synchronization algorithm). Thus, we expect the total network traffic generated during the synchronization operation to be linear in the number of discrepancies, and the size of the actual records transferred between the two sides. We performed the following experiment to verify this linearity.
At the beginning of the experiment, the two databases are identical. The starting database was obtained by pseudo-randomly constructing 30000 records (each with a a 100 byte key and 900 byte value) and inserting them in the database one by one. Starting with identical databases, the two processes then insert m additional records (of the same size as before) each into their respective databases. The additional records are also generated pseudo-randomly, but the two processes use different seeds for generating their additional records. Thus, (with very high probability) the additional records inserted by the two processes are different. Now, the synchronization facility is run by the master process for synchronizing its entire database with the slave's database (the key interval specified for this is (-INFINITY, +INFINITY)). The handler functions used by the master simply transfer missing records from one side to the other, and the discrepancy set K_12 (see the formal definition of synchronizable database ADT above) is simply printed to the standard output. We run the experiment for different values of m and measure the number of bytes transferred between the master and the slave processes in each case. The graph in Figure 4 below plots the total number of bytes transferred versus m, the number of additional insertions on each side. Clearly, except for small values of m, the relationship is close to linear.
Intermemory [IM1, IM2] is a project of the authors which originated the synchronizable database abstraction we discuss here. Intermemory presents users with a globally addressable homogeneous memory which is highly survivable and available. A block written into Intermemory is encoded using erasure-resilient codes and its fragments are dispersed to hundreds of nodes. Each node stores its blocks and fragments in a synchronizable database. Fragment dispersal occurs when neighboring nodes poll each other and synchronize specific ranges of their databases. Here, the discrepancy handler functions correlate the transporting of missing and mismatching records with extra processing required by the intermemory protocol. In this way database synchronization contributes to self-maintenance and self-repair of the system.
is a Computer and Information Science Ph.D. Candidate at the University of Pennsylvania in Philadelphia, Pennsylvania, and a Research Assistant at the NEC Research Institute, Princeton, New Jersey, USA. He received his CIS Master's of Science, Engineering, degree from the University of Pennsylvania in 1998. A part of the degree was completed at Thayer School of Engineering at Dartmouth College, Hanover, New Hampshire, USA. Alexy received his Bachelor's degree in Physics and Mathematics from Moscow Institute of Physics and Technology, Moscow, Russia, in 1993. Alexy's work experience includes a senior software engineering position at the Computer Command and Control Company in Philadelphia, PA, led by Dr. Noah Prywes, focusing on highly reliable software. Alexy's current research interests are focused on the Intermemory project led by Dr. Peter Yianilos. Among various aspects of this multidisciplinary project those of particular interest are synchronizable databases, archival semantics, and system visualization. Alexy's previous research includes work on Intelligent Agent architectures with Dr. George Cybenko, Medical Informatics and AI applications with Dr. Bonnie Webber, and Data Mining works with Dr. Lyle Ungar.
Sumeet Sobti is a graduate student in computer science at the University of Washington. He received his B.Tech. degree from Indian Institute of Technology, Kanpur in 1997. His current research interests include databases, distributed data storage systems, and combinatorial optimization.
Peter N. Yianilos received his Ph.D. in computer science from Princeton University, where he is spending this year as a visitor. He founded Proximity Technology Inc., which became Franklin Electronic Publishers through a merger -- where he served as Chief Scientist and later as President. At Franklin his work on advanced data storage and retrieval technology formed the basis for the first hand-held electronic books. He later spent eight years as a senior research scientist at the NEC Research Institute in Princeton, New Jersey. He is currently Chairman of Netrics.com, an interim consultant to and acting CTO of Franklin Electronic Publishers, and leader of the Intermemory project.