Synchronizable Databases for the Web

Alexy V. Khrabrov, University of Pennsylvania and NEC Research Institute, <braver@suffix.com>
Sumeet Sobti, University of Washington and NEC Research Institute, <sobti@cs.washington.edu>
Peter N. Yianilos, Intermemory.org, <pny@cs.princeton.edu>

Abstract

In this paper we introduce a simple abstraction called a synchronizable database, and discuss its potential for managing a diverse range of web and internet applications. We illustrate its utility in improving performance of several existing web applications, and also describe several novel ways of using this abstraction. We give a two-layer architecture for implementing a synchronizable database. The abstraction forming the bottom layer is referred to as a summarizable database. In addition to traditional database functionality, it provides a facility for efficiently computing a digest(hash) of the records within any specified key range. The top layer is the synchronization facility. It implements an elegant protocol for synchronizing two summarizable databases over a limited bandwidth link such as the internet. We describe the synchronization facility in full, and briefly discuss our prototype implementation of a summarizable database.

1 Introduction

An emerging theme in the area of advanced distributed web-site design is that of synchronization. That is, the act of forcing whole or partial agreement between two databases, or directory hierarchies; where the two are connected by a limited bandwidth link such as the internet or a local area network. Figure 1 gives an example of database synchronization.

[db sync example]

Figure 1. Synchronizable Databases. On the left, the figure depicts a car dealer's database that maintains a schedule of factory repairs. Every time a customer needs a factory repair, the dealer schedules it in his database. The factory, on the right, maintains a master schedule of repairs. Periodically the two must be synchronized -- in this case the new and changed orders must be copied from the dealer's site to the factory site and inserted into the factory's database. In order to rapidly identify the records to be transferred, the databases each support a special facility for efficient computation of a digest(hash) of any specified range of keys, which amounts to a summary of the range. This special facility corresponds to our summarizable database abstraction. An efficient synchronization facility then uses this smart summarization mechanism to minimize the amount of data that must be transferred.

In this paper we introduce an abstraction called a synchronizable database, which is best viewed as an extension of traditional database functionality. Databases with this new functionality can then become building blocks of larger distributed web applications.

We give an architecture for a synchronizable database that divides the problem into two levels. The lower level provides traditional database functionality in addition to a summarization function that operates on a specified range of keys. The summary of a range of keys is defined as a digest(hash) of all records in the specified range. The top level implements a synchronization protocol, calling on the lower level to perform traditional database functions as well as to provide range summaries.

Distributed databases are already users of synchronization services [PUMA], and synchronization of a complete modern relational database system can be quite complex (see [ORA]). The motivation for synchronization is clear: commercial distributed databases are often interrelated and require regular content-aware updates [AWE] to ensure their consistency with each other; database updates can trigger important actions. For instance, in a chain of retailers backed by a central warehouse, a purchase in a branch causes update of the local database, which must be propagated to the central one, potentially causing restocking of the branch or even the warehouse itself. Large volumes of transactions create large databases, which generate excessive traffic when synchronized on a daily basis. Another web activity in need of better synchronization methods is web-site replication and mirroring [WWW8]. Publishing, distribution, and accessibility of web sites is often achieved by extensive replication of the sites, when files and whole subdirectories are transferred across the network recursively until an up-to-date image of the local site is copied to the remote node. The problem is further complicated as sites increasingly rely on databases to drive them. Simply mirroring these databases, which can be quite large but change slowly, is clearly not the best solution. Our work aims to provide a new building block to deal with these problems and others.

Traditional synchronization tools, e.g. Unix diff and patch, are file-based, requiring both files to be present locally. A network application built using these must transfer an entire file before updating another, e.g. via rcp. An example of file-based synchronization system is CVS [CVS], often used for updating a web site from a local development tree. [FILE] describes a formal semantics for file synchronizers. File-based synchronization proves too coarse when a small change requires resending a whole database file. A better synchronization architecture allows for minimal amount of data to be identified and sent over the network when synchronizing databases. Rsync [RSYNC] is a replacement for rcp which breaks files into pieces, computes their hashes, and transfers only the hashes first. Then only the pieces of data which are absent on the other side (as determined by comparing the hashes), are sent in full. One can even replace rcp transport in CVS by rsync. Since the summary of the changes is obtained by partitioning and hashing in rsync, it has to be recomputed every time. Moreover, the hashes thus computed are not reusable for other applications. We came to see data summary as a general service, provided to any applications using the database. We introduce the concept of a summarizable database, which can summarize subsets of its data for general use in applications managing change in a local or distributed setting (see Section 3, Applications below).

For our purposes a database is defined to be a collection of (key, value) pairs, along with certain traditional functionality and some form of transaction support. Our two-layer architecture starts by enhancing (key, value) databases by adding summary capabilities to foster synchronization, and further includes a protocol that uses these enhanced databases to solve the complete synchronization problem. In the following sections we describe the summarizable database (the lower layer) and synchronization facility (the upper layer) in more detail, briefly report on a prototype implementation, and overview the promising applications of synchronizable databases.

2 Synchronizable Database - An Abstract Data Type

2.1 Formal Description

A synchronizable database D is a set containing records of the form (key, value). The key field takes values from a totally-ordered universe K of keys. Any key in K occurs in D at most once.

The major operations supported by a synchronizable database as an abstract data type are insertion, deletion, retrieval, and range synchronization. The first three are standard operations on databases with records. The last one, unique to synchronizable databases, is specified below.

The input for range synchronization is an interval I of K (remember that K is the totally-ordered universe of keys) and two databases D_1 and D_2. The range synchronization operation basically tries to make the restrictions of D_1 and D_2 to I identical. In particular, it identifies three sets of keys, which we call the discrepancy sets, K_1, K_2 and K_12. These three sets are all subsets of the key interval I. Discrepancy set K_1 is the set of keys in D_1 which are not in D_2, K_2 is the set of keys in D_2 which are not in D_1, and K_12 is the set of keys present in both D_1 and D_2 but whose corresponding records in the two databases differ in the value field. The three discrepancy sets can be processed by different handler functions. Typically, the handler functions for K_1 and K_2 would copy the missing records from one database to the other. A typical handler function for K_12 would, for each key in the discrepancy set, compare the records in D_1 and D_2 that have the key and replace one of them with the other, e.g. the older one by the more recent one.

2.2 A Two-layer Architecture

We propose a two-layer architecture for implementing a synchronizable database. The bottom layer in the architecture consists of a general purpose database engine, enhanced with capabilities for computing certain kind of summaries. We call it a summarizable database. A typical summary operation on the database would input a key range (i.e. an interval of K, the universe of keys), and return a short summary of the set of records in the database whose key fields lie in the given key range.

The top layer consists of a synchronization facility, which makes use of the summarizing capabilities of the layer below it to implement the range synchronization operation. The synchronization facility is optimized for the case when the databases being synchronized are located on different processors connected via a limited bandwidth link. Thus, one of the goals here is to minimize the network traffic generated by the range synchronization operation. A naive implementation of the facility would just bring one of the databases to the other side and do a comparison to identify the discrepancies. But this is found to be both inefficient and unnecessary. In our design, only summaries of portions of one database are transferred across the network in order to identify the discrepancies. Once the discrepancies are identified, the handler functions can choose to transfer only those records which are need to be transferred to make the databases synchronized. Thus, unnecessary transfer and processing of large amounts of data is prevented.

2.3 The Bottom Layer - A Summarizable Database

On an abstract level, this layer implements a database engine for storing and managing records of the form (key, value), enhanced with some support for range synchronization. It provides the standard functions for insertion, deletion and retrieval of records. In addition to these, it provides the following two operations which are used by the synchronization facility for implementing the range synchronization operation.

Get_All_Summaries: The input is an interval I of K. The output is a list of pairs of the form (key, summary). The list has one pair for each record in the database whose key field belongs to I. The first element in the pair is the key field of the record, and the second element is a fixed size summary of the record. If the database has no record with key field belonging to I, an empty list is returned. (The summary is supposed to include a cryptographically strong hash of the record so that the probability of two different records yielding the same summary is extremely small. This is important because the synchronization facility, as described below, would conclude that two records are equal if their summaries are equal.)

Get_Interval_Summaries: The input is an interval I of K and a positive integer H. The function partitions I into at most H disjoint sub-intervals and returns a list of triplets of the form (key_interval, num_records, summary). The list has one triplet for each sub-interval. The first element of the triplet is the sub-interval itself; the second and third elements are, respectively, the number and a fixed size summary of the set of records in the database whose key fields belong to the sub-interval. Whether the database has any records with key field belonging to I or not, in either case the list returned is non-empty and the sub-intervals in the list form a disjoint partition of I. (Again, the summary should be computed in such a way that the event of two different sets of records yielding the same summary is extremely unlikely.) It turns out that it helps the synchronization facility to have the sub-intervals such that the database has about equal amount of data in each of the sub-intervals. In particular, it improves the efficiency of the synchronization algorithm by reducing the number of communication rounds. This point will become clearer after the description of the synchronization algorithm in the next subsection.

2.4 The Top Layer - A Synchronization Facility

The synchronization facility implements the range synchronization operation on summarizable databases. As mentioned above, our design is tuned for the case when the databases being synchronized are located on different processors connected via a limited bandwidth link. Our design of the facility is asymmetric in the sense that it sees one database as a local database and the other one as a remote database. It is assumed that local database can be accessed with no (or minimal) network traffic, whereas remote database accesses generate network traffic. This asymmetric view allows the facility to make optimizations that minimize network traffic. Typically, the synchronization operation will be used by a processor to synchronize its local database with some remote database.

The synchronization algorithm starts by asking both databases to compute a single summary of all records lying in the given key interval. The Get_Interval_Summaries function is invoked for this. The remote summary is transferred to the local site and compared with the local summary. If the summaries match, it is concluded that the databases are already synchronized in the given key interval. Otherwise the size of the remote database restricted to the given key interval is checked (remember that the triplets returned by the Get_Interval_Summaries functions also has a num_records field). If the remote database only has a small number of records lying in the key interval, then summaries for all those individual records are transferred from the remote to the local site (the Get_All_Summaries function is invoked for this), and a record-by-record comparison is made to identify discrepancies. Otherwise the remote database is asked (by calling the Get_Interval_Summaries function) to partition the key interval into smaller sub-intervals and send summaries for each of the sub-intervals. These remote summaries are then compared against corresponding local summaries and the synchronization operation is invoked recursively for those sub-intervals whose summaries do not match.

Given the assumption that the Get_Interval_Summaries function partitions the input key range into sub-intervals containing almost equal amount of data, the synchronization operation takes at most log n communication rounds to identify each discrepancy, where n is the total size of the two databases restricted to the given key range. Thus, the total number is communication rounds is O(t log n), where t is the combined size of the three discrepancy sets. As we describe below, all operations for summarizable databases can be supported with O(log n) worst case time complexity. Thus, the overall computational complexity for the synchronization operation is O(t log2 n).

In practice however, the number of rounds taken and the network traffic generated depend on several factors, including how the discrepancies are distributed across the entire key range and how the parameters in the algorithm are chosen. There are two main parameters in the synchronization algorithm which need to be carefully chosen. The first one determines when to invoke the Get_All_Summaries operation instead of further partitioning with the Get_Interval_Summaries operation. The second parameter determines the number of partitions obtained through calls to Get_Interval_Summaries. The choice of these parameters to a very large extent determines the actual network traffic generated and the number of communication rounds taken. For example, if the algorithm decides to invoke the Get_All_Summaries function on key intervals for which the remote database has huge number of records, then the number of communication rounds would be small but every round would generate heavy network traffic. Similarly, if the Get_All_Summaries function is invoked only on very small key intervals, then the number of rounds will be large.

3 A B+-Tree based Implementation of Synchronizable Databases

We have a working B+-Tree based implementation of the synchronizable database abstraction. Since in this paper our focus is on presenting the abstraction of synchronizable databases and discussing its potential for the web, here we only briefly describe our implementation. Further implementation details will appear elsewhere. The software consists of three separate modules - OSYNCH, BXTREE and BEDROCK. Figure 2 presents the software architecture of our implementation.
[db architecture]

Figure 2. We implement a synchronizable database as three modules: OSYNCH provides the synchronization facility, and the BXTREE and BEDROCK modules together comprise the summarizable database. The BXTREE module implements an industrial-strength B+TREE, enhanced with summaries. The summaries are produced by hashing the leaf data and storing the hashes in the leaves, and storing XORs of all the children nodes along with each parent node. The BEDROCK module provides a transactional non-volatile storage layer, which allows for consistent, safe updates of the BXTREE.

The OSYNCH module (for Object Synchronization) implements the synchronization algorithm described in section 2.4. The BXTREE (for Extended B+-Tree) and BEDROCK modules constitute a summarizable database. When invoked for range synchronization, OSYNCH communicates with two instances of the BXTREE-BEDROCK package - a local instance and a remote instance. (Also see Figure 3, which shows the various modules interact during a synchronization of two databases located on two different machines.)

The BEDROCK module is a simple transactional memory management system for the disk. It abstracts the disk as an array of blocks that can be allocated, written to, read from and deallocated. It allows the user to perform long sequences (i.e. transactions) of these operations with all-or-none semantics. In other words, every transaction can be ended either with a commit operation or an abort operation. A commit operation upon finish guarantees that all the operations in the transaction had their effect. A finishing abort operation reverts the state of the bedrock back to what it was immediately after the last commit. In case of a system crash, the transaction going on at the time of the crash is aborted. Thus, upon restart, the bedrock is found to be in the state it was in immediately after the last finishing commit. Often applications need to maintain on the disk large and complex data-structures (like balanced trees) which make sense only when certain invariants are preserved in the data. Using the BEDROCK for storing these data-structures can guarantee an always-consistent state of the data-structure.

For storing the records, we use the well-known B+-Tree data-structure, and augment it so that the summarizing operations can be performed efficiently. In a B+-tree, the real records are stored in the leaf nodes and the internal nodes only store a vector of keys and a vector of pointers to guide the search for records. In our implementation, with each record in a leaf node, an MD5 hash of the record is also stored. Also with every child pointer in every internal node, a summary of the subtree rooted at the child is also stored. The summary in our case contains the number of records in the subtree and the XOR of hashes of all the records in the subtree. It is easy to see that the summary information in the internal nodes can be maintained efficiently during tree-rebalancing operations. We store the blocks of the BXTREE in a bedrock, thus ensuring that the tree is always consistent.

With each internal node storing the summaries of each of its children's subtrees, the Get_Interval_Summaries operation can be implemented easily in time proportional to the height of the tree and the size of the output. In our implementation, the summary of a set of records is the XOR of the MD5 hashes of the records in the set. Note that our summary computation function (the XOR function) is associative, and the summary of the set of records in a subtree does not depend on how the records are stored in the subtree. Hence it makes sense for the synchronization facility to try to compare summaries from two different databases. Also, the Get_All_Summaries operation can be implemented with a similar worst-case time bound.

An alternative to using the BXTREE-BEDROCK package as the summarizable database is to use a general-purpose transactional database like Berkeley DB [BDB] and augment it with the summary computation functions. We preferred to implement the BXTREE-BEDROCK package due to the following design considerations. Firstly, we wanted to separate the balanced-tree data structure and the transactional support it needs as two different software components.  As a consequence, we get a transactional support mechanism which can be used separately to implement other data-structures. Similarly, the BXTREE module can also be reused with a different transactional support mechanism. Secondly, we wanted an implementation which has a small memory foot-print and which uses a fixed-size pre-formatted file for storing the records. Thus, all disk activity happens in this bedrock file which is verified at the time of creation.

The three modules were written in ANSI-C. They together make about 25000 lines of code. The source code will soon be distributed under GNU Public License.

3.1 A Simple Experiment

 In this section, we describe a simple experiment which illustrates how the various software modules interact during a synchronization operation. The experiment runs two processes on two different machines. We call them the master and the slave processes. Both processes locally maintain a synchronizable database. The master process invokes the synchronization facility to synchronize its local database with the slave process's database. The major software components involved in the synchronization operation and their interaction are shown in Figure 3.
[db communication]

Figure 3. In a simple experiment, two synchronizable databases bring their data up to date with respect to each other. The master node runs the OSYNCH synchronization facility, and the other passively answers the master's queries. LW (local wrapper) and RW (remote wrapper) are light-weight software modules that manage the actual network communication that goes on during the synchronization operation. The LW-RW channel receives summary requests from the OSYNCH module and forwards them to the remote database. Similarly, it forwards answers from the remote database to the OSYNCH module. It is also used by the discrepancy handlers (DH) for transferring records between the two databases. Here it is important to note that even though one party plays the role of the master in the protocol, records may be transferred in either direction.

With the master process running the synchronization protocol, the slave process just passively acts as a summarizable database answering queries posed by the master. There are two light-weight wrappers involved (one each on the master and the slave side) which are responsible for the network communication between the two sides.

As described above, the synchronization facility identifies the discrepancy sets, which are sets of missing/mismatching keys and makes calls to user-defined handler functions for each of the discrepancy sets. The number of communication rounds taken to identify all the discrepancies is O(t log n), where t is the combined size of the discrepancy sets, and n is the combined size of the two databases restricted to the key range being synchronized. Each communication round generates a fixed amount of network traffic (fixed by the choice of parameters in the synchronization algorithm). Thus, we expect the total network traffic generated during the synchronization operation to be linear in the number of discrepancies, and the size of the actual records transferred between the two sides. We performed the following experiment to verify this linearity.

At the beginning of the experiment, the two databases are identical. The starting database was obtained by pseudo-randomly constructing 30000 records (each with a a 100 byte key and 900 byte value) and inserting them in the database one by one. Starting with identical databases, the two processes then insert m additional records (of the same size as before) each into their respective databases. The additional records are also generated pseudo-randomly, but the two processes use different seeds for generating their additional records. Thus, (with very high probability) the additional records inserted by the two processes are different. Now, the synchronization facility is run by the master process for synchronizing its entire database with the slave's database (the key interval specified for this is (-INFINITY, +INFINITY)). The handler functions used by the master simply transfer missing records from one side to the other, and the discrepancy set K_12 (see the formal definition of synchronizable database ADT above) is simply printed to the standard output. We run the experiment for different values of m and measure the number of bytes transferred between the master and the slave processes in each case. The graph in Figure 4 below plots the total number of bytes transferred versus m, the number of additional insertions on each side. Clearly, except for small values of m, the relationship is close to linear.

[db experiment]

Figure 4. Synchronization efficiency in our experiment. The horizontal axis shows the number of additional records inserted in each of the two databases, and the total number of bytes transferred between the two sides is shown on the vertical axis. The total number of bytes consists of two things: (1) the data bytes, i.e. the actual records that are missing on either side, and (2) the traffic generated by the synchronization facility to identify the missing records. Each record consists of 2000 bytes. Therefore, in a run with m insertions on each side, the number of data bytes is 2000 times 2m. For example, of the 2500K bytes transferred during the run with m = 200, about 800K bytes are the actual data bytes and the remaining traffic is synchronization overhead. The graph is asymptotically linear, showing that the overhead is upper-bounded by a constant factor of the actual data.

4 Applications

Synchronizable databases can be put to use in vast number of web-database applications of today. Performance of the existing web applications can be much improved once they take advantage of the synchronizable databases. Moreover, novel protocols and applications of tomorrow become possible with synchronizable databases at heart. We give some examples of the types of applications below.

4.1 Personal Data Synchronization

Mobile computing is the area where data synchronization is an everyday issue, and the advent of Personal Data Assistants (PDAs) literally drove it home to the general public. It is critical that key personal data, such as schedules, contacts, balances, is up to date, accurate, survivable, and available [PUMA]. Sharing it among PDAs and PCs requires content-aware synchronization [AWE], necessary to bring schedules, e-mail lists, etc., stored in different formats in different programs, in sync with each other. A special protocol for storing personal data in a standardized synchronizable database will be of great use in the era of mobile devices.

4.2 Synchronizable Databases for E-Commerce

Computer databases supporting commerce, both traditional and electronic, keep records of transactions which have to be reconciled, audited, accounted, and reported for further systemic actions. For instance, an online business-to-business order, once fulfilled, will cause changes in the databases of the producer, wholesaler, distributor, carrier, consumer, and their banks. Restocking of a product in a supermarket will cause an ensemble of changes in the database of a warehouse, delivery, and branches. A manufacturing concern's database is intimately linked with that of its sub-contractors. Striving for high degree of integration, modern businesses increasingly interlink their database systems. Changes enacted in one of such databases need to be propagated through the whole organization. Synchronizing traditional databases at the record level, while feasible, requires extensive expertise [ORA]. Synchronization is a natural model for effecting such propagation, ensuring the whole system is up to date transparently.

4.3 Synchronizable File Systems

Typical web site synchronization or distributed development, based on a CVS repository, requires the whole file to be transferred every time a change is made in it, however small. Although programs exist, notably rsync to perform fragment-based synchronization of a file, they chose their fragments ad hoc, not guaranteeing optimal size, boundaries, and performance; the fragments do not share a summary framework, and their digests have to be recomputed every time a synchronization is ordered. In line with the view of a file system as a database, we propose summarizable file systems, where each file's blocks will carry a key. Thus coherent, fasts digests of clusters of file blocks will become possible. For example, all of the blocks for the files in a subdirectory can be viewed as a summarizable database. Consequently, two subdirectories can be optimally synchronized at the block level (below file level). When the subdirectory just happened to be called public_html, an optimal web site synchronization will be transparently achieved, within the file system. Regular synchronization of important files with a mirror server will transparently ensure automatic backups. Possibilities for applications of synchronizable file systems are infinite.

4.4 Distributed Protocols

The synchronizable database abstraction is useful in implementing new types of distributed protocols. One example is the Intermemory protocol.

Intermemory [IM1, IM2] is a project of the authors which originated the synchronizable database abstraction we discuss here. Intermemory presents users with a globally addressable homogeneous memory which is highly survivable and available. A block written into Intermemory is encoded using erasure-resilient codes and its fragments are dispersed to hundreds of nodes. Each node stores its blocks and fragments in a synchronizable database. Fragment dispersal occurs when neighboring nodes poll each other and synchronize specific ranges of their databases. Here, the discrepancy handler functions correlate the transporting of missing and mismatching records with extra processing required by the intermemory protocol. In this way database synchronization contributes to self-maintenance and self-repair of the system.

4.5 Synchronizable DNS

Domain Name Service (DNS, [DNS]) is an example of a distributed protocol of paramount importance to the Internet as a whole. DNS is at the heart of the current global internet naming hierarchies, performing node name lookups, polls, and record synchronization all the time. DNS is the largest special-purpose synchronizable database in the world. Because of the size of the Internet, it can take up to three days for a new domain record to propagate around the world. Specialized DNS coordination may be replaced by a general synchronizable node architecture, where each node will run a synchronizable database capable of supporting several synchronizing services simultaneously. Synchronizable DNS is one example of such synchronizing services, while Intermemory is another, and both can share the same synchronizable database capabilities.

4.6 Geographic Resource Databases

Mobile computing puts synchronization to massive use when "syncing with home base." However, much greater synchronization needs are anticipated by the emerging mobile searches. These searches occur when a mobile system, e.g. a car navigation computer, enters a new area and inquires about a local resource, e.g. the nearest Italian trattoria. Several frameworks for cooperative mobile devices are currently discussed, e.g. .geo [GEO] domains, and all of them will require large hierarchical databases of resources, which we may call geographical resource databases, or simply geo-dbs. In the decentralized spirit of the Internet, geo-dbs can grow bottom-up, with the parent areas presenting a coherent searchable summary, obtained by synchronizing with all the enclosed children.

4.7 System Chronology

In many distributed applications, each node has a daemon which keeps a log of events along with their time stamps. General audit of a subset of such nodes will need all the logs from the surveyed nodes to be synchronized together and interwoven in a coherent line of events. The problem becomes a simple one once the logs are maintained as synchronizable databases with the timestamps as the keys. Then the complete event history for any particular time interval can simply be obtained by synchronizing all the logs restricted to that interval. This approach will be used to study the evolution of the Intermemory prototype now under development.

5 Conclusion

The abstraction of synchronizable databases is a simple and powerful one. As a new building block it encapsulates an important emerging feature of leading-edge distributed web-database applications. It promises to improve performance of some existing applications and provides a solution for certain new ones. The result is less time and network bandwidth spent, and the ability to more easily construct complex distributed designs that include synchronization.

REFERENCES

[IM1]
Andrew Goldberg and Peter Yianilos, Towards an Archival Intermemory, in IEEE Conference on Advances in Digital Libraries, IEEE Press, Piscataway, NJ, 1998. Also at www.intermemory.org.

[IM2]
Yuan Chen, Jan Edler, Andrew Goldberg, Allan Gottlieb, Sumeet Sobti and Peter Yianilos, A Prototype Implementation of Archival Intermemory, In Proceedings of the 4th ACM Conference on Digital Libraries, Berkeley, 1999. Also at www.intermemory.org.

[CVS]
Per Cederqvist et al.. CVS Reference Manual, http://www.cyclic.com/cvs/doc-cederqvist.html.

[RSYNC]
Rsync home page, http://rsync.samba.org/.

[BDB]
Berkeley DB, a product of Sleepycat Software, http://www.sleepycat.com/.

[ORA]
Dean Daniels, Lip Boon Doo, Alan Downing, Curtis Elsbernd, Gary Hallmark, Sandeep Jain, Bob Jenkins, Peter Lim, Gordon Smith, Benny Souder, Jim Stamos, Oracle's Symmetric Replication Technology and Implications for Application Design, In Proceedings of the ACM SIGMOD Conference, Minneapolis, MN, 1994.

[DNS]
Paul Albitz and Cricket Liu, DNS and BIND, O'Reilly, 1998.

[GEO]
Tomasz Imielinski and Julio C. Navas, GPS-Based Geographic Addressing, Routing, and Resource Discovery, Communications of the ACM, 1999, vol. 42, no. 4, pp. 86-92.

[PUMA]
Designing effective synchronization solutions, A White Paper from Puma Technology, http://www.pumatech.com/.

[AWE]
Arturo Crespo and Hector Garcia-Molina, Awareness Services for Digital Libraries, In Proceedings of First European Conference on Research and Advanced Technology for Digital Libraries, Pisa, Italy, 1997.

[FILE]
S. Balasubramaniam and Benjamin C. Pierce, What is a File Synchronizer?, In Proceedings of ACM MOBICOM 98, Dallas, TX, 1998.

[WWW8]
Michael Rabinovich and Amit Aggarwal, RaDaR: A Scalable Architecture for a Global Web Hosting Service, In Proceedings of the 8th International World Wide Web Conference, Toronto, Canada, 1999.

About the Authors

Alexy V. Khrabrov is a Computer and Information Science Ph.D. Candidate at the University of Pennsylvania in Philadelphia, Pennsylvania, and a Research Assistant at the NEC Research Institute, Princeton, New Jersey, USA. He received his CIS Master's of Science, Engineering, degree from the University of Pennsylvania in 1998. A part of the degree was completed at Thayer School of Engineering at Dartmouth College, Hanover, New Hampshire, USA. Alexy received his Bachelor's degree in Physics and Mathematics from Moscow Institute of Physics and Technology, Moscow, Russia, in 1993. Alexy's work experience includes a senior software engineering position at the Computer Command and Control Company in Philadelphia, PA, led by Dr. Noah Prywes, focusing on highly reliable software. Alexy's current research interests are focused on the Intermemory project led by Dr. Peter Yianilos. Among various aspects of this multidisciplinary project those of particular interest are synchronizable databases, archival semantics, and system visualization. Alexy's previous research includes work on Intelligent Agent architectures with Dr. George Cybenko, Medical Informatics and AI applications with Dr. Bonnie Webber, and Data Mining works with Dr. Lyle Ungar.

Sumeet Sobti is a graduate student in computer science at the University of Washington. He received his B.Tech. degree from Indian Institute of Technology, Kanpur in 1997. His current research interests include databases, distributed data storage systems, and combinatorial optimization.

Peter N. Yianilos received his Ph.D. in computer science from Princeton University, where he is spending this year as a visitor. He founded Proximity Technology Inc., which became Franklin Electronic Publishers through a merger -- where he served as Chief Scientist and later as President. At Franklin his work on advanced data storage and retrieval technology formed the basis for the first hand-held electronic books. He later spent eight years as a senior research scientist at the NEC Research Institute in Princeton, New Jersey. He is currently Chairman of Netrics.com, an interim consultant to and acting CTO of Franklin Electronic Publishers, and leader of the Intermemory project.