Data Migration

From Blazegraph
Revision as of 15:00, 19 December 2011 by Thompsonbry (Talk | contribs) (version 1.1.0)

Jump to: navigation, search

Change Log for Backwards Compatibility Issues

This page provides information on changes which break binary compatibility, data migration procedures, and links to utilities which you can use to migrate your data from one bigdata version to another. We try to minimize the need for data migration as much as possible by building versioning information into the root blocks and persistent data structures. However, sometimes implementing a new feature or performance optimization requires us to make a change to bigdata which breaks binary compatibility with older data files. Typically this is because there is a change in the physical schema of the RDF database.

version 1.0.0 => 1.0.1

The following changes in 1.0.1 cause problems with backwards compatibility.

  1. (Unicode clean schema names in the sparse row store).
  2. (TermIdEncoder should use more bits for scale-out).
  3. (TermIdEncoder limits Journal to 2B distinct RDF Values per triple/quad store instance).

These changes were applied to the 1.0.0 release branch:

Please note: if you are already using the 1.0.0 release branch after r4863 then you do NOT need to migrate your data as these changes were already in the branch.

version 1.1.0

The follow changes in 1.1.0 cause problems with backwards compatibility. Of these, the main change was the introduction of the BLOBS index for large literals and URIs. This change in the physical schema of the RDF database made it impossible to maintain backward compatibility with the 1.0.x branch.

  1. (Store large literals as "blobs")
  2. (inline xsd:unsigned datatypes)
  3. (Inline predeclared URIs and namespaces in 2-3 bytes)

Data migration

The most straight forward way to migrate data between bigdata versions is an export/import pattern. The ExportKB utility described below may be used to facilitate this, but this is also easy to do within program code.


Each bigdata instance may contain multiple RDF triple stores or quad stores (aka Knowledge Base or KB). Each KB has its own configuration options. If you have only one KB or if all of your KBs have the same configuration, then things are simpler. If you have KBs with different configurations then you will need to pay attention to the export/import procedure for each one.

Blank nodes

Standard RDF semantics for blank nodes requires that an export/import process maintains a mapping from the blank node ID to the internal BNode object used to model that blank node. This works fine as long as you export / import a KB as a single RDF document. However, if references to the same blank node ID appear in different RDF documents then they will be construed as distinct blank nodes!

Bigdata also supports a "told bnodes" option. When using this option, the blank node IDs are treated in much the same manner as URIs. They are stable identifiers which may be used to refer to the blank node. In this case, the interchange of RDF data may be broken down into multiple documents and blank node identity will be preserved.

Statement identifiers

Bigdata supports three main modes for a KB: triples, triples with statement identifiers (SIDs), and quads. Statement identifiers provide for statements about statements. See [[1]]].

Statement identifiers behave in many ways like blank nodes. However, the identity of the statement is grounded in the blank node identifier associated with the context position of a "ground" statement. Interchange of SIDs mode data MAY be broken into multiple documents as long as each document contains all ground statements for any metadata statement also found in that document.

SIDs mode data interchange MUST use the bigdata extension of RDF/XML. The simplest solution is simply to export all data in a KB as a single RDF/XML document and then import that RDF/XML document into a new KB instance.

Axioms and Inferences

When a KB contains materialized inferences you will typically want to export only the "told" triples (those explicitly written onto the KB by the application). After you import the data you can then recompute the materialized inferences. If you export the inferences and/or axioms as well then they will become "told" triples when you import the data into a new bigdata instance.


The com.bigdata.rdf.sail.ExportKB class may be used to facilitate data migration. The ExportKB utility will write each KB onto a separate subdirectory. Both the configuration properties for the KB and the data will be written out. By default, only told triples/quads will be exported.

java -cp ... -server -Dlog4j.configuration=file:bigdata/src/resources/logging/ com.bigdata.rdf.sail.ExportKB [options] propertyFile (namespace*)

Note: This class was introduced after the 1.0.0 release, but the code is backwards compatible with that release. People seeking to migrate from 1.0.0 to 1.0.1 should check out both the 1.0.0 release and the 1.0.1 release, the copy the ExportKB class into the same package in the 1.0.0 release and compile a new jar (ant jar). You can then use the ExportKB to export data from the 1.0.0 release. You can also download the ExportKB class directly from [2].


Before you import your data make sure that the new KB is created with the appropriate configuration properties. If you want to change any configuration options for the KB, now is the time to do it. Simple edit the exported configuration properties file (or copy it to a new location and edit the copy).

If you have only a single KB instance on a Journal, then you just need to copy the exported configuration properties for your KB into the configuration properties for your new Journal.

If there are multiple KBs to be imported, then you need to first create the Journal and then create and import each KB in turn using its exported configuration file.

Once you have created the Journal and are ready to import your data, there are a number of ways to import data.

  1. The DataLoader class. See the javadoc for more detailed information.
  2. The NanoSparqlServer.
  3. The openrdf API.

Data Migration in Scale-Out

Data migration for a bigdata federation is more complex due to the data scales involved. If the KBs in the federation are using told blank nodes mode then export/import can be achieved using the same patterns described above. However, if the cluster is using standard RDF blank node semantics then export/import is more complex as the data can only be reliably interchanged as a single massive RDF "document" or via special purpose code designed to handle the isomorphism of a very large number of blank nodes between two graphs.