Difference between revisions of " Data Migration"

From Blazegraph
Jump to: navigation, search
(Created page with 'We try to minimize the need for data migration as much as possible by building versioning information into the root blocks and persistent data structures. However, sometimes imp…')
 
Line 1: Line 1:
 
We try to minimize the need for data migration as much as possible by building versioning information into the root blocks and persistent data structures.  However, sometimes implementing a new feature or performance optimization requires us to make a change to bigdata which breaks binary compatibility with older data files.  Typically this is because there is a change in the physical schema of the RDF database.
 
We try to minimize the need for data migration as much as possible by building versioning information into the root blocks and persistent data structures.  However, sometimes implementing a new feature or performance optimization requires us to make a change to bigdata which breaks binary compatibility with older data files.  Typically this is because there is a change in the physical schema of the RDF database.
  
This page provides information on changes which break binary compatibility and links to utilities which you can use to migrate your data from one bigdata version to another.
+
This page provides information on changes which break binary compatibility, data migration procedures, and links to utilities which you can use to migrate your data from one bigdata version to another.
  
 
= Change Log for Backwards Compatibility Issues =
 
= Change Log for Backwards Compatibility Issues =
Line 21: Line 21:
  
 
The most straight forward way to migrate data between bigdata versions is an export/import pattern.  The [[#MigrationUtility]] may be used to facilitate this, but this is also easy to do within program code.
 
The most straight forward way to migrate data between bigdata versions is an export/import pattern.  The [[#MigrationUtility]] may be used to facilitate this, but this is also easy to do within program code.
 +
 +
== Background ==
 +
 +
 +
 +
=== Blank nodes ===
 +
 +
Standard RDF semantics for blank nodes requires that an export/import process maintains a mapping from the blank node ID to the internal BNode object used to model that blank node.  This works fine as long as you export / import a KB as a '''single''' RDF document.  However, if references to the same blank node ID appear in different RDF documents then they will be construed as '''distinct''' blank nodes!
 +
 +
Bigdata also supports a "told bnodes" option.  When using this option, the blank node IDs are treated in much the same manner as URIs.  They are stable identifiers which may be used to refer to the blank node.  In this case, the interchange of RDF data may be broken down into multiple documents and blank node identity will be preserved.
 +
 +
=== Statement identifiers ===
 +
 +
Bigdata supports three main modes for a KB: triples, triples with statement identifiers (SIDs), and quads.  Statement identifiers provide for statements about statements.  See [[https://sourceforge.net/apps/mediawiki/bigdata/index.php?title=GettingStarted#You_claim_that_you.27ve_.22solved.22_the_provenance_problem_for_RDF_with_statement_identifiers._Can_you_show_me_how_that_works.3F]]].
 +
 +
Statement identifiers behave in many ways like blank nodes.  However, the identity of the statement is grounded in the blank node identifier associated with the context position of a "ground" statement.  Interchange of SIDs mode data MAY be broken into multiple documents as long as each document contains all ground statements for any metadata statement also found in that document.
 +
 +
SIDs mode data interchange MUST use the bigdata extension of RDF/XML.  The simplest solution is simply to export all data in a KB as a single RDF/XML document and then import that RDF/XML document into a new KB instance.
 +
 +
=== Axioms and Inferences ===
 +
 +
When a KB contains materialized inferences you will typically want to export only the "told" triples (those explicitly written onto the KB by the application).  After you import the data you can then recompute the materialized inferences.  If you export the inferences and/or axioms as well then they will become "told" triples when you import the data into a new bigdata instance.
  
 
== MigrationUtility ==
 
== MigrationUtility ==
  
 
The MigrationUtility class exists to facilitate data migration.  This class was introduced after the 1.0.0 release, but the code is backwards compatible with that release.  People seeking to migrate from 1.0.0 to 1.0.1 should check out both the 1.0.0 release and the 1.0.1 release, the copy the MigrationUtility class into the same package in the 1.0.0 release and compile a new jar (ant jar).  You can then use the MigrationUtility to export data from the 1.0.0 release.
 
The MigrationUtility class exists to facilitate data migration.  This class was introduced after the 1.0.0 release, but the code is backwards compatible with that release.  People seeking to migrate from 1.0.0 to 1.0.1 should check out both the 1.0.0 release and the 1.0.1 release, the copy the MigrationUtility class into the same package in the 1.0.0 release and compile a new jar (ant jar).  You can then use the MigrationUtility to export data from the 1.0.0 release.
 +
 +
== Data Migration in Scale-Out ==
 +
 +
Data migration for a bigdata federation is more complex due to the data scales involved.  If the KBs in the federation are using told blank nodes mode then export/import can be achieved using the same patterns described above.  However, if the cluster is using standard RDF blank node semantics then export/import is more complex as the data can only be reliably interchanged as a single massive RDF "document" or via special purpose code designed to handle the isomorphism of a very large number of blank nodes between two graphs.

Revision as of 12:35, 26 July 2011

We try to minimize the need for data migration as much as possible by building versioning information into the root blocks and persistent data structures. However, sometimes implementing a new feature or performance optimization requires us to make a change to bigdata which breaks binary compatibility with older data files. Typically this is because there is a change in the physical schema of the RDF database.

This page provides information on changes which break binary compatibility, data migration procedures, and links to utilities which you can use to migrate your data from one bigdata version to another.

Change Log for Backwards Compatibility Issues

version 1.0.0 => 1.0.1

The following changes in 1.0.1 cause problems with backwards compatibility.

  1. https://sourceforge.net/apps/trac/bigdata/ticket/107 (Unicode clean schema names in the sparse row store).
  2. https://sourceforge.net/apps/trac/bigdata/ticket/124 (TermIdEncoder should use more bits for scale-out).
  3. https://sourceforge.net/apps/trac/bigdata/ticket/349 (TermIdEncoder limits Journal to 2B distinct RDF Values per triple/quad store instance).

These changes were applied to the 1.0.0 release branch:

https://bigdata.svn.sourceforge.net/svnroot/bigdata/branches/BIGDATA_RELEASE_1_0_0

Please note: if you are already using the 1.0.0 release branch after r4863 then you do NOT need to migrate your data as these changes were already in the branch.

Data migration

The most straight forward way to migrate data between bigdata versions is an export/import pattern. The #MigrationUtility may be used to facilitate this, but this is also easy to do within program code.

Background

Blank nodes

Standard RDF semantics for blank nodes requires that an export/import process maintains a mapping from the blank node ID to the internal BNode object used to model that blank node. This works fine as long as you export / import a KB as a single RDF document. However, if references to the same blank node ID appear in different RDF documents then they will be construed as distinct blank nodes!

Bigdata also supports a "told bnodes" option. When using this option, the blank node IDs are treated in much the same manner as URIs. They are stable identifiers which may be used to refer to the blank node. In this case, the interchange of RDF data may be broken down into multiple documents and blank node identity will be preserved.

Statement identifiers

Bigdata supports three main modes for a KB: triples, triples with statement identifiers (SIDs), and quads. Statement identifiers provide for statements about statements. See [[1]]].

Statement identifiers behave in many ways like blank nodes. However, the identity of the statement is grounded in the blank node identifier associated with the context position of a "ground" statement. Interchange of SIDs mode data MAY be broken into multiple documents as long as each document contains all ground statements for any metadata statement also found in that document.

SIDs mode data interchange MUST use the bigdata extension of RDF/XML. The simplest solution is simply to export all data in a KB as a single RDF/XML document and then import that RDF/XML document into a new KB instance.

Axioms and Inferences

When a KB contains materialized inferences you will typically want to export only the "told" triples (those explicitly written onto the KB by the application). After you import the data you can then recompute the materialized inferences. If you export the inferences and/or axioms as well then they will become "told" triples when you import the data into a new bigdata instance.

MigrationUtility

The MigrationUtility class exists to facilitate data migration. This class was introduced after the 1.0.0 release, but the code is backwards compatible with that release. People seeking to migrate from 1.0.0 to 1.0.1 should check out both the 1.0.0 release and the 1.0.1 release, the copy the MigrationUtility class into the same package in the 1.0.0 release and compile a new jar (ant jar). You can then use the MigrationUtility to export data from the 1.0.0 release.

Data Migration in Scale-Out

Data migration for a bigdata federation is more complex due to the data scales involved. If the KBs in the federation are using told blank nodes mode then export/import can be achieved using the same patterns described above. However, if the cluster is using standard RDF blank node semantics then export/import is more complex as the data can only be reliably interchanged as a single massive RDF "document" or via special purpose code designed to handle the isomorphism of a very large number of blank nodes between two graphs.