Data Migration

From Blazegraph
Jump to: navigation, search

Support Subscriptions

Customers with support subscriptions should contact their support provider.

Change Log for Backwards Compatibility Issues

This page provides information on changes which break binary compatibility, data migration procedures, and links to utilities which you can use to migrate your data from one bigdata version to another. We try to minimize the need for data migration as much as possible by building versioning information into the root blocks and persistent data structures. However, sometimes implementing a new feature or performance optimization requires us to make a change to bigdata which breaks binary compatibility with older data files. Typically this is because there is a change in the physical schema of the RDF database.

version 1.0.4 => 1.0.6

If a 1.0.4 journal was created using MIN_RELEASE_AGE GT ZERO (0) then an exception will be reported when writing on the journal using 1.0.6. The default for MIN_RELEASE_AGE is ZERO (0), so this will only effect people who have explicitly configured deferred deletes (the recycler) over session protection. The exception is caused by an attempt to re-process deferred deletes associated with older commit records. There is a migration utility which fixes this by pruning the older commit records. This issue is also fixed in the 1.0.x maintenance branch after r6008. Opening a journal with a post 1.0.6 release will not encounter this problem.

See Error releasing deferred frees using 1.0.6 against a 1.0.4 journal for the migration utility.

version 1.0.0 => 1.0.1

The following changes in 1.0.1 cause problems with backwards compatibility.

  1. https://sourceforge.net/apps/trac/bigdata/ticket/107 (Unicode clean schema names in the sparse row store).
  2. https://sourceforge.net/apps/trac/bigdata/ticket/124 (TermIdEncoder should use more bits for scale-out).
  3. https://sourceforge.net/apps/trac/bigdata/ticket/349 (TermIdEncoder limits Journal to 2B distinct RDF Values per triple/quad store instance).

These changes were applied to the 1.0.0 release branch:

https://bigdata.svn.sourceforge.net/svnroot/bigdata/branches/BIGDATA_RELEASE_1_0_0

Please note: if you are already using the 1.0.0 release branch after r4863 then you do NOT need to migrate your data as these changes were already in the branch.

version 1.0.x => 1.1.0

The follow changes in 1.1.0 cause problems with backwards compatibility. Of these, the main change was the introduction of the BLOBS index for large literals and URIs. This change in the physical schema of the RDF database made it impossible to maintain backward compatibility with the 1.0.x branch.

  1. http://sourceforge.net/apps/trac/bigdata/ticket/109 (Store large literals as "blobs")
  2. http://sourceforge.net/apps/trac/bigdata/ticket/401 (inline xsd:unsigned datatypes)
  3. http://sourceforge.net/apps/trac/bigdata/ticket/324 (Inline predeclared URIs and namespaces in 2-3 bytes)

version 1.1.0 => 1.2.0

session protection mode

If a 1.1.0 journal was created using MIN_RELEASE_AGE GT ZERO (0) then an exception may be reported when writing on the journal using 1.2.0. The default for MIN_RELEASE_AGE is ZERO (0), so this will only effect people who have explicitly configured deferred deletes (the recycler) over session protection. The exception is caused by an attempt to re-process deferred deletes associated with older commit records. There is a migration utility which fixes this by pruning the older commit records. This issue is also fixed in the 1.1.x maintenance branch after r6008.

See Error releasing deferred frees using 1.0.6 against a 1.0.4 journal for the migration utility.

full text index

As part of the refactor to support a subject-centric full text index, the schema for the integrated bigdata value-centric full text index has been changed. The changes are (a) the term weight is now stored in the key within the B+Tree tuples; and (b) the term weight is now modeled by a single byte (rather than 4 or 8 bytes). These changes reduce the size on disk of the full text index and allow search results for a single keyword to be delivered in relevance order directly from the index without sorting.

These changes ONLY effect stores using the bigdata full text index. While this property is on by default, it is explicitly disabled in many of the sample property files. If you are uncertain, check your property file to see if this change will effect you:

com.bigdata.rdf.store.AbstractTripleStore.OPTIONS.FULL_TEXT_INDEX=true

Data migration can be achieved through an export / import.

version 1.3.1 => 1.3.2 (Metabits Demi Spaces)

As of 1.3.2, new and old RWStore instances will be automatically converted to use a demi-space for the metabits IFF the maximum size of the metabits region is exceeded. The maximum size of the metabits region is determined by the maximum allocator slot size, which defaults to the recommended value of 8k (8196 bytes). Before conversion, the metabits (which identify the addresses of the allocators) were stored in a single allocation slot on the store. After conversion the metabits are stored in two alternating demi-spaces near the head of the RWStore file structure. This conversion permits the addressing of more allocators than can be stored in an allocator slot. Older code is NOT able to read the RWStore after conversion. However, older code was unable to address more metabits than would fit into a single allocator and so could not have read or written on a store which addressed more than 8k metabits.

A utility class (MetabitsUtil.java) exists to convert between these two operational modes for the metabits. However, it is not possible to convert an RWStore to the older (non-demi-space) mode once the number of allocators is greater than the maximum slot size for the RWStore since the allocators can no longer be stored in an allocation slot.

If the maximum size of the allocators has been overridden from the default / recommended 8k, then the conversion point is also changed to the overridden maximum slot size.

  1. RWStore version before conversion: 0x0400
  2. RWStore version after conversion: 0x0500

See Support larger metabit allocations

version 2.0 => 2.1

In Blazegraph 2.1 release Lucene version was updated to 5.5.0 which uses different tokenization algorithms. So text indexes created with the previous versions of Blazegraph become incompatible and need to be rebuild (see Rebuild Text Index Procedure page). This can be done using one of the following methods:

  • Using script rebuildTextIndex.sh. The script runs the RebuildTextIndex utility. It will rebuild all existing text indexes in a journal. You need to specify journal properties file path as a parameter.

Usage example:

sh rebuildTextIndex.sh  /opt/journal.properties

Data migration

The most straightforward way to migrate data between bigdata versions is an export/import pattern. The ExportKB utility described below may be used to facilitate this, but this is also easy to do within program code.

Background

Each bigdata instance may contain multiple RDF triple stores or quad stores (aka Knowledge Base or KB). Each KB has its own configuration options. If you have only one KB or if all of your KBs have the same configuration, then things are simpler. If you have KBs with different configurations then you will need to pay attention to the export/import procedure for each one.

Blank nodes

Standard RDF semantics for blank nodes requires that an export/import process maintains a mapping from the blank node ID to the internal BNode object used to model that blank node. This works fine as long as you export / import a KB as a single RDF document. However, if references to the same blank node ID appear in different RDF documents then they will be construed as distinct blank nodes!

Bigdata also supports a "told bnodes" option. When using this option, the blank node IDs are treated in much the same manner as URIs. They are stable identifiers which may be used to refer to the blank node. In this case, the interchange of RDF data may be broken down into multiple documents and blank node identity will be preserved.

Statement identifiers

Bigdata supports three main modes for a KB: triples, triples with statement identifiers (SIDs), and quads. Statement identifiers provide for statements about statements. See Reification Done Right for how to interchange data containing statements about statements. (Support for the older RDF/XML interchange syntax for SIDs was removed in 1.3.1 when we introduced the RDR data interchange and query syntax. However, binary compatibility has been maintained so you can simply upgrade to a more recent version of bigdata and use the new data interchange and query mechanisms for RDR.)

Axioms and Inferences

When a KB contains materialized inferences you will typically want to export only the "told" triples (those explicitly written onto the KB by the application). After you import the data you can then recompute the materialized inferences. If you export the inferences and/or axioms as well then they will become "told" triples when you import the data into a new bigdata instance.

Export

The com.bigdata.rdf.sail.ExportKB class may be used to facilitate data migration. The ExportKB utility will write each KB onto a separate subdirectory. Both the configuration properties for the KB and the data will be written out. By default, only told triples/quads will be exported.

java -cp ... -server -Dlog4j.configuration=file:bigdata/src/resources/logging/log4j.properties com.bigdata.rdf.sail.ExportKB [options] propertyFile (namespace*)

Note: This class was introduced after the 1.0.0 release, but the code is backwards compatible with that release. People seeking to migrate from 1.0.0 to 1.0.1 should check out both the 1.0.0 release and the 1.0.1 release, then copy the ExportKB class into the same package in the 1.0.0 release and compile a new jar (ant jar). You can then use the ExportKB to export data from the 1.0.0 release. You can also download the ExportKB class directly from [1].

Import

Before you import your data make sure that the new KB is created with the appropriate configuration properties. If you want to change any configuration options for the KB, now is the time to do it. Simply edit the exported configuration properties file (or copy it to a new location and edit the copy).

If you have only a single KB instance on a Journal, then you just need to copy the exported configuration properties for your KB into the configuration properties for your new Journal.

If there are multiple KBs to be imported, then you need to first create the Journal and then create and import each KB in turn using its exported configuration file.

Once you have created the Journal and are ready to import your data, there are a number of ways to import data.

  1. The DataLoader class. See the javadoc for more detailed information.
  2. The NanoSparqlServer.
  3. The openrdf API.

Data Migration in Scale-Out

Data migration for a bigdata federation is more complex due to the data scales involved. If the KBs in the federation are using told blank nodes mode then export/import can be achieved using the same patterns described above. However, if the cluster is using standard RDF blank node semantics then export/import is more complex as the data can only be reliably interchanged as a single massive RDF "document" or via special purpose code designed to handle the isomorphism of a very large number of blank nodes between two graphs.