Bulk Data Load

From Blazegraph
Jump to: navigation, search

DataLoader Utility

DataLoader utility may be used to create and/or load RDF data into a local database instance. Directories will be recursively processed. The data files may be compressed using zip or gzip, but the loader does not support multiple data files within a single archive.
The DataLoader does provide some options that are not available under the standard NSS interfaces, including the ability to handle restart safe queues when processing files in the file system (-durableQueues), the ability to not flush the StatementBuffer between files (this is important when loading a lot of small files, such as for LUBM), and the ability to with some convenience have a very different performance configuration for the Journal without modifying RWStore.properties (you can override a large number of relevant configuration properties using -D on the command line).

Command line

java -cp *:*.jar com.bigdata.rdf.store.DataLoader [-quiet][-closure][-verbose][-format][-baseURI][-defaultGraph][-durableQueues][-namespace namespace] propertyFile (fileOrDir)*

If you're using the executable jar:

java -cp bigdata-bundled.jar com.bigdata.rdf.store.DataLoader [-quiet][-closure][-verbose][-format][-baseURI][-defaultGraph][-durableQueues][-namespace namespace] propertyFile (fileOrDir)*
parameter definition
-quiet Suppress all stdout messages.
-verbose Show additional messages detailing the load performance.
-defaultGraph Specify the default graph. This is required for quads mode.
-format The format of the file (optional, when not specified the format is deduced for each file in turn using the RDFFormat static methods).
-baseURI The baseURI (optional, when not specified the name of the each file load is converted to a URL and used as the baseURI for that file).
-closure Compute the RDF(S)+ closure. See also the Inference And Truth Maintenance page.
-durableQueues Supports restart patterns by renaming files as .good or .fail. All files loaded into a given commit are renamed to .good.
Any file that can not be loaded successfully is renamed to .fail. The files remain in their original directories.
-namespace The namespace of the KB instance.
propertyFile The configuration file for the database instance.
fileOrDir Zero or more files or directories containing the data to be loaded.

REST API

As of version 2.0.0, the DataLoader is also available via the REST API. This also bulk load into a running Blazegraph Nano Sparql Server (NSS). A guide to configuring the REST API load is here.

The 2.0.0 deployers could use a loadRestAPI.sh script. It takes one parameter, the file or directory to load.

Usage example of loading data from several sources (file1, dir1, file2, dir2):

sh loadRestAPI.sh  file1, dir1, file2, dir2

The loadRestAPI.sh script:

#!/bin/bash

FILE_OR_DIR=$1

if [ -f "/etc/default/blazegraph" ] ; then
    . "/etc/default/blazegraph" 
else
    JETTY_PORT=9999
fi

LOAD_PROP_FILE=/tmp/$$.properties

export NSS_DATALOAD_PROPERTIES=/usr/local/blazegraph/conf/RWStore.properties

#Probably some unused properties below, but copied all to be safe.

cat <<EOT >> $LOAD_PROP_FILE
quiet=false
verbose=0
closure=false
durableQueues=true
#Needed for quads
#defaultGraph=
com.bigdata.rdf.store.DataLoader.flush=false
com.bigdata.rdf.store.DataLoader.bufferCapacity=100000
com.bigdata.rdf.store.DataLoader.queueCapacity=10
#Namespace to load
namespace=kb
#Files to load
fileOrDirs=$1
#Property file (if creating a new namespace)
propertyFile=$NSS_DATALOAD_PROPERTIES
EOT

echo "Loading with properties..."

cat $LOAD_PROP_FILE

curl -X POST --data-binary @${LOAD_PROP_FILE} --header 'Content-Type:text/plain' http://localhost:${JETTY_PORT}/blazegraph/dataloader

#Let the output go to STDOUT/ERR to allow script redirection

rm -f $LOAD_PROP_FILE

Configuring

Parsing, insert, and removal on the database are now decoupled from the index writes using the StatementBuffer. The StatementBuffer now decouples the producer writing onto it and the read/resolve/write pattern onto the triple store. This is down with a blocking queue. The caller’s process drives inserts into a StatementBuffer and can be adding or removing statements via the BigdataSailConnection, incremental truth maintenance, or the DataLoader. The StatementBuffer absorbs these inserts in batches of up to the configured bufferCapacity. Once the batch threshold is reached, the StatementBuffer evicts a batch onto a blocking queue. The writer drains the blocking queue. If there are multiple batches in the queue, then they are combined into a single larger batch. The writer then performs the necessary add/resolve for the RDF Values and then adds or removes the statements from the triple store. As a special case, if the producer finishes before the first batch has been filled up then the batch as written from the producer’s thread to avoid the overhead with cloning the backing arrays. The main parameters for the StatementBuffer are Buffer Capacity and Queue Capacity.

Buffer Capacity

The capacity of the Statement[] determines how many RDF Statements can be buffered before a batch is evicted to the queue. It is an optional property. The DataLoader.Options.BUFFER_CAPACITY has a default of 100k.
The DataLoader.Options.QUEUE_CAPACITY can increase the effective amount of data that is being buffered quite significantly. Caution is recommended when overriding the DataLoader.Options.BUFFER_CAPACITY in combination with a non-zero value of the DataLoader.Options.QUEUE_CAPACITY. The best performance will probably come from small (20k - 50k) buffer capacity values combined with a queueCapacity of 5-20. Larger values will increase the GC burden and could require a larger heap, but the net throughput might also increase.
Use the following parameter in the properties file to change the default value:

com.bigdata.rdf.store.DataLoader.bufferCapacity=100000

or -D on the command line:

-Dcom.bigdata.rdf.store.DataLoader.bufferCapacity=100000

Queue Capacity

DataLoader.Options.QUEUE_CAPACITY is an optional property specifying the capacity of blocking queue used by the StatementBuffer -or- ZERO (0) to disable the blocking queue and perform synchronous writes. The blocking queue holds parsed data pending writes onto the backing store and makes it possible for the parser to race ahead while writer is blocked writing onto the database indices. The default is 10 batches. Since the writer will merge any batches waiting in the queue, the actual size of the write batch can be up to bufferCapacity x 10.
Parameter in the properties file:

com.bigdata.rdf.store.DataLoader.queueCapacity=10

In command line:

-Dcom.bigdata.rdf.store.DataLoader.queueCapacity=10

Ignore Fatal Parser Errors

When true, the loader will not break on unresolvable parse errors, but instead skip the file containing the error. This option is useful when loading large input that may contain invalid RDF, in order to make sure that the loading process does not fully fail when malicious files are detected. Note that an error will still be logged in case files cannot be loaded, so one is able to track the files that failed. The default value is false.
Parameter in the properties file:

com.bigdata.rdf.store.DataLoader.ignoreInvalidFiles=true

In command line:

-Dcom.bigdata.rdf.store.DataLoader.ignoreInvalidFiles=true

Examples

1. Load all files from /opt/data/upload/ directory using /opt/data/upload/journal.properties properties file:

java -cp *:*.jar com.bigdata.rdf.store.DataLoader /opt/data/upload/journal.properties /opt/data/upload/


2. Load an archive /opt/data/data.nt.gz using /opt/data/upload/journal.properties properties file into a specified namespace:

java -cp *:*.jar com.bigdata.rdf.store.DataLoader -namespace someNameSpace /opt/data/upload/journal.properties /opt/data/data.nt.gz

If you are loading data with an enabled inferencing, then a temporary file will be created to compute the delta in entailments. The temporary file could grow extremely in case of loading a large data set. It may cause "no space left on device" error and, as a consequence, the data loading process will be interrupted. To avoid such a situation, it is strongly recommended to specify the DataLoader.Options.CLOSURE property as ClosureEnum.None in the properties file:

com.bigdata.rdf.store.DataLoader.closure=None

You may need to specify Java heap size to match data size. In most cases 6G will be enough (add java parameter: -Xmx6g). Also beware of setting more than 8G heap due to garbage collector pressure.

Then load the data using the DataLoader and pass it the -closure option:

java -Xmx6g -cp *:*.jar com.bigdata.rdf.store.DataLoader -closure /opt/data/upload/journal.properties /opt/data/upload/

The DataLoader will not do incremental truth maintenance during the load. Once the load is complete it will compute all entailments. This will be the "database-at-once" closure and will not use a temporary store to compute the delta in entailments. Thus the temporary store will not "eat your disk".

context not bound Exception

If you see an error like

ERROR: SPORelation.java:2303: java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: context not bound: < TermId(4U), TermId(18484U), TermId(18667L) : Explicit >
java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: context not bound: < TermId(4U), TermId(18484U), TermId(18667L) : Explicit >
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at com.bigdata.rdf.spo.SPORelation.logFuture(SPORelation.java:2298)

then you need to specify the default graph. See REST_API#Context_Not_Bound_Error_.28Quads_mode_without_defaultGraph.29.