Difference between revisions of " Bulk Data Load"
Brad Bebee (Talk | contribs) (→Command line) |
Brad Bebee (Talk | contribs) (→DataLoader Utility) |
||
Line 168: | Line 168: | ||
The DataLoader will not do incremental truth maintenance during the load. Once the load is complete it will compute all entailments. This will be the "database-at-once" closure and will not use a temporary store to compute the delta in entailments. Thus the temporary store will not "eat your disk". | The DataLoader will not do incremental truth maintenance during the load. Once the load is complete it will compute all entailments. This will be the "database-at-once" closure and will not use a temporary store to compute the delta in entailments. Thus the temporary store will not "eat your disk". | ||
+ | ===context not bound Exception=== | ||
+ | If you see an error like | ||
+ | |||
+ | <pre> | ||
+ | ERROR: SPORelation.java:2303: java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: context not bound: < TermId(4U), TermId(18484U), TermId(18667L) : Explicit > | ||
+ | java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: context not bound: < TermId(4U), TermId(18484U), TermId(18667L) : Explicit > | ||
+ | at java.util.concurrent.FutureTask.report(FutureTask.java:122) | ||
+ | at java.util.concurrent.FutureTask.get(FutureTask.java:192) | ||
+ | at com.bigdata.rdf.spo.SPORelation.logFuture(SPORelation.java:2298) | ||
+ | </pre> | ||
+ | |||
+ | then you need to specify the default graph. See [[REST_API#Context_Not_Bound_Error_.28Quads_mode_without_defaultGraph.29]]. |
Revision as of 13:55, 20 April 2016
Contents
DataLoader Utility
DataLoader utility may be used to create and/or load RDF data into a local database instance. Directories will be recursively processed. The data files may be compressed using zip or gzip, but the loader does not support multiple data files within a single archive.
The DataLoader does provide some options that are not available under the standard NSS interfaces, including the ability to handle restart safe queues when processing files in the file system (-durableQueues), the ability to not flush the StatementBuffer between files (this is important when loading a lot of small files, such as for LUBM), and the ability to with some convenience have a very different performance configuration for the Journal without modifying RWStore.properties (you can override a large number of relevant configuration properties using -D on the command line).
Command line
java -cp *:*.jar com.bigdata.rdf.store.DataLoader [-quiet][-closure][-verbose][-durableQueues][-namespace namespace] propertyFile (fileOrDir)*
If you're using the executable jar:
java -cp bigdata-bundled.jar com.bigdata.rdf.store.DataLoader [-quiet][-closure][-verbose][-durableQueues][-namespace namespace] propertyFile (fileOrDir)*
parameter | definition |
---|---|
-quiet | Suppress all stdout messages. |
-verbose | Show additional messages detailing the load performance. |
-defaultGraph | Specify the default graph. This is required for quads mode. |
-closure | Compute the RDF(S)+ closure. See also the Inference And Truth Maintenance page. |
-durableQueues | Supports restart patterns by renaming files as .good or .fail. All files loaded into a given commit are renamed to .good. Any file that can not be loaded successfully is renamed to .fail. The files remain in their original directories. |
-namespace | The namespace of the KB instance. |
propertyFile | The configuration file for the database instance. |
fileOrDir | Zero or more files or directories containing the data to be loaded. |
REST API
As of version 2.0.0, the DataLoader is also available via the REST API. This also bulk load into a running Blazegraph Nano Sparql Server (NSS). A guide to configuring the REST API load is here.
The 2.0.0 deployers could use a loadRestAPI.sh script. It takes one parameter, the file or directory to load.
Usage example of loading data from several sources (file1, dir1, file2, dir2):
sh loadRestAPI.sh file1, dir1, file2, dir2
The loadRestAPI.sh script:
#!/bin/bash FILE_OR_DIR=$1 if [ -f "/etc/default/blazegraph" ] ; then . "/etc/default/blazegraph" else JETTY_PORT=9999 fi LOAD_PROP_FILE=/tmp/$$.properties export NSS_DATALOAD_PROPERTIES=/usr/local/blazegraph/conf/RWStore.properties #Probably some unused properties below, but copied all to be safe. cat <<EOT >> $LOAD_PROP_FILE quiet=false verbose=0 closure=false durableQueues=true #Needed for quads #defaultGraph= com.bigdata.rdf.store.DataLoader.flush=false com.bigdata.rdf.store.DataLoader.bufferCapacity=100000 com.bigdata.rdf.store.DataLoader.queueCapacity=10 #Namespace to load namespace=kb #Files to load fileOrDirs=$1 #Property file (if creating a new namespace) propertyFile=$NSS_DATALOAD_PROPERTIES EOT echo "Loading with properties..." cat $LOAD_PROP_FILE curl -X POST --data-binary @${LOAD_PROP_FILE} --header 'Content-Type:text/plain' http://localhost:${JETTY_PORT}/blazegraph/dataloader #Let the output go to STDOUT/ERR to allow script redirection rm -f $LOAD_PROP_FILE
Configuring
Parsing, insert, and removal on the database are now decoupled from the index writes using the StatementBuffer. The StatementBuffer now decouples the producer writing onto it and the read/resolve/write pattern onto the triple store. This is down with a blocking queue. The caller’s process drives inserts into a StatementBuffer and can be adding or removing statements via the BigdataSailConnection, incremental truth maintenance, or the DataLoader. The StatementBuffer absorbs these inserts in batches of up to the configured bufferCapacity. Once the batch threshold is reached, the StatementBuffer evicts a batch onto a blocking queue. The writer drains the blocking queue. If there are multiple batches in the queue, then they are combined into a single larger batch. The writer then performs the necessary add/resolve for the RDF Values and then adds or removes the statements from the triple store. As a special case, if the producer finishes before the first batch has been filled up then the batch as written from the producer’s thread to avoid the overhead with cloning the backing arrays. The main parameters for the StatementBuffer are Buffer Capacity and Queue Capacity.
Buffer Capacity
The capacity of the Statement[] determines how many RDF Statements can be buffered before a batch is evicted to the queue. It is an optional property. The DataLoader.Options.BUFFER_CAPACITY has a default of 100k.
The DataLoader.Options.QUEUE_CAPACITY can increase the effective amount of data that is being buffered quite significantly. Caution is recommended when overriding the DataLoader.Options.BUFFER_CAPACITY in combination with a non-zero value of the DataLoader.Options.QUEUE_CAPACITY. The best performance will probably come from small (20k - 50k) buffer capacity values combined with a queueCapacity of 5-20. Larger values will increase the GC burden and could require a larger heap, but the net throughput might also increase.
Use the following parameter in the properties file to change the default value:
com.bigdata.rdf.store.DataLoader.bufferCapacity=100000
or -D on the command line:
-Dcom.bigdata.rdf.store.DataLoader.bufferCapacity=100000
Queue Capacity
DataLoader.Options.QUEUE_CAPACITY is an optional property specifying the capacity of blocking queue used by the StatementBuffer -or- ZERO (0) to disable the blocking queue and perform synchronous writes. The blocking queue holds parsed data pending writes onto the backing store and makes it possible for the parser to race ahead while writer is blocked writing onto the database indices. The default is 10 batches. Since the writer will merge any batches waiting in the queue, the actual size of the write batch can be up to bufferCapacity x 10.
Parameter in the properties file:
com.bigdata.rdf.store.DataLoader.queueCapacity=10
In command line:
-Dcom.bigdata.rdf.store.DataLoader.queueCapacity=10
Ignore Fatal Parser Errors
When true
, the loader will not break on unresolvable parse errors, but instead skip the file containing the error. This option is useful when loading large input that may contain invalid RDF, in order to make sure that the loading process does not fully fail when malicious files are detected. Note that an error will still be logged in case files cannot be loaded, so one is able to track the files that failed. The default value is false
.
Parameter in the properties file:
com.bigdata.rdf.store.DataLoader.ignoreInvalidFiles=true
In command line:
-Dcom.bigdata.rdf.store.DataLoader.ignoreInvalidFiles=true
Examples
1. Load all files from /opt/data/upload/ directory using /opt/data/upload/journal.properties properties file:
java -cp *:*.jar com.bigdata.rdf.store.DataLoader /opt/data/upload/journal.properties /opt/data/upload/
2. Load an archive /opt/data/data.nt.gz using /opt/data/upload/journal.properties properties file into a specified namespace:
java -cp *:*.jar com.bigdata.rdf.store.DataLoader -namespace someNameSpace /opt/data/upload/journal.properties /opt/data/data.nt.gz
If you are loading data with an enabled inferencing, then a temporary file will be created to compute the delta in entailments. The temporary file could grow extremely in case of loading a large data set. It may cause "no space left on device" error and, as a consequence, the data loading process will be interrupted. To avoid such a situation, it is strongly recommended to specify the DataLoader.Options.CLOSURE property as ClosureEnum.None in the properties file:
com.bigdata.rdf.store.DataLoader.closure=None
You may need to specify Java heap size to match data size. In most cases 6G will be enough (add java parameter: -Xmx6g). Also beware of setting more than 8G heap due to garbage collector pressure.
Then load the data using the DataLoader and pass it the -closure option:
java -Xmx6g -cp *:*.jar com.bigdata.rdf.store.DataLoader -closure /opt/data/upload/journal.properties /opt/data/upload/
The DataLoader will not do incremental truth maintenance during the load. Once the load is complete it will compute all entailments. This will be the "database-at-once" closure and will not use a temporary store to compute the delta in entailments. Thus the temporary store will not "eat your disk".
context not bound Exception
If you see an error like
ERROR: SPORelation.java:2303: java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: context not bound: < TermId(4U), TermId(18484U), TermId(18667L) : Explicit > java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: context not bound: < TermId(4U), TermId(18484U), TermId(18667L) : Explicit > at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at com.bigdata.rdf.spo.SPORelation.logFuture(SPORelation.java:2298)
then you need to specify the default graph. See REST_API#Context_Not_Bound_Error_.28Quads_mode_without_defaultGraph.29.