Difference between revisions of " Bulk Data Load"

From Blazegraph
Jump to: navigation, search
Line 1: Line 1:
[https://www.blazegraph.com/docs/api/com/bigdata/rdf/store/DataLoader.html DataLoader] utility may be used to create and/or load RDF data into a local database instance. Directories will be recursively processed. The data files may be compressed using zip or gzip, but the loader does not support multiple data files within a single archive.
+
==DataLoader Utility==
 +
[https://www.blazegraph.com/docs/api/com/bigdata/rdf/store/DataLoader.html DataLoader] utility may be used to create and/or load RDF data into a local database instance. Directories will be recursively processed. The data files may be compressed using zip or gzip, but the loader does not support multiple data files within a single archive.<br>
 +
The DataLoader does provide some options that are not available under the standard NSS interfaces, including the ability to handle restart safe queues when processing files in the file system (-durableQueues), the ability to not flush the StatementBuffer between files (this is important when loading a lot of small files, such as for LUBM), and the ability to with some convenience have a very different performance configuration for the Journal without modifying RWStore.properties (you can override a large number of relevant configuration properties using -D on the command line).
  
Command line:
+
===Command line===
 
<pre>
 
<pre>
 
java -cp *:*.jar com.bigdata.rdf.store.DataLoader [-quiet][-closure][-verbose][-durableQueues][-namespace namespace] propertyFile (fileOrDir)*
 
java -cp *:*.jar com.bigdata.rdf.store.DataLoader [-quiet][-closure][-verbose][-durableQueues][-namespace namespace] propertyFile (fileOrDir)*
Line 38: Line 40:
 
|}
 
|}
  
Examples:
+
===Examples===
  
 
1. Load all files from /opt/data/upload/ directory using /opt/data/upload/journal.properties properties file:
 
1. Load all files from /opt/data/upload/ directory using /opt/data/upload/journal.properties properties file:

Revision as of 12:58, 23 December 2015

DataLoader Utility

DataLoader utility may be used to create and/or load RDF data into a local database instance. Directories will be recursively processed. The data files may be compressed using zip or gzip, but the loader does not support multiple data files within a single archive.
The DataLoader does provide some options that are not available under the standard NSS interfaces, including the ability to handle restart safe queues when processing files in the file system (-durableQueues), the ability to not flush the StatementBuffer between files (this is important when loading a lot of small files, such as for LUBM), and the ability to with some convenience have a very different performance configuration for the Journal without modifying RWStore.properties (you can override a large number of relevant configuration properties using -D on the command line).

Command line

java -cp *:*.jar com.bigdata.rdf.store.DataLoader [-quiet][-closure][-verbose][-durableQueues][-namespace namespace] propertyFile (fileOrDir)*

If you're using the executable jar:

java -cp bigdata-bundled.jar com.bigdata.rdf.store.DataLoader [-quiet][-closure][-verbose][-durableQueues][-namespace namespace] propertyFile (fileOrDir)*
parameter definition
-quiet Suppress all stdout messages.
-verbose Show additional messages detailing the load performance.
-closure Compute the RDF(S)+ closure.
-durableQueues Supports restart patterns by renaming files as .good or .fail. All files loaded into a given commit are renamed to .good.
Any file that can not be loaded successfully is renamed to .fail. The files remain in their original directories.
-namespace The namespace of the KB instance.
propertyFile The configuration file for the database instance.
fileOrDir Zero or more files or directories containing the data to be loaded.

Examples

1. Load all files from /opt/data/upload/ directory using /opt/data/upload/journal.properties properties file:

java -cp *:*.jar com.bigdata.rdf.store.DataLoader /opt/data/upload/journal.properties /opt/data/upload/


2. Load an archive /opt/data/data.nt.gz using /opt/data/upload/journal.properties properties file into a specified namespace:

java -cp *:*.jar com.bigdata.rdf.store.DataLoader -namespace someNameSpace /opt/data/upload/journal.properties /opt/data/data.nt.gz

If you are loading data with an enabled inferencing, then a temporary file will be created to compute the delta in entailments. The temporary file could grow extremely in case of loading a large data set. It may cause "no space left on device" error and, as a consequence, the data loading process will be interrupted. To avoid such a situation, it is strongly recommended to specify the DataLoader.Options.CLOSURE property as ClosureEnum.None in the properties file:

com.bigdata.rdf.store.DataLoader.closure=None

You may need to specify Java heap size to match data size. In most cases 6G will be enough (add java parameter: -Xmx6g). Also beware of setting more than 8G heap due to garbage collector pressure.

Then load the data using the DataLoader and pass it the -closure option:

java -Xmx6g -cp *:*.jar com.bigdata.rdf.store.DataLoader -closure /opt/data/upload/journal.properties /opt/data/upload/

The DataLoader will not do incremental truth maintenance during the load. Once the load is complete it will compute all entailments. This will be the "database-at-once" closure and will not use a temporary store to compute the delta in entailments. Thus the temporary store will not "eat your disk".