Difference between revisions of " Bulk Data Load"

From Blazegraph
Jump to: navigation, search
(Created page with "[https://www.blazegraph.com/docs/api/com/bigdata/rdf/store/DataLoader.html DataLoader] utility may be used to create and/or load RDF data into a local database instance. Direc...")
 
Line 4: Line 4:
 
<pre>
 
<pre>
 
java -cp *:*.jar com.bigdata.rdf.store.DataLoader [-quiet][-closure][-verbose][-namespace namespace] propertyFile (fileOrDir)*
 
java -cp *:*.jar com.bigdata.rdf.store.DataLoader [-quiet][-closure][-verbose][-namespace namespace] propertyFile (fileOrDir)*
 +
</pre>
 +
 +
If you're using the executable jar:
 +
<pre>
 +
java -cp bigdata-bundled.jar com.bigdata.rdf.store.DataLoader [-quiet][-closure][-verbose][-namespace namespace] propertyFile (fileOrDir)*
 
</pre>
 
</pre>
  

Revision as of 18:40, 31 August 2015

DataLoader utility may be used to create and/or load RDF data into a local database instance. Directories will be recursively processed. The data files may be compressed using zip or gzip, but the loader does not support multiple data files within a single archive.

Command line:

java -cp *:*.jar com.bigdata.rdf.store.DataLoader [-quiet][-closure][-verbose][-namespace namespace] propertyFile (fileOrDir)*

If you're using the executable jar:

java -cp bigdata-bundled.jar com.bigdata.rdf.store.DataLoader [-quiet][-closure][-verbose][-namespace namespace] propertyFile (fileOrDir)*
parameter definition
-quiet Suppress all stdout messages.
-verbose Show additional messages detailing the load performance.
-closure Compute the RDF(S)+ closure.
-namespace The namespace of the KB instance.
propertyFile The configuration file for the database instance.
fileOrDir Zero or more files or directories containing the data to be loaded.

Examples:

1. Load all files from /opt/data/upload/ directory using /opt/data/upload/journal.properties properties file:

java -cp *:*.jar com.bigdata.rdf.store.DataLoader /opt/data/upload/journal.properties /opt/data/upload/


2. Load an archive /opt/data/data.nt.gz using /opt/data/upload/journal.properties properties file into a specified namespace:

java -cp *:*.jar com.bigdata.rdf.store.DataLoader -namespace someNameSpace /opt/data/upload/journal.properties /opt/data/data.nt.gz

If you are loading data with an enabled inferencing, then a temporary file will be created to compute the delta in entailments. The temporary file could grow extremely in case of loading a large data set. It may cause "no space left on device" error and, as a consequence, the data loading process will be interrupted. To avoid such a situation, it is strongly recommended to specify the DataLoader.Options.CLOSURE property as ClosureEnum.None in the properties file:

com.bigdata.rdf.store.DataLoader.closure=None

You may need to specify Java heap size to match data size. In most cases 6G will be enough (add java parameter: -Xmx6g). Also beware of setting more than 8G heap due to garbage collector pressure.

Then load the data using the DataLoader and pass it the -closure option:

java -Xmx6g -cp *:*.jar com.bigdata.rdf.store.DataLoader -closure /opt/data/upload/journal.properties /opt/data/upload/

The DataLoader will not do incremental truth maintenance during the load. Once the load is complete it will compute all entailments. This will be the "database-at-once" closure and will not use a temporary store to compute the delta in entailments. Thus the temporary store will not "eat your disk".