Get the code
The LUBM benchmark can be downloaded from . Directions on its use are available from the project home page. You can download a modified version of the LUBM benchmark which can make it a bit easier to use with bigdata from . Please contact the project maintainers if you have questions about this modified version of the LUBM benchmark.
Generate a data set
There are several "tricks" to getting a data set which you can work with.
- For easy access by the cluster, you should put this data onto some shared storage.
- The generator can overwhelm the file system, putting all of the generated files into the same directory. You can modify the generator code to do this automatically. This is a critical step since otherwise the IO Wait for the OS to locate the files in some vast directory will swamp the load time of the cluster.
- The generator does not produce compressed files by default, so it is worthwhile to make the effort and compress them yourself. Use gzip to compress each file, giving you lots of ".gz" files. Again, you can modify the generator code to do this automatically.
- Make sure that the generated files can be read by whatever user/group is running bigdata.
Bulk load a data set
Edit the main bigdata configuration file and specify the data set to bulk load in the RDFDataLoadMaster configuration section. With the federation running, start the bulk load using RDFDataLoadMaster.sh. The same approach works with any data set. The RDFDataLoadMaster is setup by default to load files from a shared volume, but the behavior is extensible and can be made to load from URLs, HDFS, etc.
nohup RDFDataLoadMaster.sh& tail -f nohup.out
nohup is used since a large data set load can run for hours. If you have setup the ssh tunnel then you can watch the progress using the Excel worksheets.
Run the LUBM queries for the named KB instance. Just start a NanoSparqlServer (NSS) instance on the command line (not the WAR) using the configuration file for the federation. You can then issue queries against the SPARQL endpoint exposed by the NSS. This procedure is nearly identical to the procedure to run LUBM against a single machine bigdata deployment. See NanoSparqlServer#Scale-out_.28cluster_.2F_federation.29