IO Optimization

From Blazegraph
Jump to: navigation, search

See also

IO Profiles

Blazegraph has different IO profiles in the Journal and Scale-Out architectures. Briefly, range scans on indices on the Journal turn into random IOs. Key-range scans on the horizontally sharded cluster turn into sequential IOs because 99%+ of the data is in key-order on the disk in the scale-out architecture - this is a side effect of the dynamic sharding algorithm. In additional, writes on the Journal drive random IOs when pages are evicted to the disk. Writes in scale-out drive sequential IOs as blocks, are appended to the mutable journal and then periodically migrated onto the read-only, read-optimized index segments.

High IOPs

Blazegraph depends on fast disk access. Low-latency, and high IOPs disk is key to high performance. SSD offers a 10x performance boost for the Journal in the embedded, standalone, and highly available replication cluster deployment modes when compared to SAS or SATA disks. In addition, SATA disks are unable to reorder writes. This can cause a marked decrease in write throughput as the size of the backing file grows over time. Since 1.3.x, we have introduced a cache compaction algorithm into the Journal that significantly reduces IOs associated with large write transactions. This makes SATA more viable, but the performance gap to SSD is still an order of magnitude for query. SAS disks can reorder read and write operations, but still have a large rotational latency and lower seek times when compared to SSD. For enterprise class deployments, you can use either PCIe flash disk or SSD in combination with the HAJournalServer (high availability). AWS now offers an SSD backed instance that is a good match for Blazegraph.

File System Cache

If you have a lot of RAM, consider giving Blazegraph a relatively small JVM heap (4G) and leaving the rest for the OS to cache the file system. The relatively small JVM heap will help to avoid large GC pauses, while the file system cache will speed up disk access.

Note: With Java 7, the G1 garbage collector may allow you to use a larger JVM heap without causing large GC pauses.

Blazegraph Parameter Tuning

Write Retention Queue

The B+Tree buffers, nodes and leaves, on a hard reference queue known as the write retention queue. When a dirty node or leaf is evicted from the write retention queue with a reference count of zero, it is written onto the backing store. Since writes on the backing store drive IO, you can improve the write throughput substantially for heavy write workloads by increasing the capacity of the write retention queue. The default is 500, and uses less RAM. A good high throughput value is 8000, which is fine with larger heaps. This can be done for all indices, as illustrated below, or only for select indices as desired.

 com.bigdata.btree.writeRetentionQueue.capacity=8000

For applications where updates are made into a large store, a smaller retention value will reduce heap pressure and reduce node reads. Somewhat paradoxically a lower retention can result in more nodes held in memory.

 com.bigdata.btree.writeRetentionQueue.capacity=4000

Branching Factors

Blazegraph uses a configurable branching factor for each index. For the RWStore, if the configured branching factor causes the page to exceed an 8k boundary (after compression), then the page will be split across multiple backing pages. The out of the box configurations make some conservative assumptions, which can limit throughput, but help to minimize the chance for such overflow pages. The minimum branching factor is 3 and the maximum is 4096.

The best way to tune the branching factors is to load a representative data sample into a Journal and then use the DumpJournal utility with the -pages option to generate a histogram of the page slot sizes for each of the indices:

 com.bigdata.journal.DumpJournal -pages path-to-journal-file

If you are using the bundled-jar, you can execute

java -server -Xmx4g -cp bigdata-1.5.2-bundled.jar com.bigdata.journal.DumpJournal -pages (or -tuples) <path to journal file>

You can also do this using the REST API. Note that dump journal via the REST API is not supported while updates are occurring on the journal. See BLZG-1437.

 .../status?dumpJournal&dumpPages

This will provide you with a table containing lots of detail on the different indices. There is a tab-delimited table towards the bottom of the output. What you want to do is look at the histogram of the record sizes for each index. The last columns (newM and curM) will contain the current and recommended branching factors for each index (ignore any indices that are essentially empty since that skews the estimates). The recommended value for each index is in the newM field.

You do not want any indices to have a large number of page sizes that are over the 8k limit, since this means that the page will be broken into multiple pages. You also do not want a lot of small pages. The dictionary indices (lex.TERM2ID, lex.ID2TERM, and lex.BLOBS) generally have a lower branching factor than the statement indices. If you raised them all to 1024 then that will cause the dictionary indices to have a large proportion of overflow pages. This can cause a non-linear slowdown in the write rate over time. In general, we recommend a branching factor around 300 for TERM2ID and 800 for ID2TERM since the latter compresses better, but you should verify this for your own data sets.

The default branching factor can be set using:

com.bigdata.btree.BTree.branchingFactor=128

The branching factors can be overridden for the lexicon (aka dictionary) indices using:

# Bump up the branching factor for the lexicon indices on the default kb.
com.bigdata.namespace.kb.lex.com.bigdata.btree.BTree.branchingFactor=400

The branching factors can be overridden for the statement indices using:

# Bump up the branching factor for the statement indices on the default kb.
com.bigdata.namespace.kb.spo.com.bigdata.btree.BTree.branchingFactor=1024

You can override the branching factors for specific indices by giving the fully qualified name of the index in the override:

# Set the branching factor for "kb.lex.BLOBS" to the specified value.
com.bigdata.namespace.kb.lex.BLOBS.com.bigdata.btree.BTree.branchingFactor=256

Note: Bigdata supports multiple triple or quad stores in a single Journal or Scale-Out cluster instance. Each triple or quad store exists in its own namespace. The default namespace is "kb" - this is the value used in the examples above. If you are using a non-default namespace, then you need to replace the "kb" with your namespace.

Branching Example

Below is an example of this. The name=kb.spo.SPO is the name of the index. The m=1024 is the current branching factor. The newM=317 is the recommended branching factor.

name=kb.spo.SPO
Checkpoint{indexType=BTree,height=2,nnodes=115,nleaves=71673,nentries=50911902,..<snip/>}
addrMetadata=0, name=kb.spo.SPO, indexType=BTree, indexUUID=87c3c523-96bb-4ead-808f-6146d7e3cde7, branchingFactor=1024, ...<snip/> }
com.bigdata.btree.BTreePageStats{indexType=BTree,m=1024,...<snip/>...,newM=317}

In this example, to set the branching factor for the kb.spo.SPO index to the new recommended value of 317, create a new namespace with the property below.

com.bigdata.namespace.kb.spo.SPO.com.bigdata.btree.BTree.branchingFactor=317

You'll want to do this for each of the indices.

Getting the Branching Factors with Awk Using Curl or DumpJournal

If you have a running NanoSparqlServer, you can use curl and awk to produce the values. Assuming it is running on localhost port 9999, there is an example below using the LUBM benchmarking example. The namespace in the example is LUBM_U50.

The same awk commands can be run using the output of the DumpJournal utility as well for an embedded application.

Generating the Properties

Running

curl -d 'dumpPages' -d 'dumpJournal' 'http://localhost:9999/bigdata/status' | grep -E "^.*\.lex.*\t|^.*\.spo.*\t" | \
awk '{if($33 >3 && $33 <=4096) {printf("com.bigdata.namespace.%s.com.bigdata.btree.BTree.branchingFactor=%d;\n",$1,$33)}}' > branch.properties

or

 com.bigdata.journal.DumpJournal -pages path-to-journal-file | \
awk '{if($33 >3 && $33 <=4096) {printf("com.bigdata.namespace.%s.com.bigdata.btree.BTree.branchingFactor=%d;\n",$1,$33)}}' > branch.properties

produces the example (the specific values will be different for each namespace or KB).

$ more branch.properties 
com.bigdata.namespace.LUBM_U50.lex.BLOBS.com.bigdata.btree.BTree.branchingFactor=128;
com.bigdata.namespace.LUBM_U50.lex.ID2TERM.com.bigdata.btree.BTree.branchingFactor=898;
com.bigdata.namespace.LUBM_U50.lex.TERM2ID.com.bigdata.btree.BTree.branchingFactor=476;
com.bigdata.namespace.LUBM_U50.spo.OSP.com.bigdata.btree.BTree.branchingFactor=829;
com.bigdata.namespace.LUBM_U50.spo.POS.com.bigdata.btree.BTree.branchingFactor=1117;
com.bigdata.namespace.LUBM_U50.spo.SPO.com.bigdata.btree.BTree.branchingFactor=713;

Generating the Java Code

Running


curl -d 'dumpPages' -d 'dumpJournal' 'http://localhost:9999/bigdata/status' | grep -E "^.*\.lex.*\t|^.*\.spo.*\t" | \
awk '{if($33 >3 && $33 <=4096) {printf("props.setProperty(\"com.bigdata.namespace.%s.com.bigdata.btree.BTree.branchingFactor\",Integer.toString(%d));\n",\
$1,$33)}}' > branch.java

or

com.bigdata.journal.DumpJournal -pages path-to-journal-file | \
awk '{if($33 >3 && $33 <=4096) {printf("props.setProperty(\"com.bigdata.namespace.%s.com.bigdata.btree.BTree.branchingFactor\",Integer.toString(%d));\n",\
$1,$33)}}' > branch.java


produces the example (the specific values will be different for each namespace or KB).

$ more branch.java 
props.setProperty("com.bigdata.namespace.LUBM_U50.lex.BLOBS.com.bigdata.btree.BTree.branchingFactor",Integer.toString(128)); 
props.setProperty("com.bigdata.namespace.LUBM_U50.lex.ID2TERM.com.bigdata.btree.BTree.branchingFactor",Integer.toString(898));
props.setProperty("com.bigdata.namespace.LUBM_U50.lex.TERM2ID.com.bigdata.btree.BTree.branchingFactor",Integer.toString(476));
props.setProperty("com.bigdata.namespace.LUBM_U50.spo.OSP.com.bigdata.btree.BTree.branchingFactor",Integer.toString(829));
props.setProperty("com.bigdata.namespace.LUBM_U50.spo.POS.com.bigdata.btree.BTree.branchingFactor",Integer.toString(1117));
props.setProperty("com.bigdata.namespace.LUBM_U50.spo.SPO.com.bigdata.btree.BTree.branchingFactor",Integer.toString(713));

The Small Slot Optimization

The current advice to reduce IO in update transactions is:

  • Default the BTree branching factor of 256 .
  • Set the default BTree retention to 4000.
  • Enable the small slot optimization.
  • Override branching factors for OSP/OCSP and POS/POSC to 64.


To do this, you need to modify your properties file and/or specify the following when creating a new namespace within Blazegraph.

# Enable small slot optimization.
com.bigdata.rwstore.RWStore.smallSlotType=1024
# Set the default B+Tree branching factor.
com.bigdata.btree.BTree.branchingFactor=256
# Set the default B+Tree retention queue capacity.
com.bigdata.btree.writeRetentionQueue.capacity=4000

The branching factor overrides need to be made for each index in each triple store or quad store instance. For example, the following property will override the OSP index branching factor for the default bigdata namespace, which is “kb”. You need to do this for each namespace that you create.

com.bigdata.namespace.kb.lex.ID2TERM.com.bigdata.btree.BTree.branchingFactor=400
com.bigdata.namespace.kb.spo.SPO.com.bigdata.btree.BTree.branchingFactor=1024
com.bigdata.namespace.kb.spo.OSP.com.bigdata.btree.BTree.branchingFactor=64
com.bigdata.namespace.kb.spo.POS.com.bigdata.btree.BTree.branchingFactor=64

The small slot optimization will take effect when you restart bigdata. The changes to the write retention queue capacity and the branching factors will only take effect when a new triple store or quad store instance is created.

Write Cache Buffers

Blazegraph uses a pool of 1MB write cache buffers to defer the eviction of dirty records (mainly index pages) onto the disk. Starting with the 1.3.0 release, Blazegraph can also compact buffers and thereby defer their eviction to the disk. For large updates, this can lead to a very significant reduction in the write pressure on the disk. This feature is very important if you have slow disks, remote disks, or disks that cannot reorder the writes in order to reduce the seek latency (SATA disks can not do this). Write cache compaction is somewhat less important for SSD, but it can still significantly reduce the number of writes to the disk when you have large updates and a large number of write cache buffers.

To understand why this works, you need to understand a little bit about how the RWStore works. When a dirty index page is evicted from the write retention queue (see above), it is written onto an allocation slot on the RWStore. That write is then buffered on a write cache buffer, which is a 1MB block of native memory. As dirty index pages are evicted, the write cache buffers continue to fill up. Eventually, the list of clean write cache buffers will be reduced below the clean list threshold. At that point, the oldest dirty write cache buffer will be evicted to the disk. However, it is write common in long running transactions that index pages may become dirty more than once and evicted to the write cache. When this happens, the space for the old version of the record from the same update is immediately marked as "free space" on the corresponding write cache block. For a large update, this can happen many times thereby creating a large number of "holes" in the write cache buffers. When a dirty write cache buffer is evicted, the write cache service checks the amount of free space in the "holes" in that buffer. If there is more than a threshold of free space available (20%), the buffer is compacted and put back into the dirty pool. When the update commits, compaction is disabled and all dirty write caches are flushed to the disk.

# Set the number of 1MB write cache buffers (RWStore only).
com.bigdata.journal.AbstractJournal.writeCacheBufferCount=2000

The default value for this variable is either 6 (in the Java code) or 12 (in some of the template properties file) Since the write cache buffers are allocated on the native heap (using the Java NIO package, so this is 100% Java), you may need to increase the amount of native heap available to the JVM:

# Sample JVM options showing allocation of a 4GB managed object heap 
# and allowing a 3GB native heap. Always use the -server mode JVM for
# Blazegraph.
-server -Xmx4G -XX:MaxDirectMemorySize=3000m

As of 1.3.0, the write cache does NOT double as a read cache (there is some experimental code which implements this feature, but it is disabled by default). This means that data read from the disk is not installed into the write cache. Thus, you should size the write cache solely for your write workload. If you are bulk loading a large amount of data, then a larger number of write cache buffers is appropriate.

OS Specific Tuning

Linux (runs out of TCP ports)

A linux kernel can run out of free ports due to the TCP wait/reuse policy. This issue shows up as a NoRouteToHostException: {{{ java.net.NoRouteToHostException: Cannot assign requested address at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at java.net.Socket.connect(Socket.java:528) at sun.net.NetworkClient.doConnect(NetworkClient.java:180) at sun.net.www.http.HttpClient.openServer(HttpClient.java:378) at sun.net.www.http.HttpClient.openServer(HttpClient.java:473) at sun.net.www.http.HttpClient.<init>(HttpClient.java:203) at sun.net.www.http.HttpClient.New(HttpClient.java:290) at sun.net.www.http.HttpClient.New(HttpClient.java:306) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:995) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:931) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:849) at benchmark.testdriver.NetQuery.exec(NetQuery.java:79) at benchmark.testdriver.SPARQLConnection.executeQuery(SPARQLConnection.java:109) at benchmark.testdriver.ClientThread.run(ClientThread.java:74) java.net.NoRouteToHostException: Cannot assign requested address at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) }}}

The fix is to setup an appropriate TCP WAIT/Reuse policy as follows. This allows for thr reusing of sockets in the TIME_WAIT state for new connections when it is safe from the protocol viewpoint. The default value is 0 (disabled). Setting this allows benchmarks and heavily loaded servers to run without the exception traces above. See http://www.speedguide.net/articles/linux-tweaking-121 for the source for this fix/workaround {{{ echo 1 > /proc/sys/net/ipv4/tcp_tw_reuse }}}