IO Optimization

From Blazegraph
Revision as of 21:27, 17 June 2015 by Brad Bebee (Talk | contribs) (Generating the Java Code)

Jump to: navigation, search

See also

IO Profiles

Bigdata has different IO profiles in the Journal and Scale-Out architectures. Briefly, range scans on indices on the Journal turn into random IOs. Key-range scans on the horizontally sharded cluster turn into sequential IOs because 99%+ of the data is in key-order on the disk in the scale-out architecture - this is a side effect of the dynamic sharding algorithm. In additional, writes on the Journal drive random IOs when pages are evicted to the disk. Writes in scale-out drive sequential IOs as blocks are are appended to the mutable journal and then periodically migrated onto the read-only, read-optimized index segments.

High IOPs

Bigdata depends on fast disk access. Low-latency, high IOPs disk is key to high performance. SSD offers a 10x performance boost for the the Journal in the embedded, standalone, and highly available replication cluster deployment modes when compared to SAS or SATA disks. In addition, SATA disks are unable to reorder writes. This can cause a marked decrease in write throughput as the size of the backing file grows over time. Since 1.3.x, we have introduced a cache compaction algorithm into the Journal that significantly reduces IOs associated with large write transactions. This makes SATA more viable, but the performance gap to SSD is still an order of magnitude for query. SAS disks can reorder read and write operations, but still have a large rotational latency and lower seek times when compared to SSD. For enterprise class deployments, you can use either PCIe flash disk or SSD in combination with the HAJournalServer (high availability). AWS now offers an SSD backed instance that is a good match for bigdata.

File System Cache

If you have a lot of RAM, consider giving bigdata a relatively small JVM heap (4G) and leaving the rest for the OS to cache the file system. The relatively small heap to avoid large GC pauses while the file system cache will speed up disk access.

Note: With Java 7, the G1 garbage collector may allow you to use a larger JVM heap without causing large GC pauses.

Bigdata Parameter Tuning

Write Retention Queue

The B+Tree buffers nodes and leaves on a hard reference queue known as the write retention queue. When a dirty node or leaf is evicted from the write retention queue with a reference count of zero, it is written onto the backing store. Since writes on the backing store drive IO, you can improve the write throughput substantially for heavy write workloads by increasing the capacity of the write retention queue. The default is 500, and uses less RAM. A good high throughput value is 8000, which is fine with larger heaps. This can be done for all indices, as illustrated below, or only for select indices as desired.

 com.bigdata.btree.writeRetentionQueue.capacity=8000

For applications where updates are made into a large store a smaller retention value will reduce heap pressure and reduce node reads. Somewhat paradoxically a lower retention can result in more nodes held in memory.

 com.bigdata.btree.writeRetentionQueue.capacity=4000

Branching Factors

Bigdata uses a configurable branching factor for each index. For the RWStore, if the configured branching factor causes the page to exceed an 8k boundary (after compression), then the page will be split across multiple backing pages. The out of the box configurations make some conservative assumptions which can limit throughput, but help to minimize the chance for such overflow pages.

The best way to tune the branching factors is to load a representative data sample into a Journal and then use the DumpJournal utility with the -pages option to generate a histogram of the page slot sizes for each of the indices:

 com.bigdata.journal.DumpJournal -pages path-to-journal-file

You can also do this using the REST API:

 .../status?dumpJournal&dumpPages

This will provide you with a table containing lots of detail on the different indices. There is a tab-delimited table towards the bottom of the output. What you want to do is look at the histogram of the record sizes for each index. The last columns (newM and curM) will contain the current and recommended branching factors for each index (ignore any indices that are essentially empty since that skews the estimates). The recommended value for each index is in the newM field.

You do not want any indices to have a large number of page sizes that are over the 8k limit since this means that the page will be broken into multiple pages. You also do not want a lot of small pages. The dictionary indices (lex.TERM2ID, lex.ID2TERM, and lex.BLOBS) generally have a lower branching factor than the statement indices. If you raised them all to 1024 then that will cause the dictionary indices to have a large proportion of overflow pages. This can cause a non-linear slowdown in the write rate over time. In general, we recommend a branching factor around 300 for TERM2ID and 800 for ID2TERM since the latter compresses better, but you should verify this for your own data sets.

The default branching factor can be set using:

com.bigdata.btree.BTree.branchingFactor=128

The branching factors can be overridden for the lexicon (aka dictionary) indices using:

# Bump up the branching factor for the lexicon indices on the default kb.
com.bigdata.namespace.kb.lex.com.bigdata.btree.BTree.branchingFactor=400

The branching factors can be overridden for the statement indices using:

# Bump up the branching factor for the statement indices on the default kb.
com.bigdata.namespace.kb.spo.com.bigdata.btree.BTree.branchingFactor=1024

You can override the branching factors for specific indices by giving the fully qualified name of the index in the override:

# Set the branching factor for "kb.lex.BLOBS" to the specified value.
com.bigdata.namespace.kb.lex.BLOBS.com.bigdata.btree.BTree.branchingFactor=256

Note: Bigdata supports multiple triple or quad stores in a single Journal or Scale-Out cluster instance. Each triple or quad store exists in its own namespace. The default namespace is "kb" - this is the value used in the examples above. If you are using a non-default namespace, then you need to replace the "kb" with your namespace.

Branching Example

An example is below. The name=kb.spo.SPO is the name of the index. The m=1024 is the current branching factor. The newM=317 is the recommended branching factor.

name=kb.spo.SPO
Checkpoint{indexType=BTree,height=2,nnodes=115,nleaves=71673,nentries=50911902,..<snip/>}
addrMetadata=0, name=kb.spo.SPO, indexType=BTree, indexUUID=87c3c523-96bb-4ead-808f-6146d7e3cde7, branchingFactor=1024, ...<snip/> }
com.bigdata.btree.BTreePageStats{indexType=BTree,m=1024,...<snip/>...,newM=317}

In this example, to set the branching factor for the kb.spo.SPO index to the new recommended value of 317 create a new namespace with the property below.

com.bigdata.namespace.kb.spo.SPO.com.bigdata.btree.BTree.branchingFactor=317

You'll want to do this for each of the indices.

Getting the Branching Factors with Awk Using Curl or DumpJournal

If you have a running NanoSparqlServer, you can use curl and awk to produce the values. Assuming it is running on localhost port 9999, there is an example below using the LUBM benchmarking example. The namespace in the example is LUBM_U50.

The same awk commands can be run using the output of the DumpJournal utility as well for an embedded application.

Generating the Properties

Running

curl -d 'dumpPages' -d 'dumpJournal' 'http://localhost:9999/bigdata/status' | grep -E "^.*\.lex.*\t|^.*\.spo.*\t" | \
awk '{if($33 >3 && $33 <=4096) {printf("com.bigdata.namespace.%s.com.bigdata.btree.BTree.branchingFactor=%d;\n",$1,$33)}}' > branch.properties

or

 com.bigdata.journal.DumpJournal -pages path-to-journal-file | \
awk '{if($33 >3 && $33 <=4096) {printf("com.bigdata.namespace.%s.com.bigdata.btree.BTree.branchingFactor=%d;\n",$1,$33)}}' > branch.properties

produces the example (the specific values will be different for each namespace or KB).

$ more branch.properties 
com.bigdata.namespace.LUBM_U50.lex.BLOBS.com.bigdata.btree.BTree.branchingFactor=128;
com.bigdata.namespace.LUBM_U50.lex.ID2TERM.com.bigdata.btree.BTree.branchingFactor=898;
com.bigdata.namespace.LUBM_U50.lex.TERM2ID.com.bigdata.btree.BTree.branchingFactor=476;
com.bigdata.namespace.LUBM_U50.spo.OSP.com.bigdata.btree.BTree.branchingFactor=829;
com.bigdata.namespace.LUBM_U50.spo.POS.com.bigdata.btree.BTree.branchingFactor=1117;
com.bigdata.namespace.LUBM_U50.spo.SPO.com.bigdata.btree.BTree.branchingFactor=713;

Generating the Java Code

Running


curl -d 'dumpPages' -d 'dumpJournal' 'http://localhost:9999/bigdata/status' | grep -E "^.*\.lex.*\t|^.*\.spo.*\t" | \
awk '{if($33 >3 && $33 <=4096) {printf("props.setProperty(\"com.bigdata.namespace.%s.com.bigdata.btree.BTree.branchingFactor\", \
Integer.toString(%d));\n",$1,$33)}}' > branch.java

or

com.bigdata.journal.DumpJournal -pages path-to-journal-file | \
 awk '{if($33 >3 && $33 <=4096) {printf("props.setProperty(\"com.bigdata.namespace.%s.com.bigdata.btree.BTree.branchingFactor\", \
Integer.toString(%d));\n",$1,$33)}}' > branch.java


produces the example (the specific values will be different for each namespace or KB).

$ more branch.java 
props.setProperty("com.bigdata.namespace.LUBM_U50.lex.BLOBS.com.bigdata.btree.BTree.branchingFactor",Integer.toString(128)); 
props.setProperty("com.bigdata.namespace.LUBM_U50.lex.ID2TERM.com.bigdata.btree.BTree.branchingFactor",Integer.toString(898));
props.setProperty("com.bigdata.namespace.LUBM_U50.lex.TERM2ID.com.bigdata.btree.BTree.branchingFactor",Integer.toString(476));
props.setProperty("com.bigdata.namespace.LUBM_U50.spo.OSP.com.bigdata.btree.BTree.branchingFactor",Integer.toString(829));
props.setProperty("com.bigdata.namespace.LUBM_U50.spo.POS.com.bigdata.btree.BTree.branchingFactor",Integer.toString(1117));
props.setProperty("com.bigdata.namespace.LUBM_U50.spo.SPO.com.bigdata.btree.BTree.branchingFactor",Integer.toString(713));

Write Cache Buffers

Bigdata uses a pool of 1MB write cache buffers to defer the eviction of dirty records (mainly index pages) onto the disk. Starting with the 1.3.0 release, bigdata can also compact buffers and thereby defer their eviction to the disk. For large updates, this can lead to a very significant reduction in the write pressure on the disk. This feature is very important if you have slow disks, remote disks, or disks that can not reorder the writes in order to reduce the seek latency (SATA disks can not do this). Write cache compaction is somewhat less important for SSD, but it can still significantly reduce the number of writes to the disk when you have large updates and a large number of write cache buffers.

To understand why this works, you need to understand a little bit about how the RWStore works. When a dirty index page is evicted from the write retention queue (see above), it is written onto an allocation slot on the RWStore. That write is then buffered on a write cache buffer, which is a 1MB block of native memory. As dirty index pages are evicted, the write cache buffers continue to fill up. Eventually, the list of clean write cache buffers will be reduced below the clean list threshold. At that point, the oldest dirty write cache buffer will be evicted to the disk. However, it is write common in long running transactions that index pages may become dirty more than once and evicted to the write cache. When this happens, the space for the old version of the record from the same update is immediately marked as "free space" on the corresponding write cache block. For a large update, this can happen many times thereby creating a large number of "holes" in the write cache buffers. When a dirty write cache buffer is evicted, the write cache service checks the amount of free space in the "holes" in that buffer. If there is more than a threshold of free space available (20%), the buffer is compacted and put back into the dirty pool. When the update commits, compaction is disabled and all dirty write caches are flushed to the disk.

# Set the number of 1MB write cache buffers (RWStore only).
com.bigdata.journal.AbstractJournal.writeCacheBufferCount=2000

Since the write cache buffers are allocated on the native heap (using the Java NIO package, so this is 100% Java), you may need to increase the amount of native heap available to the JVM:

# Sample JVM options showing allocation of a 4GB managed object heap 
# and allowing a 3GB native heap. Always use the -server mode JVM for
# bigdata.
-server -Xmx4G -XX:MaxDirectMemorySize=3000m

As of 1.3.0, the write cache does NOT double as a read cache (there is some experimental code which implements this feature, but it is disabled by default). This means that data read from the disk is not installed into the write cache. Thus, you should size the write cache solely for your write workload. If you are bulk loading a large amount of data, then a larger number of write cache buffers is appropriate.

OS Specific Tuning

Linux (runs out of TCP ports)

A linux kernel can run out of free ports due to the TCP wait/reuse policy. This issue shows up as a NoRouteToHostException: {{{ java.net.NoRouteToHostException: Cannot assign requested address at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at java.net.Socket.connect(Socket.java:528) at sun.net.NetworkClient.doConnect(NetworkClient.java:180) at sun.net.www.http.HttpClient.openServer(HttpClient.java:378) at sun.net.www.http.HttpClient.openServer(HttpClient.java:473) at sun.net.www.http.HttpClient.<init>(HttpClient.java:203) at sun.net.www.http.HttpClient.New(HttpClient.java:290) at sun.net.www.http.HttpClient.New(HttpClient.java:306) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:995) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:931) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:849) at benchmark.testdriver.NetQuery.exec(NetQuery.java:79) at benchmark.testdriver.SPARQLConnection.executeQuery(SPARQLConnection.java:109) at benchmark.testdriver.ClientThread.run(ClientThread.java:74) java.net.NoRouteToHostException: Cannot assign requested address at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) }}}

The fix is to setup an appropriate TCP WAIT/Reuse policy as follows. This allows reusing sockets in TIME_WAIT state for new connections when it is safe from protocol viewpoint. Default value is 0 (disabled). Setting this allows benchmarks and heavily loaded servers to run without the exception traces above. See http://www.speedguide.net/articles/linux-tweaking-121 for the source for this fix/workaround {{{ echo 1 > /proc/sys/net/ipv4/tcp_tw_reuse }}}