Performance Optimization

From Blazegraph
Jump to: navigation, search

See also

Query performance or incremental data load are erratic (fast,slow)

For Java 6, we recommend using the parallel old generation GC mode:

 -XX:+UseParallelOldGC

This provides a reasonably stable performance throughout load and query workloads with occasional pauses for full GC passes. In combination with reasonable "chunk" sizes for query buffers, this gives good results across a wide range of workloads. Unfortunately, the CMS-I (incremental) GC mode does not appear to work well under heavy query workload mixtures such as BSBM.

For Java 7, many customers have reported good results using the G1 collector. See http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html#G1Options for more options.

 -XX:+UseG1GC

Turn off auto-commit in the SAIL

The journal is an append only data structure. A write on the B+Tree never overwrites the old data. Instead it writes a new revision of the node or leaf. If you are doing a commit for each statement loaded then that is the worst possible case. The scale-out architecture uses the Journal as a write buffer. Each time the journal on a data service fills up, a new journal is created and the buffered writes from the old journal are migrated onto read-optimized B+Tree files known as index segments (.seg files).

Batch is beautiful

Larger operations allow Blazegraph to exploit ordered reads and writes against the B+Tree data structures. This can be a 10x or better throughput improvement for many applications. Larger operations also reduce commit processing, which is slow because it must sync the disks.

Batch Size

If you are using the BigdataSail interface, SPARQL UPDATE, or the REST API, then the batch size for incremental index writes is specified using BigdataSail.Options.BUFFER_CAPACITY. For example, the following property will trigger incremental index writes after parsing 100,000 statements. This does NOT affect the ACID properties of the update, just the batch size for the parser before handing off a chunk of statements to the index writers. The update will not become visible when the index writes are applied (unless you are using the shard-wise ACID scale-out architecture).

com.bigdata.rdf.sail.bufferCapacity=100000

Parallel is perfect

Operations that can be readily parallelized will help your application to scale. This is especially true for scale-out. Parallel client processing for scale-out can be aggregated using the asynchronous write API, which then scatters writes and gathers results. The potential parallelism of writes is directly proportional to the index shards since each shard can have a separate writer thread. Readers never block for writers, so read-only transactions can increase your potential parallelism tremendously.

Do not use exact range counts

Blazegraph has fast range count estimates for its B+Trees. The fast range count is answered with just two key probes for any key range, even for scale-out. However, if delete markers are used (they are required for scale-out and transactional isolation) then the fast range count is an upper bound, and not an exact range count.

Blazegraph uses delete flags to mark deleted tuples and prevent read through to a historical version of the tuple. The deleted tuples persist in the B+Tree view until a compacting merge is done for that shard. One consequence of this architecture is that exact range counts in scale-out are expensive. They require a key-range scan across the relevant shards. That scan can be parallelized, but it is still very costly, it can flush your cache, etc.

Always use the fast range counts unless you have a very specific requirement and then be prepared to pay the price for the exact range count. A similar problem exists if you request the range count with an iterator filter. The filter should then be discarded and the fast range count should be requested unless you really need to know the exact as filtered range count.

The fast range count API is also exposed through the REST API.

Platform specific performance tips

Linux Performance Tips

By default, Linux will start swapping out processes proactively once about 1/2 of the RAM has been allocated. This is done to ensure that RAM remains available for processes which might run, but it basically denies you 1/2 of your RAM for the processes that you want to run. Changing vm.swappiness from its default (60) to ZERO (0) fixes this. This is NOT the same as disabling swap. The host will just wait until the free memory is exhausted before it begins to swap.

To change the swappiness kernel parameter, do the following as root on each host:

sysctl -w vm.swappiness=0

Blazegraph uses threads to provide a concurrent evaluation for queries, operators with a query, and even concurrent evaluation passes for the same operator in a query (for thread safe operators). On Linux, native threads are counted as processes. The default limits on some platforms and user accounts may restrict your ability to start enough threads leading to the following root cause:

Caused by: java.lang.OutOfMemoryError: unable to create new native thread

You can use ulimit -u to review and change the maximum number of processes for the user.

To report the maximum number of processes for the current user:

ulimit -u 

To set the maximum number of processes for the current user:

ulimit -u 20000

There are also configuration files that may be used to set this limit. Review the documentation for your platform.

Windows Performance Tips

  1. Set to optimize scheduling for programs (rather than background processes) and memory for programs (rather than the system cache).
  2. On win64 platforms, the kernel can wind up consuming all available memory for IO heavy applications, resulting in heavy swapping and poor performance. The links below can help you to understand this issue and include a utility to specify the size of the file system cache. [As of Windows Server 2008 R2, the dynamic file system cache size problem is supposed to be fixed and the DynCache tool is no longer supported. See the first link below for more information on changes in R2.]
     - Microsoft Windows Dynamic Cache Service
     
     - Ntdebugging Blog

Compliance Matrix

We generally develop and test using Java 6 and Java 7 (but not the earlier builds or either releases). Most testing is done under Centos 5.4, Ubantu 12.04, Windows XP, and Windows Server 2008. The clustered database and HA are tested under Ubantu 12.04. For the clustered database install, we recommend Sun JDK 1.6.0_17 as Sun JDK 1.6.0_18, 19, and 20 are known to cause segfaults for the clustered database (but not when running a standalone Journal).

Prior to 1.6.0_18 there is a JVM bug, which can cause lost wakeups. We generally recommend 1.6.0_17 because it also works for scale-out, but you might want to specify the following option on the command line as a work around.

-XX:+UseMembar