Standalone Guide

From Blazegraph
Jump to: navigation, search

There are two distinct persistence store modes for standalone bigdata instances. This page briefly describes these two modes and provides some guidance as to why you would choose one mode or the other. Both the WORM and RW modes support HighAvailability based on replicating writes from a master along a failover chain.

Standalone Modes

WORM

The WORM (Write-Once, Read-Many) is the traditional log-structured append only journal. It was designed for very fast write rates and is used to buffer writes for scale-out. This is a good choice for immortal databases where people want access to ALL history. It scales to several billions of triples.

The WORM mode is selected with the following option.

com.bigdata.journal.AbstractJournal.bufferMode=Disk

RW

The RW store (Read-Write) supports the recycling of allocation slots on the backing file. It may be used as a time-bounded version of an immortal database where history is aged off of the database over time. This is a good choice for standalone workloads where updates are continuously arriving and older database states may be released. The RW store is also less sensitive to data skew because it can reuse B+Tree node and leaf revisions within a commit group on a large data set loads. Scaling is to 50B+ triples or quads.

The RW mode is selected with the following option.

 com.bigdata.journal.AbstractJournal.bufferMode=DiskRW

Scale-out

Both stores play a role in scale-out. The RW store is used for the metadata service (shard locator) and to aggregate performance counters (load balancer). The WORM is used to buffer writes at the data services. Buffered writes on a data services are migrated asynchronously onto read-optimized B+Tree files (called IndexSegment files). Periodically those index segments get merged. When the merged segments for a shard are large enough, the shard is split. Given the scaling characteristics of the RW store, people may ask when do I need scale-out? The answer is when you want the aggregate throughput of a cluster, which is much higher than a single node, or when the total data scale far exceeds what is reasonable on a single node. The point of scale-out is to breakdown machine boundaries by managing the data in dynamically allocated shards so we can handle data sets from 10s of billions of triples to trillions of triples through incremental scaling.

Binary compatibility

The RW and WORM modes DO NOT have binary compatibility. They use very different internal structures to manage their allocations on the persistence store. However, they have very good logical compatibility. For the most part, applications should run over either store without change. The only exceptions would be applications, which deliberately try to take advantage of the RW store features for aging history.

Embedded use of standalone modes

Both the WORM and RW modes can operate in either very small memory footprints or with very large heaps if you chose the appropriate configuration options. Other than the memory allocated to the JVM, other configuration options which can influence performance are the size and #of write cache buffers (these are direct memory buffers), the size of the write retention queue, and the capacity of the global LRU.

The number of dependencies can also be pruned depending on your application requirements and deployment goals.

OpenRDF Sesame Server

See Using_Bigdata_with_the_OpenRDF_Sesame_HTTP_Server if you are trying to install bigdata with the Sesame Server.