Cluster StartupFAQ

From Blazegraph
Jump to: navigation, search

Please see the ClusterGuide for the general bigdata configuration and federation start procedure. This page provides some additional tips to help you debug your configuration and get the federation up and running.

General Debug Procedure

The general procedure is to visit the host which will be starting the various misc services, source the installed bigdataenv ("source .../bin/bigdataenv") script to set the environment, and then run bigdata start by hand while the cluster run state is at "status". I then monitor the error log and the console and see whether or not the misc services start correctly. Once zookeeper and jini are up the other services can start as well.

When bringing up a new cluster it is a good idea to follow this procedure on the misc services node(s) and once on a node of each other service types (a ClientService node and a DataService node). That way you know that all of the different service classes can start correctly. At that point you can change the run state to 'start' and the rest of the nodes should come up unless there are configuration issues or networking issues with the nodes themselves.

Once the services are starting normally, listServices.sh will report on which services are running on which nodes. Compare the output of this to your configuration plan to verify that all services are running.

The various files mentioned above have the following locations:

* bigdataenv is in ${NAS}/bin
* bigdata is in ${NAS}/bin
* state is ${NAS}/state (this is the target run state file)
* error.log is in ${NAS}/error.log

For more detail on how to configure and start a bigdata federation, please see the ClusterGuide.

Common problems

Unknown host 'XXX'

There are several host name values, which must be configured in the top-level build.properties file prior to a cluster install. Failure to properly edit build.properties prior to an install can result in exceptions such as the following:

com.bigdata.service.jini.MetadataServer : log4j:ERROR Could not find address of [XXX].
com.bigdata.service.jini.MetadataServer : java.net.UnknownHostException: XXX
com.bigdata.service.jini.MetadataServer : at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
com.bigdata.service.jini.MetadataServer : at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:850) 
com.bigdata.service.jini.MetadataServer : at java.net.InetAddress.getAddressFromNameService(InetAddress.java:1201) 
com.bigdata.service.jini.MetadataServer : at java.net.InetAddress.getAllByName0(InetAddress.java:1154) 
com.bigdata.service.jini.MetadataServer : at java.net.InetAddress.getAllByName(InetAddress.java:1084) 
com.bigdata.service.jini.MetadataServer : at java.net.InetAddress.getAllByName(InetAddress.java:1020) 
com.bigdata.service.jini.MetadataServer : at java.net.InetAddress.getByName(InetAddress.java:970) 

Exception occurred during unicast discovery

INFO: exception occured during unicast discovery to 192.168.0.3:4160
with constraints InvocationConstraints[reqs: {}, prefs: {}]
java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
        at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
        at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
        at java.net.Socket.connect(Socket.java:519)
        at java.net.Socket.connect(Socket.java:469)

This stack trace is normal. Jini logs this message if registrar discovery lookup fails for a given IP address. However, this is how we test to see whether or not jini is running on a host where jini is configured to start. If the lookup fails, then jini SHOULD be started automatically.

Configuration requires exact match for host names (build.properties only)

The build.properties file has some values, which are host names. For the build.properties file ONLY, it is critical that the configuration value for the property is the value reported on that host by the *hostname* command. The bigdata and bigdataup shell scripts do exact string comparisons on the hostnames in order to handle some conditional service bootstrapping. Those shell script string compares will fail if the hostname command reports a different value. The main configuration file is more flexible since the configured hostnames are evaluated using DNS.

Bad /etc/hosts

The jini services can have problems connecting to hosts in the cluster if DNS is not setup correctly. It is generally sufficient to have correct entries in /etc/hosts on each host.

Swapping

Make sure that you have issued the following command on each node. This will turn down the baseline "panic" level for the Linux kernel and will let you use the entire RAM on the node without swapping. It DOES NOT disable swapping.

sysctl -w vm.swappiness=0

Also, make sure that you are not assigning too much heap to your Java processes. You need to leave enough heap available for the operating system, for the file cache, and for the C heap of the JVM itself.

Problems forking child processes or Spurious RMI Exceptions

Make sure that you have enough swap space allocated on the nodes. If there is not enough space, the kernel may decide that it cannot commit to supporting more potential allocations and refuse to shell a child process (Could not allocate memory). It appears that this can also cause RMI failures under some conditions. You can also mitigate this by reducing the size of the direct buffers allocated by bigdata or by messing with the kernel's overcommit behavior. See the link for more details http://forums.sun.com/thread.jspa?messageID=9834041#9834041

NanoSparqlServer

ClassCastException

The full stack trace for this might not be visible as it is logged @ WARN rather than error:

java.lang.ClassCastException: com.bigdata.btree.BTree cannot be cast to com.bigdata.service.ndx.IClientIndex
	at com.bigdata.relation.rule.eval.AbstractJoinNexus.locatorScan(AbstractJoinNexus.java:433)
	at com.bigdata.relation.rule.eval.pipeline.DistributedJoinMasterTask.mapBindingSet(DistributedJoinMasterTask.java:335)
	at com.bigdata.relation.rule.eval.pipeline.DistributedJoinMasterTask.start(DistributedJoinMasterTask.java:274)
	at com.bigdata.relation.rule.eval.pipeline.JoinMasterTask.call(JoinMasterTask.java:385)
	at com.bigdata.relation.rule.eval.RunRuleAndFlushBufferTask.call(RunRuleAndFlushBufferTask.java:59)
	at com.bigdata.relation.rule.eval.RunRuleAndFlushBufferTask.call(RunRuleAndFlushBufferTask.java:19)
	at com.bigdata.relation.rule.eval.AbstractStepTask.runOne(AbstractStepTask.java:307)

The root cause is an RDF database in scale-out, with truth maintenance enabled. The specific exception is thrown because truth maintenance is using a local (non-sharded) temporary triple store to fix point the statements added to / removed from the KB. However, the shared evaluation of the rules is not expecting a purely local index. This is why it sees a BTree when it is expecting an IClientIndex (which is the remote API for a sharded index).

Truth maintenance requires that ALL commits be serialized, which is a non-starter in scale-out. The pattern for using inference in scale-out relies on bulk updates followed by re-computation of the RDF(S)+ closure. The clients are then notified of the new read-behind point once the entailments have been computed.

The most likely reason why you are seeing this stack trace is that you did not explicitly setup the correct properties in the cluster configuration file when you started the NanoSparqlServer. It then registered an RDF database using defaults, which are neither appropriate nor optimized for scale-out. The proper procedure for creating a scale-out KB in this manner is described in the deployment section on the NanoSparqlServer page. In order to see the as-configured KB metadata, you can enable logging in com.bigdata.rdf.sail.webapp.BigdataRDFServletContextListener or use the NanoSparqlServer status query with the URL query option "?showKBInfo=true".