SOLR External Fulltext Search

From Blazegraph
Jump to: navigation, search

This is example of setting up Blazegraph ExternalFullTextSearch with SOLR 6.1.0. The example assumes you are using Ubuntu. The uses N-Triples data and indexes the rdf-schema#label predicate into a SOLR core called blazegraph.

Prerequisites

Install Java 8

SOLR 6.1.0 requires Java 8.

apt-get install python-software-properties
add-apt-repository ppa:webupd8team/java
apt-get update
apt-get install oracle-java8-installer

Download Solr 6.1.0

Download. Then run as root:

cd /opt
wget http://mirrors.ocf.berkeley.edu/apache/lucene/solr/6.1.0/solr-6.1.0.tgz
tar zxf solr-6.1.0.tgz
cd solr-6.1.0

Start SOLR.

root@blazegraph:/opt/solr-6.1.0# ./bin/solr start
Waiting up to 30 seconds to see Solr running on port 8983 [/]  
Started Solr server on port 8983 (pid=16296). Happy searching!

SOLR Setup

Create a directory for the Core

Create the directories for the configuration and data using the default basic_configs.**

cd /opt/solr-6.1.0
mkdir -p server/solr/blazegraph/conf
mkdir -p server/solr/blazegraph/data
cp -rf server/solr/configsets/basic_configs/conf/* server/solr/blazegraph/conf/

Create the Core

Now create the new CORE named blazegraph to index the data:

curl -F action=CREATE \
-F name=blazegraph \
-F instanceDir=/opt/solr-6.1.0/server/solr/blazegraph \
-F config=solrconfig.xml \
-F dataDir=data \
http://localhost:8983/solr/admin/cores

SOLR Indexing

The next step is to load the data. In this example, we have written a small shell script with a PERL REGEX to extract the rdfs:label from data in the N-Triples format and format it as a JSON documents to be indexed into SOLR. JSON was chosen as SOLR's loader proved more robust to special characters than the CSV representation.

The URI (Subject) is sorted in the id field and the text of the english label is stored in the label_t field. The _t means that it is a dynamic SOLR schema field of type text.

label2JSON.sh

#!/bin/bash

echo "[ "
cat ${1:-/dev/stdin}  | grep "rdf-schema#label" | grep "\@en" | grep -v "\@en\-" | \
perl -n -e '/<([^>]+)>[^<]+<([^>]+)>.*\"(.*)\"@.*$/ && printf("%s { \"id\" : \"%s\", \"label_t\":  \"%s\" }\n", $comma, $1, $3); $comma = " , "'
echo " ]"

Load the data using the label2JSON.sh script

Use the SOLR post tool.

zcat /data/rdf/rdfdata.nt.gz | \
./label2JSON.sh  | \
./bin/post -type application/json -c blazegraph -out yes -d 

Wait for this to complete. You may want to run it nohup or wrapped in a script.

Example Queries

Now, you may use the ExternalFreetextSearch within your SPARQL Queries.

PREFIX fts: <http://www.bigdata.com/rdf/fts#>
SELECT ?res ?score ?snippet WHERE {
  ?res fts:search "Blazegraph" .
  ?res fts:endpoint "http://localhost:8983/solr/blazegraph/select" .
  ?res fts:endpointType  "SOLR" .
  ?res fts:timeout "100000" .
  ?res fts:score ?score .
  ?res fts:snippet ?snippet . 
  ?res fts:params "fl=id,label_t" .
  ?res fts:searchField "id" .
  ?res fts:fieldToSearch "label_t" .
  ?res fts:snippetField "label_t" .
  ?res fts:searchResultType "URI" .
}