SOLR External Fulltext Search
This is example of setting up Blazegraph ExternalFullTextSearch with SOLR 6.1.0. The example assumes you are using Ubuntu. The uses N-Triples data and indexes the rdf-schema#label predicate into a SOLR core called blazegraph.
Contents
Prerequisites
Install Java 8
apt-get install python-software-properties add-apt-repository ppa:webupd8team/java apt-get update apt-get install oracle-java8-installer
Download Solr 6.1.0
[Download|http://archive.apache.org/dist/lucene/solr/6.1.0/]. Then run as root:
cd /opt wget http://mirrors.ocf.berkeley.edu/apache/lucene/solr/6.1.0/solr-6.1.0.tgz tar zxf solr-6.1.0.tgz cd solr-6.1.0
Start SOLR.
root@blazegraph:/opt/solr-6.1.0# ./bin/solr start Waiting up to 30 seconds to see Solr running on port 8983 [/] Started Solr server on port 8983 (pid=16296). Happy searching!
SOLR Setup
Create a directory for the Core
Create the directories for the configuration and data using the default basic_configs.**
cd /opt/solr-6.1.0 mkdir -p server/solr/blazegraph/conf mkdir -p server/solr/blazegraph/data cp -rf server/solr/configsets/basic_configs/conf/* server/solr/blazegraph/conf/
Create the Core
Now create the new CORE named blazegraph to index the data:
curl -F action=CREATE \ -F name=blazegraph \ -F instanceDir=/opt/solr-6.1.0/server/solr/blazegraph \ -F config=solrconfig.xml \ -F dataDir=data \ http://localhost:8983/solr/admin/cores
SOLR Indexing
The next step is to load the data. In this example, we have written a small shell script with a PERL REGEX to extract the rdfs:label from data in the N-Triples format and format it as a JSON documents to be indexed into SOLR. JSON was chosen as SOLR's loader proved more robust to special characters than the CSV representation.
The URI (Subject) is sorted in the id field and the text of the english label is stored in the label_t field. The _t means that it is a dynamic SOLR schema field of type text.
label2JSON.sh
#!/bin/bash echo "[ " cat ${1:-/dev/stdin} | grep "rdf-schema#label" | grep "\@en" | grep -v "\@en\-" | \ perl -n -e '/<([^>]+)>[^<]+<([^>]+)>.*\"(.*)\"@.*$/ && printf("%s { \"id\" : \"%s\", \"label_t\": \"%s\" }\n", $comma, $1, $3); $comma = " , "' echo " ]"
Load the data using the label2JSON.sh script
Use the SOLR post tool.
zcat /data/rdf/rdfdata.nt.gz | \ ./label2JSON.sh | \ ./bin/post -type application/json -c blazegraph -out yes -d
Wait for this to complete. You may want to run it nohup or wrapped in a script.
Example Queries
Now, you may use the ExternalFreetextSearch within your SPARQL Queries.
PREFIX fts: <http://www.bigdata.com/rdf/fts#> SELECT ?res ?score ?snippet WHERE { ?res fts:search "Blazegraph" . ?res fts:endpoint "http://localhost:8983/solr/blazegraph/select" . ?res fts:endpointType "SOLR" . ?res fts:timeout "100000" . ?res fts:score ?score . ?res fts:snippet ?snippet . ?res fts:params "fl=id,label_t" . ?res fts:searchField "id" . ?res fts:fieldToSearch "label_t" . ?res fts:snippetField "label_t" . ?res fts:searchResultType "URI" . }