External Full Text Search

From Blazegraph
Jump to: navigation, search

Introduction

In addition to its internal FullTextSearch capabilities, Blazegraph provides the means to query external fulltext indices from within SPARQL. The current implementation supports the prominent fulltext index Apache Solr (using the HTTP Solr API), but in principle, the implementation is flexible in the sense that it allows it to hook into other fulltext search services by implementing and hooking in a simple connector class.

Note that this fulltext search feature currently is not intended to (and does not) replace the current implementation implementation of Blazegraph's internal full text index in blazegraph as decribed at FullTextSearch. The purpose of the current solr implementation rather is to support hybrid search against external resources (which may contain data going beyond what's stored in your Blazgraph instance), but not to identify matches within Blazegraph internal graph database.

Using the External Fulltext Search Service

To accomplish the integration of external free text search services with high-level query, Blazegraph defines several magic predicates that are given special meaning, and when encountered in a SPARQL query are interpreted as service calls to the text index. The full list of magic predicates related to free text search is defined and documented in the class FTS. The simplest way to integrate free text search into a SPARQL query in bigdata is to use the magic predicate fts:search inside of a SPARQL join group, plus specifying the (e.g. Solr) endpoint using the magic predicate fts:endpoint in the same join group, where the prefix fts is defined as

PREFIX fts: <http://www.bigdata.com/rdf/fts#>

As a simple example, consider the query

PREFIX fts: <http://www.bigdata.com/rdf/fts#>
SELECT ?res WHERE {
  ?res fts:search "Alice" .
  ?res fts:endpoint "http://localhost:1234/solr/blazegraph/select" .
}

In the example, the predicate fts:search is used to query the Solr full text index running at http://localhost:1234/solr/blazegraph/select using the search string Alice. Returned matches are bound as (by default) Literals to the variable appearing in the subject position, namely ?res in the example.


Configuring search

The behavior of the fulltext search can be customized through a set of magic predicates defined inside the SPARQL query (also defined in the FTS class). As an example, consider the following external fulltext search query:

PREFIX fts: <http://www.bigdata.com/rdf/fts#>
SELECT ?res ?score ?snippet WHERE {
  ?res fts:search "Alice | Bob" .
  ?res fts:endpoint "http://localhost:1234/solr/blazegraph/select" .
  ?res fts:endpointType  "SOLR" .
  ?res fts:timeout "100000" .
  ?res fts:score ?score .
  ?res fts:snippet ?snippet . 
  ?res fts:params "fl=uri,description,score" .
  ?res fts:searchField "uri" .
  ?res fts:fieldToSearch "text" .
  ?res fts:scoreField "score" .
  ?res fts:snippetField "description" .
  ?res fts:searchResultType "URI" .
}

The magic predicates have the following semantics:

  • Search string and endpoint
    • fts:search: the search query to submit
    • fts:endpoint: the URL of the fulltext search endpoint
    • fts:endpointType: the type of endpoint (currently, only "SOLR" is supported; this is the default anyways, so this might be skipped)
    • fts:timeout: the timeout for the query in milliseconds
  • Additional output variables
    • fts:score: specifies a variable to which the score of the search result is bound (requires specification of the fts:scoreField, see below)
    • fts:snippet: specifies a variable to which the snippet of the search result is bound (requires specification of the fts:snippetField, see below)
  • Search parameters
    • fts:params: the parameter for the search, as a parameter string (URL encoded, where necessary); in the example above, we specify the fields that are returned by the Solr service, which are then used as input for the result type mapping discussed next
  • Search result type mapping
    • fts:searchField: the name of the Solr result field that is used to initialize the main output variable (i.e., the variable in subject position, namely ?res in our example); defaults to the Solr standard field id, if not specified
    • fts:fieldToSearch: the name of the Solr field that is used to search defaults to the Solr standard field text, if not specified
    • fts:scoreField: the name of the Solr result field that contains the score to be returned in the variable bound through fts:score; note that this field must be explicitly listed as return field using fts:params, as illustrated by the sample query; no default, if not specified, the fts:scoreField variable is left unbound
    • fts:snippetField: the name of the Solr result field that contains the snippet to be returned in the variable bound through fts:snippet; note that this field must be explicitly listed as a return field using fts:params, as illustrated by the sample query; no default, if not specified, the fts:snippet variable is left unbound
    • fts:searchResultType: the type of the search result, either URI or LITERAL (which is the default); note that URI can only be chosen if the fts:searchField contains strings that can be cast to URIs (if not, a runtime exception will be thrown)

Default Values for Magic Predicates

The default values for the external fulltext search magic predicates are documented at FTS, see the Options.DEFAULT_* members. Defaults are used whenever the magic predicate is not used; when seeing the respective magic vocabulary predicate, these default values are overridden.


Background and Advanced Usage

Blazegraph's internal evaluation approach for fulltext search is to collect all basic graph patterns involving external fulltext search magic predicates, and group them into a SERVICE keyword. To give an example, Blazegraph translates a query

PREFIX fts: <http://www.bigdata.com/rdf/fts#>
SELECT ?res ?uri WHERE {
  ?res fts:search "Alice | Bob" .
  ?res fts:endpoint "http://localhost:1234/solr/blazegraph/select" .
  ?res fts:params "fl=id,score" .
  ?res fts:scoreField "score" .
  ?res fts:score ?score .
  ?uri rdfs:label ?res .
}

into the following query:

PREFIX fts: <http://www.bigdata.com/rdf/fts#>
SELECT ?res ?uri WHERE {
  SERVICE <http://www.bigdata.com/rdf/fts#search> {
    ?res fts:search "Alice | Bob" .
    ?res fts:endpoint "http://localhost:1234/solr/blazegraph/select" .
    ?res fts:params "fl=id,score" .
    ?res fts:scoreField "score" .
    ?res fts:score ?score .
  }
  ?uri rdfs:label ?res .
}


By default, the SERVICE keyword is always evaluated right at the beginning of the join group.

While in the example above the service is parameterized with constants only, in some cases it might be desirable to pass in variables that specify the behavior of the external fulltext search calls. For instance, assume you have your search terms stored as instance data in your RDF database and you want to execute a keyword search for each of them. In that case, what you may want to do is not to specify your search string as a fixed literal (namely "Alice | Bob", in the example above), but pass in a variable instead, to be bound dynamically at runtime based on a given graph pattern. The following examples illustrates the idea:

PREFIX : <http://my.namespace/>
PREFIX fts: <http://www.bigdata.com/rdf/fts#>
SELECT ?searchTerm ?res WHERE {

  SERVICE <http://www.bigdata.com/rdf/fts#search> {
    ?res fts:search ?searchString .
    ?res fts:endpoint "http://localhost:1234/solr/blazegraph/select" .
    ?res fts:params "fl=id,score" .
    ?res fts:scoreField "score" .
    ?res fts:score ?score .
  }
  ?searchTerm rdf:type :SearchTerm .
  ?searchTerm :searchString ?searchString .
}

In the examples, we assume that the triple pattern ?searchTerm rdf:type :SearchTerm . ?searchTerm :searchString ?searchString binds variable ?searchString to a set of search terms. Note that Blazegraph applies some analysis of the dependencies between variables and delays the execution of the service call until all required variables have been bound. In that case, this means that the triple patterns ?searchTerm rdf:type :SearchTerm . ?searchTerm :searchString ?searchString . are executed prior to the service call.

An alternative approach (or for even more complex examples), you may also consider using the WITH Blazegraph extension (see here), which allows you to bring subqueries in any order you want, which gives you full control over them.

Pitfalls

If used inproperly, the use of external fulltext search might result in runtime exceptions. This might happen particularly if

  • the search string is not bound or empty
  • the specified fulltext search endpoint is not provided, empty, or incorrect
  • the fts:searchResultType is set to URI, but the specified fts:searchField contains strings that cannot be converted to URI
  • the query fts:timeout is exceeded

In such cases, the exception message should give pointers to the problem.

Example Application

There is a tutorial provided at SOLR_External_Fulltext_Search.

For Developers: Implementing Other External Fulltext Services

The fts:endpointType predicate specifies the type of the endpoint. Currently, only Solr endpoints are supported (fts:endpointType equals "SOLR"), but this is the switch that you could use to implement connectors to other types of endpoints. To do so, the following needs to be done:

  • Fix a new endpoint type (say, e.g., "ElasticSearch") and extend enum EndpointType
  • Implement a class, say ElasticFulltextSearchImpl that extends the IFulltextSearch<FulltextSearchHit> Method
@Override
public FulltextSearchHiterator<FulltextSearchHit> search(
       com.bigdata.service.fts.IFulltextSearch.FulltextSearchQuery query) {

   // your code goes here 

}

receives an in-memory object of the fulltext search query (i.e., an in-memory representation of what is specified in the SPARQL query) as input and must output an iterator over search hits. You may want to take a look at SolrFulltextSearchImpl as the reference implementation for Solr.

  • In class FulltextSearchServiceFactory, extend the switch over the endpointType, resolving the new enum value to your new implementation