Full Text Search

From Blazegraph
Jump to: navigation, search

Bigdata provides an integrated full text indexing and search facility.

Before you get started, make sure you have enabled the free text index in your properties file (the default, i.e. if you do not explicitly set the configuration option in your properties file, is "true", so you just need to make sure that it is not set to "false"):

com.bigdata.rdf.store.AbstractTripleStore.textIndex=true


Introduction

The bigdata full text indexing and search facility is built using the same B+Tree components as the bigdata RDF database. It provides fast, scalable full text search and retrieval. The index is a B+Tree over tokens extracted by applying a configurable Analyzer to tokenize RDF Literals. The index supports a fast prefix match on tokens, a fast exact match on tokens, and a fast match on multiple tokens. It does not accelerate arbitrary regular expressions except when the regular expression is a trailing wildcard.

The text index must be enabled via a property, when the database is created or it could be created using the Rebuild Text Index procedure. If enabled, then each literal added to the database (by appearing in the “O” position of a statement) is also added to the text index. This index can then be accessed through SPARQL query.

To accomplish this integration of free text search with high-level query, bigdata defines several magic predicates that are given special meaning, and when encountered in a SPARQL query are interpreted as service calls to the text index. The full list of magic predicates related to free text search is defined and documented in the class BDS. The simplest way to integrate a free text search into a SPARQL query in bigdata is to use the magic predicate bds:search inside of a SPARQL join group.

PREFIX bds: <http://www.bigdata.com/rdf/search#>

The predicate bds:search is used to search the full text index using the pattern in the “O” position of the search and to bind the hits (Literals) to the variable defined in the S position of the search. For example:

?lit bds:search “mike” .

will search the full text index for literals that contain the token “mike” and bind those literals onto the ?lit variable for use in subsequent joins. To find statements that use literals that contain the token mike, the SPARQL query would look as follows:

prefix bds: <http://www.bigdata.com/rdf/search#>
select ?s ?p ?o
where {
?o bds:search “mike” .
?s ?p ?o .
}

Search Metadata

In addition to simple search, additional metadata about the search can be defined inside the SPARQL query using other magic predicates (also defined in the BDS class). These predicates, when attached to the same variable as the search, will help narrow the search or bind additional metadata about search hits to other variables. We could expand the SPARQL query as follows:

prefix bds: <http://www.bigdata.com/rdf/search#>
select ?s ?p ?o ?score ?rank
where {
?o bds:search “mike personick” .
?o bds:matchAllTerms “true” .
?o bds:minRelevance “0.25” .
?o bds:relevance ?score .
?o bds:maxRank “1000” .
?o bds:rank ?rank .
?s ?p ?o .
}

Match all terms

The magic predicate bds:matchAllTerms indicates that only literals that contain all of the specified search terms should be considered. Similarly, literals can be constrained by min and max relevance (a 0 to 1 score signifying how closely the literal matches the search terms) and by min and max rank (hits are ordered by relevance, and the rank describes where the literal appears in that ordered list). If the relevance or rank is relevant to the application, those pieces of metadata can be bound to variables in the search results using the predicates bds:relevance and bds:rank.

Prefix Match

Prefix matches are supported by the bds:search magic predicate. The following query will find all literals having tokens beginning with "mi". This query is answered using the full text index. It is translated into a prefix scan of the tokens matching "mi*".

PREFIX bds: <http://www.bigdata.com/rdf/search#>

SELECT ?subj ?label 
WHERE {
      ?label bds:search "mi*" .
      ?label bds:relevance ?cosine .
      ?subj ?p ?label .
}

Regular Expressions

Trailing wildcard queries are answered using an index as illustrated above. If you have a leading wildcard such as "*foo" then that needs to be expressed using the SPARQL REGEX() function within a SPARQL filter. Bigdata does not use an index to accelerate REGEX() filters. Instead, it performs the joins, materializes the RDF Values, and then applies the REGEX() function to filter the solutions.

More Options

There are many more options for configuring the full text search facility inside of bigdata.

  • FullTextIndex.Options controls the behavior of the FullTextIndex.
  • AbstractTripleStore.Options controls the integration of the FullTextIndex with the AbstractTripleStore, whether the text indexing is enabled for a given triple or quad store, whether or not datatype literals are indexed, the text indexer implementation class, etc.
  • KeyBuilder.Options controls the Unicode collation ordering, etc.

Among other things, these options allow you to override the following:

  • The analyzer that breaks the literals down into tokens for search and indexing. By default, the analyzer will recognize language code literals and use an Apache Analyzer that is appropriate for that language family. This can be overridden if you want to index part numbers or other kinds of literals that follow different patterns.
  • The collation strength. This controls whether the search service is case sensitive or not.

Bigdata handles RDF Values that can inlined specially. Inlined values are typically small and set at a fixed length, through xsd:integer and xsd:decimal, which are also inlined. Inline values inserted directly into the statement indices to avoid the overhead of indirecting through the dictionary indices. Large objects are inserted into the BLOBS index. This keys down the stride in the dictionary indices. RDF Values that are neither inlined, nor BLOBS are inserted into the forward and reverse dictionary indices to obtain a numeric identifier. The numeric that is identified is then used in the statement indices.

Both the forward dictionary index and the OSP(C) statement indices also support range scans:

  • Using the forward dictionary index (TERM2ID), you can do fast Unicode aware prefix scans of non-inline, non-blob URIs and Literals.
  • Using the OSP(C) statement index, you can do fast prefix scans of inline RDF Values.

SERVICE

Internally, an AST optimizer translates the magic predicates into a SERVICE clause. However, with bigdata 1.2, you can now write the SERVICE clause directly. For example:

SELECT ?subj ?score
 WHERE {
   ?lit bds:search "mike" .
   ?lit bds:relevance ?score .
   ?subj ?p ?lit .
 }

Is translated internally into:

SELECT ?sub ?score
 WHERE {
   SERVICE <http://www.bigdata.com/rdf/search#search> {
     ?lit bds:search "mike" .
     ?lit bds:relevance ?score .
   }
   ?subj ?p ?lit .
}

If you are writing your own Custom Service, you no longer need to provide an AST translation. You can simply generate the SPARQL SERVICE clause directly in your application. When your custom service is invoked, it will have access to the ServiceNode and can extract the group graph pattern and interpret the magic predicates in a manner appropriate for your service's semantics.


See Full Text Search in Bigdata for more information.