Unicode

From Blazegraph
Jump to: navigation, search

This page presents special considerations for Internationalization and Unicode related issues for the bigdata platform.

KeyBuilder

The KeyBuilder class is responsible for generating the sort keys used in the B+Tree indices. When the keys include Unicode fields, it is important that you consider the impact of the collation ordering on the generated sort keys. The collation ordering can be configured using the options specified by the com.bigdata.btree.keys.KeyBuilder.Options interface.

The default collator is ICU4J and is specified by the value "ICU". This package has the advantage of supporting compressed Unicode sort keys. You can also specify "JDK" or "ASCII" for the collator.

For example, the following will force the use of ASCII collation keys on all B+Trees that have text fields in the keys:

 com.bigdata.btree.keys.KeyBuilder.collator=ASCII

while the following corresponds to the default behavior:

 com.bigdata.btree.keys.KeyBuilder.collator=ICU

These options may be specified once when the database (Journal or Federation) is created. They may also be overridden when a specific KB instance or index is created. See the com.bigdata.btree.keys.KeyBuilder.Options interface for additional options.

Tomcat

Tomcat requires a Filter that explicitly sets the character encoding of incoming POST requests to UTF-8, otherwise it will think the content is ISO-8859-1 (unless the request contains explicit charset information, which browsers typically won't set):

[1] [2]

So to get a UTF-8 clean NanoSparqlServer web interface, either avoid using Tomcat (use Jetty or something else), or implement the Filter solution suggested in the above links.

See [3] for the original discussion thread.

SPARQL Regex Operator

In versions up to 1.2.2, there is a bug in the SPARQL REGEX operator such that it does not perform case-folding correctly for Unicode data. You can work around this by specifying the 'u' flag. This flag is on automatically when the 'i' flag is specified after r7018.

See [4] (SPARQL REGEX operator does not perform case-folding correctly for Unicode data).