Mapping

Concepts

Geocodr supports multiple classes of objects. You can uses classes to separate different types of objects, for example parcel information, addresses or point of interest. End-users can select one or more classes when calling the API.

Each class contains one or more collections. You can use collections to build different indices for objects with different attributes. For example addresses have house numbers and street names, but neighborhoods have only a name and a city it belongs to. A Geocodr collection is tied to one Solr collection.

The tutorial contained a single class (address) and two collections (boroughs and streets). If you had a dataset with all house numbers, then you would add that to a new collection (e.g. addresses or housenumbers) belonging to the same address class. A dataset with shops would belong in a different class (e.g. pois).

Each collection is stored in a separate Solr collection. Each collection can have a different set of fields and you can configure which fields should be indexed and how they should be processed before indexing. This is configured in the Solr schema for each collection and is not specific to Geocodr.

Geocodr queries each collection with your search query. These queries run in parallel. The results are merged back together to a single result set. The sort order is defined by the scoring which can be adjusted by your collection definition.

Searching

Solr schema

We use a simple analyzer for string fields like street or city names with a standard tokenizer and normalization filters applicable for our language. You can improve results by harmonizing your data and queries with charFilter (PatternReplaceCharFiltersFacotry) as necessary. For example: The example schemas abbreviate the common suffix Straße (street) to str.

We use NGramFilterFactory and EdgeNGramFilterFactory for fuzzy and prefix matches. See below.

Please refer to the geocodr-mv repository for a complex example.

Geocodr mapping

Geocodr loads the definition of your collections from a mapping file. The geocodr and geocodr-api tool use the --mapping option to pass the file name of this mapping file.

The mapping file is a Python script. Each collection is a subclass of geocodr.search.Collection.

Most mapping options can be set by class variables on your custom collection classes. Basic options are class_ and name.

Here is a minimal example:

from geocodr.search import Collection

class Street(Collection):
   class_ = 'address'         # class for grouping similar collections
   class_title = 'Addresses'  # human readable class name
   name = 'streets'           # name of the Solr collection
   title = 'Streets'          # human readable name of the collection

Please refer to the geocodr-mv repository for a complex mapping file.

Exact matches

You need to define qfield with a list (or tuple) of all fields that should be used for searching. Geocodr splits the query string into terms and each term needs to be found in at least one of the qfields. You can apply different boosts by appending the boost value with ^.

For exact matches:

class Street(Collection):
   qfields = (
      SimpleField('city_name') ^ 2.0,
      SimpleField('street_name') ^ 3.0,
   )

A query for ‘rostock amberg’ will query Solr with a query similar to (city_name:rostock or street_name:rostock) and (city_name:amberg or street_name:amberg). Remember that Solr passes each term trough the tokenizer and filter of each field. Depending on your schema, these exact matches are still case insensitive, diacritics (like ä, é, ñ) will be normalized, etc..

Fuzzy matches

We use NGramField for fuzzy search, which accepts incomplete terms and spelling errors:

class Street(Collection):
   qfields = (
      SimpleField('city_name') ^ 2.0,
      NGramField('city_name_ngram') ^ 1.0,
      SimpleField('street_name') ^ 3.0,
      NGramField('street_name_ngram') ^ 1.0,
   )

NGramField requires a Solr field with NGramFilterFactory filter. NGramField builds 3-grams by default. A search term amberg generates the amb, mbe, ber and erg tokens. Solr would return results as soon as a single 3-gram is matched. Since this is not desired for our geocoding, Geocodr builds a Solr phrase search to get finer control of the results. For amberg we search for the phrases amb mbe ber erg and require that at least 3 phrases (n-grams) match. A longer search term tolerates more missing n-grams.

The boost for n-gram fields should be lower so that exact matches score higher.

Note that Solr analyzers and filters are not applied. You need to implement any filter in your own subclass. GermanNGramField is such a subclass for fields with NGramFilterFactory and GermanNormalizationFilterFactory.

Prefix matches

A PrefixField can be used to match (partial) post codes:

class Postcodes(Collection):
   qfields = (
      PrefixField('postcode')),
   )

PrefixField requires a Solr field with EdgeNGramFilterFactory filter. Terms are matched from left. The term 123 will generate a Solr query similar to postcode:123*.