Chapter 17. What are Indexers and Why Should I Care?

Peter Karman

Table of Contents

Introduction
Some Definitions
Database and Index: Working Together
Overview
Features
Summary
Methodologies
What's in a word?
To stem or not to stem
Hide and seek
Feedback
Conclusion

Introduction

Searching databases is your stock-in-trade, right? But chances are pretty good that when you enter your query into that little search box, you're not really searching the database at all, you're searching an index of the database. What's the difference, you might ask. The difference between a database and an index is not mere technical nuance; it's the difference between a book and the index at the back of the book. If you want to find some information, which method is faster and more accurate: searching the book page by page, or searching the index? For anyone designing or maintaining an information storage and retrieval system, it's an essential difference to grasp. This article will look at the differences between a database and an index; compare some of the most popular open source indexing tools; and discuss some common methodologies for designing and building an index.

Some Definitions

A database (sometimes referred to more accurately as a relational database management system (RDBMS)) is a storage and organizational system for data. It's built around the concept of the "table": columns (fields) and rows (records) layed out like a spreadsheet. Multiple tables can be related to one other through "key" fields. The result is a powerful, flexible and robust system for organizing data into conceptual chunks and then relating those chunks to one another. In the library, a database is ideal for cataloging collections: books, audiovisuals, periodicals, media of all types. A database allows you to turn data into information through the miracle of tabled organization.

An index (sometimes called an inverted index) is also a storage and organizational system for data. Unlike a database, which is organized around tables, an index is organized around something much simpler: words. An index is simply a list of words abstracted from some other data source, along with a pointer to where each word appears in the data source (a filename, a URL, a database record ID, etc.). In the case of an index at the back of a paper book, the data source is the book's text, and the book index lists words and the page numbers where they can be found. In the case of a database index, the data source is the tables, and the database index holds words and the table records in which they can be found.

Consider this example of a mailing list database:


    +-----------------------------------------------------------------+
    | ID | first | last  | address          | city    | state | zip   |
    +-----------------------------------------------------------------+
    | 1  | Joe   | Smith | 1234 Main Street | Nowhere | MO    | 12345 |
    +-----------------------------------------------------------------+
    | 2  | Sally | Jones | 5678 Some Street | Nowhere | MO    | 12345 |
    +-----------------------------------------------------------------+

   

While an index of this same data might look like:

Joe, record1
Smith, record1
Sally, record2
Jones, record2
1234, record1
5678, record2
Main, record1
Some, record2
Street, record1, record2
Nowhere, record1, record2
MO, record1, record2
12345, record1, record2

A search of the index for 12345 would quickly turn up two results: record1 and record2. Consider a database 10000x bigger, and you can begin to see how an index would provide a much faster search of the same data.

There are distinct advantages to using a database in the library, where vast amounts of data must be managed. But databases require that their users know how to ask for that data in ways that the database understands. The standard method of asking a database for information is known as Structured Query Language (SQL). While powerful and flexible, SQL is built around the same organizing principle as databases themselves: the table. SQL is a computer language; it is designed to tell a computer exactly what to find in the database and what to do with the results. Like most computer languages, it is exacting and precise, because like computers, it is built around a binary system.

Human language, on the other hand, is often muddy and ambiguous, because it has evolved around a different principle: the word. And so here we come to the crux: in the world of information retrieval, nothing is so intuitive and natural for human beings as the word. Thus the index is the most intuitive and natural tool for human beings who are searching for information, because like human language, the index is organized around the word.

Databases offer the power and range of a computer's ability to hold and retrieve vast quanities of organized data; indexes offer a humane way of accessing that data. Now we'll look at how the two systems can work well together.

Database and Index: Working Together

Many popular database software packages (Oracle, MySQL, etc.) offer native full-text search features. Full-text search is a fancy way of saying "search using words, not SQL". In order to make full-text search a reasonably useful (read: fast) feature, these packages often leverage an index, rather than looking at every field in every record of every table each time a search is performed. An index thus serves as a kind of "word cache" of the data in the database, in order to speed up the search.

While native full-text search is rapidly becoming standard for database software, it is often more useful and flexible to use a separate indexing system to create an independent index of your database. An independent index offers several advantages because it:

  • Frees you to offer more advanced search features than the database does.

  • Reduces the load on the database server. Particularly for applications where users are mostly retrieving data (rather than storing it), this can save tremendous strain on a network and server resources.

  • Can be distributed amongst multiple machines. An index is a write-once/read-many kind of application, so you might create it and then copy it to different servers or sites.

  • Allows you to carve out a subset of your data, or re-organize it in ways that might be more useful to some audiences or that "hide" data from certain audiences. You might, for example, create two indexes from a catalog database: one with only collection titles, authors, and subject listings for the general public, and another that includes ISBN codes, inventory and circulation information for library personnel.

One convenient way of taking the data from a database and inserting it into an index is to export the database data in XML. XML (eXtensible Markup Language) is an ideal means for the interchange of data between different organizing structures because it retains the semantic meaning of groups of words in a plain text format. So a common workflow might look like:

database -> XML -> index

Many database software packages have a 'XML export' feature, and if your favorite database package does not, it is fairly simple to write a script (often in Perl) to transform the data into XML.

Once your data is in XML, the next step is feeding the XML to the indexer. There are a number of open source indexing solutions available. Some of them are known as "search engines" because they have historically been used to index web sites or collections of documents. The same principles of indexing (extract words and keep track of where they appear) apply to documents as to database records, though as we will see, the methods for retrieving and ranking the search results can differ between the two. Let's look at three popular open source indexing projects: Xapian, Lucene and Swish-e.

Overview

Lucene[9] is a Java library. The package includes example indexing and searching code, but the examples won't suffice for use with a database. You'll need to either write your own Java application using the Lucene library, or adapt an existing project to the task.[10]

Xapian[11] is a C++ library. Like Lucene, you would need to either write your own application using one of the many supported languages (C/C++, Perl, Python, PHP, Java etc.) or adapt an existing project (Omega and dbi2omega are a frequent starting point).

Swish-e[12] is a command line program written in C that includes a search library and API. Unlike Lucene and Xapian, Swish-e is a complete system: the swish-e command line tool can parse, index and search data. An API is available for writing alternate searching applications in either C, PHP or Perl.

Features

The paramount question when evaluating an indexer is: does it do what I need it to? The following table summarizes some key features of the three projects:

Table 17.1. Feature Summary

LuceneSwish-eXapian
Incremental indexingxx[a]x
Synchronous search and updatex x
UTF-8 (international languages)x x
Command line interface x
Built-in XML parser x
Indexing APIx x
Searching APIxxx
Stemmingxxx
Boolean searchxxx
Wildcard searchxxx
Phrase searchxxx
Proximity searchx x
Stored document textxxx
Portable index formatxxx
LicenseApacheGPL[b]GPL
Multiple ranking algorithmsxxx

[a] Experimental feature

[b] GPL plus special clause for linking against the search library

Parsing

Parsing is reading data and breaking it down into its component parts. In our case, parsing involves discerning what is a word and what is not a word. Consider this example:

<tag>the quick brown <foo>fox</foo> jumped.</tag>

An XML parser would recognize 'the,' 'quick,' 'brown,' 'fox,' and 'jumped' as words, ignoring the whitespace, the period at the end of 'jumped' and the XML tags inside the <> delimiters.

An indexer is not a parser; an indexer takes words that have already been parsed and inserts them into an index. So an indexer is dependent upon a parser to take raw data and turn it into the building blocks (words) of an index.

Incremental indexing

One of the primary features of a database is that the data is dynamic: you can add, update and delete data in the database and your changes are instantly accessible. In order for your index to keep pace with the changes in your database, the index needs to be able to increment (change) without needing to re-create the whole index from scratch. Ideally, your index should be searchable while it is being updated, so that you do not suffer from any interruption in service. Keeping the data in your database and your index in sync is very important, lest any differences cause confusion, fear or mistrust amongst your users.

UTF-8 (Unicode)

In the Good Old Days of early computing, there were only 128 text characters that a computer needed to understand. Those 128 characters were known as American Standard Code for Information Interchange (ASCII). Of course, 128 characters is about the minimum number for any one human language, and as computers increased in popularity around the world, they needed to deal with more human languages and thus more characters. So the standard character set was doubled to include 256 characters (the maximum number of mathematical possibilities in 1 byte (8 bits) of computer memory). But many human languages contain far more than 256 characters, and so the Great Brains of the computer world finally came up with a system that allows for 1000s of characters. That system is known as Unicode. UTF-8 (short for Unicode Transformation Format -- the 8 is an allusion to its byte orientation) is one very common implementation of Unicode, and is in fact the default character set for XML.

Modern computer programs (meaning they have been written since the mid-1990s) usually support UTF-8, though exceptions of course exist. If the data in your database uses UTF-8 (or any other Unicode character set, sometimes called "multibyte encodings") then you will need to use an indexer that supports Unicode.

Searching

Once you've got your index created, you'll want to search it, of course. Experience with internet search engines tells us that most folks enter just one or two words when they search. [13]But a good search tool allows for more complicated queries. Here are some common features:

Stemming

If a user enters the word run into a search box, should your search tool return matches for the words running and runner as well? If you think those should count as valid matches, then you need stemming. Stemming means reducing a word to its root base. There are several models for deducing the stem of a word, and many indexing tools contain different models to choose from.

Stemming has other benefits: it will reduce the size of your index because fewer unique words will be added to the index, and it can increase the chances of finding just the right thing when you aren't sure of the exact form of a word. On the other hand, if you only want matches to the word 'run' then stemming will actually be a detriment, since you'll get false positives amongst your matching set. See the Methodologies section at the end of this article for more approaches to stemming.

Boolean

Boolean is a fancy word for joining your search terms with logical operators like AND, OR and NOT. For example, to find all the records that include both the word dog and the word cat, you might search for dog AND cat. To find all the records that include either of the words, you might search for dog OR cat. OR searches, as a rule, will result in a bigger set of matches than AND searches. NOT is a way of omitting records that match a word. A search for dog NOT cat would find only records that contain the word dog but not the word cat.

Wildcard

Wildcards allow you to guess at the spelling of a word by using a special character to represent any character. For example, a query like car* will match all records that contain the words car, cartesian, and cart. The * character is the wildcard character that means "match zero or more characters of any kind." Similarly, the ? character might mean "match exactly one character of any kind." Wildcards allow you more flexibility in your searches.

Note

Wildcards are not the same thing as stemming, even though they seem to behave the same way. Stemming has to do with the semantic root of a word, while wildcards have to do with matching patterns of characters. The stemmed root of runner is run, while a search for runn* will only find runner, not run.

Phrases

A phrase is a group of two or more words in the exact order and position they appear in the record. Usually a phrase is indicated through the use of "double quotation marks." For example, a search for "fast runner" would match records that have the words fast and runner adjacent, in that order. It would not match records that contain both words in a different order (like "runner fast") or in a different location ("fast is the sparrow, slow is the runner").

Proximity

Proximity searching is similar to phrase searching, only with greater freedom with respect to word order and location. In other words, you can search for words that are near one another, but not adjacent. For example, a search for runner NEAR fast might find records with runner fast, fast runner, and fast like a runner, but not records where fast and runner do not appear near one another. "Near" can be defined as a range, as in "within 10 words of one another." The proximity range should be configurable by the user, as in runner NEAR:10 fast.

Ranking

Ranking is one of the most important search features an index can offer. Ranking is a mathematical way of measuring the assumed relevance of a set of matches. Relevance can be measured in lots of different ways, and for any given user and query, one relevance method might be better than another. The best search tools allow for ranking to be adjusted or configured by the user, so that the user (who is ultimately the best judge of relevance) can tweak her search results to help her find the records she's looking for more quickly.

One common ranking method is called Inverse Document Frequency. It uses a variety of mathematical algorithms that boil down to this: the more frequently a word appears in a given record, the more likely it is that record is relevant to your search. However, if the word appears frequently in many records in the index, then that word is likely not a good word for measuring relevance. For example, if you search for the fast runner, the word the probably appears too many times in the index as a whole and should be ignored when determining relevance. The word fast might appear often, but not as often as runner, so runner is the most important word in the query. The ranking algorithm takes all that into account when sorting the matches and presenting them to the user.

Summary

Lucene

Pros

  • most flexible search syntax (wildcards, ranking, etc.)

  • fastest search for complex queries

  • ports to other languages (Clucene in C++, Plucene in Perl, etc.) shows development interest

Cons

  • little out-of-the-box support

  • requires additional (Java) programming to integrate with database

Xapian

Pros

  • probablistic ranking

  • best scaling (index size)

  • most extensive API support

Cons

  • poor documentation

  • requires additional programming to integrate with database

Swish-e

Pros

  • best out-of-the-box support

  • fastest search for simple queries

  • good documentation and examples

Cons

  • no UTF-8 (international language) support

  • no stable incremental indexing (ability to add/update/delete records)

  • limited search syntax

  • simplistic ranking

  • no indexing API/library

Which tool you choose depends on a lot of factors, among them: the size and type of data you intend to index; the technical resources you have available in setting up your system; and what kind of query flexibility your users will require.

Swish-e is the easiest to use and configure; allows the most flexibility with respect to input and defining what a 'word' is; and is fastest for most queries. However, Swish-e lacks UTF-8 support, incremental indexing, and does not scale particularly well for large (>million) data sets.

Lucene and Xapian offer great flexibility with indexing and searching, and can handle large data sets. However, neither of them provide much out-of-the-box support, nor do either of them parse your incoming database data, leaving that task to some other application.

Methodologies

Like human languages, indexes suffer from some limitations when they meet the cruel binary world of the computer. This section suggests some tips to keep in mind when you create an index.

What's in a word?

What combination of characters constitutes a "word" in your data set? That might seem like an obvious question, but it can have a significant impact on the usability of your search tool. Consider this sentence:

If you click on the link marked http://foobar.com/ it doesn't work.

Which series of characters should be considered words, and which should not? Splitting up the sentence on whitespace is a start. If all non-whitespace characters are considered valid word characters, then you would end up with a word list like:

If
you
click
on
the
link
marked
http://foobar.com/
it
doesn't
work.

But do you really want work. treated as a word? Shouldn't it be work without the trailing period? Otherwise, someone searching for work won't find that sentence. But that's trivial. Consider this sentence:

If you click, beware: it doesn't work; instead, a message like 'help!' appears.

While that's a perfectly valid (albeit awkward) sentence, splitting the characters up on whitespace alone leaves you with:

If
you
click,
beware:
it
doesn't
work;
instead,
a
message
like
'help!'
appears.

Ah! You say. Just ignore all the punctuation marks: ,;!.?' But is it that simple? Look again at the first example. The characters http://foobar.com/ would get split up into an awkward series of words like:

http
foobar
com

which isn't ideal either, especially when someone wants to find the URL foobar.com, as opposed to foobar.org or foobar.edu. A similar false splitting would happen to the word doesn't and all contractions like it.

Obviously, what constitutes a "word" isn't as easy as it first appears. Some search tools get around the awkward splitting of words by making everything a "phrase" search by default. Depending on your data set, you may need more control over the technical definition of a "word" than some indexing tools will offer.

Tip

If your indexer supports the definition of word characters, configure it to ignore punctuation only when it appears at the beginning or end of a string of non-whitespace characters. That way doesn't gets indexed as doesn't but 'help!' gets indexed as help.

To stem or not to stem

Stemming is a powerful feature that helps reduce the size of your index and helps users find related information more quickly. But stemming is also imprecise and fuzzy, and can lead to lots of "false positives" when you're looking for specific information. Stemming can sometimes hurt, more than help, the effectiveness of your search.

Tip

To increase the usability of your search system, consider creating multiple indexes: one with stemming on and another with stemming off. Then create a search interface that offers your users a choice between stemming and no stemming. While it will likely more than double the size of your filesystem requirements, hard drive space is cheap these days. And offering your users a choice when searching is well worth it.

Hide and seek

Full text search is attractive because it mirrors the way humans think: in terms of words. But as everyone who has ever used a search tool knows, figuring out which words to use in a query can sometimes be confusing (at best) and infuriating (at worst).

When you create your indexes, you have the opportunity to make the search process easier for your users by manipulating the data you put into the index. Stemming is one common approach; there are many others. Consider misspellings, for example. It's common for users to misspell query words, resulting in no matches when, if they spelled their words correctly, there are many matches to be found. Synonyms are another example: a user enters run and misses all the records that mention melt, flow, tend, incline, etc.

Tip

Just as in Stemming Tip, create multiple indexes. Include the literal data, common misspellings, synonyms, related words, etc., in one index, and just the literal data in another index.

Feedback

No indexer or search tool is perfect. One of the hottest research topics in current computer science is how to more accurately store and retrieve vast quantities of diverse information. Ranking is a huge issue, especially in the area of e-commerce, where the top ranked sites in search engines like Google and Yahoo! are getting the most dollars flowing their way.

Every audience is different, and uses search engines for different reasons. But patterns emerge. Discovering what the search patterns are amongst your user groups can make a dramatic difference in improving the quality of their search experience.

Tip

Every decent search tool offers logging of user queries, whether in a web server log or similar tool. Study how your users search, try some usability studies, and accomodate the patterns you discover by creating multiple indexes, offering search tips, accounting for commonly misspelled words, adding additional data to your indexes, and making recommendations based on previous searches. And most important: listen to your users. There are patterns lurking in their comments and questions.

Conclusion

A good indexer and search tool can help tame your databases and bring the data closer to your users. It takes practice, experimentation, and quality feedback to help tune your indexing methods to better serve your needs. Be patient. Be persistent. Join search tool e-mail lists and ask questions. It's likely that someone has run into a situation similar to yours before. And if at all possible, have fun.