Table of Contents
Searching databases is your stock-in-trade, right? But chances are pretty good that when you enter your query into that little search box, you're not really searching the database at all, you're searching an index of the database. What's the difference, you might ask. The difference between a database and an index is not mere technical nuance; it's the difference between a book and the index at the back of the book. If you want to find some information, which method is faster and more accurate: searching the book page by page, or searching the index? For anyone designing or maintaining an information storage and retrieval system, it's an essential difference to grasp. This article will look at the differences between a database and an index; compare some of the most popular open source indexing tools; and discuss some common methodologies for designing and building an index.
A database (sometimes referred to more accurately as a relational database management system (RDBMS)) is a storage and organizational system for data. It's built around the concept of the "table": columns (fields) and rows (records) layed out like a spreadsheet. Multiple tables can be related to one other through "key" fields. The result is a powerful, flexible and robust system for organizing data into conceptual chunks and then relating those chunks to one another. In the library, a database is ideal for cataloging collections: books, audiovisuals, periodicals, media of all types. A database allows you to turn data into information through the miracle of tabled organization.
An index (sometimes called an inverted index) is also a storage and organizational system for data. Unlike a database, which is organized around tables, an index is organized around something much simpler: words. An index is simply a list of words abstracted from some other data source, along with a pointer to where each word appears in the data source (a filename, a URL, a database record ID, etc.). In the case of an index at the back of a paper book, the data source is the book's text, and the book index lists words and the page numbers where they can be found. In the case of a database index, the data source is the tables, and the database index holds words and the table records in which they can be found.
Consider this example of a mailing list database:
+-----------------------------------------------------------------+
| ID | first | last | address | city | state | zip |
+-----------------------------------------------------------------+
| 1 | Joe | Smith | 1234 Main Street | Nowhere | MO | 12345 |
+-----------------------------------------------------------------+
| 2 | Sally | Jones | 5678 Some Street | Nowhere | MO | 12345 |
+-----------------------------------------------------------------+
While an index of this same data might look like:
Joe, record1
Smith, record1
Sally, record2
Jones, record2
1234, record1
5678, record2
Main, record1
Some, record2
Street, record1, record2
Nowhere, record1, record2
MO, record1, record2
12345, record1, record2
A search of the index for 12345 would
quickly turn up two results: record1 and
record2. Consider a database 10000x bigger, and you can
begin to see how an index would provide a much faster search of the same
data.
There are distinct advantages to using a database in the library, where vast amounts of data must be managed. But databases require that their users know how to ask for that data in ways that the database understands. The standard method of asking a database for information is known as Structured Query Language (SQL). While powerful and flexible, SQL is built around the same organizing principle as databases themselves: the table. SQL is a computer language; it is designed to tell a computer exactly what to find in the database and what to do with the results. Like most computer languages, it is exacting and precise, because like computers, it is built around a binary system.
Human language, on the other hand, is often muddy and ambiguous, because it has evolved around a different principle: the word. And so here we come to the crux: in the world of information retrieval, nothing is so intuitive and natural for human beings as the word. Thus the index is the most intuitive and natural tool for human beings who are searching for information, because like human language, the index is organized around the word.
Databases offer the power and range of a computer's ability to hold and retrieve vast quanities of organized data; indexes offer a humane way of accessing that data. Now we'll look at how the two systems can work well together.
Many popular database software packages (Oracle, MySQL, etc.) offer native full-text search features. Full-text search is a fancy way of saying "search using words, not SQL". In order to make full-text search a reasonably useful (read: fast) feature, these packages often leverage an index, rather than looking at every field in every record of every table each time a search is performed. An index thus serves as a kind of "word cache" of the data in the database, in order to speed up the search.
While native full-text search is rapidly becoming standard for database software, it is often more useful and flexible to use a separate indexing system to create an independent index of your database. An independent index offers several advantages because it:
Frees you to offer more advanced search features than the database does.
Reduces the load on the database server. Particularly for applications where users are mostly retrieving data (rather than storing it), this can save tremendous strain on a network and server resources.
Can be distributed amongst multiple machines. An index is a write-once/read-many kind of application, so you might create it and then copy it to different servers or sites.
Allows you to carve out a subset of your data, or re-organize it in ways that might be more useful to some audiences or that "hide" data from certain audiences. You might, for example, create two indexes from a catalog database: one with only collection titles, authors, and subject listings for the general public, and another that includes ISBN codes, inventory and circulation information for library personnel.
One convenient way of taking the data from a database and inserting it into an index is to export the database data in XML. XML (eXtensible Markup Language) is an ideal means for the interchange of data between different organizing structures because it retains the semantic meaning of groups of words in a plain text format. So a common workflow might look like:
database -> XML -> index
Many database software packages have a 'XML export' feature, and if your favorite database package does not, it is fairly simple to write a script (often in Perl) to transform the data into XML.
Once your data is in XML, the next step is feeding the XML to the indexer. There are a number of open source indexing solutions available. Some of them are known as "search engines" because they have historically been used to index web sites or collections of documents. The same principles of indexing (extract words and keep track of where they appear) apply to documents as to database records, though as we will see, the methods for retrieving and ranking the search results can differ between the two. Let's look at three popular open source indexing projects: Xapian, Lucene and Swish-e.
Lucene[9] is a Java library. The package includes example indexing and searching code, but the examples won't suffice for use with a database. You'll need to either write your own Java application using the Lucene library, or adapt an existing project to the task.[10]
Xapian[11] is a C++ library. Like Lucene, you would need to either write your own application using one of the many supported languages (C/C++, Perl, Python, PHP, Java etc.) or adapt an existing project (Omega and dbi2omega are a frequent starting point).
Swish-e[12] is a command line program written in C that includes a search library and API. Unlike Lucene and Xapian, Swish-e is a complete system: the swish-e command line tool can parse, index and search data. An API is available for writing alternate searching applications in either C, PHP or Perl.
The paramount question when evaluating an indexer is: does it do what I need it to? The following table summarizes some key features of the three projects:
Table 17.1. Feature Summary
| Lucene | Swish-e | Xapian | |
| Incremental indexing | x | x[a] | x |
| Synchronous search and update | x | x | |
| UTF-8 (international languages) | x | x | |
| Command line interface | x | ||
| Built-in XML parser | x | ||
| Indexing API | x | x | |
| Searching API | x | x | x |
| Stemming | x | x | x |
| Boolean search | x | x | x |
| Wildcard search | x | x | x |
| Phrase search | x | x | x |
| Proximity search | x | x | |
| Stored document text | x | x | x |
| Portable index format | x | x | x |
| License | Apache | GPL[b] | GPL |
| Multiple ranking algorithms | x | x | x |
[a] Experimental feature [b] GPL plus special clause for linking against the search library | |||
Parsing is reading data and breaking it down into its component parts. In our case, parsing involves discerning what is a word and what is not a word. Consider this example:
<tag>the quick brown <foo>fox</foo> jumped.</tag>
An XML parser would recognize 'the,' 'quick,' 'brown,' 'fox,' and 'jumped' as words, ignoring the whitespace, the period at the end of 'jumped' and the XML tags inside the <> delimiters.
An indexer is not a parser; an indexer takes words that have already been parsed and inserts them into an index. So an indexer is dependent upon a parser to take raw data and turn it into the building blocks (words) of an index.
One of the primary features of a database is that the data is dynamic: you can add, update and delete data in the database and your changes are instantly accessible. In order for your index to keep pace with the changes in your database, the index needs to be able to increment (change) without needing to re-create the whole index from scratch. Ideally, your index should be searchable while it is being updated, so that you do not suffer from any interruption in service. Keeping the data in your database and your index in sync is very important, lest any differences cause confusion, fear or mistrust amongst your users.
In the Good Old Days of early computing, there were only 128 text characters that a computer needed to understand. Those 128 characters were known as American Standard Code for Information Interchange (ASCII). Of course, 128 characters is about the minimum number for any one human language, and as computers increased in popularity around the world, they needed to deal with more human languages and thus more characters. So the standard character set was doubled to include 256 characters (the maximum number of mathematical possibilities in 1 byte (8 bits) of computer memory). But many human languages contain far more than 256 characters, and so the Great Brains of the computer world finally came up with a system that allows for 1000s of characters. That system is known as Unicode. UTF-8 (short for Unicode Transformation Format -- the 8 is an allusion to its byte orientation) is one very common implementation of Unicode, and is in fact the default character set for XML.
Modern computer programs (meaning they have been written since the mid-1990s) usually support UTF-8, though exceptions of course exist. If the data in your database uses UTF-8 (or any other Unicode character set, sometimes called "multibyte encodings") then you will need to use an indexer that supports Unicode.
Once you've got your index created, you'll want to search it, of course. Experience with internet search engines tells us that most folks enter just one or two words when they search. [13]But a good search tool allows for more complicated queries. Here are some common features:
If a user enters the word run into a
search box, should your search tool return matches for the
words running and runner
as well? If you think those should count as valid matches,
then you need stemming. Stemming means reducing a word to its
root base. There are several models for deducing the stem of a
word, and many indexing tools contain different models to
choose from.
Stemming has other benefits: it will reduce the size of your index because fewer unique words will be added to the index, and it can increase the chances of finding just the right thing when you aren't sure of the exact form of a word. On the other hand, if you only want matches to the word 'run' then stemming will actually be a detriment, since you'll get false positives amongst your matching set. See the Methodologies section at the end of this article for more approaches to stemming.
Boolean is a fancy word for joining your search terms
with logical operators like AND, OR and NOT. For example, to
find all the records that include both the word
dog and the word cat,
you might search for dog AND cat. To find
all the records that include either of the words, you might
search for dog OR cat. OR searches, as a
rule, will result in a bigger set of matches than AND searches.
NOT is a way of omitting records that match a word. A search
for dog NOT cat would find only
records that contain the word dog but not
the word cat.
Wildcards allow you to guess at the spelling of a word
by using a special character to represent any character. For
example, a query like car* will match all
records that contain the words car,
cartesian, and cart. The
* character is the wildcard character that means "match zero
or more characters of any kind." Similarly, the ? character
might mean "match exactly one character of any kind."
Wildcards allow you more flexibility in your searches.
Wildcards are not the same thing as stemming, even
though they seem to behave the same way. Stemming has to
do with the semantic root of a word, while wildcards have
to do with matching patterns of characters. The stemmed
root of runner is
run, while a search for
runn* will only find
runner, not
run.
A phrase is a group of two or more words in the exact
order and position they appear in the record. Usually a phrase
is indicated through the use of "double quotation marks." For
example, a search for "fast runner" would
match records that have the words fast and
runner adjacent, in that order. It would
not match records that contain both words in a different order
(like "runner fast") or in a different
location ("fast is the sparrow, slow is the
runner").
Proximity searching is similar to phrase searching, only
with greater freedom with respect to word order and location.
In other words, you can search for words that are near one
another, but not adjacent. For example, a search for
runner NEAR fast might find records with
runner fast, fast
runner, and fast like a runner,
but not records where fast and
runner do not appear near one another.
"Near" can be defined as a range, as in "within 10 words of
one another." The proximity range should be configurable by
the user, as in runner NEAR:10 fast.
Ranking is one of the most important search features an index can offer. Ranking is a mathematical way of measuring the assumed relevance of a set of matches. Relevance can be measured in lots of different ways, and for any given user and query, one relevance method might be better than another. The best search tools allow for ranking to be adjusted or configured by the user, so that the user (who is ultimately the best judge of relevance) can tweak her search results to help her find the records she's looking for more quickly.
One common ranking method is called Inverse Document
Frequency. It uses a variety of mathematical algorithms that
boil down to this: the more frequently a word appears in a
given record, the more likely it is that record is relevant to
your search. However, if the word appears frequently in many
records in the index, then that word is likely not a good word
for measuring relevance. For example, if you search for
the fast runner, the word
the probably appears too many times in the
index as a whole and should be ignored when determining
relevance. The word fast might appear
often, but not as often as runner, so
runner is the most important word in the
query. The ranking algorithm takes all that into account when
sorting the matches and presenting them to the user.
most flexible search syntax (wildcards, ranking, etc.)
fastest search for complex queries
ports to other languages (Clucene in C++, Plucene in Perl, etc.) shows development interest
little out-of-the-box support
requires additional (Java) programming to integrate with database
probablistic ranking
best scaling (index size)
most extensive API support
poor documentation
requires additional programming to integrate with database
best out-of-the-box support
fastest search for simple queries
good documentation and examples
no UTF-8 (international language) support
no stable incremental indexing (ability to add/update/delete records)
limited search syntax
simplistic ranking
no indexing API/library
Which tool you choose depends on a lot of factors, among them: the size and type of data you intend to index; the technical resources you have available in setting up your system; and what kind of query flexibility your users will require.
Swish-e is the easiest to use and configure; allows the most flexibility with respect to input and defining what a 'word' is; and is fastest for most queries. However, Swish-e lacks UTF-8 support, incremental indexing, and does not scale particularly well for large (>million) data sets.
Lucene and Xapian offer great flexibility with indexing and searching, and can handle large data sets. However, neither of them provide much out-of-the-box support, nor do either of them parse your incoming database data, leaving that task to some other application.
Like human languages, indexes suffer from some limitations when they meet the cruel binary world of the computer. This section suggests some tips to keep in mind when you create an index.
What combination of characters constitutes a "word" in your data set? That might seem like an obvious question, but it can have a significant impact on the usability of your search tool. Consider this sentence:
If you click on the link marked http://foobar.com/ it doesn't work.
Which series of characters should be considered words, and which should not? Splitting up the sentence on whitespace is a start. If all non-whitespace characters are considered valid word characters, then you would end up with a word list like:
| If |
| you |
| click |
| on |
| the |
| link |
| marked |
| http://foobar.com/ |
| it |
| doesn't |
| work. |
But do you really want work. treated as a word?
Shouldn't it be work without the trailing period?
Otherwise, someone searching for work won't find that
sentence. But that's trivial. Consider this sentence:
If you click, beware: it doesn't work; instead, a message like 'help!' appears.
While that's a perfectly valid (albeit awkward) sentence, splitting the characters up on whitespace alone leaves you with:
| If |
| you |
| click, |
| beware: |
| it |
| doesn't |
| work; |
| instead, |
| a |
| message |
| like |
| 'help!' |
| appears. |
Ah! You say. Just ignore all the punctuation marks:
,;!.?' But is it that simple? Look again at the first
example. The characters http://foobar.com/ would get
split up into an awkward series of words like:
| http |
| foobar |
| com |
which isn't ideal either, especially when someone wants to find
the URL foobar.com, as opposed to
foobar.org or foobar.edu. A
similar false splitting would happen to the word
doesn't and all contractions like it.
Obviously, what constitutes a "word" isn't as easy as it first appears. Some search tools get around the awkward splitting of words by making everything a "phrase" search by default. Depending on your data set, you may need more control over the technical definition of a "word" than some indexing tools will offer.
If your indexer supports the definition of word characters,
configure it to ignore punctuation only when it appears at the
beginning or end of a string of non-whitespace characters. That way
doesn't gets indexed as
doesn't but 'help!' gets
indexed as help.
Stemming is a powerful feature that helps reduce the size of your index and helps users find related information more quickly. But stemming is also imprecise and fuzzy, and can lead to lots of "false positives" when you're looking for specific information. Stemming can sometimes hurt, more than help, the effectiveness of your search.
To increase the usability of your search system, consider creating multiple indexes: one with stemming on and another with stemming off. Then create a search interface that offers your users a choice between stemming and no stemming. While it will likely more than double the size of your filesystem requirements, hard drive space is cheap these days. And offering your users a choice when searching is well worth it.
Full text search is attractive because it mirrors the way humans think: in terms of words. But as everyone who has ever used a search tool knows, figuring out which words to use in a query can sometimes be confusing (at best) and infuriating (at worst).
When you create your indexes, you have the opportunity to make the
search process easier for your users by manipulating the data you put
into the index. Stemming is one common approach; there are many others.
Consider misspellings, for example. It's common for users to misspell
query words, resulting in no matches when, if they spelled their words
correctly, there are many matches to be found. Synonyms are another
example: a user enters run and misses all the records
that mention melt, flow,
tend, incline, etc.
Just as in Stemming Tip, create multiple indexes. Include the literal data, common misspellings, synonyms, related words, etc., in one index, and just the literal data in another index.
No indexer or search tool is perfect. One of the hottest research topics in current computer science is how to more accurately store and retrieve vast quantities of diverse information. Ranking is a huge issue, especially in the area of e-commerce, where the top ranked sites in search engines like Google and Yahoo! are getting the most dollars flowing their way.
Every audience is different, and uses search engines for different reasons. But patterns emerge. Discovering what the search patterns are amongst your user groups can make a dramatic difference in improving the quality of their search experience.
Every decent search tool offers logging of user queries, whether in a web server log or similar tool. Study how your users search, try some usability studies, and accomodate the patterns you discover by creating multiple indexes, offering search tips, accounting for commonly misspelled words, adding additional data to your indexes, and making recommendations based on previous searches. And most important: listen to your users. There are patterns lurking in their comments and questions.
A good indexer and search tool can help tame your databases and bring the data closer to your users. It takes practice, experimentation, and quality feedback to help tune your indexing methods to better serve your needs. Be patient. Be persistent. Join search tool e-mail lists and ask questions. It's likely that someone has run into a situation similar to yours before. And if at all possible, have fun.