Tagging - Part 2

I have made some initial steps regarding my tagging application. See:

http://mylibrary.library.nd.edu/demos/tagging/

 

I started with the Reading List application since it included bunches o’ data and the framework for managing users. I then used the “kewl” details hack from Articles Index demonstration to create a pop-up form for inputting tags. Finally, I got some Javascript to echo the input — a resource ID, the username of the patron, and the tag. My next step will be to use an AJAX technique to create facet/term combinations on behalf of the user and echo the results on the screen. On my mark. Get set. Go.

Tagging

I am going to try to write a MyLibrary application that implements tagging.

First I will create a system that asks you to log in. I will then list information resources. Next, I will supply a widget labeled “Add a tag”. The widget will display an input box. Enter a tag. Click go. The tag gets associated with the resource as well as the user, and the tag gets displayed along with the resource. This is the first go. Wish me luck.

A problem to solve

Open source software is filled with people who like to solve problem, so, here’s a problem I hope someone here will be able to help me solve.

How can I make the subroutine, get_location, faster? It is given a resource object and a scalar representing the human-readable location type. The routine then finds all the locations for the given resource, loops through them, and returns the location whose name equals the desired type.

my $isbn = &get_location( $resource, 'ISBN' );

sub get_location {

  # get the input
  my $resource      = shift;
  my $location_type = shift;

  # initialize
  my $location = '';

  # process each location
  foreach ( $resource->resource_locations ) {

    my $type = MyLibrary::Resource::Location::Type->new( id => $_->resource_location_type );
    if ( $type->name eq $location_type ) {

      $location = $_->location;
      last;

    }

  }

  # done
  return $location

}

The problem is the code is really slow. My my partcular application each resource includes has three location, and it seems to take about .5 seconds to process the foreach loop. Since get_location is called twice for every resource, and since there are about 25 resources per batch, doing this look up takes about 25 seconds. Way too slow. What are we doing wrong!?

More imitation

More imitation as the sincerest form of flattery.

This past week I had the opportunity to attend the Charleston Conference. During one of the sessions someone reported the results of their various usability studies. To resolve some the issues the studies uncovered they decided to create a system where people would log in and be presented with a set of suggested library resources for getting their learning, teaching, and research done. They were going to call this system My Library.

On one hand, this was sort of frustrating. I had been at this point ten years ago, and My Library was being presented as a new and novel idea. On the other hand, it is nice to see that something conceived ten years ago is still seen as a viable option today. Satisfying. Change takes time.

Reference FAQ

I believe we will be implementing a “Reference Desk FAQ” using MyLibrary.

Over the past year the reference department has facilitated online reference chats. They have kept these chats in a database but the database’s support has dried up. They now use simpler tools for chat reference  like IM. Given the year’s worth of data, they noticed patterns — frequently asked questions. They desire to make the questions (and their answers) available on the Web. Here what we will do:

  1. Distill the questions to a useful few, about one or two hundred.
  2. Answer each question.
  3. Classify/tag each question/answer pair using facet/term combinations.
  4. Import the questions as titles in a MyLibrary instance.
  5. Import the answers as descriptions in the same instance.
  6. Associate each question/answer with a facet/term.
  7. Create a browsable list of questions/answers from the facet/terms.
  8. Create an index to the questions and answers using something like KinoSearch.
  9. Allow users to comment on the content of the system through the use of MyLibrary Review objects.

When you’ve got a hammer, everything begins to look like a nail.

CRRA (”Catholic Portal”)

For a few days last week I went to Boston College and attended a meeting for the Catholic Research Resources Alliance (CRRA), also known as the “Catholic Portal”. In its present state, the portal is really an  index to rare and unique materials of interest to catholic scholars. The system runs on top of MyLibrary. Presently it ingests data from EAD files, saves the metadata to MyLibrary, indexes the EAD using KinoSearch, provides an SRU interface to the index, and a browsable interface to the database. At the meeting I learned the ingestion process now needs to support MARC records as well as EAD files. I don’t think this will very difficult. On our mark. Get set. Go!

Tiny Facebook steps

I have taken some tiny facebook steps.

I have created a simple Facebook application, but it probably doesn’t work for you. The code is here. The idea is to read information from a person’s profile, use some textual analysis, and recommend one or more information resources for the user to… use. I might be able to use other’s people’s profiles as well to enable something like, “I trust my friend. What is recommended for them? Maybe I’ll use that.” The problem with my current implementation is that no one but me seems to be able to access it. Why?

Supporting search

When you want to provide search against your MyLibrary content use and indexer.

MyLibrary is great for storing and manipulating the information of digital libraries. This is because it uses a database underneath. Ironcially, databases are weak when it comes to search because queries always need to be mapped to fields. Moreover databases, unless they exploit some sort of “-ism” do not support relevance ranking. This is where indexers come in. They do not require you to denote a field to search, and they do support relevance ranking.

When you want to support search against MyLibrary, write a report against MyLibrary and feed the content to your indexer of choice. While SOLR/Lucene seem to be the gold standard these days, I like Kinosearch because it uses the same query language as Lucene and the Lucene query language is supported by my SRU client.

Here is some code that loops through each MyLibrary resource object, extracts some metadata, and adds it to a Kinosearch index:

# define
use constant INDEX => '../etc/index';

# require/include
use KinoSearch::InvIndexer;
use KinoSearch::Analysis::PolyAnalyzer;
use MyLibrary::Core;

# configure
MyLibrary::Config->instance( 'catalog' );

# create an index
$analyzer   = KinoSearch::Analysis::PolyAnalyzer->new( language => 'en' );
$invindexer = KinoSearch::InvIndexer->new(
  invindex => INDEX,
  create   => 1,
  analyzer => $analyzer
);
$invindexer->spec_field( name => 'id' );
$invindexer->spec_field( name => 'fkey' );
$invindexer->spec_field( name => 'title' );
$invindexer->spec_field( name => 'creator' );
$invindexer->spec_field( name => 'subject' );
$invindexer->spec_field( name => 'description' );

# process each resource
my $index = 0;
my @ids = MyLibrary::Resource->get_ids;
foreach ( MyLibrary::Resource->get_ids ) {

  # get this resource
  my $resource = MyLibrary::Resource->new( id => $_ );

  # create, fill, and commit a document with content
  my $doc = $invindexer->new_doc;
  $doc->set_value ( id          => $resource->id );
  $doc->set_value ( fkey        => $resource->fkey );
  $doc->set_value ( title       => $resource->name ))   unless ( ! $resource->name );
  $doc->set_value ( creator     => $resource->creator ) unless ( ! $resource->creator );
  $doc->set_value ( subject     => $resource->subject ) unless ( ! $resource->subject );
  $doc->set_value ( description => $resource->note )    unless ( ! $resource->note );

  # done
  $invindexer->add_doc( $doc );

}

# clean up
print "noptimizing... ";
$invindexer->finish( optimize => 1 );

# done
exit;

Here is some code that searches the resulting index:

# define
use constant INDEX => '../etc/index';

# require/include
use KinoSearch::Searcher;
use KinoSearch::Analysis::PolyAnalyzer;
use MyLibrary::Core;

# configure
MyLibrary::Config->instance( 'catalog' );

my $query = shift;
if ( ! $query ) {

  # get the query
  print "Enter a query. "; chop ( $query = <STDIN> )

}

# open an index
$analyzer = KinoSearch::Analysis::PolyAnalyzer->new( language => 'en' );
$searcher = KinoSearch::Searcher->new(
  invindex => INDEX,
  analyzer => $analyzer
);

# search
$hits = $searcher->search( qq($query) );

# get the number of hits and report result
$total_hits = $hits->total_hits;
print "Your query ($query) found $total_hits record(s).\n\n";

# loop through the results
while ( my $hit = $hits->fetch_hit_hashref ) {

  &listOneResource( $hit->{ 'id' } );

}
print "\n";

sub listOneResource {

  my $id = shift;
  my $resource = MyLibrary::Resource->new( id => $id );
  print "           id = " . $resource->id   . "\n";
  print "         name = " . $resource->name . "\n";
  print "         date = " . $resource->date . "\n";
  print "         note = " . $resource->note . "\n";
  print "     creators = ";
  foreach ( split /|/, $resource->creator ) { print "$_; " }
  print "\n";
  my @resource_terms = $resource->related_terms();
  print "      term(s) = ";
  foreach (@resource_terms) {

    my $term = MyLibrary::Term->new(id => $_);
    print $term->term_name, " ($_)", '; ';

  }
  print "\n";
  my @locations = $resource->resource_locations();
  print "  location(s) = ";
  foreach (@locations) { print $_->location, "; " }
  print "\n\n";

}

Infinitely deep facet/term combinations

We sometimes wonder whether or not the facet/term combinations should be infinitely deep.

As of right now the facet/term combinations are simple two-dimensional hierarchies. Examples include:

  • Audience/Freshman
  • Audience/Sophomore
  • Audience/Junior
  • Tools/Dictionaries
  • Tools/Catalogs
  • Tools/Indexes
  • Formats/Books
  • Formats/Journals
  • Formats/Movies

These two-dimensional hierarchies work, most of the time, but what do you do when it comes to subjects? This works just fine, as far as it goes:

  • Subjects/Astronomy
  • Subjects/Mathematics
  • Subjects/Literature
  • Subjects/Music

To make a subjects hierarchy deeper you can start to list things in an inverted order. I call this Library Speak:

  • Subjects/Literature, English
  • Subjects/Literature, French
  • Subjects/Literature, Spanish

Such an approach can quickly get out of hand, especially when you want to list things as English Literature of the 20th Century, not Literature, English (20th Century). Incidentally, the addition of “20th Century” is not a subject but a time period. You could create a facet called time periods and a term called “20th Century”. A better example would be Elizabethan Literature.

To over come this problem, the database could be redesigned. The facets table could have four fields instead of three:

  1. facet id
  2. facet name
  3. facet note
  4. parent id

In such a scenario there is no need for a terms table. Root facets (terms) would have a parent id of 0. All other facets (terms) would have a parent id of some other facet. For example, here is how a hierarchical list of subjects might be created:

id name note parent id
1 Subjects Aboutness 0
2 Literature Writing 1
3 Astronomy Stars 1
4 English Literature Writing from the UK 2
5 French Literature Writing from France 2

Re-implementing the facet/term structure in this way would be a lot of work. It would change the database as well as the Perl API. Migrating applications from the current version of MyLibrary to another would be difficult. Moreover, how often do we really need infinitely deep hierarchies?

This is a possible problem to be solved.

Dublin Core is for initial discovery

I have heard people say, “MyLibrary is cool, but I can’t stuff all my data into its Dublin Core-based database structure.” When I hear I feel people don’t get the point; MyLibrary is not intended to contain all of your metadata, just the metadata primarily useful for initial discovery.

Once again, the framework of MyLibrary supports resources, people (librarians and patrons), and a simple controlled vocabulary scheme of your own design (facet and terms). The structure of resource objects is very Dublin Core-esque. There are additional attributes of resource objects that lend themselves very well as pointers to outside objects. FKey attributes and location objects come to mind. Save much of your metadata in MyLibrary, and point to the full record somewhere else.

Suppose then you had a set of MARC records (or XML files) and you wanted to use them as the foundation of our resource objects. You could implement a cross-walk from MARC to Dublin Core to MyLibrary. This would only give you some of the rich MARC data. Suppose too you knew exactly where an individual MARC record resided on your file system. You could then save this location as an FKey attribute. Next, you could full-text index your MARC records. Search results could return what ever you desired (titles, authors, etc.) or they could return MyLibrary resource ids or FKey values. Given this you could retrieve the full MARC record from the file system for final display.

“Okay,” you say, “but why, in this case, use MyLibrary at all?” The answer lies in the fact that there is more to digital libraries than search (indexing). You might want to create a browsable interface to your collection. That is a report against your database, not your index. You might want to create an OAI repository against your collection. That is another report against your database. You might want to transform your entire collection into another format, say for example, a printed catalog. (Heresy!) You might want to create relationships between different objects (resources to resources, resources to patrons, patrons to librarians, librarians to resources, etc.). All of these things are data manipulation function — functions for your database.

MyLibrary is not expected to be container for all your data. It is intended to save information about resources and people. Yes, the Dublin Core elements are limiting, but they are not useless. You do not need to save each and every metadata element of each and every resource object in a MyLibrary resource object, just the most important ones, and I assert those important once will almost always boil down to Dublin Core elements.