The summary: I was looking for an easy way to search through minutes of the DAWG, given that some but not all of the minutes are reproduced in plain text within a mailing list message. All minutes are (in one way or another) URL accessible, however, so I setup Apache Nutch to crawl, index, and search the minutes. I learned stuff along the way, and that's what the rest of this post shares.
One of the first things I'm doing as I'm getting up to speed in my new role as DAWG chair is finding the issues the DAWG has not yet resolved and determining whether we're on target to address the issues. One of the issues raised a few months ago was the syntactical order of the LIMIT and OFFSET keywords within queries. I had remembered that the group had reached a decision about this issue, but did not remember the details. I wanted to find the minutes which recorded the decision.
I could have searched the mailing list for limit and offset and probably found what I needed by perusing the search results. But not all minutes make it into mailing list messages as something other than links or attachments, and I didn't want to wade through general discussion. I'd rather be able to search the minutes explicitly. So here's what I did:
(I work in a Windows XP environment with a standard Cygwin installatoin.)
- Updated the DAWG homepage, adding links to minutes of the the past few months' teleconferences.
- Dug up a script I'd written last year to pull links from a Web page where the text of the link matches a certain pattern. Invoked this script with the pattern '\d+\s+?\w{3}' against the URL http://www.w3.org/2001/sw/DataAccess/ to pull out all the links to minutes from the Web page. This heuristic approach works well, but it would feel far more elegant to have the markup authoritatively tell me which links were links to minutes. Via RDFa, perhaps. I redirected the list of links produced by this script to the text file, dawg-minutes/root-urls/minutes.
- Downloaded the latest version of Apache Nutch and unzipped it, adding a symlink from nutch-install-dir/bin/nutch such that nutch ended up in my path.
- Followed instructions #2 and #3 from the Nutch user manual. This involves supplying a name to the user agent which Nutch crawls the Web with and also specifying a URL filter that decides which pages to crawl (or which pages not to crawl). To be on the safe side, I added these two lines to nutch-install-dir/conf/crawl-urlfilter.txt:
+^http://([a-z0-9]*\.)*w3c.org/ +^http://([a-z0-9]*\.)*w3.org
- The next step was to crawl the list of links I had already generated. I didn't want to follow any other links from these URLs, so this was a pretty simple invocation of Nutch. I did get trapped for a bit by the fact that earlier versions of Nutch required the command-line argument to be a text file with the list of URLs while the current version requires the argument to be the directory containing lists of links. I ended up invoking nutch as:
cd dawg-minutes ; nutch crawl root-urls -dir nutch/ -depth 1
This fetched, crawled, and indexed the set of DAWG minutes (but no other links thanks to the -depth 1) and stored the resulting data structures within the nutch subdirectory. - At this point, I had (still unresolved) trouble getting the command-line search tool to work:
nutch org.apache.nutch.searcher.NutchBean apache
Regardless of the working directory from which I executed this, I always received Total hits: 0. This problem led me to discover Luke, the Lucene Index Toolbox, which confirmed for me that my indexes had been properly created and populated. - I pressed ahead with getting Nutch's Web interface setup. I already had an installation of Apache Tomcat 5.5, so no installation needed there. Instead, I copied the file nutch-install-dir/nutch-version.war to nutch.war at the root of my Tomcat webapps directory.
- I started Tomcat from the dawg-minutes/nutch directory (where Nutch had put all of its indexes and other data structures), and launched a Web browser to http://localhost:5000/nutch. (The default Tomcat install runs on port 8080, I believe; I have too many programs clamoring for my port 8080.)
- The Nutch search interface appeared, but again any searches that I performed led to no hits being returned!
- Some Web searching led me to a mailing-list message which suggested investigating the searcher.dir property in webapps/nutch/WEB-INF/classes/nutch-site.xml. I added this property with a value of c:/documents and settings/.../dawg-minutes/nutch and restarted tomcat.
- All's well that ends well.
So I ran into a few speed bumps, but in the end I've got a relatively lightweight system for indexing and searching DAWG minutes. Hooray!