Package edu.umd.cloud9.collection.trec

Provides classes for working with the TREC collection (particularly disks 4 and 5).

See:
          Description

Class Summary
BuildTrecForwardIndex Tool for building a document forward index for TREC collections.
DemoCountTrecDocuments Simple demo program that counts all the documents in the TREC collection.
NumberTrecDocuments Program that builds the mapping from TREC docids (String identifiers) to docnos (sequentially-numbered ints).
TrecDocnoMapping Object that maps between TREC docids (String identifiers) to docnos (sequentially-numbered ints).
TrecDocument Object representing a TREC document.
TrecDocumentInputFormat Hadoop InputFormat for processing the TREC collection.
TrecDocumentInputFormat.TrecDocumentRecordReader Hadoop RecordReader for reading TREC-formatted documents.
TrecForwardIndex Object representing a document forward index for TREC collections.
 

Package edu.umd.cloud9.collection.trec Description

Provides classes for working with the TREC collection (particularly disks 4 and 5). TREC disks 4 and 5 represent one of the standard collections used in information retrieval research. There are two common "views" of the collection:

Here are the two steps for preparing the collection for processing with Hadoop:

  1. The distribution of the collection consists of many individual small files (listed above). Since Hadoop works better with large files, it is advisable to cat the individual files together (e.g., with a simple Perl script).
  2. Since many information retrieval algorithms require a sequential numbering of documents, it is necessary to build a mapping between docids (e.g., LA123190-0134) and docnos (sequentially-numbered ints). The class NumberTrecDocuments accomplishes this. Here is a sample invocation:
  3. hadoop jar cloud9.jar edu.umd.cloud9.collection.trec.NumberTrecDocuments \
    /umd/collections/trec/trec4-5_noCRFR.xml \
    /user/jimmylin/trec-docid-tmp \
    /user/jimmylin/docno.mapping 100
    

After the corpus has been prepared, it is ready for processing with Hadoop. The class DemoCountTrecDocuments is a simple demo program that counts all documents in the collection. It provides a skeleton for MapReduce programs that process the collection. Here is a sample invocation:

hadoop jar cloud9.jar edu.umd.cloud9.collection.trec.DemoCountTrecDocuments \
/umd/collections/trec/trec4-5_noCRFR.xml \
/user/jimmylin/count-tmp \
/user/jimmylin/docno.mapping 100

The output key-value pairs in this sample program are the docid to docno mappings.