|
||||||||||
| PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES | |||||||||
See:
Description
| Class Summary | |
|---|---|
| BuildTrecForwardIndex | Tool for building a document forward index for TREC collections. |
| DemoCountTrecDocuments | Simple demo program that counts all the documents in the TREC collection. |
| NumberTrecDocuments | Program that builds the mapping from TREC docids (String identifiers) to docnos (sequentially-numbered ints). |
| TrecDocnoMapping | Object that maps between TREC docids (String identifiers) to docnos (sequentially-numbered ints). |
| TrecDocument | Object representing a TREC document. |
| TrecDocumentInputFormat | Hadoop InputFormat for processing the TREC collection. |
| TrecDocumentInputFormat.TrecDocumentRecordReader | Hadoop RecordReader for reading TREC-formatted documents. |
| TrecForwardIndex | Object representing a document forward index for TREC collections. |
Provides classes for working with the TREC collection (particularly disks 4 and 5). TREC disks 4 and 5 represent one of the standard collections used in information retrieval research. There are two common "views" of the collection:
Here are the two steps for preparing the collection for processing with Hadoop:
LA123190-0134) and docnos
(sequentially-numbered ints). The
class NumberTrecDocuments
accomplishes this. Here is a sample invocation:hadoop jar cloud9.jar edu.umd.cloud9.collection.trec.NumberTrecDocuments \ /umd/collections/trec/trec4-5_noCRFR.xml \ /user/jimmylin/trec-docid-tmp \ /user/jimmylin/docno.mapping 100
After the corpus has been prepared, it is ready for processing with
Hadoop. The
class DemoCountTrecDocuments
is a simple demo program that counts all documents in the collection.
It provides a skeleton for MapReduce programs that process the
collection. Here is a sample invocation:
hadoop jar cloud9.jar edu.umd.cloud9.collection.trec.DemoCountTrecDocuments \ /umd/collections/trec/trec4-5_noCRFR.xml \ /user/jimmylin/count-tmp \ /user/jimmylin/docno.mapping 100
The output key-value pairs in this sample program are the docid to docno mappings.
|
||||||||||
| PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES | |||||||||