Package edu.umd.cloud9.collection.clue

Provides classes for working with the ClueWeb09 collection.

See:
          Description

Class Summary
BuildClueWarcForwardIndex Tool for building a document forward index for the ClueWeb09 collection.
ClueCollectionPathConstants Class that provides convenience methods for processing portions of the Clue Web collection with Hadoop.
ClueWarcDocnoMapping Object that maps between WARC-TREC-IDs (String identifiers) to docnos (sequentially-numbered ints).
ClueWarcForwardIndex  
ClueWarcInputFormat  
ClueWarcInputFormat.ClueWarcRecordReader  
ClueWarcRecord  
DemoCountClueWarcRecords Simple demo program to count the number of records in the ClueWeb09 collection, from either the original source WARC files or repacked SequenceFiles (controlled by the first command-line parameter).
RepackClueWarcRecords Program to uncompress the ClueWeb09 collection from the original distribution WARC files and repack as SequenceFiles.
ScanBlockCompressedSequenceFile  
 

Package edu.umd.cloud9.collection.clue Description

Provides classes for working with the ClueWeb09 collection. The dataset consists of one billion web pages (5 TB compressed, 25 TB uncompressed), in ten languages, collected in January and February 2009. Its creation, supported by U.S. National Science Foundation (NSF), was led by Jamie Callan of the Language Technologies Institute at Carnegie Mellon University to support research on information retrieval and related human language technologies.