|
||||||||||
| PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES | |||||||||
See:
Description
| Class Summary | |
|---|---|
| BuildClueWarcForwardIndex | Tool for building a document forward index for the ClueWeb09 collection. |
| ClueCollectionPathConstants | Class that provides convenience methods for processing portions of the Clue Web collection with Hadoop. |
| ClueWarcDocnoMapping | Object that maps between WARC-TREC-IDs (String identifiers) to docnos (sequentially-numbered ints). |
| ClueWarcForwardIndex | |
| ClueWarcInputFormat | |
| ClueWarcInputFormat.ClueWarcRecordReader | |
| ClueWarcRecord | |
| DemoCountClueWarcRecords | Simple demo program to count the number of records in the ClueWeb09 collection, from either the original source WARC files or repacked SequenceFiles (controlled by the first command-line parameter). |
| RepackClueWarcRecords |
Program to uncompress the ClueWeb09 collection from the original distribution
WARC files and repack as SequenceFiles. |
| ScanBlockCompressedSequenceFile | |
Provides classes for working with the ClueWeb09 collection. The dataset consists of one billion web pages (5 TB compressed, 25 TB uncompressed), in ten languages, collected in January and February 2009. Its creation, supported by U.S. National Science Foundation (NSF), was led by Jamie Callan of the Language Technologies Institute at Carnegie Mellon University to support research on information retrieval and related human language technologies.
|
||||||||||
| PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES | |||||||||