edu.umd.cloud9.collection.clue
Class ClueCollectionPathConstants

java.lang.Object
  extended by edu.umd.cloud9.collection.clue.ClueCollectionPathConstants

public class ClueCollectionPathConstants
extends Object

Class that provides convenience methods for processing portions of the Clue Web collection with Hadoop. Static methods in this class allow the user to easily "select" different portions of the collection to serve as input to a MapReduce job.

Author:
Jimmy Lin

Method Summary
static void addEnglishCollectionPart(JobConf conf, String base, int i)
          Adds a part (segment) of the Clue Web English collection to a Hadoop JobConf object.
static void addEnglishCompleteCollection(JobConf conf, String base)
          Adds the complete Clue Web English collection to a Hadoop JobConf object.
static void addEnglishSmallCollection(JobConf conf, String base)
          Adds the first part (segment) of the Clue Web English collection to a Hadoop JobConf object.
static void addEnglishTestFile(JobConf conf, String base)
          Adds a sample compressed WARC archive to a Hadoop JobConf object.
static void addEnglishTinyCollection(JobConf conf, String base)
          Adds the first section of the Clue Web English collection to a Hadoop JobConf object.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Method Detail

addEnglishTestFile

public static void addEnglishTestFile(JobConf conf,
                                      String base)
Adds a sample compressed WARC archive to a Hadoop JobConf object. The specific archive is ClueWeb09_English_1/en0000/00.warc.gz, which contains 35,582 Web pages.

Parameters:
conf - Hadoop JobConf
base - base path for the Clue Web collection

addEnglishTinyCollection

public static void addEnglishTinyCollection(JobConf conf,
                                            String base)
Adds the first section of the Clue Web English collection to a Hadoop JobConf object. Specifically, this method adds the contents of ClueWeb09_English_1/en0000/, which contains 3,382,356 pages.

Parameters:
conf - Hadoop JobConf
base - base path for the Clue Web collection

addEnglishSmallCollection

public static void addEnglishSmallCollection(JobConf conf,
                                             String base)
Adds the first part (segment) of the Clue Web English collection to a Hadoop JobConf object. Specifically, this method adds the contents of ClueWeb09_English_1/, which contains 50,220,423 pages.

Parameters:
conf - Hadoop JobConf
base - base path for the Clue Web collection

addEnglishCompleteCollection

public static void addEnglishCompleteCollection(JobConf conf,
                                                String base)
Adds the complete Clue Web English collection to a Hadoop JobConf object. Specifically, this method adds the contents of ClueWeb09_English_1/ through ClueWeb09_English_10/, which contains 503,903,810 pages.

Parameters:
conf - Hadoop JobConf
base - base path for the Clue Web collection

addEnglishCollectionPart

public static void addEnglishCollectionPart(JobConf conf,
                                            String base,
                                            int i)
Adds a part (segment) of the Clue Web English collection to a Hadoop JobConf object. Part 1 corresponds to the contents of ClueWeb09_English_1/ (i.e., the "small" collection), all the way through part 10. Note that adding all ten parts is equivalent to adding the complete English collection.

Parameters:
conf - Hadoop JobConf
base - base path for the Clue Web collection