edu.umd.cloud9.collection.wikipedia
Class BuildWikipediaDocnoMapping
java.lang.Object
org.apache.hadoop.conf.Configured
edu.umd.cloud9.collection.wikipedia.BuildWikipediaDocnoMapping
- All Implemented Interfaces:
- Configurable, Tool
public class BuildWikipediaDocnoMapping
- extends Configured
- implements Tool
Tool for building the mapping between Wikipedia internal ids (docids) and
sequentially-numbered ints (docnos). The program takes four command-line
arguments:
- [input] path to the Wikipedia XML dump file
- [output-dir] path to temporary MapReduce output directory
- [output-file] path to location of mappings file
- [num-mappers] number of mappers to run
Here's a sample invocation:
hadoop jar cloud9.jar edu.umd.cloud9.collection.wikipedia.BuildWikipediaDocnoMapping \
-libjars bliki-core-3.0.15.jar,commons-lang-2.5.jar \
/user/jimmy/Wikipedia/raw/enwiki-20101011-pages-articles.xml tmp \
/user/jimmy/Wikipedia/docno-en-20101011.dat 100
- Author:
- Jimmy Lin
BuildWikipediaDocnoMapping
public BuildWikipediaDocnoMapping()
run
public int run(String[] args)
throws Exception
- Specified by:
run in interface Tool
- Throws:
Exception
main
public static void main(String[] args)
throws Exception
- Throws:
Exception