edu.umd.cloud9.collection.wikipedia
Class RepackWikipedia
java.lang.Object
org.apache.hadoop.conf.Configured
edu.umd.cloud9.collection.wikipedia.RepackWikipedia
- All Implemented Interfaces:
- Configurable, Tool
public class RepackWikipedia
- extends Configured
- implements Tool
Tool for repacking Wikipedia XML dumps into SequenceFiles.
The program takes the following command-line arguments:
- [xml-dump-file] XML dump file
- [output-path] output path
- [docno-mapping-data-file] docno mapping data file
- (block|record|none) to indicate block-compression, record-compression, or
no compression
Here's a sample invocation:
hadoop jar cloud9.jar edu.umd.cloud9.collection.wikipedia.RepackWikipedia \
-libjars bliki-core-3.0.15.jar,commons-lang-2.5.jar \
/user/jimmy/Wikipedia/raw/enwiki-20101011-pages-articles.xml \
/user/jimmy/Wikipedia/compressed.block/en-20101011 \
/user/jimmy/Wikipedia/docno-en-20101011.dat block
- Author:
- Jimmy Lin
RepackWikipedia
public RepackWikipedia()
run
public int run(String[] args)
throws Exception
- Runs this tool.
- Specified by:
run in interface Tool
- Throws:
Exception
main
public static void main(String[] args)
throws Exception
- Throws:
Exception