edu.umd.cloud9.collection.wikipedia
Class DumpWikipediaToPlainText
java.lang.Object
org.apache.hadoop.conf.Configured
edu.umd.cloud9.collection.wikipedia.DumpWikipediaToPlainText
- All Implemented Interfaces:
- Configurable, Tool
public class DumpWikipediaToPlainText
- extends Configured
- implements Tool
Tool for taking a Wikipedia XML dump file and spits out articles in a flat
text file (article title and content, separated by a tap).
Here's a sample invocation:
hadoop jar cloud9.jar edu.umd.cloud9.collection.wikipedia.DumpWikipediaToPlainText \
-libjars bliki-core-3.0.15.jar,commons-lang-2.5.jar \
/user/jimmy/Wikipedia/raw/enwiki-20101011-pages-articles.xml /user/jimmy/Wikipedia/txt
- Author:
- Jimmy Lin
DumpWikipediaToPlainText
public DumpWikipediaToPlainText()
run
public int run(String[] args)
throws Exception
- Specified by:
run in interface Tool
- Throws:
Exception
main
public static void main(String[] args)
throws Exception
- Throws:
Exception