edu.umd.cloud9.collection.wikipedia
Class DemoCountWikipediaPages
java.lang.Object
org.apache.hadoop.conf.Configured
edu.umd.cloud9.collection.wikipedia.DemoCountWikipediaPages
- All Implemented Interfaces:
- Configurable, Tool
public class DemoCountWikipediaPages
- extends Configured
- implements Tool
Tool for counting the number of pages in a particular Wikipedia XML dump
file. This program keeps track of total number of pages, redirect pages,
disambiguation pages, empty pages, actual articles (including stubs), stubs,
and non-articles ("File:", "Category:", "Wikipedia:", etc.). This also
provides a skeleton for MapReduce programs to process the collection. The
program takes a single command-line argument, which is the path to the
Wikipedia XML dump file.
Here's a sample invocation:
hadoop jar cloud9.jar edu.umd.cloud9.collection.wikipedia.DemoCountWikipediaPages \
-libjars bliki-core-3.0.15.jar,commons-lang-2.5.jar \
/user/jimmy/Wikipedia/raw/enwiki-20101011-pages-articles.xml
- Author:
- Jimmy Lin
DemoCountWikipediaPages
public DemoCountWikipediaPages()
run
public int run(String[] args)
throws Exception
- Specified by:
run in interface Tool
- Throws:
Exception
main
public static void main(String[] args)
throws Exception
- Throws:
Exception