edu.umd.cloud9.collection.wikipedia
Class DemoCountWikipediaPages

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by edu.umd.cloud9.collection.wikipedia.DemoCountWikipediaPages
All Implemented Interfaces:
Configurable, Tool

public class DemoCountWikipediaPages
extends Configured
implements Tool

Tool for counting the number of pages in a particular Wikipedia XML dump file. This program keeps track of total number of pages, redirect pages, disambiguation pages, empty pages, actual articles (including stubs), stubs, and non-articles ("File:", "Category:", "Wikipedia:", etc.). This also provides a skeleton for MapReduce programs to process the collection. The program takes a single command-line argument, which is the path to the Wikipedia XML dump file.

Here's a sample invocation:

 hadoop jar cloud9.jar edu.umd.cloud9.collection.wikipedia.DemoCountWikipediaPages \
   -libjars bliki-core-3.0.15.jar,commons-lang-2.5.jar \
   /user/jimmy/Wikipedia/raw/enwiki-20101011-pages-articles.xml
 

Author:
Jimmy Lin

Constructor Summary
DemoCountWikipediaPages()
           
 
Method Summary
static void main(String[] args)
           
 int run(String[] args)
           
 
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
 

Constructor Detail

DemoCountWikipediaPages

public DemoCountWikipediaPages()
Method Detail

run

public int run(String[] args)
        throws Exception
Specified by:
run in interface Tool
Throws:
Exception

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception