Package edu.umd.cloud9.collection.wikipedia

Provides classes for working with Wikipedia XML dumps.

See:
          Description

Class Summary
BuildWikipediaDocnoMapping Tool for building the mapping between Wikipedia internal ids (docids) and sequentially-numbered ints (docnos).
BuildWikipediaForwardIndex Tool for building a document forward index for Wikipedia.
BuildWikipediaLinkGraph Tool for extracting the link graph out of Wikipedia.
DemoCountWikipediaPages Tool for counting the number of pages in a particular Wikipedia XML dump file.
DumpWikipediaToPlainText Tool for taking a Wikipedia XML dump file and spits out articles in a flat text file (article title and content, separated by a tap).
LookupWikipediaArticle Tool for providing command-line access to page titles given either a docno or a docid.
RepackWikipedia Tool for repacking Wikipedia XML dumps into SequenceFiles.
WikipediaDocnoMapping Provides a mapping between Wikipedia internal ids (docids) and sequentially-numbered ints (docnos).
WikipediaForwardIndex Forward index for Wikipedia collections.
WikipediaPage A page from Wikipedia.
WikipediaPageInputFormat Hadoop InputFormat for processing Wikipedia pages from the XML dumps.
WikipediaPageInputFormat.WikipediaPageRecordReader Hadoop RecordReader for reading Wikipedia pages from the XML dumps.
WikipediaPagesBz2InputStream Class for working with bz2-compressed Wikipedia article dump files on local disk.
 

Package edu.umd.cloud9.collection.wikipedia Description

Provides classes for working with Wikipedia XML dumps.