Skip navigation links

Package co.cask.cdap.examples.wikipedia

This package contains the WikipediaPipeline Application that demonstrates a CDAP Workflow for processing and analyzing Wikipedia data.

See: Description

Package co.cask.cdap.examples.wikipedia Description

This package contains the WikipediaPipeline Application that demonstrates a CDAP Workflow for processing and analyzing Wikipedia data.

The app contains a CDAP Workflow that runs in either online or offline mode. In the offline mode, it expects Wikipedia data to be available in a Stream. In the online mode, it attempts to download wikipedia data for a provided set of page titles (formatted as the output of the Facebook Likes API). Once wikipedia data is available it runs a map-only job to filter bad records and normalize data formatted as text/wiki-text into text/plain. It then runs two analyses on the plain text data in a fork:

  1. ScalaSparkLDA runs topic modeling on Wikipedia data using Latent Dirichlet Allocation (LDA).
  2. TopNMapReduce that produces the Top N terms in the supplied Wikipedia data.
  3. The output of the above analyses is stored in the following datasets:
    • A Table named lda which contains the output of the Spark LDA program.
    • A KeyValueTable named topn which contains the output of the TopNMapReduce program.

One of the main purposes of this application is to demonstrate how the flow of a typical data pipeline can be controlled using Workflow Tokens.

Skip navigation links

Copyright © 2018 Cask Data, Inc.. All rights reserved.