Spark program that executes in a workflow and analyzes wikipedia data
MapReduce program that dumps events from a stream to a dataset.
Mapper that dumps stream events to a
Mapper that dumps raw Wikipedia data from a stream to a
MapReduce program that outputs the top N words from the input text.
Mapper that emits tokens.
Reducer that outputs top N tokens.
MapReduce program to validate and extract content from raw wikidata blobs.
Mapper that: - Filters records that are null or empty or cannot be parsed as JSON.
MapReduce job that downloads wikipedia data and stores it in a dataset.
Mapper that downloads Wikipedia data for each input record.
App to demonstrate a data pipeline that processes Wikipedia data using a CDAP Workflow.
Config for Wikipedia App.
Workflow for Wikipedia data pipeline
Service to retrieve results of analyses of Wikipedia data.
The app contains a CDAP Workflow that runs in either online or offline mode. In the offline mode, it expects Wikipedia data to be available in a Stream. In the online mode, it attempts to download wikipedia data for a provided set of page titles (formatted as the output of the Facebook Likes API). Once wikipedia data is available it runs a map-only job to filter bad records and normalize data formatted as text/wiki-text into text/plain. It then runs two analyses on the plain text data in a fork:
ScalaSparkLDAruns topic modeling on Wikipedia data using Latent Dirichlet Allocation (LDA).
TopNMapReducethat produces the Top N terms in the supplied Wikipedia data.
One of the main purposes of this application is to demonstrate how the flow of a typical data pipeline can be controlled using Workflow Tokens.
Copyright © 2018 Cask Data, Inc.. All rights reserved.