Skip navigation links

Package co.cask.cdap.examples.datacleansing

This package contains the DataCleansing Application that filters records that do not match a given schema.

See: Description

Package co.cask.cdap.examples.datacleansing Description

This package contains the DataCleansing Application that filters records that do not match a given schema. This DataCleansing Application consists of these programs and datasets:

  1. DataCleansingService that allows writing to the rawRecords PartitionedFileSets.
  2. A MapReduce named DataCleansingMapReduce that reads the files from a PartitionedFileSet, applies a filter to remove "unclean" records, based upon a particular schema, and outputs the records to an output PartitionedFileSet. Each time the job runs, it processes only the files of the newly created partitions.
  3. Three Datasets used by the MapReduce and Service:
    • A PartitionedFileSet named rawRecords which serves as the input data for DataCleansingMapReduce.
    • A PartitionedFileSet named cleanRecords which serves as output for DataCleansingMapReduce.
    • A KeyValueTable named consumingState which keeps track of the state of the DataCleansingMapReduce so that each time it is run, it only processes files of newly created Partitions.
Skip navigation links

Copyright © 2018 Cask Data, Inc.. All rights reserved.