Data Cleansing sample Application.
A simple MapReduce that reads records from the rawRecords PartitionedFileSet and writes all records that match a particular
A Mapper which skips text that doesn't match a given schema.
Partitions the records based upon a runtime argument (time) and a field extracted from the text being written (zip)
A handler that allows writing to the 'rawRecords' PartitionedFileSet.
Implements a simple Workflow with to run the DataCleansingMapReduce MapReduce.
A schema matcher for flat-record schemas with simple (or nullable of simple) fields.
This package contains the DataCleansing Application that filters records that do not match a given schema. This DataCleansing Application consists of these programs and datasets:
DataCleansingServicethat allows writing to the rawRecords PartitionedFileSets.
DataCleansingMapReducethat reads the files from a PartitionedFileSet, applies a filter to remove "unclean" records, based upon a particular schema, and outputs the records to an output PartitionedFileSet. Each time the job runs, it processes only the files of the newly created partitions.
Copyright © 2018 Cask Data, Inc.. All rights reserved.