Pentaho Data Integration

Pentaho Data Integration is an advanced, open source business intelligence tool that can execute transformations of data coming from various sources. Let's see how to connect it to CDAP datasets using the CDAP JDBC driver.

  1. Before opening the Pentaho Data Integration application, copy the co.cask.cdap.cdap-explore-jdbc-5.1.2.jar file to the lib directory of Pentaho Data Integration, located at the root of the application's directory.

  2. Open Pentaho Data Integration.

  3. In the toolbar, select File -> New -> Database Connection....

  4. In the General section, select a Connection Name, such as CDAP Sandbox. For the Connection Type, select Generic database. Select Native (JDBC) for the Access field. In this example, where we connect to a CDAP Sandbox, our Custom Connection URL will then be jdbc:cdap://localhost:11015. In the field Custom Driver Class Name, enter co.cask.cdap.explore.jdbc.ExploreDriver.

  5. Click on OK.

  6. To use this connection, navigate to the Design tab on the left of the main view. In the Input menu, double click on Table input. It will create a new transformation containing this input.

  7. Right-click on Table input in your transformation and select Edit step. You can specify an appropriate name for this input such as CDAP datasets query. Under Connection, select the newly created database connection; in this example, CDAP Sandbox. Enter a valid SQL query in the main SQL field. This will define the data available to your transformation.

  8. Click on OK. Your input is now ready to be used in your transformation, and it will contain data coming from the results of the SQL query on the CDAP datasets.

  9. For more information on how to add components to a transformation and link them together, see the Pentaho Data Integration page.