Affects Version/s: 3.4.0
Fix Version/s: 3.5.0
Release Notes:The "CDAP Applications" section in the documentation has been split into two separate section, now included under "CDAP Extensions": "Cask Hydrator" and "Cask Tracker".
Improvements to the Cask Hydrator & ETL Pipelines Documentation
Review and revise the documentation for Cask Hydrator and ETL Pipelines. This covers changes to:
- "CDAP Applications" (http://docs.cask.co/cdap/3.5.0-SNAPSHOT/en/cdap-apps/index.html)
- "Cask Hydrator Plugins Repository" (https://github.com/caskdata/hydrator-plugins/tree/develop)
It includes the inclusion of material from the Developers' Manual > Building Blocks > Plugins
- "Plugins" (http://docs.cask.co/cdap/3.5.0-SNAPSHOT/en/developers-manual/building-blocks/plugins.html)
As this can affect the entire manual ("CDAP Applications"), it's best to consider it in its entirety.
1. Users of Hydrator, creating pipelines, working with datasets
2. Administrators, installing plugins, running CDAP
3. Developers of custom plugins and applications
- (was "CDAP Applications" page, http://docs.cask.co/cdap/3.5.0-SNAPSHOT/en/cdap-apps/index.html)
Introduction to [extensions? applications? add-ons?]
- Cask Hydrator
- Cask Tracker
- Data Quality Application
Question: Are all these now "extensions"? or "add-ons"?
(was "Cask Hydrator and ETL", http://docs.cask.co/cdap/3.5.0-SNAPSHOT/en/cdap-apps/hydrator/index.html)
- Video (Building, publishing, running a pipeline for a defined scenario, using the SDK)
- Constructing and running the pipeline from UI
- Diagram that explains the lifecycle (of pipelines?)
- What is a pipeline?
- Batch (Q: Is this now a "Data pipeline"?)
- Branching of Pipelines
- Joining in Pipelines
- What is a plugin?
- Currently Available Hydrator Plugins
- Link to https://github.com/caskdata/hydrator-plugins/tree/develop
- List of existing plugins that we include in CDAP
- Show version used to generate docs
- Creating Pipelines
- Details of using Hydrator
- Especially details that aren't obvious in UI
- Editing an Existing Pipeline
- Running a Pipeline
- Runtime Arguments and Preferences
- Managing Pipeline Lifecycle
- Pipelines, Metrics, and Logs
- CDAP and Hydrator
- Talk about the integration of the two?
- Logical view of a pipeline
- Configuring a plugin
- Configuring a pipeline
- Validating a pipeline
- Publishing a pipeline
- Physical view
- Data pipeline workflow
- MapReduce or Spark
- Advanced: Creating a Hydrator Pipeline Programmatically (or something similar)
(was "Creating an ETL Application", http://docs.cask.co/cdap/3.5.0-SNAPSHOT/en/cdap-apps/hydrator/creating.html)
- Installing Plugins
- Upgrading Hydrator Pipelines (was "Upgrading ETL Applications", http://docs.cask.co/cdap/3.5.0-SNAPSHOT/en/cdap-apps/hydrator/upgrade.html)
- How to create custom plugins
- (was "Creating Custom ETL Plugins", http://docs.cask.co/cdap/3.5.0-SNAPSHOT/en/cdap-apps/hydrator/custom.html)
- Plugin Types and Maven Archetypes
- Available Annotations
- Creating a Batch Source
- Creating a Batch Sink
- Creating a Real-time Source
- Creating a Real-time Sink
- Creating a Transformation
- Test Framework for Plugins
- Source State in a Real-time Source
- (was "Plugin Packaging and Deployment", http://docs.cask.co/cdap/3.5.0-SNAPSHOT/en/cdap-apps/hydrator/custom.html#plugin-packaging-and-deployment)
- Plugin Packaging
- Plugin JSON
- Plugin Widget JSON
- Uploading plugin artifacts
- Deploying as a System Artifact
- Deploying as a User Artifact
- Deployment Verification
- Note: this material is currently a "Sphinx include" from the Developers' Manual > Building Blocks > Plugins
(existing material at https://github.com/caskdata/hydrator-plugins/tree/develop)
- Oriented around developers, not around users of the plugins.
- Note in documentation which branch should be used
- Add a summary of what the plugins do or which ones are available.
- As all of these plugins ship with the SDK, not crucial, but should say that they do, and perhaps have different sections for users and developers.
- Installation instructions could explain exactly when this required. You don't need to do this for SDK; what about distributed CDAP?
- Perhaps add links to each of the READMEs in the lower directories, and then add the missing lower directories.
- Some of the directories have README.rst listing what's in them
- Directories missing a README.rst:
- Simplify titles from (for example) "Elasticsearch Source and Sink Plugin Collection" to "Elasticsearch Source and Sink Plugins"
- Once CDAP docs are revised, the links on the README.rst(s) need to point to relevant docs, with correct versions.
- One issue is the "develop" branch README.rst points to "current" CDAP branch, when it actually needs to point to
the current develop branch ("3.5.0-SNAPSHOT"). A solution would be to add a redirect ("develop") that, like
"current", is updated every release and always points to the current development version of the docs.
- A quick-ref showing which plugins are available and what they do (one-liners or titles) would be helpful.
We might not have sufficient levels to organize Hydrator adequately; should Cask Hydrator be promoted to its own manual?
(from "Glossary", http://docs.cask.co/cdap/3.5.0-SNAPSHOT/en/reference-manual/glossary.html)
- Application: A collection of programs and services that read and write through the data abstraction layer in CDAP. Question: how do Applications exist? As an config file? As a Java class? Both?
- Artifact: A JAR file containing Java classes and resources required to create and run an Application. Multiple applications can be created from the same artifact.
- Cask Hydrator: Cask Hydrator (Studio?) provides an easy method of configuring pipelines using a visual editor. You drag and drop sources, transformations, and sinks, configuring an ETL a Hydrator pipeline within minutes. It provides an operational view of the resulting ETL pipeline that allows for lifecycle control and monitoring of metrics, logs, and other run-time information.
- ETL: Refers to the Extract, Transform and Load of data.
- ETL Application Hydrator Pipeline: A type of A application created from one of the two ETL system artifacts shipped with CDAP: cdap-etl-batch and cdap-etl-realtime. An ETL Application A Hydrator Pipeline defines a source to read from, zero or more transformations to perform on the data read from the source, and one or more sinks to write the transformed data to.
- ETL Hydrator Plugin: A plugin of type BatchSource, RealtimeSource, BatchSink, RealtimeSink, or Transformation, packaged in a JAR file format, for use as a plugin in an ETL Application a Hydrator Pipeline.
- Plugin: A plugin extends an application by implementing an interface expected by the application. Plugins are packaged in an artifact.
- Structured Record: The data format used to exchange events between most of the pre-built CDAP ETL Cask Hydrator plugins.
- CDAP Extension: Definition required
- ETL Pipeline: goes away?
- Hydrator Pipeline: replaces ETL Application, ETL Pipeline
- Different Pipelines: Data pipeline (replaces Batch pipeline?), Real-time pipeline
(existing material at http://docs.cask.co/cdap/3.5.0-SNAPSHOT/en/cdap-apps/tracker/index.html)
This is not being revised as part of this, except for moving to a new manual if required.
(existing material at http://docs.cask.co/cdap/3.5.0-SNAPSHOT/en/cdap-apps/data-quality/index.html)
- Current Material:
- Creating the Application
- Deploying the Application
- End-to-End Example
- Built-in Library of Aggregation Functions
- Extending the Application
- Question: Is Data Quality an "application" or an "extension"?