Uploaded image for project: 'CDAP'
  1. CDAP
  2. CDAP-5263

Revisions to CDAP Applications Manual

    XMLWordPrintableJSON

    Details

    • Type: Task
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 3.4.0
    • Fix Version/s: 3.5.0
    • Component/s: Docs
    • Labels:
    • Release Notes:
      The "CDAP Applications" section in the documentation has been split into two separate section, now included under "CDAP Extensions": "Cask Hydrator" and "Cask Tracker".
    • Rank:
      1|hzz8pz:

      Description

      Improvements to the Cask Hydrator & ETL Pipelines Documentation

      Review and revise the documentation for Cask Hydrator and ETL Pipelines. This covers changes to:

      It includes the inclusion of material from the Developers' Manual > Building Blocks > Plugins

      As this can affect the entire manual ("CDAP Applications"), it's best to consider it in its entirety.

      Audiences

      1. Users of Hydrator, creating pipelines, working with datasets
      2. Administrators, installing plugins, running CDAP
      3. Developers of custom plugins and applications

      Outline

      CDAP Extensions

      Question: Are all these now "extensions"? or "add-ons"?

      Cask Hydrator

      (was "Cask Hydrator and ETL", http://docs.cask.co/cdap/3.5.0-SNAPSHOT/en/cdap-apps/hydrator/index.html)

      Hydrator Quickstart

      • Video (Building, publishing, running a pipeline for a defined scenario, using the SDK)
      • Constructing and running the pipeline from UI
      • Concepts
        • Pipelines
        • Plugins
      • Diagram that explains the lifecycle (of pipelines?)
        • Constructing
        • Configuring
        • Publishing
        • Running
      • Types
        • Batch
          • M/R
          • Spark
        • Real-time

      Using Cask Hydrator and Cask Hydrator Studio

      • What is a pipeline?
        • Batch (Q: Is this now a "Data pipeline"?)
        • Real-time
        • Branching of Pipelines
          • Joining in Pipelines
      • Creating Pipelines
        • Details of using Hydrator
        • Especially details that aren't obvious in UI
      • Editing an Existing Pipeline
      • Running a Pipeline
        • Runtime Arguments and Preferences
      • Managing Pipeline Lifecycle
      • Pipelines, Metrics, and Logs
      • CDAP and Hydrator
        • Talk about the integration of the two?
      • Logical view of a pipeline
        • Plugins
        • Configuring a plugin
        • Configuring a pipeline
        • Validating a pipeline
        • Publishing a pipeline
      • Physical view
        • Data pipeline workflow
        • MapReduce or Spark

      Cask Hydrator Administration

      • Installing Plugins

      Custom Hydrator Plugins

      • How to create custom plugins

      Cask Hydrator Plugins Repository

      (existing material at https://github.com/caskdata/hydrator-plugins/tree/develop)

      • Oriented around developers, not around users of the plugins.
      • Note in documentation which branch should be used
      • Add a summary of what the plugins do or which ones are available.
      • As all of these plugins ship with the SDK, not crucial, but should say that they do, and perhaps have different sections for users and developers.
      • Installation instructions could explain exactly when this required. You don't need to do this for SDK; what about distributed CDAP?
      • Perhaps add links to each of the READMEs in the lower directories, and then add the missing lower directories.
      • Some of the directories have README.rst listing what's in them
      • Directories missing a README.rst:
        • core-plugins
        • database-plugins
        • hbase-plugins
        • hdfs-plugins
        • hydrator-common
      • Simplify titles from (for example) "Elasticsearch Source and Sink Plugin Collection" to "Elasticsearch Source and Sink Plugins"
      • Once CDAP docs are revised, the links on the README.rst(s) need to point to relevant docs, with correct versions.
      • One issue is the "develop" branch README.rst points to "current" CDAP branch, when it actually needs to point to
        the current develop branch ("3.5.0-SNAPSHOT"). A solution would be to add a redirect ("develop") that, like
        "current", is updated every release and always points to the current development version of the docs.
      • A quick-ref showing which plugins are available and what they do (one-liners or titles) would be helpful.

      Documentation Presentation

      We might not have sufficient levels to organize Hydrator adequately; should Cask Hydrator be promoted to its own manual?

      Glossary of Terms

      (from "Glossary", http://docs.cask.co/cdap/3.5.0-SNAPSHOT/en/reference-manual/glossary.html)

      Current Terms (that may require revision or deletion)

      • Application: A collection of programs and services that read and write through the data abstraction layer in CDAP. Question: how do Applications exist? As an config file? As a Java class? Both?
      • Artifact: A JAR file containing Java classes and resources required to create and run an Application. Multiple applications can be created from the same artifact.
      • Cask Hydrator: Cask Hydrator (Studio?) provides an easy method of configuring pipelines using a visual editor. You drag and drop sources, transformations, and sinks, configuring an ETL a Hydrator pipeline within minutes. It provides an operational view of the resulting ETL pipeline that allows for lifecycle control and monitoring of metrics, logs, and other run-time information.
      • ETL: Refers to the Extract, Transform and Load of data.
      • ETL Application Hydrator Pipeline: A type of A application created from one of the two ETL system artifacts shipped with CDAP: cdap-etl-batch and cdap-etl-realtime. An ETL Application A Hydrator Pipeline defines a source to read from, zero or more transformations to perform on the data read from the source, and one or more sinks to write the transformed data to.
      • ETL Hydrator Plugin: A plugin of type BatchSource, RealtimeSource, BatchSink, RealtimeSink, or Transformation, packaged in a JAR file format, for use as a plugin in an ETL Application a Hydrator Pipeline.
      • Plugin: A plugin extends an application by implementing an interface expected by the application. Plugins are packaged in an artifact.
      • Structured Record: The data format used to exchange events between most of the pre-built CDAP ETL Cask Hydrator plugins.

      Additional Terms

      • CDAP Extension: Definition required
      • ETL Pipeline: goes away?
      • Hydrator Pipeline: replaces ETL Application, ETL Pipeline
      • Different Pipelines: Data pipeline (replaces Batch pipeline?), Real-time pipeline

      Cask Tracker

      (existing material at http://docs.cask.co/cdap/3.5.0-SNAPSHOT/en/cdap-apps/tracker/index.html)
      This is not being revised as part of this, except for moving to a new manual if required.

      Data Quality Application

      (existing material at http://docs.cask.co/cdap/3.5.0-SNAPSHOT/en/cdap-apps/data-quality/index.html)

      • Current Material:
        • Overview
        • Creating the Application
        • Deploying the Application
        • End-to-End Example
        • Built-in Library of Aggregation Functions
        • Extending the Application
      • Question: Is Data Quality an "application" or an "extension"?

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                John John Jackson
                Reporter:
                John John Jackson
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: