Uploaded image for project: 'CDAP'
  1. CDAP
  2. CDAP-7081

Concurrency protection for datasets and partitions

    XMLWordPrintableJSON

    Details

    • Rank:
      1|hzy5s7:

      Description

      Transactions provide snapshot isolation for data that is stored in transaction-aware datasets. That is, even if the data is changed or deleted, a program that has started a transaction has a consistent view of the data until the transcation ends.

      However, this does not protect the program from deletion or modification of a datasets as a whole while the program is running. For example, a table can be deleted, or its schema can be altered while a program is accessing it, and the program may then fail to access that table.

      Also, transactions do not protect non-transaction-aware dataset types such as file sets (or hybrids, such as PartitionedFileSet, which has transactional meta data, but non-transactional partition files).

      Transactions also only protect data cells in the same dataset. In the case of partitioned file set, a MapReduce may read a partition's files in its mappers. The partition meta data is read during preparation of the job, most likely in a separate transaction, and therefore at the time the mapper runs, there is no conflict on the partition meta data if the partition is deleted. The partition and its files will disappear and the MapReduce job will fail. We need to protect the job from that.

      We need to distinguish between protection for the duration of a transaction and the duration of a program. For example, a service that responds to data requests may receive the name of the table it reads from as part of the request and dynamically instantiate that dataset inside a transaction. After the transaction is finished, it is ok to delete the dataset. On the other hand, a flow may instantiate a dataset once (@UseDataSet) and keep using it in subsequent transactions. In this case we would want to protect the dataset from deletion for the duration of the program run. (CDAP-6062 does this for Hive queries, piggy-backing on Hive concurrency control, but not for MapReduce).

      Obviously this is a large piece of work, and requires changes in many components. This Jira will have sub-tasks for the individual work items once a design is established.

        Attachments

          Activity

            People

            • Assignee:
              andreas Andreas Neumann
              Reporter:
              andreas Andreas Neumann
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: