Uploaded image for project: 'CDAP'
  1. CDAP
  2. CDAP-9523

Column Dropdown - Extract text using Regex Groups

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.1.2
    • Component/s: Data Prep, UI
    • Labels:
    • Epic Link:
    • Sprint:
      UP3 Sprint 1, UP3 Sprint 2
    • Release Notes:
      Added the ability to extract text using regex patterns in Dataprep UI as point and click directive.
    • Rank:
      1|hzy0vj:

      Description

      https://github.com/hydrator/wrangler-transform/blob/release/1.0/docs/directives/extract-regex-groups.md

      • Change top level menu from Split to Columns to "Extract text"
      • "Extract Text" has two sub-menu items - "Using Split" and "Using Regex Groups"
      • "Using Split" has the current "Split to Columns" flow, with the submenu moving to a modal.
      • "Using Regex Groups" has a modal. Modal Title: "Select Regex Pattern"
      • The modal has some default patterns in a dropdown:
      • Email, Regex:
        ((?:\S+|".*?")+@[a-zA-Z0-9\.-]+(?:\.[a-zA-Z]{2,6})?)
      • Phone numbers, Regex -
        ((?:\+\d{1,3}[\s-]?)?\(?\d{3}\)?[\s-]?\d{3}[\s-]?\d{4})
      • SSN, Regex -
        (\d{3}[-\s]?\d{2}[-\s]?\d{4})
      • Credit Cards, Regex -
        ((?:\d{4}[-\s]?){4})
      • URL, Regex:
        ((?:https?://)?[a-zA-Z0-9\.-]+\.[a-zA-Z]{2,6}(?:/[\w\.-]+)*(?:\?[\w\.&=\-]+)?)
      • IP Address, Regex:
        ((?:(?:0|(?:25[0-5])|(?:2[0-4][1-9])|(?:1\d\d)|(?:[1-9]\d?))\.){3}(?:(?:0|(?:25[0-5])|(?:2[0-4][1-9])|(?:1\d\d)|(?:[1-9]\d?))))
      • Mac Addresses, Regex:
        ((?:\p{XDigit}{2}[:-]){5}(?:\p{XDigit}{2}))
      • HTML Tag, Regex:
        <([a-zA-Z]+)(?:\s+[a-zA-Z]+=".*?")*(?:(?:>(.*)</\1>)|(?:\s*/?>))
      • HTML Hyperlinks, Regex:
        <[aA](?:\s+[a-zA-Z]+=".*?")*\s+[hH][rR][eE][fF]="(.*?)"(?:\s+[a-zA-Z]+=".*?")*>(.*)</[aA]>
      • Date, Regex:
        ((?:(?:\d{4}|\d{2})(?:(?:[.,]\s)|[-/.\s])(?:(?:1[0-2])|(?:0?\d)|(?:[a-zA-Z]{3}))(?:(?:[.,]\s)|[-/.\s])(?:\d{1,2}))|(?:(?:(?:\d{1,2})(?:(?:[.,]\s)|[-/.\s])(?:(?:1[0-2])|(?:0?\d)|(?:[a-zA-Z]{3}))|(?:(?:1[0-2])|(?:0?\d)|(?:[a-zA-Z]{3}))(?:(?:[.,]\s)|[-/.\s])(?:\d{1,2}))(?:(?:[.,]\s)|[-/.\s])(?:\d{4}|\d{2})))
        
      • Time, Regex:
        ((?:(?:2[0-3])|(?:[01]?\d))[h:\s][0-5]\d(?::(?:(?:[0-5]\d)|(?:60)))?(?:\s[aApP][mM])?(?:Z|(?:[+-](?:1[0-2])|(?:0?\d):[0-5]\d)|(?:\s[[a-zA-Z]\s]+))?)
      • Datetime, Regex:
        ((?:(?:(?:\d{4}|\d{2})(?:(?:[.,]\s)|[-/.\s])(?:(?:1[0-2])|(?:0?\d)|(?:[a-zA-Z]{3}))(?:(?:[.,]\s)|[-/.\s])(?:\d{1,2}))|(?:(?:(?:\d{1,2})(?:(?:[.,]\s)|[-/.\s])(?:(?:1[0-2])|(?:0?\d)|(?:[a-zA-Z]{3}))|(?:(?:1[0-2])|(?:0?\d)|(?:[a-zA-Z]{3}))(?:(?:[.,]\s)|[-/.\s])(?:\d{1,2}))(?:(?:[.,]\s)|[-/.\s])(?:\d{4}|\d{2})))[T\s](?:(?:(?:2[0-3])|(?:[01]?\d))[h:\s][0-5]\d(?::(?:(?:[0-5]\d)|(?:60)))?(?:\s[aApP][mM])?(?:Z|(?:[+-](?:1[0-2])|(?:0?\d):[0-5]\d)|(?:\s[[a-zA-Z]\s]+))?))
      • UPS Codes, Regex:
        (1Z\s?[0-9a-zA-Z]{3}\s?[0-9a-zA-Z]{3}\s?[0-9a-zA-Z]{2}\s?\d{4}\s?\d{4})
      • ISBN Codes, Regex:
        ((?:97[89]-?)?(?:\d-?){9}[\dxX])
      • N Digits Numbers, Regex:
        
        
      • Start/End Pattern, Regex:
        
        
      • Custom Pattern. This option as a text box, with the placeholder as
         E.g. [^(]+\(([0-9]{4})\).* 

        . It has two buttons: "Extract", and "Cancel". Extract performs the operation and generates columns. Cancel cancels it.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                ajai Ajai Narayan
                Reporter:
                lea Lea
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: