Data Contract CLI Source

Data Caterer reading from Data Contract CLI file for schema metadata and data quality

Creating a data generator for a CSV file based on metadata stored in Data Contract CLI.

Requirements

10 minutes
Git
Gradle
Docker

Get Started

First, we will clone the data-caterer-example repo which will already have the base project setup required.

JavaScalaYAMLUI

git clone git@github.com:data-catering/data-caterer-example.git

git clone git@github.com:data-catering/data-caterer-example.git

git clone git@github.com:data-catering/data-caterer-example.git

Run Data Caterer UI via the 'Quick Start' found here.

Data Contract CLI Setup

We will be using the following Data Contract CLI file for this example.

Plan Setup

Create a new Java/Scala class or YAML file.

Java: src/main/java/io/github/datacatering/plan/MyAdvancedDataContractCliJavaPlanRun.java
Scala: src/main/scala/io/github/datacatering/plan/MyAdvancedDataContractCliPlanRun.scala
YAML: docker/data/customer/plan/my-datacontract-cli.yaml

Make sure your class extends PlanRun.

JavaScalaYAMLUI

import io.github.datacatering.datacaterer.java.api.PlanRun;
...

public class MyAdvancedDataContractCliJavaPlanRun extends PlanRun {
    {
        var conf = configuration().enableGeneratePlanAndTasks(true)
            .generatedReportsFolderPath("/opt/app/data/report");
    }
}

import io.github.datacatering.datacaterer.api.PlanRun
...

class MyAdvancedDataContractCliPlanRun extends PlanRun {
  val conf = configuration.enableGeneratePlanAndTasks(true)
    .generatedReportsFolderPath("/opt/app/data/report")
}

In docker/data/custom/plan/my-datacontract-cli.yaml:

name: "my_datacontract_cli_plan"
description: "Create account data in CSV via Data Contract CLI metadata"
tasks:
  - name: "csv_account_file"
    dataSourceName: "customer_accounts"
    enabled: true

In application.conf:

flags {
  enableUniqueCheck = true
}
folders {
  generatedReportsFolderPath = "/opt/app/data/report"
}

Click on Advanced Configuration towards the bottom of the screen
Click on Flag and click on Unique Check
Click on Folder and enter /tmp/data-caterer/report for Generated Reports Folder Path

We will enable generate plan and tasks so that we can read from external sources for metadata and save the reports under a folder we can easily access.

Schema

We can point the schema of a data source to our Data Contract CLI file.

JavaScalaYAMLUI

var accountTask = csv("my_csv", "/opt/app/data/account-datacontract-cli", Map.of("header", "true"))
        .schema(metadataSource().dataContractCli("/opt/app/mount/datacontract-cli/datacontract.yaml"))
        .count(count().records(100));

val accountTask = csv("customer_accounts", "/opt/app/data/customer/account-datacontract-cli", Map("header" -> "true"))
  .schema(metadataSource.dataContractCli("/opt/app/mount/datacontract-cli/datacontract.yaml"))
  .count(count.records(100))

In docker/data/custom/task/file/csv/csv-datacontract-cli-account-task.yaml:

name: "csv_account_file"
steps:
  - name: "accounts"
    type: "csv"
    options:
      path: "/opt/app/data/csv/account-datacontract-cli"
      metadataSourceType: "dataContractCli"
      dataContractFile: "/opt/app/mount/datacontract-cli/datacontract.yaml"
    count:
      records: 100

Click on Connection tab at the top
Select Data Contract CLI as the data source and enter example-datacontract-cli
Copy this file into /tmp/datacontract-cli/datacontract.yaml
Enter /tmp/datacontract-cli/datacontract.yaml as the Contract File

The above defines that the schema will come from Data Contract CLI, which is a type of metadata source that contains information about schemas. Specifically, it points to the schema provided here in the docker/mount/datacontract-cli folder of data-caterer-example repo.

Run

Let's try run and see what happens.

JavaScalaYAMLUI

./run.sh MyAdvancedDataContractCliJavaPlanRun
head docker/sample/customer/account-datacontract-cli/part-00000-*

./run.sh MyAdvancedDataContractCliPlanRun
head docker/sample/customer/account-datacontract-cli/part-00000-*

./run.sh my-datacontract-cli.yaml
head docker/sample/customer/account-datacontract-cli/part-00000-*

Click on Execute at the top

head /tmp/data-caterer/customer/account-datacontract-cli/part-00000*

It should look something like this.

province_state,latitude,confirmed,fips,longitude,country_region,last_update,combined_key,admin2
fwFaFV F73BAIfFd,69977.84296117249,17533,ln9 CRbGkQ9IEyuW,793.3222856184141,87YVVqgS1podHa S,2024-02-10T10:25:39.176Z,sAnv74T9xOyA6MZI,06iRhvBBy40WBlVf
W9N6z1 s7CYyc4L3,54580.231575916325,96761,4mxWLbwArVKOhg6E,58977.422371028944,TkCABcFIYJf87okg,2024-09-07T17:45:27.641Z,9GDm6MGk3WfPdorc,TQdRvrCSgCXg ioP
dp2E6zXwoSKJ5 J2,13368.961196453121,18606,wGJ3iQNg5SdaN4ad,22482.40836235147,r4 Ka6J9ZNKQVEHK,2024-01-25T14:01:09.224Z,RYh6Kl5 46QvOZFR,eEad607OtQX15Vlw
sfQG0neaO5hS7PlV,17461.556283773938,40155,DeSwWCpYwa4WFx5F,81371.85361585379,F2 tzIJS9JsTlhuE,2024-06-13T08:44:55.555Z,JnnGplRjkjo6SgOX,8B5h7UuV2r965wD4
rAISjVikM0ScAsRX,65831.49716656232,36392,vKhuncOokeDgia7e,67677.50911541228,zZVJkymK09ef5oFC,2024-01-01T14:32:02.881Z,lLdHa4JExfuN2FXv,ebcPhXgYJMYTAla1

Looks like we have some data now. But we can do better and add some enhancements to it.

Custom metadata

We can see from the data generated, that it isn't quite what we want. Sometimes, the metadata is not sufficient for us to produce production-like data yet, and we want to manually edit it. Let's try to add some enhancements to it.

Let's make the latitude and longitude fields make sense. latitude is meant to be between -90 to 90 whilst longitude is between -180 to 180. country_region should also represent a state name. For the full guide on data generation options, check the following page.

JavaScalaYAMLUI

var accountTask = csv("my_csv", "/opt/app/data/account-datacontract-cli", Map.of("header", "true"))
            .schema(metadata...)
            .schema(
                field().name("latitude").min(-90).max(90),
                field().name("longitude").min(-180).max(180),
                field().name("country_region").expression("#{Address.state}")
            )
            .count(count().records(100));

val accountTask = csv("customer_accounts", "/opt/app/data/customer/account-datacontract-cli", Map("header" -> "true"))
  .schema(metadata...)
  .schema(
    field.name("latitude").min(-90).max(90),
    field.name("longitude").min(-180).max(180),
    field.name("country_region").expression("#{Address.state}")
  )
  .count(count.records(100))

In docker/data/custom/task/file/csv/csv-odcs-account-task.yaml:

name: "csv_account_file"
steps:
  - name: "accounts"
    type: "csv"
    options:
      path: "/opt/app/data/csv/account-datacontract-cli"
      metadataSourceType: "dataContractCli"
      dataContractFile: "/opt/app/mount/datacontract-cli/datacontract.yaml"
    count:
      records: 100
    schema:
      fields:
        - name: "latitude"
          options:
            min: -90
            max: 90
        - name: "longitude"
          options:
            min: -180
            max: 180
        - name: "country_region"
          options:
            expression: "#{Address.state}"

Click on Generation and tick the Manual checkbox
Click on + Field
1. Go to latitude field
2. Select data type as double
3. Click on + dropdown next to double data type
4. Click Min and enter -90
5. Click Max and enter 90
Click on + Field
1. Go to longitude field
2. Select data type as double
3. Click on + dropdown next to double data type
4. Click Min and enter -180
5. Click Max and enter 180
Click on + Field
1. Go to country_region field
2. Click on + dropdown next to string data type
3. Click Faker Expression and enter #{Address.state}

Let's test it out by running it again

JavaScalaYAMLUI

./run.sh MyAdvancedDataContractCliJavaPlanRun
head docker/sample/customer/account-datacontract-cli/part-00000-*

./run.sh MyAdvancedDataContractCliPlanRun
head docker/sample/customer/account-datacontract-cli/part-00000-*

./run.sh my-datacontract-cli.yaml
head docker/sample/customer/account-datacontract-cli/part-00000-*

Click on Execute at the top

head /tmp/data-caterer/customer/account-datacontract-cli/part-00000*

province_state,latitude,confirmed,fips,longitude,country_region,last_update,combined_key,admin2
HY5GstfIPnXT0em,35.73941132584518,63652,6YS4JJvZ8N9JsqT,27.037747952451554,Connecticut,2023-12-24T12:42:08.798Z,qIPco7WUo5jXA D,ODADv25VyKsf6Qn
vnkQrkwgf9oj xR,81.87829759208316,73064,cPgrOuPwBVnxK2b,-146.20644012308924,Illinois,2024-03-14T10:24:52.327Z,7NYzdyaM87VjlfH,KUpbi4msmXWZYS4
jnSwW Pk6zj1LsC,82.87970774482852,72341,rL5XqKZtM5unS9x,-153.1279291007243,Mississippi,2024-08-29T15:30:56.338Z,NouXv6EXlWY1Ihe,mirpEgTno0OEDH8
ZmNNb9C5g t8CgJ,43.58312642271184,73116,NFlRmB8p0egkFqG,179.56650534615852,Indiana,2024-01-22T17:05:51.968Z,Fkxf0l3CC a42o5,JznmesYH8ReGhg3
Uf5QH6luS4u5SnO,-75.64320251178277,6232,yRQLBU2OQvm5uqC,-31.025626492871083,New Jersey,2024-09-25T02:35:03.477Z,7IXVfeL6BEpkRbf,f7wUqnigV8WU4B

Great! Now we have the ability to get schema information from an external source, add our own metadata and generate data.

Data validation

To find out what data validation options are available, check this link.

Another aspect of Data Contract CLI that can be leveraged is the definition of data quality rules. In a later version of Data Caterer, the data quality rules could be later imported and all run within Data Caterer. Once available, it will be as easy as enabling data validations via enableGenerateValidations in configuration.

JavaScalaYAMLUI

var conf = configuration().enableGeneratePlanAndTasks(true)
    .enableGenerateValidations(true)
    .generatedReportsFolderPath("/opt/app/data/report");

execute(conf, accountTask);

val conf = configuration.enableGeneratePlanAndTasks(true)
  .enableGenerateValidations(true)
  .generatedReportsFolderPath("/opt/app/data/report")

execute(conf, accountTask)

In application.conf:

flags {
  enableGenerateValidations = true
}

Click on Advanced Configuration towards the bottom of the screen
Click on Flag and click on Generate Validations

Check out the full example under AdvancedDataContractCliSourcePlanRun in the example repo.