Data Contract CLI Source
Creating a data generator for a CSV file based on metadata stored in Data Contract CLI.
Requirements
- 10 minutes
- Git
- Gradle
- Docker
Get Started
First, we will clone the data-caterer-example repo which will already have the base project setup required.
git clone git@github.com:data-catering/data-caterer-example.git
git clone git@github.com:data-catering/data-caterer-example.git
git clone git@github.com:data-catering/data-caterer-example.git
Data Contract CLI Setup
We will be using the following Data Contract CLI file for this example.
Plan Setup
Create a new Java/Scala class or YAML file.
- Java:
src/main/java/io/github/datacatering/plan/MyAdvancedDataContractCliJavaPlanRun.java
- Scala:
src/main/scala/io/github/datacatering/plan/MyAdvancedDataContractCliPlanRun.scala
- YAML:
docker/data/customer/plan/my-datacontract-cli.yaml
Make sure your class extends PlanRun
.
import io.github.datacatering.datacaterer.java.api.PlanRun;
...
public class MyAdvancedDataContractCliJavaPlanRun extends PlanRun {
{
var conf = configuration().enableGeneratePlanAndTasks(true)
.generatedReportsFolderPath("/opt/app/data/report");
}
}
import io.github.datacatering.datacaterer.api.PlanRun
...
class MyAdvancedDataContractCliPlanRun extends PlanRun {
val conf = configuration.enableGeneratePlanAndTasks(true)
.generatedReportsFolderPath("/opt/app/data/report")
}
In docker/data/custom/plan/my-datacontract-cli.yaml
:
name: "my_datacontract_cli_plan"
description: "Create account data in CSV via Data Contract CLI metadata"
tasks:
- name: "csv_account_file"
dataSourceName: "customer_accounts"
enabled: true
In application.conf
:
flags {
enableUniqueCheck = true
}
folders {
generatedReportsFolderPath = "/opt/app/data/report"
}
- Click on
Advanced Configuration
towards the bottom of the screen - Click on
Flag
and click onUnique Check
- Click on
Folder
and enter/tmp/data-caterer/report
forGenerated Reports Folder Path
We will enable generate plan and tasks so that we can read from external sources for metadata and save the reports under a folder we can easily access.
Schema
We can point the schema of a data source to our Data Contract CLI file.
var accountTask = csv("my_csv", "/opt/app/data/account-datacontract-cli", Map.of("header", "true"))
.schema(metadataSource().dataContractCli("/opt/app/mount/datacontract-cli/datacontract.yaml"))
.count(count().records(100));
val accountTask = csv("customer_accounts", "/opt/app/data/customer/account-datacontract-cli", Map("header" -> "true"))
.schema(metadataSource.dataContractCli("/opt/app/mount/datacontract-cli/datacontract.yaml"))
.count(count.records(100))
In docker/data/custom/task/file/csv/csv-datacontract-cli-account-task.yaml
:
name: "csv_account_file"
steps:
- name: "accounts"
type: "csv"
options:
path: "/opt/app/data/csv/account-datacontract-cli"
metadataSourceType: "dataContractCli"
dataContractFile: "/opt/app/mount/datacontract-cli/datacontract.yaml"
count:
records: 100
- Click on
Connection
tab at the top - Select
Data Contract CLI
as the data source and enterexample-datacontract-cli
- Copy this file into
/tmp/datacontract-cli/datacontract.yaml
- Enter
/tmp/datacontract-cli/datacontract.yaml
as theContract File
The above defines that the schema will come from Data Contract CLI, which is a type of metadata source
that contains information about schemas.
Specifically, it points to the schema provided here
in the docker/mount/datacontract-cli
folder of data-caterer-example repo.
Run
Let's try run and see what happens.
./run.sh MyAdvancedDataContractCliJavaPlanRun
head docker/sample/customer/account-datacontract-cli/part-00000-*
./run.sh MyAdvancedDataContractCliPlanRun
head docker/sample/customer/account-datacontract-cli/part-00000-*
./run.sh my-datacontract-cli.yaml
head docker/sample/customer/account-datacontract-cli/part-00000-*
- Click on
Execute
at the tophead /tmp/data-caterer/customer/account-datacontract-cli/part-00000*
It should look something like this.
province_state,latitude,confirmed,fips,longitude,country_region,last_update,combined_key,admin2
fwFaFV F73BAIfFd,69977.84296117249,17533,ln9 CRbGkQ9IEyuW,793.3222856184141,87YVVqgS1podHa S,2024-02-10T10:25:39.176Z,sAnv74T9xOyA6MZI,06iRhvBBy40WBlVf
W9N6z1 s7CYyc4L3,54580.231575916325,96761,4mxWLbwArVKOhg6E,58977.422371028944,TkCABcFIYJf87okg,2024-09-07T17:45:27.641Z,9GDm6MGk3WfPdorc,TQdRvrCSgCXg ioP
dp2E6zXwoSKJ5 J2,13368.961196453121,18606,wGJ3iQNg5SdaN4ad,22482.40836235147,r4 Ka6J9ZNKQVEHK,2024-01-25T14:01:09.224Z,RYh6Kl5 46QvOZFR,eEad607OtQX15Vlw
sfQG0neaO5hS7PlV,17461.556283773938,40155,DeSwWCpYwa4WFx5F,81371.85361585379,F2 tzIJS9JsTlhuE,2024-06-13T08:44:55.555Z,JnnGplRjkjo6SgOX,8B5h7UuV2r965wD4
rAISjVikM0ScAsRX,65831.49716656232,36392,vKhuncOokeDgia7e,67677.50911541228,zZVJkymK09ef5oFC,2024-01-01T14:32:02.881Z,lLdHa4JExfuN2FXv,ebcPhXgYJMYTAla1
Looks like we have some data now. But we can do better and add some enhancements to it.
Custom metadata
We can see from the data generated, that it isn't quite what we want. Sometimes, the metadata is not sufficient for us to produce production-like data yet, and we want to manually edit it. Let's try to add some enhancements to it.
Let's make the latitude
and longitude
fields make sense. latitude
is meant to be between -90 to 90 whilst
longitude
is between -180 to 180. country_region
should also represent a state name.
For the full guide on data generation options,
check the following page.
var accountTask = csv("my_csv", "/opt/app/data/account-datacontract-cli", Map.of("header", "true"))
.schema(metadata...)
.schema(
field().name("latitude").min(-90).max(90),
field().name("longitude").min(-180).max(180),
field().name("country_region").expression("#{Address.state}")
)
.count(count().records(100));
val accountTask = csv("customer_accounts", "/opt/app/data/customer/account-datacontract-cli", Map("header" -> "true"))
.schema(metadata...)
.schema(
field.name("latitude").min(-90).max(90),
field.name("longitude").min(-180).max(180),
field.name("country_region").expression("#{Address.state}")
)
.count(count.records(100))
In docker/data/custom/task/file/csv/csv-odcs-account-task.yaml
:
name: "csv_account_file"
steps:
- name: "accounts"
type: "csv"
options:
path: "/opt/app/data/csv/account-datacontract-cli"
metadataSourceType: "dataContractCli"
dataContractFile: "/opt/app/mount/datacontract-cli/datacontract.yaml"
count:
records: 100
schema:
fields:
- name: "latitude"
options:
min: -90
max: 90
- name: "longitude"
options:
min: -180
max: 180
- name: "country_region"
options:
expression: "#{Address.state}"
- Click on
Generation
and tick theManual
checkbox - Click on
+ Field
- Go to
latitude
field - Select data type as
double
- Click on
+
dropdown next todouble
data type - Click
Min
and enter-90
- Click
Max
and enter90
- Go to
- Click on
+ Field
- Go to
longitude
field - Select data type as
double
- Click on
+
dropdown next todouble
data type - Click
Min
and enter-180
- Click
Max
and enter180
- Go to
- Click on
+ Field
- Go to
country_region
field - Click on
+
dropdown next tostring
data type - Click
Faker Expression
and enter#{Address.state}
- Go to
Let's test it out by running it again
./run.sh MyAdvancedDataContractCliJavaPlanRun
head docker/sample/customer/account-datacontract-cli/part-00000-*
./run.sh MyAdvancedDataContractCliPlanRun
head docker/sample/customer/account-datacontract-cli/part-00000-*
./run.sh my-datacontract-cli.yaml
head docker/sample/customer/account-datacontract-cli/part-00000-*
- Click on
Execute
at the tophead /tmp/data-caterer/customer/account-datacontract-cli/part-00000*
province_state,latitude,confirmed,fips,longitude,country_region,last_update,combined_key,admin2
HY5GstfIPnXT0em,35.73941132584518,63652,6YS4JJvZ8N9JsqT,27.037747952451554,Connecticut,2023-12-24T12:42:08.798Z,qIPco7WUo5jXA D,ODADv25VyKsf6Qn
vnkQrkwgf9oj xR,81.87829759208316,73064,cPgrOuPwBVnxK2b,-146.20644012308924,Illinois,2024-03-14T10:24:52.327Z,7NYzdyaM87VjlfH,KUpbi4msmXWZYS4
jnSwW Pk6zj1LsC,82.87970774482852,72341,rL5XqKZtM5unS9x,-153.1279291007243,Mississippi,2024-08-29T15:30:56.338Z,NouXv6EXlWY1Ihe,mirpEgTno0OEDH8
ZmNNb9C5g t8CgJ,43.58312642271184,73116,NFlRmB8p0egkFqG,179.56650534615852,Indiana,2024-01-22T17:05:51.968Z,Fkxf0l3CC a42o5,JznmesYH8ReGhg3
Uf5QH6luS4u5SnO,-75.64320251178277,6232,yRQLBU2OQvm5uqC,-31.025626492871083,New Jersey,2024-09-25T02:35:03.477Z,7IXVfeL6BEpkRbf,f7wUqnigV8WU4B
Great! Now we have the ability to get schema information from an external source, add our own metadata and generate data.
Data validation
To find out what data validation options are available, check this link.
Another aspect of Data Contract CLI that can be leveraged is the definition of data quality rules.
In a later version of Data Caterer, the data quality rules could be later imported and all run within Data Caterer.
Once available, it will be as easy as enabling data validations via enableGenerateValidations
in configuration
.
var conf = configuration().enableGeneratePlanAndTasks(true)
.enableGenerateValidations(true)
.generatedReportsFolderPath("/opt/app/data/report");
execute(conf, accountTask);
val conf = configuration.enableGeneratePlanAndTasks(true)
.enableGenerateValidations(true)
.generatedReportsFolderPath("/opt/app/data/report")
execute(conf, accountTask)
In application.conf
:
flags {
enableGenerateValidations = true
}
- Click on
Advanced Configuration
towards the bottom of the screen - Click on
Flag
and click onGenerate Validations
Check out the full example under AdvancedDataContractCliSourcePlanRun
in the example repo.