CSV

Creating a data generator for CSV. You will have the ability to generate and validate CSV files via Docker.

Scala Example | Java Example | YAML Example

Requirements

10 minutes
Git
Gradle
Docker

Get Started

First, we will clone the data-caterer repo which will already have the base project setup required.

JavaScalaYAMLUI

git clone git@github.com:data-catering/data-caterer.git
cd data-caterer/example

git clone git@github.com:data-catering/data-caterer.git
cd data-caterer/example

git clone git@github.com:data-catering/data-caterer.git
cd data-caterer/example

Run Data Caterer UI via the 'Quick Start' found here.

Plan Setup

Create a file depending on which interface you want to use.

Java: src/main/java/io/github/datacatering/plan/MyCSVJavaPlan.java
Scala: src/main/scala/io/github/datacatering/plan/MyCSVPlan.scala
YAML: docker/data/custom/plan/my-csv.yaml

JavaScalaYAMLUI

import io.github.datacatering.datacaterer.java.api.PlanRun;

public class MyCSVJavaPlan extends PlanRun {
}

import io.github.datacatering.datacaterer.api.PlanRun

class MyCSVPlan extends PlanRun {
}

In docker/data/custom/plan/my-csv.yaml:

name: "my_csv_plan"
description: "Create account data in CSV"
tasks:
  - name: "csv_task"
    dataSourceName: "my_csv"

Click on Connection towards the top of the screen
For connection name, set to my_csv
Click on Select data source type.. and select CSV
Set Path as /tmp/custom/csv/accounts
1. Optionally, we could set the number of partitions and columns to partition by
Click on Create
You should see your connection my_csv show under Existing connections
Click on Home towards the top of the screen
Set plan name to my_csv_plan
Set task name to csv_task
Click on Select data source.. and select my_csv

This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.

Connection Configuration

Within our class, we can start by defining the connection properties to read/write from/to CSV.

JavaScalaYAMLUI

var accountTask = csv(
    "customer_accounts",              //name
    "/opt/app/data/customer/account", //path
    Map.of("header", "true")          //additional options
);

Additional options such as including a header row, etc can be found here.

val accountTask = csv(
  "customer_accounts",              //name         
  "/opt/app/data/customer/account", //path
  Map("header" -> "true")           //additional options
)

Additional options such as including a header row, etc can be found here.

In docker/data/custom/application.conf:

csv {
    my_csv {
        "header": "true"
    }
}

We have already created the connection details in this step

Schema

Depending on how you want to define the schema, follow the below:

Manual schema guide
Automatically detect schema from the data source, you can simply enable configuration.enableGeneratePlanAndTasks(true)
Automatically detect schema from a metadata source

Additional Configurations

At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations. We will also enable the unique check to ensure any unique fields will have unique values generated.

JavaScalaYAMLUI

var config = configuration()
        .generatedReportsFolderPath("/opt/app/data/report")
        .enableUniqueCheck(true);

execute(myPlan, config, accountTask, transactionTask);

val config = configuration
  .generatedReportsFolderPath("/opt/app/data/report")
  .enableUniqueCheck(true)

execute(myPlan, config, accountTask, transactionTask)

In docker/data/custom/application.conf:

flags {
  enableUniqueCheck = true
}
folders {
  generatedReportsFolderPath = "/opt/app/data/report"
}

Click on Advanced Configuration towards the bottom of the screen
Click on Flag and click on Unique Check
Click on Folder and enter /tmp/data-caterer/report for Generated Reports Folder Path

Run

Now we can run via the script ./run.sh that is in the top level directory of the data-caterer/example folder to run the class we just created.

JavaScalaYAMLUI

./run.sh MyCSVJavaPlan
account=$(tail -1 docker/sample/customer/account/part-00000* | awk -F "," '{print $1 "," $4}')
echo $account
cat docker/sample/customer/transaction/part-00000* | grep $account

./run.sh MyCSVPlan
account=$(tail -1 docker/sample/customer/account/part-00000* | awk -F "," '{print $1 "," $4}')
echo $account
cat docker/sample/customer/transaction/part-00000* | grep $account

./run.sh my-csv.yaml
account=$(tail -1 docker/sample/customer/account/part-00000* | awk -F "," '{print $1 "," $4}')
echo $account
cat docker/sample/customer/transaction/part-00000* | grep $account

Click the button Execute at the top
Progress updates will show in the bottom right corner
Click on History at the top
Check for your plan name and see the result summary
Click on Report on the right side to see more details of what was executed

It should look something like this.

ACC29117767,Willodean Sauer
ACC29117767,Willodean Sauer,84.99145871948083,2023-05-14T09:55:51.439Z,2023-05-14
ACC29117767,Willodean Sauer,58.89345733567232,2022-11-22T07:38:20.143Z,2022-11-22

Congratulations! You have now made a data generator that has simulated a real world data scenario. You can check the CSVJavaPlan.java or CSVPlan.scala files as well to check that your plan is the same.

Validation

If you want to validate data from a CSV source, follow the validation documentation found here to help guide you.