Skip to content

JSON

Creating a data generator for JSON. You will have the ability to generate and validate JSON files via Docker.

Requirements

  • 10 minutes
  • Git
  • Gradle
  • Docker

Get Started

First, we will clone the data-caterer-example repo which will already have the base project setup required.

git clone git@github.com:data-catering/data-caterer-example.git
git clone git@github.com:data-catering/data-caterer-example.git
git clone git@github.com:data-catering/data-caterer-example.git

Plan Setup

Create a file depending on which interface you want to use.

  • Java: src/main/java/io/github/datacatering/plan/MyJSONJavaPlan.java
  • Scala: src/main/scala/io/github/datacatering/plan/MyJSONPlan.scala
  • YAML: docker/data/custom/plan/my-json.yaml
import io.github.datacatering.datacaterer.java.api.PlanRun;

public class MyJSONJavaPlan extends PlanRun {
}
import io.github.datacatering.datacaterer.api.PlanRun

class MyJSONPlan extends PlanRun {
}

In docker/data/custom/plan/my-json.yaml:

name: "my_json_plan"
description: "Create account data in JSON"
tasks:
  - name: "json_task"
    dataSourceName: "my_json"

  1. Click on Connection towards the top of the screen
  2. For connection name, set to my_json
  3. Click on Select data source type.. and select JSON
  4. Set Path as /tmp/custom/json/accounts
    1. Optionally, we could set the number of partitions and columns to partition by
  5. Click on Create
  6. You should see your connection my_json show under Existing connections
  7. Click on Home towards the top of the screen
  8. Set plan name to my_json_plan
  9. Set task name to json_task
  10. Click on Select data source.. and select my_json

This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.

Connection Configuration

Within our class, we can start by defining the connection properties to read/write from/to JSON.

var accountTask = json(
    "customer_accounts",                    //name
    "/opt/app/data/customer/account_json",  //path
    Map.of()                                //additional options
);

Additional options can be found here.

val accountTask = json(
  "customer_accounts",                    //name         
  "/opt/app/data/customer/account_json",  //path
  Map()                                   //additional options
)

Additional options can be found here.

In docker/data/custom/application.conf:

json {
    my_json {
        "dateFormat": "dd-MM-yyyy"
    }
}

  1. We have already created the connection details in this step

Schema

Depending on how you want to define the schema, follow the below:

Let's create a task for generating data as accounts and then generate data for transactions, which will be related to the accounts generated.

var accountTask = json("customer_accounts", "/opt/app/data/customer/account_json")
        .fields(
            field().name("account_id"),
            field().name("balance").type(new DecimalType(5, 2)),
            field().name("created_by"),
            field().name("open_time").type(TimestampType.instance()),
            field().name("status"),
            field().name("customer_details")
                .fields(
                    field().name("name"),
                    field().name("age").type(IntegerType.instance()),
                    field().name("city")
                )
        );
val accountTask = json("customer_accounts", "/opt/app/data/customer/account_json")
  .fields(
    field.name("account_id"),
    field.name("balance").`type`(new DecimalType(5, 2)),
    field.name("created_by"),
    field.name("open_time").`type`(TimestampType),
    field.name("status"),
    field.name("customer_details")
      .fields(
        field.name("name"),
        field.name("age").`type`(IntegerType),
        field.name("city")
      )
  )

In docker/data/custom/task/json/json-task.yaml:

name: "json_task"
steps:
  - name: "accounts"
    type: "json"
    fields:
      - name: "account_id"
      - name: "balance"
        type: "double"
      - name: "created_by"
      - name: "open_time"
        type: "timestamp"
      - name: "status"
      - name: "customer_details"
        fields:
          - name: "name"
          - name: "age"
            type: "integer"
          - name: "city"

  1. Click on Generation and tick the Manual checkbox
  2. Click on + Field
  3. Add name as account_id
  4. Click on Select data type and select string
  5. Click on + Field and add name as balance
  6. Click on Select data type and select double
  7. Click on + Field and add name as created_by
  8. Click on Select data type and select string
  9. Click on + Field and add name as open_time
  10. Click on Select data type and select timestamp
  11. Click on + Field and add name as status
  12. Click on Select data type and select string
  13. Click on + Field and add name as customer_details
  14. Click on Select data type and select struct
    1. Under customer_details, click on + Field and add name name
    2. Under customer_details, click on + Field and add name age, set data type as integer
    3. Under customer_details, click on + Field and add name city

Additional Configurations

At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations. We will also enable the unique check to ensure any unique fields will have unique values generated.

var config = configuration()
        .generatedReportsFolderPath("/opt/app/data/report")
        .enableUniqueCheck(true);

execute(myPlan, config, accountTask, transactionTask);
val config = configuration
  .generatedReportsFolderPath("/opt/app/data/report")
  .enableUniqueCheck(true)

execute(myPlan, config, accountTask, transactionTask)

In docker/data/custom/application.conf:

flags {
  enableUniqueCheck = true
}
folders {
  generatedReportsFolderPath = "/opt/app/data/report"
}

  1. Click on Advanced Configuration towards the bottom of the screen
  2. Click on Flag and click on Unique Check
  3. Click on Folder and enter /tmp/data-caterer/report for Generated Reports Folder Path

Run

Now we can run via the script ./run.sh that is in the top level directory of the data-caterer-example to run the class we just created.

./run.sh MyJSONJavaPlan
account=$(head -1 docker/sample/customer/account_json/part-00000-* | sed -nr 's/.*account_id":"(.+)","balance.*/\1/p')
echo "Head account record:"
head -1 docker/sample/customer/account_json/part-00000-*
echo $account
echo "Transaction records:"
cat docker/sample/customer/transaction_json/part-0000* | grep $account
./run.sh MyJSONPlan
account=$(head -1 docker/sample/customer/account_json/part-00000-* | sed -nr 's/.*account_id":"(.+)","balance.*/\1/p')
echo "Head account record:"
head -1 docker/sample/customer/account_json/part-00000-*
echo $account
echo "Transaction records:"
cat docker/sample/customer/transaction_json/part-0000* | grep $account
./run.sh my-json.yaml
account=$(head -1 docker/sample/customer/account_json/part-00000-* | sed -nr 's/.*account_id":"(.+)","balance.*/\1/p')
echo "Head account record:"
head -1 docker/sample/customer/account_json/part-00000-*
echo $account
echo "Transaction records:"
cat docker/sample/customer/transaction_json/part-0000* | grep $account
  1. Click the button Execute at the top
  2. Progress updates will show in the bottom right corner
  3. Click on History at the top
  4. Check for your plan name and see the result summary
  5. Click on Report on the right side to see more details of what was executed

It should look something like this.

Head account record:
{"account_id":"ACC00047541","balance":445.62,"created_by":"event","open_time":"2024-03-13T00:31:38.836Z","status":"suspended","customer_details":{"name":"Joey Gaylord","age":44,"city":"Lake Jose"}}
ACC00047541
Transaction records:
{"account_id":"ACC00047541","full_name":"Joey Gaylord","amount":31.485424217447527,"time":"2023-11-07T04:50:20.875Z","date":"2023-11-07"}
{"account_id":"ACC00047541","full_name":"Joey Gaylord","amount":79.22177964401857,"time":"2024-02-01T15:15:38.289Z","date":"2024-02-01"}
{"account_id":"ACC00047541","full_name":"Joey Gaylord","amount":56.06230355456882,"time":"2024-02-29T21:42:42.473Z","date":"2024-02-29"}

Congratulations! You have now made a data generator that has simulated a real world data scenario. You can check the JsonJavaPlan.java or JsonPlan.scala files as well to check that your plan is the same.

Validation

If you want to validate data from a JSON source, follow the validation documentation found here to help guide you.