JSON
Creating a data generator for JSON. You will have the ability to generate and validate JSON files via Docker.
Requirements
- 10 minutes
- Git
- Gradle
- Docker
Get Started
First, we will clone the data-caterer-example repo which will already have the base project setup required.
Plan Setup
Create a file depending on which interface you want to use.
- Java:
src/main/java/io/github/datacatering/plan/MyJSONJavaPlan.java
- Scala:
src/main/scala/io/github/datacatering/plan/MyJSONPlan.scala
- YAML:
docker/data/custom/plan/my-json.yaml
In docker/data/custom/plan/my-json.yaml
:
- Click on
Connection
towards the top of the screen - For connection name, set to
my_json
- Click on
Select data source type..
and selectJSON
- Set
Path
as/tmp/custom/json/accounts
- Optionally, we could set the number of partitions and columns to partition by
- Click on
Create
- You should see your connection
my_json
show underExisting connections
- Click on
Home
towards the top of the screen - Set plan name to
my_json_plan
- Set task name to
json_task
- Click on
Select data source..
and selectmy_json
This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.
Connection Configuration
Within our class, we can start by defining the connection properties to read/write from/to JSON.
var accountTask = json(
"customer_accounts", //name
"/opt/app/data/customer/account_json", //path
Map.of() //additional options
);
Additional options can be found here.
val accountTask = json(
"customer_accounts", //name
"/opt/app/data/customer/account_json", //path
Map() //additional options
)
Additional options can be found here.
- We have already created the connection details in this step
Schema
Depending on how you want to define the schema, follow the below:
- Manual schema guide
- Automatically detect schema from the data source, you can simply enable
configuration.enableGeneratePlanAndTasks(true)
- Automatically detect schema from a metadata source
Let's create a task for generating data as accounts
and then generate data for transactions
, which will be related
to the accounts generated.
var accountTask = json("customer_accounts", "/opt/app/data/customer/account_json")
.fields(
field().name("account_id"),
field().name("balance").type(new DecimalType(5, 2)),
field().name("created_by"),
field().name("open_time").type(TimestampType.instance()),
field().name("status"),
field().name("customer_details")
.fields(
field().name("name"),
field().name("age").type(IntegerType.instance()),
field().name("city")
)
);
val accountTask = json("customer_accounts", "/opt/app/data/customer/account_json")
.fields(
field.name("account_id"),
field.name("balance").`type`(new DecimalType(5, 2)),
field.name("created_by"),
field.name("open_time").`type`(TimestampType),
field.name("status"),
field.name("customer_details")
.fields(
field.name("name"),
field.name("age").`type`(IntegerType),
field.name("city")
)
)
In docker/data/custom/task/json/json-task.yaml
:
- Click on
Generation
and tick theManual
checkbox - Click on
+ Field
- Add name as
account_id
- Click on
Select data type
and selectstring
- Click on
+ Field
and add name asbalance
- Click on
Select data type
and selectdouble
- Click on
+ Field
and add name ascreated_by
- Click on
Select data type
and selectstring
- Click on
+ Field
and add name asopen_time
- Click on
Select data type
and selecttimestamp
- Click on
+ Field
and add name asstatus
- Click on
Select data type
and selectstring
- Click on
+ Field
and add name ascustomer_details
- Click on
Select data type
and selectstruct
- Under
customer_details
, click on+ Field
and add namename
- Under
customer_details
, click on+ Field
and add nameage
, set data type asinteger
- Under
customer_details
, click on+ Field
and add namecity
- Under
Additional Configurations
At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations. We will also enable the unique check to ensure any unique fields will have unique values generated.
In docker/data/custom/application.conf
:
- Click on
Advanced Configuration
towards the bottom of the screen - Click on
Flag
and click onUnique Check
- Click on
Folder
and enter/tmp/data-caterer/report
forGenerated Reports Folder Path
Run
Now we can run via the script ./run.sh
that is in the top level directory of the data-caterer-example
to run the class we just
created.
./run.sh MyJSONJavaPlan
account=$(head -1 docker/sample/customer/account_json/part-00000-* | sed -nr 's/.*account_id":"(.+)","balance.*/\1/p')
echo "Head account record:"
head -1 docker/sample/customer/account_json/part-00000-*
echo $account
echo "Transaction records:"
cat docker/sample/customer/transaction_json/part-0000* | grep $account
./run.sh MyJSONPlan
account=$(head -1 docker/sample/customer/account_json/part-00000-* | sed -nr 's/.*account_id":"(.+)","balance.*/\1/p')
echo "Head account record:"
head -1 docker/sample/customer/account_json/part-00000-*
echo $account
echo "Transaction records:"
cat docker/sample/customer/transaction_json/part-0000* | grep $account
./run.sh my-json.yaml
account=$(head -1 docker/sample/customer/account_json/part-00000-* | sed -nr 's/.*account_id":"(.+)","balance.*/\1/p')
echo "Head account record:"
head -1 docker/sample/customer/account_json/part-00000-*
echo $account
echo "Transaction records:"
cat docker/sample/customer/transaction_json/part-0000* | grep $account
- Click the button
Execute
at the top - Progress updates will show in the bottom right corner
- Click on
History
at the top - Check for your plan name and see the result summary
- Click on
Report
on the right side to see more details of what was executed
It should look something like this.
Head account record:
{"account_id":"ACC00047541","balance":445.62,"created_by":"event","open_time":"2024-03-13T00:31:38.836Z","status":"suspended","customer_details":{"name":"Joey Gaylord","age":44,"city":"Lake Jose"}}
ACC00047541
Transaction records:
{"account_id":"ACC00047541","full_name":"Joey Gaylord","amount":31.485424217447527,"time":"2023-11-07T04:50:20.875Z","date":"2023-11-07"}
{"account_id":"ACC00047541","full_name":"Joey Gaylord","amount":79.22177964401857,"time":"2024-02-01T15:15:38.289Z","date":"2024-02-01"}
{"account_id":"ACC00047541","full_name":"Joey Gaylord","amount":56.06230355456882,"time":"2024-02-29T21:42:42.473Z","date":"2024-02-29"}
Congratulations! You have now made a data generator that has simulated a real world data scenario. You can check the
JsonJavaPlan.java
or JsonPlan.scala
files as well to check that your plan is the same.
Validation
If you want to validate data from a JSON source, follow the validation documentation found here to help guide you.