JSON Schema Source

Data Caterer reading from JSON Schema file for schema metadata

Creating a data generator for JSON files based on metadata stored in JSON Schema format. JSON Schema provides a powerful way to describe and validate the structure of JSON data, making it an excellent metadata source for generating realistic test data.

Requirements

10 minutes
Git
Gradle
Docker

Get Started

First, we will clone the data-caterer-example repo which will already have the base project setup required.

JavaScalaYAMLUI

git clone git@github.com:data-catering/data-caterer-example.git

git clone git@github.com:data-catering/data-caterer-example.git

git clone git@github.com:data-catering/data-caterer-example.git

Run Data Caterer UI via the 'Quick Start' found here.

JSON Schema Setup

We will be using a JSON Schema file that defines the structure for a financial payment system. You can use your own JSON Schema file by placing it in the appropriate mount folder and following the steps below.

Example JSON Schema structure:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "properties": {
    "customer_direct_debit_initiation_v11": {
      "type": "object",
      "properties": {
        "group_header": {
          "type": "object",
          "properties": {
            "message_identification": {"type": "string"},
            "creation_date_time": {"type": "string", "format": "date-time"},
            "number_of_transactions": {"type": "integer"},
            "control_sum": {"type": "number"},
            "initiating_party": {
              "type": "object",
              "properties": {
                "name": {"type": "string"}
              }
            }
          }
        },
        "payment_information": {
          "type": "object",
          "properties": {
            "payment_information_identification": {"type": "string"},
            "payment_method": {"type": "string"},
            "batch_booking": {"type": "boolean"},
            "direct_debit_transaction_information": {
              "type": "object",
              "properties": {
                "payment_identification": {
                  "type": "object",
                  "properties": {
                    "end_to_end_identification": {"type": "string"}
                  }
                },
                "instructed_amount": {
                  "type": "object",
                  "properties": {
                    "value": {"type": "number"},
                    "currency": {"type": "string"}
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

Plan Setup

Create a file depending on which interface you want to use.

Java: src/main/java/io/github/datacatering/plan/MyJSONSchemaJavaPlan.java
Scala: src/main/scala/io/github/datacatering/plan/MyJSONSchemaPlan.scala
YAML: docker/data/custom/plan/my-json-schema.yaml

JavaScalaYAMLUI

import io.github.datacatering.datacaterer.java.api.PlanRun;

public class MyJSONSchemaJavaPlan extends PlanRun {
    {
        var conf = configuration().enableGeneratePlanAndTasks(true)
            .generatedReportsFolderPath("/opt/app/data/report");
    }
}

import io.github.datacatering.datacaterer.api.PlanRun

class MyJSONSchemaPlan extends PlanRun {
  val conf = configuration.enableGeneratePlanAndTasks(true)
    .generatedReportsFolderPath("/opt/app/data/report")
}

In docker/data/custom/plan/my-json-schema.yaml:

name: "my_json_schema_plan"
description: "Create JSON data via JSON Schema metadata"
tasks:
  - name: "json_schema_task"
    dataSourceName: "my_json_schema"

In docker/data/custom/application.conf:

flags {
  enableGeneratePlanAndTasks = true
  enableUniqueCheck = true
}
folders {
  generatedReportsFolderPath = "/opt/app/data/report"
}

Click on Advanced Configuration towards the bottom of the screen
Click on Flag and click on Generate Plan And Tasks
Click on Flag and click on Unique Check
Click on Folder and enter /tmp/data-caterer/report for Generated Reports Folder Path

We will enable generate plan and tasks so that we can read from external sources for metadata and save the reports under a folder we can easily access.

Connection Configuration

Within our class, we can start by defining the connection properties to read/write from/to JSON and specify the JSON Schema metadata source.

JavaScalaYAMLUI

var jsonSchemaTask = json(
    "my_json_schema",                           //name
    "/opt/app/data/json-schema-output",         //path
    Map.of("saveMode", "overwrite")             //additional options
);

val jsonSchemaTask = json(
  "my_json_schema",                           //name         
  "/opt/app/data/json-schema-output",         //path
  Map("saveMode" -> "overwrite")              //additional options
)

In docker/data/custom/application.conf:

json {
    my_json_schema {
        "saveMode": "overwrite"
    }
}

Click on Connection towards the top of the screen
For connection name, set to my_json_schema
Click on Select data source type.. and select JSON
Set Path as /tmp/custom/json-schema/output
Click on Create

Schema

We can point the schema of a data source to our JSON Schema file. The metadata source will automatically parse the JSON Schema and generate appropriate field definitions.

JavaScalaYAMLUI

var jsonSchemaTask = json("my_json_schema", "/opt/app/data/json-schema-output", Map.of("saveMode", "overwrite"))
        .fields(metadataSource().jsonSchema("/opt/app/mount/json-schema/payment-schema.json"))
        .count(count().records(10));

val jsonSchemaTask = json("my_json_schema", "/opt/app/data/json-schema-output", Map("saveMode" -> "overwrite"))
  .fields(metadataSource.jsonSchema("/opt/app/mount/json-schema/payment-schema.json"))
  .count(count.records(10))

In docker/data/custom/task/file/json/json-schema-task.yaml:

name: "json_schema_task"
steps:
  - name: "json_data"
    type: "json"
    options:
      path: "/opt/app/data/json-schema-output"
      saveMode: "overwrite"
      metadataSourceType: "jsonSchema"
      schemaFile: "/opt/app/mount/json-schema/payment-schema.json"
    count:
      records: 10

Click on Connection tab at the top
Select JSON Schema as the data source and enter my-json-schema-metadata
Create your JSON Schema file at /tmp/json-schema/payment-schema.json
Enter /tmp/json-schema/payment-schema.json as the Schema File
Click on Generation and select the metadata source connection

Field Filtering Options

JSON Schema metadata source supports powerful field filtering capabilities to control which fields are included or excluded from data generation:

JavaScalaYAMLUI

var jsonSchemaTask = json("my_json_schema", "/opt/app/data/json-schema-output", Map.of("saveMode", "overwrite"))
        .fields(metadataSource().jsonSchema("/opt/app/mount/json-schema/payment-schema.json"))
        // Include specific fields only
        .includeFields(List.of(
            "customer_direct_debit_initiation_v11.group_header.message_identification",
            "customer_direct_debit_initiation_v11.group_header.creation_date_time",
            "customer_direct_debit_initiation_v11.payment_information.payment_information_identification",
            "customer_direct_debit_initiation_v11.payment_information.direct_debit_transaction_information.instructed_amount.value"
        ))
        // Or exclude specific fields
        // .excludeFields(List.of(
        //     "customer_direct_debit_initiation_v11.group_header.control_sum",
        //     "customer_direct_debit_initiation_v11.payment_information.batch_booking"
        // ))
        // Or include fields matching patterns
        // .includeFieldPatterns(List.of(".*amount.*", ".*identification.*"))
        // Or exclude fields matching patterns  
        // .excludeFieldPatterns(List.of(".*internal.*", ".*debug.*"))
        .count(count().records(10));

val jsonSchemaTask = json("my_json_schema", "/opt/app/data/json-schema-output", Map("saveMode" -> "overwrite"))
  .fields(metadataSource.jsonSchema("/opt/app/mount/json-schema/payment-schema.json"))
  // Include specific fields only
  .includeFields(
    "customer_direct_debit_initiation_v11.group_header.message_identification",
    "customer_direct_debit_initiation_v11.group_header.creation_date_time",
    "customer_direct_debit_initiation_v11.payment_information.payment_information_identification",
    "customer_direct_debit_initiation_v11.payment_information.direct_debit_transaction_information.instructed_amount.value"
  )
  // Or exclude specific fields
  // .excludeFields(
  //   "customer_direct_debit_initiation_v11.group_header.control_sum",
  //   "customer_direct_debit_initiation_v11.payment_information.batch_booking"
  // )
  // Or include fields matching patterns
  // .includeFieldPatterns(".*amount.*", ".*identification.*")
  // Or exclude fields matching patterns  
  // .excludeFieldPatterns(".*internal.*", ".*debug.*")
  .count(count.records(10))

In docker/data/custom/task/file/json/json-schema-task.yaml:

ste

href="#__codelineno-16-1">name: "json_schema_task" ps: - name: "json_data" type: "json" options: path: "/opt/app/data/json-schema-output" saveMode: "overwrite" metadataSourceType: "jsonSchema" schemaFile: "/opt/app/mount/json-schema/payment-schema.json" # Include specific fields only includeFields: - "customer_direct_debit_initiation_v11.group_header.message_identification" - "customer_direct_debit_initiation_v11.group_header.creation_date_time" - "customer_direct_debit_initiation_v11.payment_information.payment_information_identification" # Or exclude specific fields # excludeFields: # - "customer_direct_debit_initiation_v11.group_header.control_sum" # - "customer_direct_debit_initiation_v11.payment_information.batch_booking" # Or include fields matching patterns # includeFieldPatterns: # - ".*amount.*" # - ".*identification.*" # Or exclude fields matching patterns # excludeFieldPatterns: # - ".*internal.*" # - ".*debug.*" count: records: 10

In the connection configuration, expand Advanced Options
Add Include Fields and enter comma-separated field paths
Or add Exclude Fields for fields to exclude
Or use Include Field Patterns with regex patterns
Or use Exclude Field Patterns with regex patterns

The field filtering options provide flexibility to:

includeFields: Only generate data for the specified field paths
excludeFields: Generate data for all fields except the specified ones
includeFieldPatterns: Include fields matching the regex patterns
excludeFieldPatterns: Exclude fields matching the regex patterns

Field paths use dot notation to navigate nested structures (e.g., parent.child.grandchild).

Additional Configurations

At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations.

JavaScalaYAMLUI

var config = configuration()
        .generatedReportsFolderPath("/opt/app/data/report")
        .enableGeneratePlanAndTasks(true)
        .enableUniqueCheck(true);

execute(config, jsonSchemaTask);

val config = configuration
  .generatedReportsFolderPath("/opt/app/data/report")
  .enableGeneratePlanAndTasks(true)
  .enableUniqueCheck(true)

execute(config, jsonSchemaTask)

In docker/data/custom/application.conf:

flags {
  enableGeneratePlanAndTasks = true
  enableUniqueCheck = true
}
folders {
  generatedReportsFolderPath = "/opt/app/data/report"
}

Click on Advanced Configuration towards the bottom of the screen
Click on Flag and click on Generate Plan And Tasks
Click on Flag and click on Unique Check
Click on Folder and enter /tmp/data-caterer/report for Generated Reports Folder Path

Run

Now we can run via the script ./run.sh that is in the top level directory of the data-caterer-example to run the class we just created.

JavaScalaYAMLUI

./run.sh MyJSONSchemaJavaPlan
head docker/sample/json-schema-output/part-00000-*

./run.sh MyJSONSchemaPlan
head docker/sample/json-schema-output/part-00000-*

./run.sh my-json-schema.yaml
head docker/sample/json-schema-output/part-00000-*

Click the button Execute at the top
Progress updates will show in the bottom right corner
Click on History at the top
Check for your plan name and see the result summary
Click on Report on the right side to see more details of what was executed

It should look something like this.

{
    "customer_direct_debit_initiation_v11": {
        "group_header": {
            "message_identification": "MSG001",
            "creation_date_time": "2024-03-15T10:30:45Z",
            "number_of_transactions": 1,
            "control_sum": 100.50,
            "initiating_party": {
                "name": "ACME Corp"
            }
        },
        "payment_information": {
            "payment_information_identification": "PMT001",
            "payment_method": "DD",
            "batch_booking": true,
            "direct_debit_transaction_information": {
                "payment_identification": {
                    "end_to_end_identification": "TXN001"
                },
                "instructed_amount": {
                    "value": 100.50,
                    "currency": "EUR"
                }
            }
        }
    }
}

Congratulations! You have now made a data generator that uses JSON Schema as a metadata source to generate realistic test data following your schema specifications.

Complex Schema Support

JSON Schema metadata source supports various JSON Schema features:

Data Types

String: Generates random strings with optional length constraints
Number/Integer: Generates numeric values within specified ranges
Boolean: Generates true/false values
Array: Generates arrays with configurable item types and sizes
Object: Generates nested objects with all defined properties

Constraints

enum: Selects from predefined values
pattern: Generates strings matching regex patterns
minimum/maximum: Numeric range constraints
minLength/maxLength: String length constraints
format: Built-in formats like date-time, email, uri, etc.

Nested Structures

Complex nested objects and arrays are fully supported, allowing you to generate data that matches sophisticated JSON Schema definitions.

Validation

If you want to validate data against a JSON Schema, you can use the generated data with validation frameworks or tools that support JSON Schema validation.

The JSON Schema metadata source in Data Caterer focuses on data generation based on the schema structure, ensuring that the generated data conforms to the defined schema constraints and types.