Guides

Below are a list of guides you can follow to create your first Data Catering job for your use case.

Data Sources

Databases

Cassandra - Generate/validate data for Cassandra
MySQL - Generate/validate data for MySQL
Postgres - Generate/validate data for Postgres

Files

CSV - Generate/validate data for CSV
Delta Lake - Generate/validate data for Delta Lake
Iceberg - Generate/validate data for Iceberg tables
JSON - Generate/validate data for JSON
ORC - Generate/validate data for ORC
Parquet - Generate/validate data for Parquet

HTTP

REST API - Generate data for REST APIs

Messaging

Kafka - Generate data for Kafka topics
Solace - Generate data for Solace messages

Metadata

Data Contract CLI - Generate data based on metadata in data contract files in Data Contract CLI format
Great Expectations - Use validations from Great Expectations for testing
Marquez - Generate data based on metadata in Marquez
OpenMetadata - Generate data based on metadata in OpenMetadata
Open Data Contract Standard (ODCS) - Generate data based on metadata in data contract files in ODCS format

Scenarios

Auto Generate From Data Connection - Automatically generating data from just defining data sources
Data Generation - Generate production-like data
Data Validations - Run data validations after generating data
Delete Generated Data - Delete the generated data whilst leaving other data
First Data Generation - If you are new, this is the place to start
Foreign Keys Across Data Sources - Generate matching values across generated data sets
Generate Batch and Event Data - Generate matching batch and event data
Multiple Records Per Column Value - How you can generate multiple records per set of columns

YAML Files

Base Concept

The execution of the data generator is based on the concept of plans and tasks. A plan represent the set of tasks that need to be executed, along with other information that spans across tasks, such as foreign keys between data sources.
A task represent the component(s) of a data source and its associated metadata so that it understands what the data should look like and how many steps (sub data sources) there are (i.e. tables in a database, topics in Kafka). Tasks can define one or more steps.

Plan

Foreign Keys

Define foreign keys across data sources in your plan to ensure generated data can match
Link to associated task 1
Link to associated task 2

Task

Data Source Type	Data Source	Sample Task	Notes
Database	Postgres	Sample
Database	MySQL	Sample
Database	Cassandra	Sample
File	CSV	Sample
File	JSON	Sample	Contains nested schemas and use of SQL for generated values
File	Parquet	Sample	Partition by year column
Messaging System	Kafka	Sample	Specific base schema to be used, define headers, key, value, etc.
Messaging System	Solace	Sample	JSON formatted message
HTTP	PUT	Sample	JSON formatted PUT body

Configuration

Basic configuration

Docker-compose

To see how it runs against different data sources, you can run using docker-compose and set DATA_SOURCE like below

./gradlew build
cd docker
DATA_SOURCE=postgres docker-compose up -d datacaterer

Can set it to one of the following:

postgres
mysql
cassandra
solace
kafka
http