Skip to content


Below are a list of guides you can follow to create your first Data Catering job for your use case.

Data Sources


  • Cassandra - Generate/validate data for Cassandra
  • MySQL - Generate/validate data for MySQL
  • Postgres - Generate/validate data for Postgres


  • CSV - Generate/validate data for CSV
  • Delta Lake - Generate/validate data for Delta Lake
  • Iceberg - Generate/validate data for Iceberg tables
  • JSON - Generate/validate data for JSON
  • ORC - Generate/validate data for ORC
  • Parquet - Generate/validate data for Parquet


  • REST API - Generate data for REST APIs


  • Kafka - Generate data for Kafka topics
  • Solace - Generate data for Solace messages


  • Great Expectations - Use validations from Great Expectations for testing
  • Marquez - Generate data based on metadata in Marquez
  • OpenMetadata - Generate data based on metadata in OpenMetadata


YAML Files

Base Concept

The execution of the data generator is based on the concept of plans and tasks. A plan represent the set of tasks that need to be executed, along with other information that spans across tasks, such as foreign keys between data sources.
A task represent the component(s) of a data source and its associated metadata so that it understands what the data should look like and how many steps (sub data sources) there are (i.e. tables in a database, topics in Kafka). Tasks can define one or more steps.


Foreign Keys

Define foreign keys across data sources in your plan to ensure generated data can match
Link to associated task 1
Link to associated task 2


Data Source Type Data Source Sample Task Notes
Database Postgres Sample
Database MySQL Sample
Database Cassandra Sample
File CSV Sample
File JSON Sample Contains nested schemas and use of SQL for generated values
File Parquet Sample Partition by year column
Messaging System Kafka Sample Specific base schema to be used, define headers, key, value, etc.
Messaging System Solace Sample JSON formatted message
HTTP PUT Sample JSON formatted PUT body


Basic configuration


To see how it runs against different data sources, you can run using docker-compose and set DATA_SOURCE like below

./gradlew build
cd docker
DATA_SOURCE=postgres docker-compose up -d datacaterer

Can set it to one of the following:

  • postgres
  • mysql
  • cassandra
  • solace
  • kafka
  • http