Guides
Below are a list of guides you can follow to create your first Data Catering job for your use case.
Data Sources
Databases
Files
HTTP
- REST API - Generate data for REST APIs
Messaging
Metadata
- Data Contract CLI - Generate data based on metadata in data contract files in Data Contract CLI format
- Great Expectations - Use validations from Great Expectations for testing
- Marquez - Generate data based on metadata in Marquez
- OpenMetadata - Generate data based on metadata in OpenMetadata
- Open Data Contract Standard (ODCS) - Generate data based on metadata in data contract files in ODCS format
Scenarios
- Auto Generate From Data Connection - Automatically generating data from just defining data sources
- Data Generation - Generate production-like data
- Data Validations - Run data validations after generating data
- Delete Generated Data - Delete the generated data whilst leaving other data
- First Data Generation - If you are new, this is the place to start
- Foreign Keys Across Data Sources - Generate matching values across generated data sets
- Generate Batch and Event Data - Generate matching batch and event data
- Multiple Records Per Column Value - How you can generate multiple records per set of columns
YAML Files
Base Concept
The execution of the data generator is based on the concept of plans and tasks. A plan represent the set of tasks that
need to be executed,
along with other information that spans across tasks, such as foreign keys between data sources.
A task represent the component(s) of a data source and its associated metadata so that it understands what the data
should look like
and how many steps (sub data sources) there are (i.e. tables in a database, topics in Kafka). Tasks can define one or
more steps.
Plan
Foreign Keys
Define foreign keys across data sources in your plan to ensure generated data can match
Link to associated task 1
Link to associated task 2
Task
Data Source Type | Data Source | Sample Task | Notes |
---|---|---|---|
Database | Postgres | Sample | |
Database | MySQL | Sample | |
Database | Cassandra | Sample | |
File | CSV | Sample | |
File | JSON | Sample | Contains nested schemas and use of SQL for generated values |
File | Parquet | Sample | Partition by year column |
Messaging System | Kafka | Sample | Specific base schema to be used, define headers, key, value, etc. |
Messaging System | Solace | Sample | JSON formatted message |
HTTP | PUT | Sample | JSON formatted PUT body |
Configuration
Docker-compose
To see how it runs against different data sources, you can run using docker-compose
and set DATA_SOURCE
like below
./gradlew build
cd docker
DATA_SOURCE=postgres docker-compose up -d datacaterer
Can set it to one of the following:
- postgres
- mysql
- cassandra
- solace
- kafka
- http