Roadmap

Items below summarise the roadmap of Data Caterer. As each task gets completed, it will be documented and linked.

Feature	Description	Sub Tasks
Data source support	Batch or real time data sources that can be added to Data Caterer. Support data sources that users want	- AWS, GCP and Azure related data services ( cloud storage) - Delta Lake - Iceberg - RabbitMQ - ActiveMQ - MongoDB - Elasticsearch - Snowflake - Databricks - Pulsar
Metadata discovery	Allow for schema and data profiling from external metadata sources	- HTTP (OpenAPI spec) - JMS - Read from samples - OpenLineage metadata (Marquez) - OpenMetadata - Open Data Contract Standard (ODCS) - Amundsen - Datahub - Solace Event Portal - Airflow - DBT - Manually insert create table statement from UI
Developer API	Scala/Java interface for developers/testers to create data generation and validation tasks	- Scala - Java
Report generation	Generate a report that summarises the data generation or validation results	- Report for data generated and validation rules
UI portal	Allow users to access a UI to input data generation or validation tasks. Also be able to view report results	- Base UI with create, edit and delete plan, connections and history - Run on Mac, Linux and Windows - Metadata stored in database - Store data generation/validation run information in file/database - Preview of generated data - Additional dialog to confirm delete and execute plan
Integration with data validation tools	Derive data validation rules from existing data validation tools	- Great Expectation - DBT constraints - SodaCL - MonteCarlo - OpenMetadata
Data validation rule suggestions	Based on metadata, generate data validation rules appropriate for the dataset	- Suggest basic data validations (yet to document)
Wait conditions before data validation	Define certain conditions to be met before starting data validations	- Webhook - File exists - Data exists via SQL expression - Pause
Validation types	Ability to define simple/complex data validations	- Basic validations - Aggregates (sum of amount per account is > 500) - Ordering (transactions are ordered by date) - Relationship (at least one account entry in history table per account in accounts table) - Data profile (how close the generated data profile is compared to the expected data profile) - Column name (check column count, column names, ordering) - Pre-conditions before validating data
Data generation record count	Generate scenarios where there are one to many, many to many situations relating to record count. Also ability to cover all edge cases or scenarios	- Cover all possible cases (i.e. record for each combination of oneOf values, positive/negative values, pairwise etc.) - Ability to override edge cases
Alerting	When tasks have completed, ability to define alerts based on certain conditions	- Slack - Email
Metadata enhancements	Based on data profiling or inference, can add to existing metadata	- PII detection (can integrate with Presidio) - Relationship detection across data sources - SQL generation - Ordering information
Data cleanup	Ability to clean up generated data	- Clean up generated data - Clean up data in consumer data sinks - Clean up data from real time sources (i.e. DELETE HTTP endpoint, delete events in JMS)
Trial version	Trial version of the full app for users to test out all the features	- Trial app to try out all features
Code generation	Based on metadata or existing classes, code for data generation and validation could be generated	- Code generation - Schema generation from Scala/Java class
Real time response data validations	Ability to define data validations based on the response from real time data sources (e.g. HTTP response)	- HTTP response data validation