Roadmap

Items below summarise the roadmap of Data Caterer. As each task gets completed, it will be documented and linked.

Feature	Description	Sub Tasks
Data source support	Batch or real time data sources that can be added to Data Caterer. Support data sources that users want	- AWS, GCP and Azure related data services (cloud storage) - Delta Lake - Iceberg - Hudi - RabbitMQ - Solace - BigQuery - ActiveMQ - MongoDB - Elasticsearch - Snowflake - Databricks - Pulsar
Metadata discovery	Allow for schema and data profiling from external metadata sources	- HTTP (OpenAPI spec) - JSON Schema - YAML configurations - OpenLineage metadata (Marquez) - OpenMetadata - Open Data Contract Standard (ODCS) - Data Contract CLI - Confluent Schema Registry - Amundsen - Datahub - Solace Event Portal - Airflow - DBT - Manually insert create table statement from UI
Developer API	Scala/Java interface for developers/testers to create data generation and validation tasks	- Scala - Java - YAML - Python - Javascript
Report generation	Generate a report that summarises the data generation or validation results	- Report for data generated and validation rules
UI portal	Allow users to access a UI to input data generation or validation tasks. Also be able to view report results	- Base UI with create, edit and delete plan, connections and history - Run on Mac, Linux and Windows - User authentication and usage tracking - Store data generation/validation run information in file/database - Preview of generated data via sample endpoints - Metadata stored in database - Additional dialog to confirm delete and execute plan
Integration with data validation tools	Derive data validation rules from existing data validation tools	- Great Expectation - OpenMetadata - DBT constraints - SodaCL - MonteCarlo
Data validation rule suggestions	Based on metadata, generate data validation rules appropriate for the dataset	- Suggest basic data validations (yet to document)
Wait conditions before data validation	Define certain conditions to be met before starting data validations	- Webhook - File exists - Data exists via SQL expression - Pause
Validation types	Ability to define simple/complex data validations	- Basic validations - Aggregates (sum of amount per account is > 500) - Relationship (at least one account entry in history table per account in accounts table) - Field name (check field count, field names, ordering) - Pre-conditions before validating data - Ordering (transactions are ordered by date) - Data profile (how close the generated data profile is compared to the expected data profile)
Data generation features	Advanced data generation capabilities for realistic test data	- Custom transformations (per-record and whole-file) - Distribution-based generation (normal, exponential) - Weighted value selection - Reference mode for foreign keys - Field filtering (include/exclude patterns) - Cover all possible cases (i.e. record for each combination of oneOf values, positive/negative values, pairwise etc.) - Ability to override edge cases
Performance optimization	Features to improve data generation speed and efficiency	- Fast regex generation (SQL-based, ~5-6x faster) - Unique value optimization with Bloom filters - Fast generation mode (automatic optimizations) - Performance testing infrastructure - HTTP rate limiting
Alerting	When tasks have completed, ability to define alerts based on certain conditions	- Slack - Email
Metadata enhancements	Based on data profiling or inference, can add to existing metadata	- PII detection (can integrate with Presidio) - Relationship detection across data sources - SQL generation - Ordering information
Data cleanup	Ability to clean up generated data	- Clean up generated data - Clean up data in consumer data sinks - Clean up data from real time sources (i.e. DELETE HTTP endpoint, delete events in JMS)
Trial version	Trial version of the full app for users to test out all the features	- Trial app to try out all features
Code generation	Based on metadata or existing classes, code for data generation and validation could be generated	- Code generation - Schema generation from Scala/Java class
Real time response data validations	Ability to define data validations based on the response from real time data sources (e.g. HTTP response)	- HTTP response data validation
Infrastructure & CI/CD	Infrastructure improvements for development and deployment	- Pre-plan and post-plan processors - Benchmark results tracking