Skip to content

Roadmap

Items below summarise the roadmap of Data Caterer. As each task gets completed, it will be documented and linked.

Feature Description Sub Tasks
Data source support Batch or real time data sources that can be added to Data Caterer. Support data sources that users want - AWS, GCP and Azure related data services (✅ cloud storage)
- ✅ Delta Lake
- ✅ Iceberg
- RabbitMQ
- ActiveMQ
- MongoDB
- Elasticsearch
- Snowflake
- Databricks
- Pulsar
Metadata discovery Allow for schema and data profiling from external metadata sources - ✅ HTTP (OpenAPI spec)
- JMS
- Read from samples
- ✅ OpenLineage metadata (Marquez)
- ✅ OpenMetadata
- ✅ Open Data Contract Standard (ODCS)
- Amundsen
- Datahub
- Solace Event Portal
- Airflow
- DBT
- Manually insert create table statement from UI
Developer API Scala/Java interface for developers/testers to create data generation and validation tasks - ✅ Scala
- ✅ Java
Report generation Generate a report that summarises the data generation or validation results - ✅ Report for data generated and validation rules
UI portal Allow users to access a UI to input data generation or validation tasks. Also be able to view report results - ✅ Base UI with create, edit and delete plan, connections and history
- ✅ Run on Mac, Linux and Windows
- Metadata stored in database
- ✅ Store data generation/validation run information in file/database
- Preview of generated data
- Additional dialog to confirm delete and execute plan
Integration with data validation tools Derive data validation rules from existing data validation tools - ✅ Great Expectation
- DBT constraints
- SodaCL
- MonteCarlo
- ✅ OpenMetadata
Data validation rule suggestions Based on metadata, generate data validation rules appropriate for the dataset - ✅ Suggest basic data validations (yet to document)
Wait conditions before data validation Define certain conditions to be met before starting data validations - ✅ Webhook
- ✅ File exists
- ✅ Data exists via SQL expression
- ✅ Pause
Validation types Ability to define simple/complex data validations - ✅ Basic validations
- ✅ Aggregates (sum of amount per account is > 500)
- Ordering (transactions are ordered by date)
- ✅ Relationship (at least one account entry in history table per account in accounts table)
- Data profile (how close the generated data profile is compared to the expected data profile)
- ✅ Column name (check column count, column names, ordering)
- ✅ Pre-conditions before validating data
Data generation record count Generate scenarios where there are one to many, many to many situations relating to record count. Also ability to cover all edge cases or scenarios - ✅ Cover all possible cases (i.e. record for each combination of oneOf values, positive/negative values, pairwise etc.)
- Ability to override edge cases
Alerting When tasks have completed, ability to define alerts based on certain conditions - ✅ Slack
- Email
Metadata enhancements Based on data profiling or inference, can add to existing metadata - PII detection (can integrate with Presidio)
- Relationship detection across data sources
- SQL generation
- Ordering information
Data cleanup Ability to clean up generated data - ✅ Clean up generated data
- ✅ Clean up data in consumer data sinks
- Clean up data from real time sources (i.e. DELETE HTTP endpoint, delete events in JMS)
Trial version Trial version of the full app for users to test out all the features - ✅ Trial app to try out all features
Code generation Based on metadata or existing classes, code for data generation and validation could be generated - Code generation
- Schema generation from Scala/Java class
Real time response data validations Ability to define data validations based on the response from real time data sources (e.g. HTTP response) - HTTP response data validation