Roadmap
Items below summarise the roadmap of Data Caterer. As each task gets completed, it will be documented and linked.
Feature | Description | Sub Tasks |
---|---|---|
Data source support | Batch or real time data sources that can be added to Data Caterer. Support data sources that users want | - AWS, GCP and Azure related data services ( cloud storage) - Delta Lake - Iceberg - RabbitMQ - ActiveMQ - MongoDB - Elasticsearch - Snowflake - Databricks - Pulsar |
Metadata discovery | Allow for schema and data profiling from external metadata sources | - HTTP (OpenAPI spec) - JMS - Read from samples - OpenLineage metadata (Marquez) - OpenMetadata - Open Data Contract Standard (ODCS) - Amundsen - Datahub - Solace Event Portal - Airflow - DBT - Manually insert create table statement from UI |
Developer API | Scala/Java interface for developers/testers to create data generation and validation tasks | - Scala - Java |
Report generation | Generate a report that summarises the data generation or validation results | - Report for data generated and validation rules |
UI portal | Allow users to access a UI to input data generation or validation tasks. Also be able to view report results | - Base UI with create, edit and delete plan, connections and history - Run on Mac, Linux and Windows - Metadata stored in database - Store data generation/validation run information in file/database - Preview of generated data - Additional dialog to confirm delete and execute plan |
Integration with data validation tools | Derive data validation rules from existing data validation tools | - Great Expectation - DBT constraints - SodaCL - MonteCarlo - OpenMetadata |
Data validation rule suggestions | Based on metadata, generate data validation rules appropriate for the dataset | - Suggest basic data validations (yet to document) |
Wait conditions before data validation | Define certain conditions to be met before starting data validations | - Webhook - File exists - Data exists via SQL expression - Pause |
Validation types | Ability to define simple/complex data validations | - Basic validations - Aggregates (sum of amount per account is > 500) - Ordering (transactions are ordered by date) - Relationship (at least one account entry in history table per account in accounts table) - Data profile (how close the generated data profile is compared to the expected data profile) - Column name (check column count, column names, ordering) - Pre-conditions before validating data |
Data generation record count | Generate scenarios where there are one to many, many to many situations relating to record count. Also ability to cover all edge cases or scenarios | - Cover all possible cases (i.e. record for each combination of oneOf values, positive/negative values, pairwise etc.) - Ability to override edge cases |
Alerting | When tasks have completed, ability to define alerts based on certain conditions | - Slack |
Metadata enhancements | Based on data profiling or inference, can add to existing metadata | - PII detection (can integrate with Presidio) - Relationship detection across data sources - SQL generation - Ordering information |
Data cleanup | Ability to clean up generated data | - Clean up generated data - Clean up data in consumer data sinks - Clean up data from real time sources (i.e. DELETE HTTP endpoint, delete events in JMS) |
Trial version | Trial version of the full app for users to test out all the features | - Trial app to try out all features |
Code generation | Based on metadata or existing classes, code for data generation and validation could be generated | - Code generation - Schema generation from Scala/Java class |
Real time response data validations | Ability to define data validations based on the response from real time data sources (e.g. HTTP response) | - HTTP response data validation |