Skip to content

Roadmap

Items below summarise the roadmap of Data Caterer. As each task gets completed, it will be documented and linked.

Feature Description Sub Tasks
Data source support Batch or real time data sources that can be added to Data Caterer. Support data sources that users want - ✅ AWS, GCP and Azure related data services (cloud storage)
- ✅ Delta Lake
- ✅ Iceberg
- ✅ Hudi
- ✅ RabbitMQ
- ✅ Solace
- ✅ BigQuery
- ActiveMQ
- MongoDB
- Elasticsearch
- Snowflake
- Databricks
- Pulsar
Metadata discovery Allow for schema and data profiling from external metadata sources - ✅ HTTP (OpenAPI spec)
- ✅ JSON Schema
- ✅ YAML configurations
- ✅ OpenLineage metadata (Marquez)
- ✅ OpenMetadata
- ✅ Open Data Contract Standard (ODCS)
- ✅ Data Contract CLI
- ✅ Confluent Schema Registry
- Amundsen
- Datahub
- Solace Event Portal
- Airflow
- DBT
- Manually insert create table statement from UI
Developer API Scala/Java interface for developers/testers to create data generation and validation tasks - ✅ Scala
- ✅ Java
- ✅ YAML
- Python
- Javascript
Report generation Generate a report that summarises the data generation or validation results - ✅ Report for data generated and validation rules
UI portal Allow users to access a UI to input data generation or validation tasks. Also be able to view report results - ✅ Base UI with create, edit and delete plan, connections and history
- ✅ Run on Mac, Linux and Windows
- ✅ User authentication and usage tracking
- ✅ Store data generation/validation run information in file/database
- ✅ Preview of generated data via sample endpoints
- Metadata stored in database
- Additional dialog to confirm delete and execute plan
Integration with data validation tools Derive data validation rules from existing data validation tools - ✅ Great Expectation
- ✅ OpenMetadata
- DBT constraints
- SodaCL
- MonteCarlo
Data validation rule suggestions Based on metadata, generate data validation rules appropriate for the dataset - ✅ Suggest basic data validations (yet to document)
Wait conditions before data validation Define certain conditions to be met before starting data validations - ✅ Webhook
- ✅ File exists
- ✅ Data exists via SQL expression
- ✅ Pause
Validation types Ability to define simple/complex data validations - ✅ Basic validations
- ✅ Aggregates (sum of amount per account is > 500)
- ✅ Relationship (at least one account entry in history table per account in accounts table)
- ✅ Field name (check field count, field names, ordering)
- ✅ Pre-conditions before validating data
- Ordering (transactions are ordered by date)
- Data profile (how close the generated data profile is compared to the expected data profile)
Data generation features Advanced data generation capabilities for realistic test data - ✅ Custom transformations (per-record and whole-file)
- ✅ Distribution-based generation (normal, exponential)
- ✅ Weighted value selection
- ✅ Reference mode for foreign keys
- ✅ Field filtering (include/exclude patterns)
- ✅ Cover all possible cases (i.e. record for each combination of oneOf values, positive/negative values, pairwise etc.)
- Ability to override edge cases
Performance optimization Features to improve data generation speed and efficiency - ✅ Fast regex generation (SQL-based, ~5-6x faster)
- ✅ Unique value optimization with Bloom filters
- ✅ Fast generation mode (automatic optimizations)
- ✅ Performance testing infrastructure
- ✅ HTTP rate limiting
Alerting When tasks have completed, ability to define alerts based on certain conditions - ✅ Slack
- Email
Metadata enhancements Based on data profiling or inference, can add to existing metadata - PII detection (can integrate with Presidio)
- Relationship detection across data sources
- SQL generation
- Ordering information
Data cleanup Ability to clean up generated data - ✅ Clean up generated data
- ✅ Clean up data in consumer data sinks
- Clean up data from real time sources (i.e. DELETE HTTP endpoint, delete events in JMS)
Trial version Trial version of the full app for users to test out all the features - ✅ Trial app to try out all features
Code generation Based on metadata or existing classes, code for data generation and validation could be generated - Code generation
- Schema generation from Scala/Java class
Real time response data validations Ability to define data validations based on the response from real time data sources (e.g. HTTP response) - ✅ HTTP response data validation
Infrastructure & CI/CD Infrastructure improvements for development and deployment - ✅ Pre-plan and post-plan processors
- ✅ Benchmark results tracking