Comparison to similar tools
I have tried to include all the companies found in the list here from Mostly AI blog post and used information that is publicly available.
The companies/products not shown below either have:
- a website with insufficient information about the technology side of data generation/validation
- no/little documentation
- don't have a free, no sign-up version of their app to use
Data Generation
Tool | Description | Cost | Pros | Cons |
---|---|---|---|---|
Data Catering | Scala based data generation and validation tool via metadata | Free (Open Source) Sponsorship model for support or additional features Starts at $100 per month |
Data generation and validation Batch and event generation Maintain referential integrity Scala/Java SDK Customisable scenarios and validations Open source Metadata driven Report generation Use validation rules from existing tools Data clean up UI Alerting |
No load testing metrics No validation of real time data sources |
Clearbox AI | Python based data generation tool via ML | Unclear | Python SDK UI interface Detect private data Report generation |
Batch data only No data clean up Limited/no documentation |
Curiosity Software | Platform solution for test data management | Unclear | Extensive documentation Generate data based off test cases UI interface Web/API/UI/mobile testing |
No quick start No SDK Many components that may not be required No event generation support |
DataCebo Synthetic Data Vault | Python based data generation tool via ML | Unclear | Python SDK Report generation Data quality checks Business logic constraints |
No data connection support No data clean up No foreign key support |
Datafaker | Realistic data generation library | Free | SDK for many languages Simple, easy to use Extensible Open source Generate realistic values |
No data connection support No data clean up No validation No foreign key support |
DBLDatagen | Python based data generation tool | Free | Python SDK Open source Good documentation Customisable scenarios Customisable column generation Generate from existing data/schemas Plugin third-party libraries |
Limited support if issues Code required No data clean up No data validation |
Gatling | HTTP API load testing tool | Free (Open Source) Gatling Enterprise, usage based, starts from €89 per month, 1 user, 6.25 hours of testing |
Kotlin, Java & Scala SDK Widely used Open source Clear documentation Extensive testing/validation support Customisable scenarios Report generation |
Only supports HTTP, JMS and JDBC No data clean up Data feeders not based off metadata |
Gretel | Python based data generation tool via ML | Usage based, starts from $295 per month, $2.20 per credit, assumed USD | CLI & Python SDK UI interface Training and re-use of models Detect private data Customisable scenarios |
Batch data only No relationships between data sources Only simple foreign key relations defined No data clean up Charge by usage |
Howso | Python based data generation tool via ML | Unclear | Python SDK Playground to try Open source library Customisable scenarios |
No support for data sources No data validation No data clean up |
Mostly AI | Python based data generation tool via ML | Usage based, Enterprise 1 user, 100 columns, 100K rows $3,100 per month, assumed USD | Report generation Non-technical users can use UI Customisable scenarios |
Charge by usage Batch data only No data clean up Confusing use of 'smart select' for multiple foreign keys Limited custom column generation logic Multiple deployment components No SDK |
Octopize | Python based data generation tool via ML | Unclear | Python & R SDK Report generation API for metadata Customisable scenarios |
Input data source is only CSV Multiple manual steps before starting Quickstart is not a quickstart Documentation lacks code examples |
Synthesized | Python based data generation tool via ML | Unclear | CLI & Python SDK API for metadata IDE setup Data quality checks |
Not sure what is SDK & TDK Charge by usage No report of what was generated No relationships between data sources |
Tonic | Platform solution for generating data | Unclear | UI interface Good documentation Detect private data Support for encrypted columns Report generation Alerting |
Batch data only Multiple deployment components No relationships between data sources No data validation No data clean up No SDK (only API) Difficult to embed complex business logic |
YData | Python based data generation tool via ML. Platform solution as well | Unclear | Python SDK Open source Detect private data Compare datasets Report generation |
No data connection support Batch data only No data clean up Separate data generation and data validation No foreign key support |
Use of ML models
You may notice that the majority of data generators use machine learning (ML) models to learn from your existing datasets to generate new data. Below are some pros and cons to the approach.
Pros
- Simple setup
- Ability to reproduce complex logic
- Flexible to accept all types of data
Cons
- Long time for model learning
- Black box of logic
- Maintain, store and update of ML models
- Required connection to production data
- Potential to leak PII data from production
- Restriction on input data lengths
- May not maintain referential integrity
- Require deeper understanding of ML models for fine-tuning
- Accuracy may be worse than non-ML models