Skip to content

Comparison to similar tools

I have tried to include all the companies found in the list here from Mostly AI blog post and used information that is publicly available.

The companies/products not shown below either have:

  • a website with insufficient information about the technology side of data generation/validation
  • no/little documentation
  • don't have a free, no sign-up version of their app to use

Data Generation

Tool Description Cost Pros Cons
Data Catering Scala based data generation and validation tool via metadata Free (Open Source)
Sponsorship model for support or additional features
Starts at $100 per month
✅ Data generation and validation
✅ Batch and event generation
✅ Maintain referential integrity
✅ Scala/Java SDK
✅ Customisable scenarios and validations
✅ Open source
✅ Metadata driven
✅ Report generation
✅ Use validation rules from existing tools
✅ Data clean up
✅ UI
✅ Alerting
No load testing metrics
No validation of real time data sources
Clearbox AI Python based data generation tool via ML Unclear ✅ Python SDK
✅ UI interface
✅ Detect private data
✅ Report generation
Batch data only
No data clean up
Limited/no documentation
Curiosity Software Platform solution for test data management Unclear ✅ Extensive documentation
✅ Generate data based off test cases
✅ UI interface
✅ Web/API/UI/mobile testing
No quick start
No SDK
Many components that may not be required
No event generation support
DataCebo Synthetic Data Vault Python based data generation tool via ML Unclear ✅ Python SDK
✅ Report generation
✅ Data quality checks
✅ Business logic constraints
No data connection support
No data clean up
No foreign key support
Datafaker Realistic data generation library Free ✅ SDK for many languages
✅ Simple, easy to use
✅ Extensible
✅ Open source
✅ Generate realistic values
No data connection support
No data clean up
No validation
No foreign key support
DBLDatagen Python based data generation tool Free ✅ Python SDK
✅ Open source
✅ Good documentation
✅ Customisable scenarios
✅ Customisable column generation
✅ Generate from existing data/schemas
✅ Plugin third-party libraries
Limited support if issues
Code required
No data clean up
No data validation
Gatling HTTP API load testing tool Free (Open Source)
Gatling Enterprise, usage based, starts from €89 per month, 1 user, 6.25 hours of testing
✅ Kotlin, Java & Scala SDK
✅ Widely used
✅ Open source
✅ Clear documentation
✅ Extensive testing/validation support
✅ Customisable scenarios
✅ Report generation
Only supports HTTP, JMS and JDBC
No data clean up
Data feeders not based off metadata
Gretel Python based data generation tool via ML Usage based, starts from $295 per month, $2.20 per credit, assumed USD ✅ CLI & Python SDK
✅ UI interface
✅ Training and re-use of models
✅ Detect private data
✅ Customisable scenarios
Batch data only
No relationships between data sources
Only simple foreign key relations defined
No data clean up
Charge by usage
Howso Python based data generation tool via ML Unclear ✅ Python SDK
✅ Playground to try
✅ Open source library
✅ Customisable scenarios
No support for data sources
No data validation
No data clean up
Mostly AI Python based data generation tool via ML Usage based, Enterprise 1 user, 100 columns, 100K rows $3,100 per month, assumed USD ✅ Report generation
✅ Non-technical users can use UI
✅ Customisable scenarios
Charge by usage
Batch data only
No data clean up
Confusing use of 'smart select' for multiple foreign keys
Limited custom column generation logic
Multiple deployment components
No SDK
Octopize Python based data generation tool via ML Unclear ✅ Python & R SDK
✅ Report generation
✅ API for metadata
✅ Customisable scenarios
Input data source is only CSV
Multiple manual steps before starting
Quickstart is not a quickstart
Documentation lacks code examples
Synthesized Python based data generation tool via ML Unclear ✅ CLI & Python SDK
✅ API for metadata
✅ IDE setup
✅ Data quality checks
Not sure what is SDK & TDK
Charge by usage
No report of what was generated
No relationships between data sources
Tonic Platform solution for generating data Unclear ✅ UI interface
✅ Good documentation
✅ Detect private data
✅ Support for encrypted columns
✅ Report generation
✅ Alerting
Batch data only
Multiple deployment components
No relationships between data sources
No data validation
No data clean up
No SDK (only API)
Difficult to embed complex business logic
YData Python based data generation tool via ML. Platform solution as well Unclear ✅ Python SDK
✅ Open source
✅ Detect private data
✅ Compare datasets
✅ Report generation
No data connection support
Batch data only
No data clean up
Separate data generation and data validation
No foreign key support

Use of ML models

You may notice that the majority of data generators use machine learning (ML) models to learn from your existing datasets to generate new data. Below are some pros and cons to the approach.

Pros

  • Simple setup
  • Ability to reproduce complex logic
  • Flexible to accept all types of data

Cons

  • Long time for model learning
  • Black box of logic
  • Maintain, store and update of ML models
  • Required connection to production data
  • Potential to leak PII data from production
  • Restriction on input data lengths
  • May not maintain referential integrity
  • Require deeper understanding of ML models for fine-tuning
  • Accuracy may be worse than non-ML models