Comparison to similar tools

I have tried to include all the companies found in the list here from Mostly AI blog post and used information that is publicly available.

The companies/products not shown below either have:

a website with insufficient information about the technology side of data generation/validation
no/little documentation
don't have a free, no sign-up version of their app to use

Data Generation

Tool	Description	Cost	Pros	Cons
Data Catering	Scala based data generation and validation tool via metadata	Free (Open Source) Sponsorship model for support or additional features	Data generation and validation Batch and event generation Maintain referential integrity Scala/Java SDK Customisable scenarios and validations Open source Metadata driven Report generation Use validation rules from existing tools Data clean up UI Alerting	No load testing metrics No validation of real time data sources
Clearbox AI	Python based data generation tool via ML	Unclear	Python SDK UI interface Detect private data Report generation	Batch data only No data clean up Limited/no documentation
Curiosity Software	Platform solution for test data management	Unclear	Extensive documentation Generate data based off test cases UI interface Web/API/UI/mobile testing	No quick start No SDK Many components that may not be required No event generation support
DataCebo Synthetic Data Vault	Python based data generation tool via ML	Unclear	Python SDK Report generation Data quality checks Business logic constraints	No data connection support No data clean up No foreign key support
Datafaker	Realistic data generation library	Free	SDK for many languages Simple, easy to use Extensible Open source Generate realistic values	No data connection support No data clean up No validation No foreign key support
DBLDatagen	Python based data generation tool	Free	Python SDK Open source Good documentation Customisable scenarios Customisable column generation Generate from existing data/schemas Plugin third-party libraries	Limited support if issues Code required No data clean up No data validation
Gatling	HTTP API load testing tool	Free (Open Source) Gatling Enterprise, usage based, starts from €89 per month, 1 user, 6.25 hours of testing	Kotlin, Java & Scala SDK Widely used Open source Clear documentation Extensive testing/validation support Customisable scenarios Report generation	Only supports HTTP, JMS and JDBC No data clean up Data feeders not based off metadata
Gretel	Python based data generation tool via ML	Usage based, starts from $295 per month, $2.20 per credit, assumed USD	CLI & Python SDK UI interface Training and re-use of models Detect private data Customisable scenarios	Batch data only No relationships between data sources Only simple foreign key relations defined No data clean up Charge by usage
Howso	Python based data generation tool via ML	Unclear	Python SDK Playground to try Open source library Customisable scenarios	No support for data sources No data validation No data clean up
Mostly AI	Python based data generation tool via ML	Usage based, Enterprise 1 user, 100 columns, 100K rows $3,100 per month, assumed USD	Report generation Non-technical users can use UI Customisable scenarios	Charge by usage Batch data only No data clean up Confusing use of 'smart select' for multiple foreign keys Limited custom column generation logic Multiple deployment components No SDK
Octopize	Python based data generation tool via ML	Unclear	Python & R SDK Report generation API for metadata Customisable scenarios	Input data source is only CSV Multiple manual steps before starting Quickstart is not a quickstart Documentation lacks code examples
Synthesized	Python based data generation tool via ML	Unclear	CLI & Python SDK API for metadata IDE setup Data quality checks	Not sure what is SDK & TDK Charge by usage No report of what was generated No relationships between data sources
Tonic	Platform solution for generating data	Unclear	UI interface Good documentation Detect private data Support for encrypted columns Report generation Alerting	Batch data only Multiple deployment components No relationships between data sources No data validation No data clean up No SDK (only API) Difficult to embed complex business logic
YData	Python based data generation tool via ML. Platform solution as well	Unclear	Python SDK Open source Detect private data Compare datasets Report generation	No data connection support Batch data only No data clean up Separate data generation and data validation No foreign key support

Use of ML models

You may notice that the majority of data generators use machine learning (ML) models to learn from your existing datasets to generate new data. Below are some pros and cons to the approach.

Pros

Simple setup
Ability to reproduce complex logic
Flexible to accept all types of data

Cons

Long time for model learning
Black box of logic
Maintain, store and update of ML models
Required connection to production data
Potential to leak PII data from production
Restriction on input data lengths
May not maintain referential integrity
Require deeper understanding of ML models for fine-tuning
Accuracy may be worse than non-ML models