0.16.11
Deployed: 10-10-2025
Latest features and fixes for Data Catering include:
Open Data Contract Standard (ODCS) Enhancements
-
Data Validations Support: Now extracts and applies data quality rules from ODCS v3 contracts
- Automatically converts ODCS quality checks to Data Caterer validations
- Supports both schema-level and field-level quality definitions
- Check here for documentation
-
Extended Metadata Extraction: Enhanced field-level metadata mapping from ODCS contracts
- String constraints:
minLength,maxLength,pattern(regex),format - Numeric constraints:
minimum,maximum - Examples: Extracted as metadata for documentation/reference
- Classification: Data classification tags (e.g.,
public,restricted) - Constraints automatically applied to data generation
- String constraints:
-
Relationship Modeling: Added support for ODCS relationship definitions
- Property-level relationships (
OpenDataContractStandardRelationshipV3) - Schema-level relationships (
OpenDataContractStandardRelationshipSchemaLevelV3) - Support for foreign key and reference relationships
- Property-level relationships (
Single File Output for File Data Sources
-
Automatic Part File Consolidation: When path includes file extension (e.g.,
output.json,data.csv), Spark part files are automatically consolidated into single file- Supports JSON, CSV, Parquet, ORC, XML, and TXT formats
- Intelligent header handling for CSV files (preserves header from first part, skips headers from subsequent parts)
- Automatic cleanup of temporary Spark directories
- Uses
coalesce(1)for efficient single-partition writes
-
File Suffix Detection: Automatically detects when path should output single file vs directory
- Directory output:
/path/to/data→ Spark default behavior with part files - Single file output:
/path/to/data.json→ Consolidated single file
- Directory output:
Core Engine Improvements
-
OpenDataContractStandardDataSourceMetadata:
- Implement
getDataSourceValidations()to extract validations from ODCS v3 contracts - Parse ODCS YAML and extract quality checks from schema definitions
- Support both v2 (no validations) and v3 (with validations) contract formats
- Implement
-
OpenDataContractStandardV3Mapper:
- Add
getLogicalTypeOptionsMetadata()to extract field constraints - Map ODCS
logicalTypeOptionsto Data Caterer field metadata - Extract examples, classification tags, and relationship information
- Enhanced metadata provides richer data generation context
- Add
-
SinkFactory:
- Add
consolidatePartFiles()method for single file output - Add
detectFileSuffix()to identify when consolidation is needed - Add
consolidateCsvWithHeaders()for proper CSV header handling - Add
cleanupDirectory()for temporary file cleanup - Enhanced JSON save handling to support both array unwrapping and single file output
- Add
-
OpenDataContractStandardV3Models:
- Add
relationshipsfield toOpenDataContractStandardSchemaV3 - Add
relationshipsfield toOpenDataContractStandardElementV3 - New case classes:
OpenDataContractStandardRelationshipV3,OpenDataContractStandardRelationshipSchemaLevelV3
- Add
Tests
-
OpenDataContractStandardDataSourceMetadataTest:
- Update assertions to handle v3 additional metadata (examples, classification)
- Verify metadata extraction for both v2 and v3 contracts
-
SinkFactoryTest:
- New test: "Should consolidate part files into single JSON file when path has .json suffix"
- New test: "Should consolidate part files into single CSV file when path has .csv suffix"
- New test: "Should consolidate part files into single Parquet file when path has .parquet suffix"
- New test: "Should not consolidate when path is a directory (no file extension)"
- New test: "Should handle CSV header consolidation correctly with multiple part files"
- New test: "Should use default Spark behavior when path is directory without file extension"
Bug Fixes
- ODCS Metadata Extraction: Fixed issue where examples and classification were not extracted in v3 contracts
- CSV Header Duplication: Fixed header duplication when consolidating multiple CSV part files
Benefits
- Richer Metadata: ODCS contracts now provide more complete field constraints for data generation
- Better Data Quality: Automatic validation extraction from ODCS contracts ensures generated data meets contract requirements
- Simplified Output: Users can specify exact file names instead of working with Spark's part file directories
- CSV Compatibility: Consolidated CSV files have proper header structure for tools expecting single CSV files
- Clean Workspaces: Automatic cleanup of temporary Spark directories prevents clutter