Skip to content

0.16.11

Deployed: 10-10-2025

Latest features and fixes for Data Catering include:

Open Data Contract Standard (ODCS) Enhancements

  • Data Validations Support: Now extracts and applies data quality rules from ODCS v3 contracts

    • Automatically converts ODCS quality checks to Data Caterer validations
    • Supports both schema-level and field-level quality definitions
    • Check here for documentation
  • Extended Metadata Extraction: Enhanced field-level metadata mapping from ODCS contracts

    • String constraints: minLength, maxLength, pattern (regex), format
    • Numeric constraints: minimum, maximum
    • Examples: Extracted as metadata for documentation/reference
    • Classification: Data classification tags (e.g., public, restricted)
    • Constraints automatically applied to data generation
  • Relationship Modeling: Added support for ODCS relationship definitions

    • Property-level relationships (OpenDataContractStandardRelationshipV3)
    • Schema-level relationships (OpenDataContractStandardRelationshipSchemaLevelV3)
    • Support for foreign key and reference relationships

Single File Output for File Data Sources

  • Automatic Part File Consolidation: When path includes file extension (e.g., output.json, data.csv), Spark part files are automatically consolidated into single file

    • Supports JSON, CSV, Parquet, ORC, XML, and TXT formats
    • Intelligent header handling for CSV files (preserves header from first part, skips headers from subsequent parts)
    • Automatic cleanup of temporary Spark directories
    • Uses coalesce(1) for efficient single-partition writes
  • File Suffix Detection: Automatically detects when path should output single file vs directory

    • Directory output: /path/to/data → Spark default behavior with part files
    • Single file output: /path/to/data.json → Consolidated single file

Core Engine Improvements

  • OpenDataContractStandardDataSourceMetadata:

    • Implement getDataSourceValidations() to extract validations from ODCS v3 contracts
    • Parse ODCS YAML and extract quality checks from schema definitions
    • Support both v2 (no validations) and v3 (with validations) contract formats
  • OpenDataContractStandardV3Mapper:

    • Add getLogicalTypeOptionsMetadata() to extract field constraints
    • Map ODCS logicalTypeOptions to Data Caterer field metadata
    • Extract examples, classification tags, and relationship information
    • Enhanced metadata provides richer data generation context
  • SinkFactory:

    • Add consolidatePartFiles() method for single file output
    • Add detectFileSuffix() to identify when consolidation is needed
    • Add consolidateCsvWithHeaders() for proper CSV header handling
    • Add cleanupDirectory() for temporary file cleanup
    • Enhanced JSON save handling to support both array unwrapping and single file output
  • OpenDataContractStandardV3Models:

    • Add relationships field to OpenDataContractStandardSchemaV3
    • Add relationships field to OpenDataContractStandardElementV3
    • New case classes: OpenDataContractStandardRelationshipV3, OpenDataContractStandardRelationshipSchemaLevelV3

Tests

  • OpenDataContractStandardDataSourceMetadataTest:

    • Update assertions to handle v3 additional metadata (examples, classification)
    • Verify metadata extraction for both v2 and v3 contracts
  • SinkFactoryTest:

    • New test: "Should consolidate part files into single JSON file when path has .json suffix"
    • New test: "Should consolidate part files into single CSV file when path has .csv suffix"
    • New test: "Should consolidate part files into single Parquet file when path has .parquet suffix"
    • New test: "Should not consolidate when path is a directory (no file extension)"
    • New test: "Should handle CSV header consolidation correctly with multiple part files"
    • New test: "Should use default Spark behavior when path is directory without file extension"

Bug Fixes

  • ODCS Metadata Extraction: Fixed issue where examples and classification were not extracted in v3 contracts
  • CSV Header Duplication: Fixed header duplication when consolidating multiple CSV part files

Benefits

  • Richer Metadata: ODCS contracts now provide more complete field constraints for data generation
  • Better Data Quality: Automatic validation extraction from ODCS contracts ensures generated data meets contract requirements
  • Simplified Output: Users can specify exact file names instead of working with Spark's part file directories
  • CSV Compatibility: Consolidated CSV files have proper header structure for tools expecting single CSV files
  • Clean Workspaces: Automatic cleanup of temporary Spark directories prevents clutter