Great Expectations Source
Creating a data generator for a JSON file and validating the data based on expectations in Great Expectations.
Requirements
- 10 minutes
- Git
- Gradle
- Docker
Get Started
First, we will clone the data-caterer repo which will already have the base project setup required.
Great Expectations Setup
A sample expectations file that will be used for this guide can be found here.
If you want to use your own expectations file, simply add it into the docker/mount/ge folder path and follow the below
steps.
Plan Setup
Create a file depending on which interface you want to use.
- Java:
src/main/java/io/github/datacatering/plan/MyAdvancedGreatExpectationsJavaPlanRun.java - Scala:
src/main/scala/io/github/datacatering/plan/MyAdvancedGreatExpectationsPlanRun.scala - YAML:
docker/data/custom/plan/my-great-expectations.yaml
In docker/data/custom/plan/my-great-expectations.yaml:
name: "my_great_expectations_plan"
description: "Create account data in JSON format and validate via Great Expectations metadata"
tasks:
- name: "json_task"
dataSourceName: "my_json"
In docker/data/custom/application.conf:
- Click on
Advanced Configurationtowards the bottom of the screen - Click on
Flagand click onGenerate Validations - Click on
Folderand enter/tmp/data-caterer/reportforGenerated Reports Folder Path
We will enable generate validations so that we can read from external sources for validations and save the reports under a folder we can easily access.
Great Expectations
To point to a specific expectations file, we create a metadata source as seen below.
In docker/data/custom/task/file/json/json-great-expectations-task.yaml:
- Click on
Connectiontab at the top - Select
Great Expectationsas the data source and entermy-great-expectations - Copy this file into
/tmp/ge/taxi-expectations.json - Enter
/tmp/ge/taxi-expectations.jsonas theExpectations File
Schema & Validation
To simulate a scenario where we have an existing data source, we will manually create a sample dataset.
At the end, we point to our expectations metadata source to use those validations to validate the data.
var jsonTask = json("my_json", "/opt/app/data/json", Map.of("saveMode", "overwrite"))
.fields(
field().name("vendor_id"),
field().name("pickup_datetime").type(TimestampType.instance()),
field().name("dropoff_datetime").type(TimestampType.instance()),
field().name("passenger_count").type(IntegerType.instance()),
field().name("trip_distance").type(DoubleType.instance()),
field().name("rate_code_id"),
field().name("store_and_fwd_flag"),
field().name("pickup_location_id"),
field().name("dropoff_location_id"),
field().name("payment_type"),
field().name("fare_amount").type(DoubleType.instance()),
field().name("extra"),
field().name("mta_tax").type(DoubleType.instance()),
field().name("tip_amount").type(DoubleType.instance()),
field().name("tolls_amount").type(DoubleType.instance()),
field().name("improvement_surcharge").type(DoubleType.instance()),
field().name("total_amount").type(DoubleType.instance()),
field().name("congestion_surcharge").type(DoubleType.instance())
)
.validations(greatExpectations);
val jsonTask = json("my_json", "/opt/app/data/taxi_json", Map("saveMode" -> "overwrite"))
.fields(
field.name("vendor_id"),
field.name("pickup_datetime").`type`(TimestampType),
field.name("dropoff_datetime").`type`(TimestampType),
field.name("passenger_count").`type`(IntegerType),
field.name("trip_distance").`type`(DoubleType),
field.name("rate_code_id"),
field.name("store_and_fwd_flag"),
field.name("pickup_location_id"),
field.name("dropoff_location_id"),
field.name("payment_type"),
field.name("fare_amount").`type`(DoubleType),
field.name("extra"),
field.name("mta_tax").`type`(DoubleType),
field.name("tip_amount").`type`(DoubleType),
field.name("tolls_amount").`type`(DoubleType),
field.name("improvement_surcharge").`type`(DoubleType),
field.name("total_amount").`type`(DoubleType),
field.name("congestion_surcharge").`type`(DoubleType),
)
.validations(greatExpectationsSource)
In docker/data/custom/task/json/json-great-expectations-task.yaml:
name: "json_task"
steps:
- name: "accounts"
type: "json"
options:
path: "/opt/app/data/json"
metadataSourceType: "greatExpectations"
expectationsFile: "/opt/app/mount/ge/taxi-expectations.json"
fields:
- name: "vendor_id"
- name: "pickup_datetime"
type: "timestamp"
- name: "dropoff_datetime"
type: "timestamp"
- name: "passenger_count"
type: "integer"
- name: "trip_distance"
type: "double"
- name: "rate_code_id"
- name: "store_and_fwd_flag"
- name: "pickup_location_id"
- name: "dropoff_location_id"
- name: "payment_type"
- name: "fare_amount"
type: "double"
- name: "extra"
- name: "mta_tax"
type: "double"
- name: "tip_amount"
type: "double"
- name: "tolls_amount"
type: "double"
- name: "improvement_surcharge"
type: "double"
- name: "total_amount"
type: "double"
- name: "congestion_surcharge"
type: "double"
- Click on
Generationand tick theManualcheckbox - Click on
+ Field - Add name as
vendor_id - Click on
Select data typeand selectstring - Continue with other fields and data types
Run
Let's try run and see what happens.
cd ..
./run.sh
#input class MyAdvancedGreatExpectationsJavaPlanRun or MyAdvancedGreatExpectationsPlanRun
#after completing
#open docker/sample/report/index.html
- Click the button
Executeat the top - Progress updates will show in the bottom right corner
- Click on
Historyat the top - Check for your plan name and see the result summary
- Click on
Reporton the right side to see more details of what was executed
It should look something like this.

So we were just able to validate our data source from reading a Great Expectations file. Simple! This gives us an easy way to integrate with existing data validations, but now we can relate that to generated data in test environments.
But we still may want to add on our own validations outside what is in Great Expectations.
Custom validation
We found that we should also check that the trip_distance has to be less than 500 but was not included in Great
Expectations. No worries, we can simply add it in here alongside the existing expectations.
In docker/data/custom/validation/great-expectations-validation.yaml:
- Under
Validation, click onManual - Click on
+ Validationand go toSelect validation type...asField - Set
Fieldtotrip_distance - Click on
+next to the field name and selectLess Than - Enter
500as the value
Let's test it out by running it again.

Check out the full example under GreatExpectationsPlanRun in the example repo.