Great Expectations Source
Creating a data generator for a JSON file and validating the data based on expectations in Great Expectations.
Requirements
- 10 minutes
- Git
- Gradle
- Docker
Get Started
First, we will clone the data-caterer-example repo which will already have the base project setup required.
Great Expectations Setup
A sample expectations file that will be used for this guide can be found here.
If you want to use your own expectations file, simply add it into the docker/mount/ge
folder path and follow the below
steps.
Plan Setup
Create a file depending on which interface you want to use.
- Java:
src/main/java/io/github/datacatering/plan/MyAdvancedGreatExpectationsJavaPlanRun.java
- Scala:
src/main/scala/io/github/datacatering/plan/MyAdvancedGreatExpectationsPlanRun.scala
- YAML:
docker/data/custom/plan/my-great-expectations.yaml
In docker/data/custom/plan/my-great-expectations.yaml
:
name: "my_great_expectations_plan"
description: "Create account data in JSON format and validate via Great Expectations metadata"
tasks:
- name: "json_task"
dataSourceName: "my_json"
In docker/data/custom/application.conf
:
- Click on
Advanced Configuration
towards the bottom of the screen - Click on
Flag
and click onGenerate Validations
- Click on
Folder
and enter/tmp/data-caterer/report
forGenerated Reports Folder Path
We will enable generate validations so that we can read from external sources for validations and save the reports under a folder we can easily access.
Great Expectations
To point to a specific expectations file, we create a metadata source as seen below.
In docker/data/custom/task/file/json/json-great-expectations-task.yaml
:
- Click on
Connection
tab at the top - Select
Great Expectations
as the data source and entermy-great-expectations
- Copy this file into
/tmp/ge/taxi-expectations.json
- Enter
/tmp/ge/taxi-expectations.json
as theExpectations File
Schema & Validation
To simulate a scenario where we have an existing data source, we will manually create a sample dataset.
At the end, we point to our expectations metadata source to use those validations to validate the data.
var jsonTask = json("my_json", "/opt/app/data/json", Map.of("saveMode", "overwrite"))
.fields(
field().name("vendor_id"),
field().name("pickup_datetime").type(TimestampType.instance()),
field().name("dropoff_datetime").type(TimestampType.instance()),
field().name("passenger_count").type(IntegerType.instance()),
field().name("trip_distance").type(DoubleType.instance()),
field().name("rate_code_id"),
field().name("store_and_fwd_flag"),
field().name("pickup_location_id"),
field().name("dropoff_location_id"),
field().name("payment_type"),
field().name("fare_amount").type(DoubleType.instance()),
field().name("extra"),
field().name("mta_tax").type(DoubleType.instance()),
field().name("tip_amount").type(DoubleType.instance()),
field().name("tolls_amount").type(DoubleType.instance()),
field().name("improvement_surcharge").type(DoubleType.instance()),
field().name("total_amount").type(DoubleType.instance()),
field().name("congestion_surcharge").type(DoubleType.instance())
)
.validations(greatExpectations);
val jsonTask = json("my_json", "/opt/app/data/taxi_json", Map("saveMode" -> "overwrite"))
.fields(
field.name("vendor_id"),
field.name("pickup_datetime").`type`(TimestampType),
field.name("dropoff_datetime").`type`(TimestampType),
field.name("passenger_count").`type`(IntegerType),
field.name("trip_distance").`type`(DoubleType),
field.name("rate_code_id"),
field.name("store_and_fwd_flag"),
field.name("pickup_location_id"),
field.name("dropoff_location_id"),
field.name("payment_type"),
field.name("fare_amount").`type`(DoubleType),
field.name("extra"),
field.name("mta_tax").`type`(DoubleType),
field.name("tip_amount").`type`(DoubleType),
field.name("tolls_amount").`type`(DoubleType),
field.name("improvement_surcharge").`type`(DoubleType),
field.name("total_amount").`type`(DoubleType),
field.name("congestion_surcharge").`type`(DoubleType),
)
.validations(greatExpectationsSource)
In docker/data/custom/task/json/json-great-expectations-task.yaml
:
name: "json_task"
steps:
- name: "accounts"
type: "json"
options:
path: "/opt/app/data/json"
metadataSourceType: "greatExpectations"
expectationsFile: "/opt/app/mount/ge/taxi-expectations.json"
fields:
- name: "vendor_id"
- name: "pickup_datetime"
type: "timestamp"
- name: "dropoff_datetime"
type: "timestamp"
- name: "passenger_count"
type: "integer"
- name: "trip_distance"
type: "double"
- name: "rate_code_id"
- name: "store_and_fwd_flag"
- name: "pickup_location_id"
- name: "dropoff_location_id"
- name: "payment_type"
- name: "fare_amount"
type: "double"
- name: "extra"
- name: "mta_tax"
type: "double"
- name: "tip_amount"
type: "double"
- name: "tolls_amount"
type: "double"
- name: "improvement_surcharge"
type: "double"
- name: "total_amount"
type: "double"
- name: "congestion_surcharge"
type: "double"
- Click on
Generation
and tick theManual
checkbox - Click on
+ Field
- Add name as
vendor_id
- Click on
Select data type
and selectstring
- Continue with other fields and data types
Run
Let's try run and see what happens.
cd ..
./run.sh
#input class MyAdvancedGreatExpectationsJavaPlanRun or MyAdvancedGreatExpectationsPlanRun
#after completing
#open docker/sample/report/index.html
- Click the button
Execute
at the top - Progress updates will show in the bottom right corner
- Click on
History
at the top - Check for your plan name and see the result summary
- Click on
Report
on the right side to see more details of what was executed
It should look something like this.
So we were just able to validate our data source from reading a Great Expectations file. Simple! This gives us an easy way to integrate with existing data validations, but now we can relate that to generated data in test environments.
But we still may want to add on our own validations outside what is in Great Expectations.
Custom validation
We found that we should also check that the trip_distance
has to be less than 500
but was not included in Great
Expectations. No worries, we can simply add it in here alongside the existing expectations.
In docker/data/custom/validation/great-expectations-validation.yaml
:
- Under
Validation
, click onManual
- Click on
+ Validation
and go toSelect validation type...
asField
- Set
Field
totrip_distance
- Click on
+
next to the field name and selectLess Than
- Enter
500
as the value
Let's test it out by running it again.
Check out the full example under GreatExpectationsPlanRun
in the example repo.