Basic Validations
Run validations on a field to ensure the values adhere to your requirement. Can be set to complex validation logic via SQL expression as well if needed (see here).
Pre-filter
If you want to only run the validation on a specific subset of data, you can define pre-filter conditions. Find more details here.
Equal
Ensure all data in field is equal/not equal to certain value. Value can be of any data type. Can use isEqualField
to
define SQL expression that can reference other fields.
Null
Ensure all data in field is null or not null.
Contains
Ensure all data in field is contains/not contains a certain string. Field has to have type string.
Unique
Ensure all data in field is unique.
Less Than
Ensure all data in field is less than certain value. Can use lessThanField
to define SQL expression that can reference
other fields.
Greater Than
Ensure all data in field is greater than certain value. Can use greaterThanField
to define SQL expression
that can reference other fields.
Between
Ensure all data in field is between two values. Can use betweenFields
to define SQL expression that references other
fields.
In
Ensure all data in field is in set of defined values.
Matches
Ensure all data in field matches certain regex expression(s).
validation().field("account_id").matches("ACC[0-9]{8}")
validation().field("account_id").matchesList(List.of("ACC[0-9]{8}", "ACC[0-9]{10}")) //check matches all regexes
validation().field("account_id").matches("ACC[0-9]{8}", true) //check not matches
validation().field("account_id").matchesList(List.of("ACC[0-9]{8}", "ACC[0-9]{10}"), true, false) //check does not match all regexes
validation().field("account_id").matchesList(List.of("ACC[0-9]{8}", "ACC[0-9]{10}"), false, true) //check matches at least one regex
validation().field("account_id").matchesList(List.of("ACC[0-9]{8}", "ACC[0-9]{10}"), false, false) //check does not match at least one regex
validation.field("account_id").matches("ACC[0-9]{8}")
validation.field("account_id").matchesList(List("ACC[0-9]{8}", "ACC[0-9]{10}")) //check matches all regexes
validation.field("account_id").matches("ACC[0-9]{8}", true) //check not matches
validation.field("account_id").matchesList(List("ACC[0-9]{8}", "ACC[0-9]{10}"), true, false) //check does not match all regexes
validation.field("account_id").matchesList(List("ACC[0-9]{8}", "ACC[0-9]{10}"), false, true) //check matches at least one regex
validation.field("account_id").matchesList(List("ACC[0-9]{8}", "ACC[0-9]{10}"), false, false) //check does not match at least one regex
---
name: "account_checks"
dataSources:
...
validations:
- expr: "REGEXP(account_id, ACC[0-9]{8})"
- field: "account_id"
validation:
- type: "matches"
regex: "ACC[0-9]{8}"
- type: "matches"
regex: "ACC[0-9]{8}"
negate: true
- type: "matchesList"
regexes: ["ACC[0-9]{8}", "ACC[0-9]{10}"]
matchAll: true
negate: true
Starts With
Ensure all data in field starts with certain string. Field has to have type string.
Ends With
Ensure all data in field ends with certain string. Field has to have type string.
Size
Ensure all data in field has certain size. Field has to have type array or map.
Less Than Size
Ensure all data in field has size less than certain value. Field has to have type array or map.
Greater Than Size
Ensure all data in field has size greater than certain value. Field has to have type array or map.
Luhn Check
Ensure all data in field passes Luhn check. Luhn check is used to validate credit card numbers and certain identification numbers (see here for more details).
Has Type
Ensure all data in field has certain data type.
---
name: "account_checks"
dataSources:
...
validations:
- expr: "TYPEOF(id) == 'string'"
- field: "id"
validation:
- type: "hasType"
value: "string"
- type: "hasType"
value: "string"
negate: true
- type: "hasTypes"
values: ["string", "double"]
- type: "hasType"
values: ["string", "double"]
negate: true
Distinct Values In Set
Check if distinct values of field exist in set.
Distinct Values Contains Set
Check if distinct values of field contains set of values.
Distinct Values Equal
Check if distinct values of field equals set of values.
Max, Mean, Median, Min, Standard Deviation, Sum Between
Check if aggregation of values for field is between set of values.
validation().field("amount").maxBetween(1, 100)
validation().field("amount").meanBetween(1, 100)
validation().field("amount").medianBetween(1, 100)
validation().field("amount").minBetween(1, 100)
validation().field("amount").stdDevBetween(1, 100)
validation().field("amount").sumBetween(1, 100)
validation().field("amount").maxBetween(1, 100, true) //check max amount is not between 1 and 100
validation().field("amount").meanBetween(1, 100, true)
validation().field("amount").medianBetween(1, 100, true)
validation().field("amount").minBetween(1, 100, true)
validation().field("amount").stdDevBetween(1, 100, true)
validation().field("amount").sumBetween(1, 100, true)
validation.field("amount").maxBetween(1, 100)
validation.field("amount").meanBetween(1, 100)
validation.field("amount").medianBetween(1, 100)
validation.field("amount").minBetween(1, 100)
validation.field("amount").stdDevBetween(1, 100)
validation.field("amount").sumBetween(1, 100)
validation.field("amount").maxBetween(1, 100, true) //check max amount is not between 1 and 100
validation.field("amount").meanBetween(1, 100, true)
validation.field("amount").medianBetween(1, 100, true)
validation.field("amount").minBetween(1, 100, true)
validation.field("amount").stdDevBetween(1, 100, true)
validation.field("amount").sumBetween(1, 100, true)
---
name: "account_checks"
dataSources:
...
validations:
- field: "amount"
validation:
- type: "maxBetween"
min: 1
max: 100
- type: "meanBetween"
min: 1
max: 100
- type: "medianBetween"
min: 1
max: 100
- type: "minBetween"
min: 1
max: 100
- type: "stdDevBetween"
min: 1
max: 100
- type: "sumBetween"
min: 1
max: 100
negate: true
Length Equal/Between
Check if length of field values is between or equal to value(s).
Is Increasing/Decreasing
Check if values of a field are increasing or decreasing.
Is JSON Parsable
Check if values of a field are JSON parsable.
Match JSON Schema
Check if values of a field match JSON schema.
Match Date Time Format
Check if values of a field match date time format (defined formats).
Most Common Value In Set
Check if the most common field value exists in set of values.
Unique Values Proportion Between
Check if the fields proportion of unique values is between two values.
Quantile Values Between
Check if quantiles of field values is within range.
Expression
Ensure all data in field adheres to SQL expression defined that returns back a boolean. You can define complex logic in here that could combine multiple fields.
For example, CASE WHEN status == 'open' THEN balance > 0 ELSE balance == 0 END
would check all rows with status
open to have balance
greater than 0, otherwise, check the balance
is 0.
var csvTxns = csv("transactions", "/tmp/csv", Map.of("header", "true"))
.validations(
validation().expr("amount < 100"),
validation().expr("year == 2021").errorThreshold(0.1), //equivalent to if error percentage is > 10%, then fail
validation().expr("REGEXP_LIKE(name, 'Peter .*')").errorThreshold(200) //equivalent to if number of errors is > 200, then fail
);
var conf = configuration().enableValidation(true);
val csvTxns = csv("transactions", "/tmp/csv", Map("header" -> "true"))
.validations(
validation.expr("amount < 100"),
validation.expr("year == 2021").errorThreshold(0.1), //equivalent to if error percentage is > 10%, then fail
validation.expr("REGEXP_LIKE(name, 'Peter .*')").errorThreshold(200) //equivalent to if number of errors is > 200, then fail
)
val conf = configuration.enableValidation(true)
---
name: "account_checks"
dataSources:
transactions:
options:
path: "/tmp/csv"
validations:
- expr: "amount < 100"
- expr: "year == 2021"
errorThreshold: 0.1 #equivalent to if error percentage is > 10%, then fail
- expr: "REGEXP_LIKE(name, 'Peter .*')"
errorThreshold: 200 #equivalent to if number of errors is > 200, then fail
description: "Should be lots of Peters"
#enableValidation inside application.conf