Skip to content

Basic Validations

Run validations on a column to ensure the values adhere to your requirement. Can be set to complex validation logic via SQL expression as well if needed (see here).

Pre-filter

If you want to only run the validation on a specific subset of data, you can define pre-filter conditions. Find more details here.

Equal

Ensure all data in column is equal to certain value. Value can be of any data type. Can use isEqualCol to define SQL expression that can reference other columns.

validation().col("year").isEqual(2021),
validation().col("year").isEqualCol("YEAR(date)"),
validation.col("year").isEqual(2021),
validation.col("year").isEqualCol("YEAR(date)"),
---
name: "account_checks"
dataSources:
  ...
    validations:
      - whereExpr: "year == 2021"

Not Equal

Ensure all data in column is not equal to certain value. Value can be of any data type. Can use isNotEqualCol to define SQL expression that can reference other columns.

validation().col("year").isNotEqual(2021),
validation().col("year").isNotEqualCol("YEAR(date)"),
validation.col("year").isNotEqual(2021)
validation.col("year").isEqualCol("YEAR(date)"),
---
name: "account_checks"
dataSources:
  ...
    validations:
      - whereExpr: "year != 2021"

Null

Ensure all data in column is null.

validation().col("year").isNull()
validation.col("year").isNull
---
name: "account_checks"
dataSources:
  ...
    validations:
      - whereExpr: "ISNULL(year)"

Not Null

Ensure all data in column is not null.

validation().col("year").isNotNull()
validation.col("year").isNotNull
---
name: "account_checks"
dataSources:
  ...
    validations:
      - whereExpr: "ISNOTNULL(year)"

Contains

Ensure all data in column is contains certain string. Column has to have type string.

validation().col("name").contains("peter")
validation.col("name").contains("peter")
---
name: "account_checks"
dataSources:
  ...
    validations:
      - whereExpr: "CONTAINS(name, 'peter')"

Not Contains

Ensure all data in column does not contain certain string. Column has to have type string.

validation().col("name").notContains("peter")
validation.col("name").notContains("peter")
---
name: "account_checks"
dataSources:
  ...
    validations:
      - whereExpr: "!CONTAINS(name, 'peter')"

Unique

Ensure all data in column is unique.

validation().unique("account_id", "name")
validation.unique("account_id", "name")
---
name: "account_checks"
dataSources:
  ...
    validations:
      - unique: ["account_id", "name"]

Less Than

Ensure all data in column is less than certain value. Can use lessThanCol to define SQL expression that can reference other columns.

validation().col("amount").lessThan(100),
validation().col("amount").lessThanCol("balance + 1"),
validation.col("amount").lessThan(100),
validation.col("amount").lessThanCol("balance + 1"),
---
name: "account_checks"
dataSources:
  ...
    validations:
      - whereExpr: "amount < 100"
      - whereExpr: "amount < balance + 1"

Less Than Or Equal

Ensure all data in column is less than or equal to certain value. Can use lessThanOrEqualCol to define SQL expression that can reference other columns.

validation().col("amount").lessThanOrEqual(100),
validation().col("amount").lessThanOrEqualCol("balance + 1"),
validation.col("amount").lessThanOrEqual(100),
validation.col("amount").lessThanCol("balance + 1"),
---
name: "account_checks"
dataSources:
  ...
    validations:
      - whereExpr: "amount <= 100"
      - whereExpr: "amount <= balance + 1"

Greater Than

Ensure all data in column is greater than certain value. Can use greaterThanCol to define SQL expression that can reference other columns.

validation().col("amount").greaterThan(100),
validation().col("amount").greaterThanCol("balance"),
validation.col("amount").greaterThan(100),
validation.col("amount").greaterThanCol("balance"),
---
name: "account_checks"
dataSources:
  ...
    validations:
      - whereExpr: "amount > 100"
      - whereExpr: "amount > balance"

Greater Than Or Equal

Ensure all data in column is greater than or equal to certain value. Can use greaterThanOrEqualCol to define SQL expression that can reference other columns.

validation().col("amount").greaterThanOrEqual(100),
validation().col("amount").greaterThanOrEqualCol("balance"),
validation.col("amount").greaterThanOrEqual(100),
validation.col("amount").greaterThanOrEqualCol("balance"),
---
name: "account_checks"
dataSources:
  ...
    validations:
      - whereExpr: "amount >= 100"
      - whereExpr: "amount >= balance"

Between

Ensure all data in column is between two values. Can use betweenCol to define SQL expression that references other columns.

validation().col("amount").between(100, 200),
validation().col("amount").betweenCol("balance * 0.9", "balance * 1.1"),
validation.col("amount").between(100, 200),
validation.col("amount").betweenCol("balance * 0.9", "balance * 1.1"),
---
name: "account_checks"
dataSources:
  ...
    validations:
      - whereExpr: "amount BETWEEN 100 AND 200"
      - whereExpr: "amount BETWEEN balance * 0.9 AND balance * 1.1"

Not Between

Ensure all data in column is not between two values. Can use notBetweenCol to define SQL expression that references other columns.

validation().col("amount").notBetween(100, 200),
validation().col("amount").notBetweenCol("balance * 0.9", "balance * 1.1"),
validation.col("amount").notBetween(100, 200)
validation.col("amount").notBetweenCol("balance * 0.9", "balance * 1.1"),
---
name: "account_checks"
dataSources:
  ...
    validations:
      - whereExpr: "amount NOT BETWEEN 100 AND 200"
      - whereExpr: "amount NOT BETWEEN balance * 0.9 AND balance * 1.1"

In

Ensure all data in column is in set of defined values.

validation().col("status").in("open", "closed")
validation.col("status").in("open", "closed")
---
name: "account_checks"
dataSources:
  ...
    validations:
      - whereExpr: "status IN ('open', 'closed')"

Matches

Ensure all data in column matches certain regex expression.

validation().col("account_id").matches("ACC[0-9]{8}")
validation.col("account_id").matches("ACC[0-9]{8}")
---
name: "account_checks"
dataSources:
  ...
    validations:
      - whereExpr: "REGEXP(account_id, ACC[0-9]{8})"

Not Matches

Ensure all data in column does not match certain regex expression.

validation().col("account_id").notMatches("^acc.*")
validation.col("account_id").notMatches("^acc.*")
---
name: "account_checks"
dataSources:
  ...
    validations:
      - whereExpr: "!REGEXP(account_id, '^acc.*')"

Starts With

Ensure all data in column starts with certain string. Column has to have type string.

validation().col("account_id").startsWith("ACC")
validation.col("account_id").startsWith("ACC")
---
name: "account_checks"
dataSources:
  ...
    validations:
      - whereExpr: "STARTSWITH(account_id, 'ACC')"

Not Starts With

Ensure all data in column does not start with certain string. Column has to have type string.

validation().col("account_id").notStartsWith("ACC")
validation.col("account_id").notStartsWith("ACC")
---
name: "account_checks"
dataSources:
  ...
    validations:
      - whereExpr: "!STARTSWITH(account_id, 'ACC')"

Ends With

Ensure all data in column ends with certain string. Column has to have type string.

validation().col("account_id").endsWith("ACC")
validation.col("account_id").endsWith("ACC")
---
name: "account_checks"
dataSources:
  ...
    validations:
      - whereExpr: "ENDWITH(account_id, 'ACC')"

Not Ends With

Ensure all data in column does not end with certain string. Column has to have type string.

validation().col("account_id").notEndsWith("ACC")
validation.col("account_id").notEndsWith("ACC")
---
name: "account_checks"
dataSources:
  ...
    validations:
      - whereExpr: "!ENDWITH(account_id, 'ACC')"

Size

Ensure all data in column has certain size. Column has to have type array or map.

validation().col("transactions").size(5)
validation.col("transactions").size(5)
---
name: "account_checks"
dataSources:
  ...
    validations:
      - whereExpr: "SIZE(transactions, 5)"

Not Size

Ensure all data in column does not have certain size. Column has to have type array or map.

validation().col("transactions").notSize(5)
validation.col("transactions").notSize(5)
---
name: "account_checks"
dataSources:
  ...
    validations:
      - whereExpr: "SIZE(transactions) != 5"

Less Than Size

Ensure all data in column has size less than certain value. Column has to have type array or map.

validation().col("transactions").lessThanSize(5)
validation.col("transactions").lessThanSize(5)
---
name: "account_checks"
dataSources:
  ...
    validations:
      - whereExpr: "SIZE(transactions) < 5"

Less Than Or Equal Size

Ensure all data in column has size less than or equal to certain value. Column has to have type array or map.

validation().col("transactions").lessThanOrEqualSize(5)
validation.col("transactions").lessThanOrEqualSize(5)
---
name: "account_checks"
dataSources:
  ...
    validations:
      - whereExpr: "SIZE(transactions) <= 5"

Greater Than Size

Ensure all data in column has size greater than certain value. Column has to have type array or map.

validation().col("transactions").greaterThanSize(5)
validation.col("transactions").greaterThanSize(5)
---
name: "account_checks"
dataSources:
  ...
    validations:
      - whereExpr: "SIZE(transactions) > 5"

Greater Than Or Equal Size

Ensure all data in column has size greater than or equal to certain value. Column has to have type array or map.

validation().col("transactions").greaterThanOrEqualSize(5)
validation.col("transactions").greaterThanOrEqualSize(5)
---
name: "account_checks"
dataSources:
  ...
    validations:
      - whereExpr: "SIZE(transactions) >= 5"

Luhn Check

Ensure all data in column passes luhn check. Luhn check is used to validate credit card numbers and certain identification numbers (see here for more details).

validation().col("credit_card").luhnCheck()
validation.col("credit_card").luhnCheck
---
name: "account_checks"
dataSources:
  ...
    validations:
      - whereExpr: "LUHN_CHECK(credit_card)"

Has Type

Ensure all data in column has certain data type.

validation().col("id").hasType("string")
validation.col("id").hasType("string")
---
name: "account_checks"
dataSources:
  ...
    validations:
      - whereExpr: "TYPEOF(id) == 'string'"

Expression

Ensure all data in column adheres to SQL expression defined that returns back a boolean. You can define complex logic in here that could combine multiple columns.

For example, CASE WHEN status == 'open' THEN balance > 0 ELSE balance == 0 END would check all rows with status open to have balance greater than 0, otherwise, check the balance is 0.

var csvTxns = csv("transactions", "/tmp/csv", Map.of("header", "true"))
  .validations(
    validation().expr("amount < 100"),
    validation().expr("year == 2021").errorThreshold(0.1),  //equivalent to if error percentage is > 10%, then fail
    validation().expr("REGEXP_LIKE(name, 'Peter .*')").errorThreshold(200)  //equivalent to if number of errors is > 200, then fail
  );

var conf = configuration().enableValidation(true);
val csvTxns = csv("transactions", "/tmp/csv", Map("header" -> "true"))
  .validations(
    validation.expr("amount < 100"),
    validation.expr("year == 2021").errorThreshold(0.1),  //equivalent to if error percentage is > 10%, then fail
    validation.expr("REGEXP_LIKE(name, 'Peter .*')").errorThreshold(200)  //equivalent to if number of errors is > 200, then fail
  )

val conf = configuration.enableValidation(true)
---
name: "account_checks"
dataSources:
  transactions:
    options:
      path: "/tmp/csv"
    validations:
      - whereExpr: "amount < 100"
      - whereExpr: "year == 2021"
        errorThreshold: 0.1   #equivalent to if error percentage is > 10%, then fail
      - whereExpr: "REGEXP_LIKE(name, 'Peter .*')"
        errorThreshold: 200   #equivalent to if number of errors is > 200, then fail
        description: "Should be lots of Peters"

#enableValidation inside application.conf