Iceberg

Data testing for Iceberg. You will have the ability to generate and validate Iceberg tables.

Requirements

5 minutes
Git
Gradle
Docker

Get Started

First, we will clone the data-caterer-example repo which will already have the base project setup required.

JavaScalaYAMLUI

git clone git@github.com:data-catering/data-caterer-example.git

git clone git@github.com:data-catering/data-caterer-example.git

git clone git@github.com:data-catering/data-caterer-example.git

Run Data Caterer UI via the 'Quick Start' found here.

Plan Setup

Create a new Java or Scala class.

Java: src/main/java/io/github/datacatering/plan/MyIcebergJavaPlan.java
Scala: src/main/scala/io/github/datacatering/plan/MyIcebergPlan.scala
YAML: docker/data/customer/plan/my-iceberg.yaml

Make sure your class extends PlanRun.

JavaScalaYAMLUI

import io.github.datacatering.datacaterer.java.api.PlanRun;

public class MyIcebergJavaPlan extends PlanRun {
}

import io.github.datacatering.datacaterer.api.PlanRun

class MyIcebergPlan extends PlanRun {
}

In docker/data/custom/plan/my-iceberg.yaml:

name: "my_iceberg_plan"
description: "Create account data in Iceberg table"
tasks:
  - name: "iceberg_account_table"
    dataSourceName: "customer_accounts"

Go here.

This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.

Connection Configuration

Within our class, we can start by defining the connection properties to read/write from/to Iceberg.

JavaScalaYAMLUI

var accountTask = iceberg(
        "customer_accounts",              //name
        "account.accounts",               //table name
        "/opt/app/data/customer/iceberg", //warehouse path
        "hadoop",                         //catalog type
        "",                               //catalog uri
        Map.of()                          //additional options
);

Additional options can be found here.

val accountTask = iceberg(
  "customer_accounts",              //name
  "account.accounts",               //table name
  "/opt/app/data/customer/iceberg", //warehouse path
  "hadoop",                         //catalog type
  "",                               //catalog uri
  Map()                             //additional options
)

Additional options can be found here.

In application.conf:

iceberg {
  customer_accounts {
    path = "/opt/app/data/customer/iceberg"
    path = ${?ICEBERG_WAREHOUSE_PATH}
    catalogType = "hadoop"
    catalogType = ${?ICEBERG_CATALOG_TYPE}
    catalogUri = ""
    catalogUri = ${?ICEBERG_CATALOG_URI}
  }
}

Go to Connection tab in the top bar
Select data source as Iceberg
1. Enter in data source name customer_accounts
2. Select catalog type hadoop
3. Enter warehouse path as /opt/app/data/customer/iceberg

Schema

Depending on how you want to define the schema, follow the below:

Manual schema guide
Automatically detect schema from the data source, you can simply enable configuration.enableGeneratePlanAndTasks(true)
Automatically detect schema from a metadata source

Additional Configurations

At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations. We will also enable the unique check to ensure any unique fields will have unique values generated.

JavaScalaYAMLUI

var config = configuration()
        .generatedReportsFolderPath("/opt/app/data/report")
        .enableUniqueCheck(true);

execute(myPlan, config, accountTask, transactionTask);

val config = configuration
  .generatedReportsFolderPath("/opt/app/data/report")
  .enableUniqueCheck(true)

execute(myPlan, config, accountTask, transactionTask)

In application.conf:

flags {
  enableUniqueCheck = true
}
folders {
  generatedReportsFolderPath = "/opt/app/data/report"
}

Click on Advanced Configuration towards the bottom of the screen
Click on Flag and click on Unique Check
Click on Folder and enter /tmp/data-caterer/report for Generated Reports Folder Path

Run

Now we can run via the script ./run.sh that is in the top level directory of the data-caterer-example to run the class we just created.

JavaScalaYAMLUI

./run.sh MyIcebergJavaPlan

./run.sh MyIcebergPlan

./run.sh my-iceberg.yaml

Click on Execute at the top

Congratulations! You have now made a data generator that has simulated a real world data scenario. You can check the IcebergJavaPlan.java or IcebergPlan.scala files as well to check that your plan is the same.

Validation

If you want to validate data from an Iceberg source, follow the validation documentation found here to help guide you.