Iceberg
Data testing for Iceberg. You will have the ability to generate and validate Iceberg tables.
Requirements
- 5 minutes
- Git
- Gradle
- Docker
Get Started
First, we will clone the data-caterer-example repo which will already have the base project setup required.
Plan Setup
Create a file depending on which interface you want to use.
- Java:
src/main/java/io/github/datacatering/plan/MyIcebergJavaPlan.java
- Scala:
src/main/scala/io/github/datacatering/plan/MyIcebergPlan.scala
- YAML:
docker/data/customer/plan/my-iceberg.yaml
In docker/data/custom/plan/my-iceberg.yaml
:
- Go to
Connection
tab in the top bar - Select data source as
Iceberg
- Enter in data source name
my_iceberg
- Select catalog type
hadoop
- Enter warehouse path as
/opt/app/data/customer/iceberg
- Enter in data source name
This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.
Connection Configuration
Within our class, we can start by defining the connection properties to read/write from/to Iceberg.
var accountTask = iceberg(
"customer_accounts", //name
"account.accounts", //table name
"/opt/app/data/customer/iceberg", //warehouse path
"hadoop", //catalog type
"", //catalog uri
Map.of() //additional options
);
Additional options can be found here.
val accountTask = iceberg(
"customer_accounts", //name
"account.accounts", //table name
"/opt/app/data/customer/iceberg", //warehouse path
"hadoop", //catalog type
"", //catalog uri
Map() //additional options
)
Additional options can be found here.
In docker/data/custom/application.conf
:
- We have already created the connection details in this step
Schema
Depending on how you want to define the schema, follow the below:
- Manual schema guide
- Automatically detect schema from the data source, you can simply enable
configuration.enableGeneratePlanAndTasks(true)
- Automatically detect schema from a metadata source
Additional Configurations
At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations. We will also enable the unique check to ensure any unique fields will have unique values generated.
In docker/data/custom/application.conf
:
- Click on
Advanced Configuration
towards the bottom of the screen - Click on
Flag
and click onUnique Check
- Click on
Folder
and enter/tmp/data-caterer/report
forGenerated Reports Folder Path
Run
Now we can run via the script ./run.sh
that is in the top level directory of the data-caterer-example
to run the class we just
created.
- Click the button
Execute
at the top - Progress updates will show in the bottom right corner
- Click on
History
at the top - Check for your plan name and see the result summary
- Click on
Report
on the right side to see more details of what was executed
Congratulations! You have now made a data generator that has simulated a real world data scenario. You can check the
IcebergJavaPlan.java
or IcebergPlan.scala
files as well to check that your plan is the same.
Validation
If you want to validate data from an Iceberg source, follow the validation documentation found here to help guide you.