ORC
Creating a data generator for ORC. You will have the ability to generate and validate ORC files via Docker.
Requirements
- 10 minutes
- Git
- Gradle
- Docker
Get Started
First, we will clone the data-caterer repo which will already have the base project setup required.
Plan Setup
Create a file depending on which interface you want to use.
- Java: src/main/java/io/github/datacatering/plan/MyORCJavaPlan.java
- Scala: src/main/scala/io/github/datacatering/plan/MyORCPlan.scala
- YAML: docker/data/custom/plan/my-orc.yaml
In docker/data/custom/plan/my-orc.yaml:
- Click on Connectiontowards the top of the screen
- For connection name, set to my_orc
- Click on Select data source type..and selectORC
- Set Pathas/tmp/custom/orc/accounts- Optionally, we could set the number of partitions and columns to partition by
 
- Click on Create
- You should see your connection my_orcshow underExisting connections
- Click on Hometowards the top of the screen
- Set plan name to my_orc_plan
- Set task name to orc_task
- Click on Select data source..and selectmy_orc
This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.
Connection Configuration
Within our class, we can start by defining the connection properties to read/write from/to ORC.
var accountTask = orc(
    "customer_accounts",                  //name
    "/opt/app/data/customer/account_orc", //path
    Map.of()                              //additional options
);
Additional options can be found here.
val accountTask = orc(
  "customer_accounts",                  //name         
  "/opt/app/data/customer/account_orc", //path
  Map()                                 //additional options
)
Additional options can be found here.
In docker/data/custom/application.conf:
Additional options can be found here.
- We have already created the connection details in this step
Schema
Depending on how you want to define the schema, follow the below:
- Manual schema guide
- Automatically detect schema from the data source, you can simply enable configuration.enableGeneratePlanAndTasks(true)
- Automatically detect schema from a metadata source
Additional Configurations
At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations. We will also enable the unique check to ensure any unique fields will have unique values generated.
In docker/data/custom/application.conf:
- Click on Advanced Configurationtowards the bottom of the screen
- Click on Flagand click onUnique Check
- Click on Folderand enter/tmp/data-caterer/reportforGenerated Reports Folder Path
Run
Now we can run via the script ./run.sh that is in the top level directory of the data-caterer/example folder to run the class we just
created.
- Click the button Executeat the top
- Progress updates will show in the bottom right corner
- Click on Historyat the top
- Check for your plan name and see the result summary
- Click on Reporton the right side to see more details of what was executed
Congratulations! You have now made a data generator that has simulated a real world data scenario. You can check the
ORCJavaPlan.java or ORCPlan.scala files as well to check that your plan is the same.
Validation
If you want to validate data from a ORC source, follow the validation documentation found here to help guide you.