Skip to content

Multiple Records Per Column

Creating a data generator for a CSV file where there are multiple records per column values.

Requirements

  • 5 minutes
  • Git
  • Gradle
  • Docker

Get Started

First, we will clone the data-caterer-example repo which will already have the base project setup required.

git clone git@github.com:data-catering/data-caterer-example.git
git clone git@github.com:data-catering/data-caterer-example.git
git clone git@github.com:data-catering/data-caterer-example.git

Plan Setup

Create a new Java or Scala class.

  • Java: src/main/java/io/github/datacatering/plan/MyMultipleRecordsPerColJavaPlan.java
  • Scala: src/main/scala/io/github/datacatering/plan/MyMultipleRecordsPerColPlan.scala

Make sure your class extends PlanRun.

import io.github.datacatering.datacaterer.java.api.PlanRun;
...

public class MyMultipleRecordsPerColJavaPlan extends PlanRun {
    {
        var transactionTask = csv("customer_transactions", "/opt/app/data/customer/transaction", Map.of("header", "true"))
                .schema(
                        field().name("account_id"),
                        field().name("full_name"),
                        field().name("amount").type(DoubleType.instance()).min(1).max(100),
                        field().name("time").type(TimestampType.instance()).min(java.sql.Date.valueOf("2022-01-01")),
                        field().name("date").type(DateType.instance()).sql("DATE(time)")
                );

        var config = configuration()
                .generatedReportsFolderPath("/opt/app/data/report")
                .enableUniqueCheck(true);

        execute(config, transactionTask);
    }
}
import io.github.datacatering.datacaterer.api.PlanRun
...

class MyMultipleRecordsPerColPlan extends PlanRun {

  val transactionTask = csv("customer_transactions", "/opt/app/data/customer/transaction", Map("header" -> "true"))
    .schema(
      field.name("account_id").regex("ACC[0-9]{8}"), 
      field.name("full_name").expression("#{Name.name}"), 
      field.name("amount").`type`(DoubleType.instance).min(1).max(100),
      field.name("time").`type`(TimestampType.instance).min(java.sql.Date.valueOf("2022-01-01")), 
      field.name("date").`type`(DateType.instance).sql("DATE(time)")
    )

  val config = configuration
    .generatedReportsFolderPath("/opt/app/data/report")

  execute(config, transactionTask)
}

Record Count

By default, tasks will generate 1000 records. You can alter this value via the count configuration which can be applied to individual tasks. For example, in Scala, csv(...).count(count.records(100)) to generate only 100 records.

Records Per Column

In this scenario, for a given account_id, full_name, there should be multiple records for it as we want to simulate a customer having multiple transactions. We can achieve this through defining the number of records to generate in the count function.

var transactionTask = csv("customer_transactions", "/opt/app/data/customer/transaction", Map.of("header", "true"))
        .schema(
                ...
        )
        .count(count().recordsPerColumn(5, "account_id", "full_name"));
val transactionTask = csv("customer_transactions", "/opt/app/data/customer/transaction", Map("header" -> "true"))
  .schema(
    ...
  )
  .count(count.recordsPerColumn(5, "account_id", "full_name"))

This will generate 1000 * 5 = 5000 records as the default number of records is set (1000) and per account_id, full_name from the initial 1000 records, 5 records will be generated.

Random Records Per Column

Generating 5 records per column is okay but still not quite reflective of the real world. Sometimes, people have accounts with no transactions in them, or they could have many. We can accommodate for this via defining a random number of records per column.

var transactionTask = csv("customer_transactions", "/opt/app/data/customer/transaction", Map.of("header", "true"))
        .schema(
                ...
        )
        .count(count().recordsPerColumnGenerator(generator().min(0).max(5), "account_id", "full_name"));
val transactionTask = csv("customer_transactions", "/opt/app/data/customer/transaction", Map("header" -> "true"))
  .schema(
    ...
  )
  .count(count.recordsPerColumnGenerator(generator.min(0).max(5), "account_id", "full_name"))

Here we set the minimum number of records per column to be 0 and the maximum to 5. This will follow a uniform distribution so the average number of records per account is 2.5. We could also define other metadata, just like we did with fields, when defining the generator. For example, we could set standardDeviation and mean for the number of records generated per column to follow a normal distribution.

Run

Let's try run.

#clean up old data
rm -rf docker/sample/customer/account
./run.sh
#input class MyMultipleRecordsPerColJavaPlan or MyMultipleRecordsPerColPlan
#after completing
head docker/sample/customer/transaction/part-00000*

It should look something like this.

ACC29117767,Willodean Sauer
ACC29117767,Willodean Sauer,84.99145871948083,2023-05-14T09:55:51.439Z,2023-05-14
ACC29117767,Willodean Sauer,58.89345733567232,2022-11-22T07:38:20.143Z,2022-11-22

You can now look to play around with other count configurations found here.