Guiding data processing with adverbial::step_by_step() in R

Implementing the right functions is essential for efficiently sharing data processing knowledge. However, finding the right balance between usability and customization can be challenging.

One approach to achieving this balance is to break the data processing flow into multiple functions. To ensure the effectiveness of this approach, users must have a clear understanding of the overall workflow.

One of the advantages of programming, as I see it, is that it allows me to make explicit implicit knowledge that previously had to be accurately understood by humans. This principle inspired the implementation of step-by-step data processing functions in the R adverbial package.

The step-by-step data processing functionality provided by this package consists mainly of the following three functions.

step_by_step() defines a step-by-step data processing workflow.
as_step() converts a function into a step that can be used in the workflow.
end_step() ends a step in the workflow.

These functions provide a clear framework for data processing, making it easier to share knowledge and collaborate on similarly structured data analysis tasks.

Example

Let’s turn the following data process into a step-by-step data process. This process uses the penguins data to calculate the average weight of penguins by island, species and year.

library(tidyverse)

as_tibble(penguins) |>
  select(species, island, body_mass, year) |>
  filter(!is.na(body_mass)) |>
  mutate(
    # Convert body_mass to kg
    body_mass = units::set_units(body_mass, g) |>
      units::set_units("kg")
  ) |> 
  summarise(
    mean_body_mass = mean(body_mass),
    .by = c(island, species, year)
  ) |> 
  arrange(island, species, year)

# A tibble: 15 × 4
   island    species    year mean_body_mass
   <fct>     <fct>     <int>           [kg]
 1 Biscoe    Adelie     2007           3.62
 2 Biscoe    Adelie     2008           3.63
 3 Biscoe    Adelie     2009           3.86
 4 Biscoe    Gentoo     2007           5.07
 5 Biscoe    Gentoo     2008           5.02
 6 Biscoe    Gentoo     2009           5.14
 7 Dream     Adelie     2007           3.67
 8 Dream     Adelie     2008           3.76
 9 Dream     Adelie     2009           3.65
10 Dream     Chinstrap  2007           3.69
11 Dream     Chinstrap  2008           3.8 
12 Dream     Chinstrap  2009           3.72
13 Torgersen Adelie     2007           3.76
14 Torgersen Adelie     2008           3.86
15 Torgersen Adelie     2009           3.49

Defining a step-by-step data processing workflow

This process can be broken down into the following steps:

Select columns from the data frame with select().
Filter rows from data frames with filter().
Mutate columns in the data frame with mutate().
Summarise the data frame with summarise().
Order rows in the data frame with arrange().

Thus, we can define a step-by-step data processing workflow using the adverbial package as follows. Now we can use the data_wrangler() to process the data step by step. It shows the steps involved in processing data in the header, making it easier to understand and customise workflows.

library(adverbial)

data_wrangler <- step_by_step(
  c(
    select_step = "Select columns from the data frame",
    filter_step = "Filter rows from data frames",
    mutate_step = "Mutate columns in the data frame",
    summarise_step = "Summarise the data frame",
    arrange_step = "Order rows in the data frame"
  )
)

select_step <- as_step(mutate, "select_step")
filter_step <- as_step(filter, "filter_step")
mutate_step <- as_step(mutate, "mutate_step")
summarise_step <- as_step(summarise, "summarise_step")
arrange_step <- as_step(arrange, "arrange_step")

data_wrangler(as_tibble(penguins))

# Steps:
# ☒ 1. select_step:    Select columns from the data frame
# ☐ 2. filter_step:    Filter rows from data frames
# ☐ 3. mutate_step:    Mutate columns in the data frame
# ☐ 4. summarise_step: Summarise the data frame
# ☐ 5. arrange_step:   Order rows in the data frame
# ℹ Please call `select_step()` to continue.
#
# A tibble: 344 × 8
   species island    bill_len bill_dep flipper_len body_mass sex     year
 * <fct>   <fct>        <dbl>    <dbl>       <int>     <int> <fct>  <int>
 1 Adelie  Torgersen     39.1     18.7         181      3750 male    2007
 2 Adelie  Torgersen     39.5     17.4         186      3800 female  2007
 3 Adelie  Torgersen     40.3     18           195      3250 female  2007
 4 Adelie  Torgersen     NA       NA            NA        NA <NA>    2007
 5 Adelie  Torgersen     36.7     19.3         193      3450 female  2007
 6 Adelie  Torgersen     39.3     20.6         190      3650 male    2007
 7 Adelie  Torgersen     38.9     17.8         181      3625 female  2007
 8 Adelie  Torgersen     39.2     19.6         195      4675 male    2007
 9 Adelie  Torgersen     34.1     18.1         193      3475 <NA>    2007
10 Adelie  Torgersen     42       20.2         190      4250 <NA>    2007
# ℹ 334 more rows

Processing data step by step

You can process the data step by step as follows (up to step 3). You can also use the as_step() function to convert any function into a step that can be used in the workflow. If you specify incorrect step names, an error will occur.

data_wrangler(as_tibble(penguins)) |> 
  select_step(species, island, body_mass, year) |>
  filter_step(!is.na(body_mass)) |>
  mutate_step(
    # Convert body_mass to kg
    body_mass = units::set_units(body_mass, g) |>
      units::set_units("kg")
  ) |> 
  # You can use another function during the step-by-step process.
  as_step(head)(n = 3)

# Steps:
# ☒ 1. select_step:    Select columns from the data frame
# ☒ 2. filter_step:    Filter rows from data frames
# ☒ 3. mutate_step:    Mutate columns in the data frame
# ☐ 4. summarise_step: Summarise the data frame
# ☐ 5. arrange_step:   Order rows in the data frame
# ℹ Please call `summarise_step()` to continue.
#
# A tibble: 3 × 8
  species island    bill_len bill_dep flipper_len body_mass sex     year
* <fct>   <fct>        <dbl>    <dbl>       <int>      [kg] <fct>  <int>
1 Adelie  Torgersen     39.1     18.7         181      3.75 male    2007
2 Adelie  Torgersen     39.5     17.4         186      3.8  female  2007
3 Adelie  Torgersen     40.3     18           195      3.25 female  2007

Completing the data processing

You can complete the data processing by adding the remaining steps and ending the workflow with end_step(). The following code performs almost the same processing as the original code, the only difference being that it successively informs the user which function to apply next.

data_wrangler(as_tibble(penguins)) |> 
  select_step(species, island, body_mass, year) |>
  filter_step(!is.na(body_mass)) |>
  mutate_step(
    # Convert body_mass to kg
    body_mass = units::set_units(body_mass, g) |>
      units::set_units("kg")
  ) |> 
  summarise_step(
    mean_body_mass = mean(body_mass),
    .by = c(island, species, year)
  ) |> 
  arrange_step(island, species, year) |> 
  end_step()

# A tibble: 15 × 4
   island    species    year mean_body_mass
 * <fct>     <fct>     <int>           [kg]
 1 Biscoe    Adelie     2007           3.62
 2 Biscoe    Adelie     2008           3.63
 3 Biscoe    Adelie     2009           3.86
 4 Biscoe    Gentoo     2007           5.07
 5 Biscoe    Gentoo     2008           5.02
 6 Biscoe    Gentoo     2009           5.14
 7 Dream     Adelie     2007           3.67
 8 Dream     Adelie     2008           3.76
 9 Dream     Adelie     2009           3.65
10 Dream     Chinstrap  2007           3.69
11 Dream     Chinstrap  2008           3.8 
12 Dream     Chinstrap  2009           3.72
13 Torgersen Adelie     2007           3.76
14 Torgersen Adelie     2008           3.86
15 Torgersen Adelie     2009           3.49

Conclusion

adverbial package provides a way of breaking down data processing tasks into smaller, more manageable steps. This approach allows users to understand the overall workflow and adapt it to their needs. The aim is to eliminate the need for workers to remember each mentally demanding step of data processing.