Guiding data processing with adverbial::step_by_step() in R
adverbial
English
Published
May 25, 2025
Implementing the right functions is essential for efficiently sharing data processing knowledge. However, finding the right balance between usability and customization can be challenging.
One approach to achieving this balance is to break the data processing flow into multiple functions. To ensure the effectiveness of this approach, users must have a clear understanding of the overall workflow.
One of the advantages of programming, as I see it, is that it allows me to make explicit implicit knowledge that previously had to be accurately understood by humans. This principle inspired the implementation of step-by-step data processing functions in the R adverbial package.
The step-by-step data processing functionality provided by this package consists mainly of the following three functions.
step_by_step() defines a step-by-step data processing workflow.
as_step() converts a function into a step that can be used in the workflow.
end_step() ends a step in the workflow.
These functions provide a clear framework for data processing, making it easier to share knowledge and collaborate on similarly structured data analysis tasks.
Example
Let’s turn the following data process into a step-by-step data process. This process uses the penguins data to calculate the average weight of penguins by island, species and year.
This process can be broken down into the following steps:
Select columns from the data frame with select().
Filter rows from data frames with filter().
Mutate columns in the data frame with mutate().
Summarise the data frame with summarise().
Order rows in the data frame with arrange().
Thus, we can define a step-by-step data processing workflow using the adverbial package as follows. Now we can use the data_wrangler() to process the data step by step. It shows the steps involved in processing data in the header, making it easier to understand and customise workflows.
library(adverbial)data_wrangler <-step_by_step(c(select_step ="Select columns from the data frame",filter_step ="Filter rows from data frames",mutate_step ="Mutate columns in the data frame",summarise_step ="Summarise the data frame",arrange_step ="Order rows in the data frame" ))select_step <-as_step(mutate, "select_step")filter_step <-as_step(filter, "filter_step")mutate_step <-as_step(mutate, "mutate_step")summarise_step <-as_step(summarise, "summarise_step")arrange_step <-as_step(arrange, "arrange_step")
data_wrangler(as_tibble(penguins))
# Steps:
# ☒ 1. select_step: Select columns from the data frame
# ☐ 2. filter_step: Filter rows from data frames
# ☐ 3. mutate_step: Mutate columns in the data frame
# ☐ 4. summarise_step: Summarise the data frame
# ☐ 5. arrange_step: Order rows in the data frame
# ℹ Please call `select_step()` to continue.
#
# A tibble: 344 × 8
species island bill_len bill_dep flipper_len body_mass sex year
* <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 female 2007
3 Adelie Torgersen 40.3 18 195 3250 female 2007
4 Adelie Torgersen NA NA NA NA <NA> 2007
5 Adelie Torgersen 36.7 19.3 193 3450 female 2007
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
7 Adelie Torgersen 38.9 17.8 181 3625 female 2007
8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
# ℹ 334 more rows
Processing data step by step
You can process the data step by step as follows (up to step 3). You can also use the as_step() function to convert any function into a step that can be used in the workflow. If you specify incorrect step names, an error will occur.
data_wrangler(as_tibble(penguins)) |>select_step(species, island, body_mass, year) |>filter_step(!is.na(body_mass)) |>mutate_step(# Convert body_mass to kgbody_mass = units::set_units(body_mass, g) |> units::set_units("kg") ) |># You can use another function during the step-by-step process.as_step(head)(n =3)
# Steps:
# ☒ 1. select_step: Select columns from the data frame
# ☒ 2. filter_step: Filter rows from data frames
# ☒ 3. mutate_step: Mutate columns in the data frame
# ☐ 4. summarise_step: Summarise the data frame
# ☐ 5. arrange_step: Order rows in the data frame
# ℹ Please call `summarise_step()` to continue.
#
# A tibble: 3 × 8
species island bill_len bill_dep flipper_len body_mass sex year
* <fct> <fct> <dbl> <dbl> <int> [kg] <fct> <int>
1 Adelie Torgersen 39.1 18.7 181 3.75 male 2007
2 Adelie Torgersen 39.5 17.4 186 3.8 female 2007
3 Adelie Torgersen 40.3 18 195 3.25 female 2007
Completing the data processing
You can complete the data processing by adding the remaining steps and ending the workflow with end_step(). The following code performs almost the same processing as the original code, the only difference being that it successively informs the user which function to apply next.
adverbial package provides a way of breaking down data processing tasks into smaller, more manageable steps. This approach allows users to understand the overall workflow and adapt it to their needs. The aim is to eliminate the need for workers to remember each mentally demanding step of data processing.