Skip to contents

The {dm} package offers functions to work with relational data models in R.

This document introduces you to filtering functions, and shows how to apply them to the data that is separated into multiple tables.

Our example data is drawn from the {nycflights13} package that contains five inter-linked tables.

First, we will load the packages that we need:

Data: nycflights13

To explore filtering with {dm}, we’ll use the {nycflights13} data with its flights, planes, airlines, airports and weather tables.

This dataset contains information about the 336 776 flights that departed from New York City in 2013, with 3322 different planes and 1458 airports involved. The data comes from the US Bureau of Transportation Statistics, and is documented in ?nycflights13::flights.

To start with our exploration, we have to create a dm object from the {nycflights13} data. The built-in dm::dm_nycflights13() function takes care of this.

By default it only uses a subset of the complete data though: only the flights on the 10th of each month are considered, reducing the number of rows in the flights table to 11 227.

A data model object contains data from the source tables, and metadata about the tables.

If you would like to create a dm object from tables other than the example data, you can use the new_dm(), dm() or as_dm() functions. See vignette("howto-dm-df") for details.

The console output of the ’dm` object shows its data and metadata, and is colored for clarity:

dm
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 4
#> Foreign keys: 4

Now we know that there are five tables in our dm object. But how are they connected? These relations are best displayed as a visualization of the entity-relationship model:

%0 airlines airlinescarrierairports airportsfaaflights flightscarriertailnumoriginorigin, time_hourflights:carrier->airlines:carrier flights:origin->airports:faa planes planestailnumflights:tailnum->planes:tailnum weather weatherorigin, time_hourflights:origin, time_hour->weather:origin, time_hour

You can look at a single table with tbl. To print the airports table, call

tbl(dm, "airports")
#> Warning: `tbl.dm()` was deprecated in dm 0.2.0.
#> Use `dm[[table_name]]` instead to access a specific table.
#> # A tibble: 86 × 8
#>    faa   name                            lat    lon   alt    tz dst   tzone
#>    <chr> <chr>                         <dbl>  <dbl> <dbl> <dbl> <chr> <chr>
#>  1 ALB   Albany Intl                    42.7  -73.8   285    -5 A     Amer…
#>  2 ATL   Hartsfield Jackson Atlanta I…  33.6  -84.4  1026    -5 A     Amer…
#>  3 AUS   Austin Bergstrom Intl          30.2  -97.7   542    -6 A     Amer…
#>  4 BDL   Bradley Intl                   41.9  -72.7   173    -5 A     Amer…
#>  5 BHM   Birmingham Intl                33.6  -86.8   644    -6 A     Amer…
#>  6 BNA   Nashville Intl                 36.1  -86.7   599    -6 A     Amer…
#>  7 BOS   General Edward Lawrence Loga…  42.4  -71.0    19    -5 A     Amer…
#>  8 BTV   Burlington Intl                44.5  -73.2   335    -5 A     Amer…
#>  9 BUF   Buffalo Niagara Intl           42.9  -78.7   724    -5 A     Amer…
#> 10 BUR   Bob Hope                       34.2 -118.    778    -8 A     Amer…
#> # … with 76 more rows

Filtering a dm object

dm_filter() allows you to select a subset of a dm object.

How it works

Filtering a dm object is not that different from filtering a dataframe or tibble with dplyr::filter().

The corresponding {dm} function is dm::dm_filter(). With this function one or more filtering conditions can be set for one of the tables of the dm object. These conditions are immediately evaluated for their respective tables and for all related tables. For each resulting table, all related tables (directly or indirectly) with a filter condition them are taken into account in the following way: - filtering semi-joins are successively performed along the paths from each of the filtered tables to the requested table, each join reducing the left-hand side tables of the joins to only those of their rows with key values that have corresponding values in key columns of the right-hand side tables of the join. - eventually the requested table is returned, containing only the the remaining rows after the filtering joins

Currently, this only works if the graph induced by the foreign key relations is cycle free. Fortunately, this is the default for dm_nycflights13().

Filtering Examples

Let’s see filtering in action:

We only want the data that is related to John F. Kennedy International Airport.

filtered_dm <-
  dm %>%
  dm_filter(airports = (name == "John F Kennedy Intl"))
filtered_dm
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 4
#> Foreign keys: 4

You can get the numbers of rows of each table with dm_nrow().

rows_per_table <-
  filtered_dm %>%
  dm_nrow()
rows_per_table
#> airlines airports  flights   planes  weather 
#>       10        1      602      336       38
sum(rows_per_table)
#> [1] 987
sum_nrow <- sum(dm_nrow(dm))
sum_nrow_filtered <- sum(dm_nrow(dm_apply_filters(filtered_dm)))
#> Warning: `dm_apply_filters()` was deprecated in dm 1.0.0.
#> Calling `dm_apply_filters()` after `dm_filter()` is no longer necessary.

The total number of rows in the dm drops from 2 951 to 987 (the only unaffected table is the disconnected weather table).

Next example:

Get a dm object containing data for flights from New York to the Dulles International Airport in Washington D.C., abbreviated with IAD.

dm %>%
  dm_filter(flights = (dest == "IAD")) %>%
  dm_nrow()
#> airlines airports  flights   planes  weather 
#>        4        3       32       28       30

Applying multiple filters to different tables is also supported.

An example:

Get all January 2013 flights from Delta Air Lines which didn’t depart from John F. Kennedy International Airport.

dm_delta_may <-
  dm %>%
  dm_filter(
    airlines = (name == "Delta Air Lines Inc."),
    airports = (name != "John F Kennedy Intl"),
    flights = (month == 1)
  )
dm_delta_may
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 4
#> Foreign keys: 4
dm_delta_may %>%
  dm_nrow()
#> airlines airports  flights   planes  weather 
#>        1        2       75       58       25

You can inspect the filtered tables by subsetting them.

In the airlines table, Delta is the only remaining carrier:

dm_delta_may$airlines
#> # A tibble: 1 × 2
#>   carrier name                
#>   <chr>   <chr>               
#> 1 DL      Delta Air Lines Inc.

Which planes were used to service these flights?

dm_delta_may$planes
#> # A tibble: 58 × 9
#>    tailnum  year type              manuf…¹ model engines seats speed engine
#>    <chr>   <int> <chr>             <chr>   <chr>   <int> <int> <int> <chr> 
#>  1 N302NB   1999 Fixed wing multi… AIRBUS… A319…       2   145    NA Turbo…
#>  2 N304DQ   2008 Fixed wing multi… BOEING  737-…       2   149    NA Turbo…
#>  3 N306DQ   2009 Fixed wing multi… BOEING  737-…       2   149    NA Turbo…
#>  4 N307DQ   2009 Fixed wing multi… BOEING  737-…       2   149    NA Turbo…
#>  5 N309US   1990 Fixed wing multi… AIRBUS… A320…       2   182    NA Turbo…
#>  6 N316US   1991 Fixed wing multi… AIRBUS… A320…       2   182    NA Turbo…
#>  7 N317NB   2000 Fixed wing multi… AIRBUS… A319…       2   145    NA Turbo…
#>  8 N318NB   2000 Fixed wing multi… AIRBUS… A319…       2   145    NA Turbo…
#>  9 N318US   1991 Fixed wing multi… AIRBUS… A320…       2   182    NA Turbo…
#> 10 N322NB   2001 Fixed wing multi… AIRBUS… A319…       2   145    NA Turbo…
#> # … with 48 more rows, and abbreviated variable name ¹​manufacturer

And indeed, all included flights departed in January (month == 1):

dm_delta_may$flights %>%
  count(month)
#> # A tibble: 1 × 2
#>   month     n
#>   <int> <int>
#> 1     1    75

For comparison, let’s review the equivalent manual query for flights in dplyr syntax:

airlines_filtered <- filter(airlines, name == "Delta Air Lines Inc.")
airports_filtered <- filter(airports, name != "John F Kennedy Intl")
flights %>%
  semi_join(airlines_filtered, by = "carrier") %>%
  semi_join(airports_filtered, by = c("origin" = "faa")) %>%
  filter(month == 5)
#> # A tibble: 2,340 × 19
#>     year month   day dep_time sched_dep_t…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵
#>    <int> <int> <int>    <int>         <int>   <dbl>   <int>   <int>   <dbl>
#>  1  2013     5     1      554           600      -6     731     756     -25
#>  2  2013     5     1      555           600      -5     819     827      -8
#>  3  2013     5     1      603           610      -7     754     809     -15
#>  4  2013     5     1      622           630      -8     848     844       4
#>  5  2013     5     1      654           700      -6     931     950     -19
#>  6  2013     5     1      655           700      -5     944    1007     -23
#>  7  2013     5     1      656           705      -9    1005    1011      -6
#>  8  2013     5     1      658           700      -2     925     928      -3
#>  9  2013     5     1      743           745      -2    1014    1004      10
#> 10  2013     5     1      755           800      -5     929    1001     -32
#> # … with 2,330 more rows, 10 more variables: carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and
#> #   abbreviated variable names ¹​sched_dep_time, ²​dep_delay, ³​arr_time,
#> #   ⁴​sched_arr_time, ⁵​arr_delay

The {dm} code is leaner because the foreign key information is encoded in the object.

SQL statements behind filtering a dm object on a database

{dm} is meant to work with relational data models, locally as well as on databases. In your project, the data is probably not stored locally but in a remote relational database that can be queried with SQL statements.

You can check the queries by using sql_render() from the {dbplyr} package.

Example:

Print the SQL statements for getting all January 2013 flights from Delta Air Lines, which did not depart from John F. Kennedy International Airport, with the data stored in a sqlite database.

To show the SQL query behind a dm_filter(), we copy the flights, airlines and airports tables from the nyflights13 dataset to a temporary in-memory database using the built-in function copy_dm_to() and dbplyr::src_memdb.

Then we filter the data, and print the corresponding SQL statement with dbplyr::sql_render().

dm %>%
  dm_select_tbl(flights, airlines, airports) %>%
  copy_dm_to(dbplyr::src_memdb(), .) %>%
  dm_filter(
    airlines = (name == "Delta Air Lines Inc."),
    airports = (name != "John F Kennedy Intl"),
    flights = (month == 1)
  ) %>%
  dm_get_tables() %>%
  map(dbplyr::sql_render)
#> $flights
#> <SQL> SELECT * FROM (
#>   SELECT * FROM (
#>     SELECT *
#>     FROM `flights_1_2020_08_28_07_13_03_12345`
#>     WHERE ((`month` = 1.0))
#> ) AS `LHS`
#>   WHERE EXISTS (
#>     SELECT 1 FROM (
#>     SELECT *
#>     FROM `airlines_1_2020_08_28_07_13_03_12345`
#>     WHERE ((`name` = 'Delta Air Lines Inc.'))
#> ) AS `RHS`
#>     WHERE (`LHS`.`carrier` = `RHS`.`carrier`)
#>   )
#> ) AS `LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM (
#>   SELECT *
#>   FROM `airports_1_2020_08_28_07_13_03_12345`
#>   WHERE ((`name` != 'John F Kennedy Intl'))
#> ) AS `RHS`
#>   WHERE (`LHS`.`origin` = `RHS`.`faa`)
#> )
#> 
#> $airlines
#> <SQL> SELECT * FROM (
#>   SELECT *
#>   FROM `airlines_1_2020_08_28_07_13_03_12345`
#>   WHERE ((`name` = 'Delta Air Lines Inc.'))
#> ) AS `LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM (
#>   SELECT * FROM (
#>     SELECT *
#>     FROM `flights_1_2020_08_28_07_13_03_12345`
#>     WHERE ((`month` = 1.0))
#> ) AS `LHS`
#>   WHERE EXISTS (
#>     SELECT 1 FROM (
#>     SELECT *
#>     FROM `airports_1_2020_08_28_07_13_03_12345`
#>     WHERE ((`name` != 'John F Kennedy Intl'))
#> ) AS `RHS`
#>     WHERE (`LHS`.`origin` = `RHS`.`faa`)
#>   )
#> ) AS `RHS`
#>   WHERE (`LHS`.`carrier` = `RHS`.`carrier`)
#> )
#> 
#> $airports
#> <SQL> SELECT * FROM (
#>   SELECT *
#>   FROM `airports_1_2020_08_28_07_13_03_12345`
#>   WHERE ((`name` != 'John F Kennedy Intl'))
#> ) AS `LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM (
#>   SELECT * FROM (
#>     SELECT *
#>     FROM `flights_1_2020_08_28_07_13_03_12345`
#>     WHERE ((`month` = 1.0))
#> ) AS `LHS`
#>   WHERE EXISTS (
#>     SELECT 1 FROM (
#>     SELECT *
#>     FROM `airlines_1_2020_08_28_07_13_03_12345`
#>     WHERE ((`name` = 'Delta Air Lines Inc.'))
#> ) AS `RHS`
#>     WHERE (`LHS`.`carrier` = `RHS`.`carrier`)
#>   )
#> ) AS `RHS`
#>   WHERE (`LHS`.`faa` = `RHS`.`origin`)
#> )

Further reading: {dm}’s function for copying data from and to databases.