Skip to contents

dm allows you to create your own relational data models from local data frames. Once your data model is complete, you can deploy it to a range of database management systems (DBMS) using {dm}.

Creating a dm object from data frames

The example data set that we will be using is available through the nycflights13 package. The five tables that we are working with contain information about all flights that departed from the airports of New York to other destinations in the United States in 2013:

  • flights represents the trips taken by planes
  • airlines includes
    • the names of transport organizations (name)
    • their abbreviated codes (carrier)
  • airports indicates the ports of departure (origin) and of destination (dest)
  • weather contains meteorological information at each hour
  • planes describes characteristics of the aircraft

Once we’ve loaded {nycflights13}, the aforementioned tables are all in our work environment, ready to be accessed.

#> # A tibble: 1,458 × 8
#>    faa   name                            lat    lon   alt    tz dst   tzone
#>    <chr> <chr>                         <dbl>  <dbl> <dbl> <dbl> <chr> <chr>
#>  1 04G   Lansdowne Airport              41.1  -80.6  1044    -5 A     Amer…
#>  2 06A   Moton Field Municipal Airport  32.5  -85.7   264    -6 A     Amer…
#>  3 06C   Schaumburg Regional            42.0  -88.1   801    -6 A     Amer…
#>  4 06N   Randall Airport                41.4  -74.4   523    -5 A     Amer…
#>  5 09J   Jekyll Island Airport          31.1  -81.4    11    -5 A     Amer…
#>  6 0A9   Elizabethton Municipal Airpo…  36.4  -82.2  1593    -5 A     Amer…
#>  7 0G6   Williams County Airport        41.5  -84.5   730    -5 A     Amer…
#>  8 0G7   Finger Lakes Regional Airport  42.9  -76.8   492    -5 A     Amer…
#>  9 0P2   Shoestring Aviation Airfield   39.8  -76.6  1000    -5 U     Amer…
#> 10 0S9   Jefferson County Intl          48.1 -123.    108    -8 A     Amer…
#> # ℹ 1,448 more rows

Your own data will probably not be available as an R package. Whatever format it is in, you will need to be able to load it as data frames into your R session. If the data is too large, consider using dm to connect to the database instead. See vignette("howto-dm-db") for details on using dm with databases.

Adding Tables

Our first step will be to tell dm which tables we want to work with and how they are connected. For that we can use dm(), passing in the table names as arguments.

library(dm)

flights_dm_no_keys <- dm(airlines, airports, flights, planes, weather)
flights_dm_no_keys
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 0
#> Foreign keys: 0

The as_dm() function is an alternative that works if you already have a list of tables.

A dm is a list

dm objects behave like lists with a user- and console-friendly print format. In fact, using a dm as a nicer list for organizing your data frames in your environment is an easy first step towards using dm and its data modeling functionality.

Subsetting syntax for a dm object (either by subscript or by name) is similar to syntax for lists, and so you don’t need to learn any additional syntax to work with dm objects.

names(flights_dm_no_keys)
#> [1] "airlines" "airports" "flights"  "planes"   "weather"
flights_dm_no_keys$airports
#> # A tibble: 1,458 × 8
#>    faa   name                            lat    lon   alt    tz dst   tzone
#>    <chr> <chr>                         <dbl>  <dbl> <dbl> <dbl> <chr> <chr>
#>  1 04G   Lansdowne Airport              41.1  -80.6  1044    -5 A     Amer…
#>  2 06A   Moton Field Municipal Airport  32.5  -85.7   264    -6 A     Amer…
#>  3 06C   Schaumburg Regional            42.0  -88.1   801    -6 A     Amer…
#>  4 06N   Randall Airport                41.4  -74.4   523    -5 A     Amer…
#>  5 09J   Jekyll Island Airport          31.1  -81.4    11    -5 A     Amer…
#>  6 0A9   Elizabethton Municipal Airpo…  36.4  -82.2  1593    -5 A     Amer…
#>  7 0G6   Williams County Airport        41.5  -84.5   730    -5 A     Amer…
#>  8 0G7   Finger Lakes Regional Airport  42.9  -76.8   492    -5 A     Amer…
#>  9 0P2   Shoestring Aviation Airfield   39.8  -76.6  1000    -5 U     Amer…
#> 10 0S9   Jefferson County Intl          48.1 -123.    108    -8 A     Amer…
#> # ℹ 1,448 more rows
flights_dm_no_keys[c("airports", "flights")]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airports`, `flights`
#> Columns: 27
#> Primary keys: 0
#> Foreign keys: 0

Defining Keys

Even though we now have a dm object that contains all our data, we have not specified how our five tables are connected. To do this, we need to define primary keys and foreign keys on the tables.

Primary keys and foreign keys are how relational database tables are linked with each other. A primary key is a column or column tuple that has a unique value for each row within a table. A foreign key is a column or column tuple containing the primary key for a row in another table. Foreign keys act as cross references between tables. They specify the relationships that gives us the relational database. For more information on keys and a crash course on databases, see vignette("howto-dm-theory").

Primary Keys

dm offers dm_enum_pk_candidates() to identify viable primary keys for a table in the dm object, and dm_add_pk() to add them.

dm_enum_pk_candidates(
  dm = flights_dm_no_keys,
  table = planes
)
#> # A tibble: 9 × 3
#>   columns      candidate why                                               
#>   <keys>       <lgl>     <chr>                                             
#> 1 tailnum      TRUE      ""                                                
#> 2 year         FALSE     "has duplicate values: 2001 (284), 2000 (244), 20…
#> 3 type         FALSE     "has duplicate values: Fixed wing multi engine (3…
#> 4 manufacturer FALSE     "has duplicate values: BOEING (1630), AIRBUS INDU…
#> 5 model        FALSE     "has duplicate values: 737-7H4 (361), A320-232 (2…
#> 6 engines      FALSE     "has duplicate values: 2 (3288), 1 (27), 4 (4), 3…
#> 7 seats        FALSE     "has duplicate values: 149 (452), 140 (411), 55 (…
#> 8 speed        FALSE     "has 3299 missing values, and duplicate values: 4…
#> 9 engine       FALSE     "has duplicate values: Turbo-fan (2750), Turbo-je…

Now, we can add the identified primary keys:

flights_dm_only_pks <-
  flights_dm_no_keys %>%
  dm_add_pk(table = airlines, columns = carrier) %>%
  dm_add_pk(airports, faa) %>%
  dm_add_pk(planes, tailnum) %>%
  dm_add_pk(weather, c(origin, time_hour))
flights_dm_only_pks
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 4
#> Foreign keys: 0

Note that {dm} functions work with both named and positional argument specification, and compound keys can be specified using a vector argument.

Foreign Keys

To define how our tables are related, we use dm_add_fk() to add foreign keys. Naturally, this function will deal with two tables: a table looking for a reference, and a table that is providing the reference. Accordingly, while calling dm_add_fk(), the table argument specifies the table that needs a foreign key to link it to a second table, and the ref_table argument specifies the table to be linked to, which needs a primary key already defined for it.

dm_enum_fk_candidates(
  dm = flights_dm_only_pks,
  table = flights,
  ref_table = airlines
)
#> # A tibble: 19 × 3
#>    columns        candidate why                                            
#>    <keys>         <lgl>     <chr>                                          
#>  1 carrier        TRUE      ""                                             
#>  2 year           FALSE     "\u001b[1m\u001b[22mCan't join `x$value1` with…
#>  3 month          FALSE     "\u001b[1m\u001b[22mCan't join `x$value1` with…
#>  4 day            FALSE     "\u001b[1m\u001b[22mCan't join `x$value1` with…
#>  5 dep_time       FALSE     "\u001b[1m\u001b[22mCan't join `x$value1` with…
#>  6 sched_dep_time FALSE     "\u001b[1m\u001b[22mCan't join `x$value1` with…
#>  7 dep_delay      FALSE     "\u001b[1m\u001b[22mCan't join `x$value1` with…
#>  8 arr_time       FALSE     "\u001b[1m\u001b[22mCan't join `x$value1` with…
#>  9 sched_arr_time FALSE     "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 10 arr_delay      FALSE     "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 11 flight         FALSE     "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 12 tailnum        FALSE     "values of `flights$tailnum` not in `airlines$…
#> 13 origin         FALSE     "values of `flights$origin` not in `airlines$c…
#> 14 dest           FALSE     "values of `flights$dest` not in `airlines$car…
#> 15 air_time       FALSE     "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 16 distance       FALSE     "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 17 hour           FALSE     "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 18 minute         FALSE     "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 19 time_hour      FALSE     "\u001b[1m\u001b[22mCan't join `x$value1` with…

Having chosen a column from the successful candidates provided by dm_enum_fk_candidates(), we use the dm_add_fk() function to establish the foreign key linking the tables. In the second call to dm_add_fk() we complete the process for the flights and airlines tables that we started above. The carrier column in the airlines table will be added as the foreign key in flights.

flights_dm_all_keys <-
  flights_dm_only_pks %>%
  dm_add_fk(table = flights, columns = tailnum, ref_table = planes) %>%
  dm_add_fk(flights, carrier, airlines) %>%
  dm_add_fk(flights, origin, airports) %>%
  dm_add_fk(flights, c(origin, time_hour), weather)
flights_dm_all_keys
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 4
#> Foreign keys: 4

Having created the required primary and foreign keys to link all the tables together, we now have a relational data model we can work with.

Visualization

Visualizing a data model is a quick and easy way to verify that we have successfully created the model we were aiming for. We can use dm_draw() at any stage of the process to generate a visual representation of the tables and the links between them:

flights_dm_no_keys %>%
  dm_draw(rankdir = "TB", view_type = "all")
%0 airlines airlinescarriernameairports airportsfaanamelatlonalttzdsttzoneflights flightsyearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delaycarrierflighttailnumorigindestair_timedistancehourminutetime_hourplanes planestailnumyeartypemanufacturermodelenginesseatsspeedengineweather weatheroriginyearmonthdayhourtempdewphumidwind_dirwind_speedwind_gustprecippressurevisibtime_hour
flights_dm_no_keys %>%
  dm_add_pk(airlines, carrier) %>%
  dm_draw()
%0 airlines airlinescarrierairports airportsflights flightsplanes planesweather weather
flights_dm_only_pks %>%
  dm_add_fk(flights, tailnum, planes) %>%
  dm_draw()
%0 airlines airlinescarrierairports airportsfaaflights flightstailnumplanes planestailnumflights:tailnum->planes:tailnum weather weatherorigin, time_hour
flights_dm_all_keys %>%
  dm_draw()
%0 airlines airlinescarrierairports airportsfaaflights flightscarriertailnumoriginorigin, time_hourflights:carrier->airlines:carrier flights:origin->airports:faa planes planestailnumflights:tailnum->planes:tailnum weather weatherorigin, time_hourflights:origin, time_hour->weather:origin, time_hour

Integrity Checks

As well as checking our data model visually, dm can examine the constraints we have created by the addition of keys and verify that they are sensible.

flights_dm_no_keys %>%
  dm_examine_constraints()
#> ℹ No constraints defined.

flights_dm_only_pks %>%
  dm_examine_constraints()
#> ℹ All constraints satisfied.

flights_dm_all_keys %>%
  dm_examine_constraints()
#> ! Unsatisfied constraints:
#> • Table `flights`: foreign key `tailnum` into table `planes`: values of `flights$tailnum` not in `planes$tailnum`: N725MQ (575), N722MQ (513), N723MQ (507), N713MQ (483), N735MQ (396), …
#> • Table `flights`: foreign key `origin`, `time_hour` into table `weather`: values of `flights$origin`, `flights$time_hour` not in `weather$origin`, `weather$time_hour`: EWR, 2013-10-23 06:00:00 (34), EWR, 2013-08-19 17:00:00 (26), EWR, 2013-12-31 06:00:00 (26), EWR, 2013-12-31 07:00:00 (26), JFK, 2013-08-19 17:00:00 (26), …

The results are presented in a human-readable form, and available as a tibble for programmatic inspection.

Programming

Helper functions are available to access details on keys and check results.

A data frame of primary keys is retrieved with dm_get_all_pks():

flights_dm_only_pks %>%
  dm_get_all_pks()
#> # A tibble: 4 × 3
#>   table    pk_col            autoincrement
#>   <chr>    <keys>            <lgl>        
#> 1 airlines carrier           FALSE        
#> 2 airports faa               FALSE        
#> 3 planes   tailnum           FALSE        
#> 4 weather  origin, time_hour FALSE

Similarly, a data frame of foreign keys is retrieved with dm_get_all_fks():

flights_dm_all_keys %>%
  dm_get_all_fks()
#> # A tibble: 4 × 5
#>   child_table child_fk_cols     parent_table parent_key_cols   on_delete
#>   <chr>       <keys>            <chr>        <keys>            <chr>    
#> 1 flights     carrier           airlines     carrier           no_action
#> 2 flights     origin            airports     faa               no_action
#> 3 flights     tailnum           planes       tailnum           no_action
#> 4 flights     origin, time_hour weather      origin, time_hour no_action

We can use tibble::as_tibble() on the result of dm_examine_constraints() to programmatically inspect which constraints are not satisfied:

flights_dm_all_keys %>%
  dm_examine_constraints() %>%
  tibble::as_tibble()
#> # A tibble: 8 × 6
#>   table    kind  columns           ref_table is_key problem                
#>   <chr>    <chr> <keys>            <chr>     <lgl>  <chr>                  
#> 1 flights  FK    tailnum           planes    FALSE  "values of `flights$ta…
#> 2 flights  FK    origin, time_hour weather   FALSE  "values of `flights$or…
#> 3 airlines PK    carrier           NA        TRUE   ""                     
#> 4 airports PK    faa               NA        TRUE   ""                     
#> 5 planes   PK    tailnum           NA        TRUE   ""                     
#> 6 weather  PK    origin, time_hour NA        TRUE   ""                     
#> 7 flights  FK    carrier           airlines  TRUE   ""                     
#> 8 flights  FK    origin            airports  TRUE   ""

Conclusion

In this tutorial, we have demonstrated how simple it is to create relational data models from local data frames using {dm}, including setting primary and foreign keys and visualizing the resulting relational model.

Further reading

vignette("howto-dm-db") – This article covers accessing and working with RDBMSs within your R session, including manipulating data, filling in missing relationships between tables, getting data out of the RDBMS and into your model, and deploying your data model to an RDBMS.

vignette("howto-dm-theory") – Do you know all about data frames but very little about relational data models? This quick introduction will walk you through the key similarities and differences, and show you how to move from individual data frames to a relational data model.