vignettes/extract_data.Rmd
extract_data.Rmd
inspectEHR contains a number of helper functions for wrangling data. In due course, these will be ported over to cleanEHR to keep the roles of the two packages clear; in theory cleanEHR = cleaning/wrangling, inspectEHR = verification/validation.
First, install inspectEHR from github, or download the dev release directly.
Establish a database connection. You will most likely be working with sqlite as the postgres instance is confined to the UCL IDHS. inspectEHR ships with a synthetic database, that you can play around with. The content of the data is more or less nonsense, but it is structurally sound.
library(inspectEHR)
# Synthetic database ships with inspectEHR
db_pth <- system.file("testdata/synthetic_db.sqlite3", package = "inspectEHR")
ctn <- connect(sqlite_file = db_pth)
Enter the fields you want to extract using HIC codes. See the cleanEHR repo for more information on these codes. Alternately, the qref
object from inspectEHR contrains all the information on what data exists in CC-HIC. I write this here as a tribble to keep everything in the same document. If you want to rename column names then it is important to keep names and code names together and in the same order.
It’s best to separate out 1d (time invariant) and 2d (longitudinal) data.
# Grab non-longitudinal (1d) data ----------------
demo_codes <- tibble::tribble(
~hic_codes, ~short_name,
"NIHR_HIC_ICU_0399", "primary_admission_reason",
"NIHR_HIC_ICU_0097", "unit_mortality",
)
# Extract static variables. Rename on the fly.
dtb <- extract_demographics(
connection = ctn,
episode_ids = 13639:13643,
code_names = demo_codes$hic_codes,
rename = demo_codes$short_name
)
head(dtb)
#> # A tibble: 5 x 3
#> episode_id unit_mortality primary_admission_reason
#> <int> <chr> <chr>
#> 1 13639 A 01.05.09.12.13
#> 2 13640 A 02.08.10.10.07
#> 3 13641 A 02.01.07.17.07
#> 4 13642 A 02.12.04.26.16
#> 5 13643 A 01.09.02.51.01
We can now extact 2d data. This will return a table with 1 row per patient per time unit specified in the cadance option. For example, a cadance of 1 will return a table with 1 row per patient per hour. This does not create a row if nothing was recorded, so the rows might run: 1, 2, 4, if nothing was recorded in hour 3. Note how a summary function can be provided to coalese_rows
. This determines the behaviour when collapsing dataitems down into rows below the natural cadance of recording.
# Grab longitudinal (2d) data ---------------------
long_codes <- tibble::tribble(
~hic_codes, ~short_name, ~func,
"NIHR_HIC_ICU_0108", "hr", mean
)
# Extract time varying variables. Rename on the fly.
ltb <- extract_timevarying(
ctn,
episode_ids = 13639:13643,
code_names = long_codes$hic_codes,
rename = long_codes$short_name,
coalesce_rows = long_codes$func
)
#> 8e-05 hours to process
#> AHHH! How kickass was that?!
head(ltb)
#> # A tibble: 6 x 3
#> time hr episode_id
#> <dbl> <int> <int>
#> 1 0 99 13639
#> 2 1 84 13639
#> 3 2 102 13639
#> 4 3 95 13639
#> 5 4 69 13639
#> 6 5 76 13639
Depending upon how much data you want to pull into a long form for analysis, this can take considerable time (more than 24 hours if you want to convert all 50,000 episodes).
You can expand this table to a regular row cadance, by inserting rows of NAs with expand_missing
We can also pull out data at any arbitrary temporal resolution and custom define the behaviour for information recorded at resolution higher than you are sampling. For example, if heart rates are sampled every hour, and you want to pull out data every 6 hours, we need to know what action to take to collapse those heart rates into a single row. You can supply any custom summary function to the coalese_rows
argument of extract_timevarying
.
ltb_2 <- extract_timevarying(
ctn,
episode_ids = 13639:13643,
code_names = long_codes$hic_codes,
rename = long_codes$short_name,
cadance = 6, # 1 row every 6 hours
coalesce_rows = long_codes$func
)
head(ltb_2)
#> # A tibble: 6 x 3
#> time hr episode_id
#> <dbl> <int> <int>
#> 1 0 99 13639
#> 2 6 69 13639
#> 3 12 88 13639
#> 4 18 101 13639
#> 5 24 92 13639
#> 6 30 93 13639
DBI::dbDisconnect(ctn)
You can also run extract_timevarying
with the option of cadance = “timestamp” which both pull out the exact timing of events and labels them with their timestamp. I advise against this unless you have a specific requirement, as it is a very memory intensive task.