Robert A. Amezquita

3 minute read

Here I just wanted to take a quick second to share a small code snippet that illustrates munging a contiguous dataset to a wide format.

What do I mean by that? Consider the following example: let’s say we have a dataset where we have patients A, B, and C, and we take two tissues from each patient - one of Serum and one of TIF (tumor interstitial fluid), and from each tissue assay a whole host of metabolites, each with its own column. If we want to take the log2 fold-change of Serum vs. TIF for each patient, its tricky to do in Excel when the data is spread out contiguously, across rows.

So how can we tackle this sort of problem in R? Making use of the tidyverse, the key here is to first gather the metabolites, such that we have a new column named metabolite describing what is being assayed, and a value column with the readout.

Now, after this we still have our tissue (specimen type) being Serum vs. TIF. If we want to easily compare the two, it’s easier to spread these values out across columns, such that we have one row per patient, with two columns, Serum and TIF, deriving their values from our freshly created value column, with metabolite still describing each assay.

In code, this is what the process looks like below.

library(tidyverse)
## Create example data
dat <- tribble(
    ~Patient.ID, ~Tissue, ~Glucose, ~Pyruvate, ~Glutamate,
    'CR1',       'Serum',        1,         2,          3,
    'CR1',         'TIF',        4,         3,         10,
    'CR2',       'Serum',       11,         2,          5,
    'CR2',         'TIF',        5,         10,         4
)
dat
## # A tibble: 4 x 5
##   Patient.ID Tissue Glucose Pyruvate Glutamate
##        <chr>  <chr>   <dbl>    <dbl>     <dbl>
## 1        CR1  Serum       1        2         3
## 2        CR1    TIF       4        3        10
## 3        CR2  Serum      11        2         5
## 4        CR2    TIF       5       10         4
## Tidy-ify data for easy calculations
## Key steps here:
## - gather: this will take all the different readouts
##     and put them together under a new column specifying
##     the metabolite, and a separate column with the value
## - spread: this takes your data and instead of having
##     a separate row for each Serum and TIF reading, we
##     create two new columns each for Serum and TIF, and
##     spread our `value` column out. less tidy, but better
##     for log2foldchange calculations
tidy_dat <- dat %>%
    tidyr::gather(metabolite, value,
                  -Patient.ID, -Tissue) %>%
    spread(Tissue, value)
tidy_dat
## # A tibble: 6 x 4
##   Patient.ID metabolite Serum   TIF
## *      <chr>      <chr> <dbl> <dbl>
## 1        CR1    Glucose     1     4
## 2        CR1  Glutamate     3    10
## 3        CR1   Pyruvate     2     3
## 4        CR2    Glucose    11     5
## 5        CR2  Glutamate     5     4
## 6        CR2   Pyruvate     2    10
## Calculate fold-change across tissues per patient
## All we do here is a `mutate()` step, which
## creates a new column based on some input, in this case
## log2 foldchange of the column Serum over TIF
pc <- 0.01  # pseudocount for lfc calc

calc_dat <- tidy_dat %>%
    dplyr::mutate(
               lfc_Serum_TIF = log2((Serum + pc) / (TIF + pc))
           )
calc_dat
## # A tibble: 6 x 5
##   Patient.ID metabolite Serum   TIF lfc_Serum_TIF
##        <chr>      <chr> <dbl> <dbl>         <dbl>
## 1        CR1    Glucose     1     4        -1.989
## 2        CR1  Glutamate     3    10        -1.734
## 3        CR1   Pyruvate     2     3        -0.583
## 4        CR2    Glucose    11     5         1.136
## 5        CR2  Glutamate     5     4         0.321
## 6        CR2   Pyruvate     2    10        -2.316
comments powered by Disqus