Here I just wanted to take a quick second to share a small code snippet that illustrates munging a contiguous dataset to a wide format.
What do I mean by that? Consider the following example: let’s say we have a dataset where we have patients A, B, and C, and we take two tissues from each patient - one of Serum and one of TIF (tumor interstitial fluid), and from each tissue assay a whole host of metabolites, each with its own column. If we want to take the log2 fold-change of Serum vs. TIF for each patient, its tricky to do in Excel when the data is spread out contiguously, across rows.
So how can we tackle this sort of problem in R? Making use of the tidyverse
, the key here is to first gather the metabolites, such that we have a new column named metabolite
describing what is being assayed, and a value
column with the readout.
Now, after this we still have our tissue (specimen type) being Serum vs. TIF. If we want to easily compare the two, it’s easier to spread
these values out across columns, such that we have one row per patient, with two columns, Serum
and TIF
, deriving their values from our freshly created value
column, with metabolite
still describing each assay.
In code, this is what the process looks like below.
library(tidyverse)
## Create example data
dat <- tribble(
~Patient.ID, ~Tissue, ~Glucose, ~Pyruvate, ~Glutamate,
'CR1', 'Serum', 1, 2, 3,
'CR1', 'TIF', 4, 3, 10,
'CR2', 'Serum', 11, 2, 5,
'CR2', 'TIF', 5, 10, 4
)
dat
## # A tibble: 4 x 5
## Patient.ID Tissue Glucose Pyruvate Glutamate
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 CR1 Serum 1 2 3
## 2 CR1 TIF 4 3 10
## 3 CR2 Serum 11 2 5
## 4 CR2 TIF 5 10 4
## Tidy-ify data for easy calculations
## Key steps here:
## - gather: this will take all the different readouts
## and put them together under a new column specifying
## the metabolite, and a separate column with the value
## - spread: this takes your data and instead of having
## a separate row for each Serum and TIF reading, we
## create two new columns each for Serum and TIF, and
## spread our `value` column out. less tidy, but better
## for log2foldchange calculations
tidy_dat <- dat %>%
tidyr::gather(metabolite, value,
-Patient.ID, -Tissue) %>%
spread(Tissue, value)
tidy_dat
## # A tibble: 6 x 4
## Patient.ID metabolite Serum TIF
## * <chr> <chr> <dbl> <dbl>
## 1 CR1 Glucose 1 4
## 2 CR1 Glutamate 3 10
## 3 CR1 Pyruvate 2 3
## 4 CR2 Glucose 11 5
## 5 CR2 Glutamate 5 4
## 6 CR2 Pyruvate 2 10
## Calculate fold-change across tissues per patient
## All we do here is a `mutate()` step, which
## creates a new column based on some input, in this case
## log2 foldchange of the column Serum over TIF
pc <- 0.01 # pseudocount for lfc calc
calc_dat <- tidy_dat %>%
dplyr::mutate(
lfc_Serum_TIF = log2((Serum + pc) / (TIF + pc))
)
calc_dat
## # A tibble: 6 x 5
## Patient.ID metabolite Serum TIF lfc_Serum_TIF
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 CR1 Glucose 1 4 -1.989
## 2 CR1 Glutamate 3 10 -1.734
## 3 CR1 Pyruvate 2 3 -0.583
## 4 CR2 Glucose 11 5 1.136
## 5 CR2 Glutamate 5 4 0.321
## 6 CR2 Pyruvate 2 10 -2.316
Share this post
Twitter
Google+
Facebook
Reddit
LinkedIn
StumbleUpon
Email