marge
: An API for Analysis of Motifs Using HOMER in Rmarge-workflow.Rmd
Profiling of open chromatin regions through the assay for transposase-accessible chromatin (ATAC) and transcription factor binding via chromatin immunoprecipitation (ChIP) sequencing has increased our ability to resolve patterns of putative transcription factor binding sites at the genome-wide level. Popular tools such as HOMER and MEME have driven forward the analyses of sequence composition, deriving de novo motifs and searching for the enrichment of known motifs. However, their interfaces do not allow for the construction of parallel inquiries of multiple datasets. Furthermore, their results do not conform to formats amenable to ‘tidy’ analyses, presenting a significant bottleneck in motif analysis. Here, I present marge
, a companion R
package that interfaces with HOMER to facilitate the construction of queries and to tidy results for further downstream analyses.
marge
is an R
package that makes use of various tidyverse
conventions, allowing for modern conventions of R
programming and manipulations. marge
interfaces directly with HOMER to construct queries, execute motif enrichment searches, and organize results. While this requires a separate installation of HOMER and introduces this long-term dependency, core HOMER routines have been stable for a long time, with infrequent updates. Given the popularity of HOMER, the syntax and construction of queries will thus be familiar to the audience utilizing the marge
companion package. However, to conform to modern standards of software design, certain core routines, such as findMotifsGenome.pl
which encompasses many different tasks, have been partitioned into distinct R
functions. For example, motif analysis and motif location identification have been split up into distinct functions find_motifs_genome
and find_motifs_instances
, respectively.
marge
is distributed on Bitbucket as an R
package, and is compatible with Mac OS X, Windows, and Linux. The latest version of HOMER (v4.9 as of this writing) should be installed prior to marge
, and configured with the desired genomes and feature sets. Subsequently, marge
can be installed from the repository. R
Package dependencies and system requirements are documented in the repository and automatically installed in conjunction with the marge
package. Online documentation is also available at https://robertamezquita.github.io/marge/
.
To demonstrate the functionality and utility of marge
, we present a tutorial of core functionality and present potential use cases for high-throughput motif enrichment studies and subsequent downstream analyses.
Code objects shall be referred to in-line with the code
font, whereas R
functions will be referred to with a pair of parentheses following the function name, as in find_motifs_genome()
. R
code chunks will be presented as a block. Comments describing relevant portions of code are prepended by hashes (#
), and all else is runnable R
code. Results of code chunks are printed in separate block below, with each line of output prepended by a double hash (##
). An example of code chunk and output formatting follows:
salutation <- c('hello world')
salutation # view output
## [1] "hello world"
Lastly, documentation for functions can be accessed invoking through help utility in R
. Online documentation with the same content is also available under the ‘Reference’ tab.
The latest version of HOMER should be installed prior to installing marge
as documented online in the introduction/install section. Following installation, HOMER packages should additionally be installed, which contains the necessary sequence data to perform motif enrichment, such as those that follow:
## In a terminal/command-line
perl /path-to-homer/configureHomer.pl -install mm10 # mouse sequence data
## perl /path-to-homer/configureHomer.pl -install hg38 # human sequence data
Additionally, users should take care to ensure that HOMER is added to the executable path, such that entering findMotifsGenome.pl
on the command-line works as expected. For further details, we again refer to the introduction/install section of the HOMER documentation. For this vignette, we make use of the mm10
human sequence data package in HOMER.
What follows is a basic tutorial of a typical workflow in performing motif enrichment analysis from a given, single set of regions. This section will introduce the various verbs available in marge
for this task.
To check that HOMER was installed properly and is detected by marge
invoke the check_homer()
function. Sequence annotation data for the genome of interest should also have been installed prior to running HOMER or marge
. To check what is currently available via HOMER, simply run list_homer_packages()
. For this analysis, either ‘mm10’ or ‘mm10’ packages can be installed, which include the human and mouse genomes and annotations, respectively.
## Install `marge`:
## devtools::install_github('robertamezquita/marge', ref = 'master')
## library(marge)
devtools::load_all() ## load dev version from cloning GH repo
## Loading marge
## HOMER installation found.
## # A tibble: 5 x 4
## status package version description
## <chr> <chr> <chr> <chr>
## 1 + homer v4.10.4 Code/Executables, ontologies, motifs for HOMER
## 2 + mouse-o v6.0 Mus musculus (mouse) accession and ontology infor…
## 3 + human-o v6.0 Homo sapiens (human) accession and ontology infor…
## 4 + mm10 v6.0 mouse genome and annotation for UCSC mm10
## 5 + hg38 v6.0 human genome and annotation for UCSC hg38
To begin, regions of interest should first be identified. For example, peak calls derived from ATAC-seq or ChIP-seq data are of particular interest for motif enrichment analyses. At minimum, such regions should contain three columns: chrom
(chromosome), start
, and end
, describing the location of the region of interest. Additional columns are also allowed - suggested columns include a region identifier, gene annotations, distance to the gene, and properties such as expression/chromatin state.
A minimal set of regions is included with the marge
package for testing, and can be loaded as follows:
## Use included test regions from `marge` package
test.file <- system.file('extdata/test_regions.bed', package = 'marge')
dat <- readr::read_tsv(test.file)
dat
## # A tibble: 4 x 6
## chrom start end id value gene
## <chr> <dbl> <dbl> <chr> <dbl> <chr>
## 1 chr6 102207441 102207830 region_1 1 A
## 2 chr7 44688352 44688664 region_2 2 B
## 3 chr7 92093830 92094134 region_3 3 C
## 4 chr12 103669461 103669829 region_4 4 D
From the regions of interest, enriched de novo and/or known motifs can be performed via the find_motifs_genome()
function. This function connects to the HOMER utility findMotifsGenome.pl
, and uses much of the same syntax, and so should be familiar to users of HOMER. The following code shows all available options, with documentation available online and via R
for further details. Suffice it to say, a de novo and known motifs enrichment analysis is run on the test regions loaded prior, and writes results to the designated path directory (here, results_dir
). Thus, find_motifs_genome()
is run purely for its side-effect of creating a HOMER results directory.
## Create a temporary directory to write results
## This directory is erased once R session is closed
results_dir <- tempfile(pattern = 'test-dir_')
## Run a motif enrichment analysis
find_motifs_genome(
dat,
path = results_dir,
genome = 'mm10',
motif_length = 8,
scan_size = 50,
optimize_count = 2,
background = 'automatic',
local_background = FALSE,
only_known = FALSE, only_denovo = FALSE,
fdr_num = 5,
cores = 1, cache = 100,
overwrite = TRUE, keep_minimal = FALSE
)
Note in particular the argument keep_minimal
, which if set to TRUE
, removes all extraneous HTML and logo images, retaining only the relevant enrichment results necessary for downstream analyses. If FALSE
(the default), the output is exactly the same as if HOMER had been run from the command line with the given parameters.
HOMER produces two distinct files for de novo and known motif enrichments within the designated results directory, contained in homerMotifs.all.motif
and knownResults.txt
, respectively. These results can be read in using the respective read_*_results()
functions, where *
is either denovo
or known
.
known <- read_known_results(path = results_dir, homer_dir = TRUE)
denovo <- read_denovo_results(path = results_dir, homer_dir = TRUE)
known
denovo
## Adding missing grouping variables: `motif_id`
## # A tibble: 365 x 14
## motif_name motif_family experiment accession database consensus
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Fosl2 bZIP 3T3L1-Fos… GSE56872 Homer NATGASTC…
## 2 MafB bZIP BMM-Mafb-… GSE75722 Homer WNTGCTGA…
## 3 Fra2 bZIP Striatum-… GSE43429 Homer GGATGACT…
## 4 Fra1 bZIP BT549-Fra… GSE46166 Homer NNATGAST…
## 5 JunB bZIP Dendritic… GSE36099 Homer RATGASTC…
## 6 Atf3 bZIP GBM-ATF3-… GSE33912 Homer DATGASTC…
## 7 BATF bZIP Th17-BATF… GSE39756 Homer DATGASTC…
## 8 AP-1 bZIP ThioMac-P… GSE21512 Homer VTGACTCA…
## 9 RUNX Runt HPC7-Runx… GSE22178 Homer SAAACCAC…
## 10 NFY CCAAT Promoter <NA> Homer RGCCAATS…
## # … with 355 more rows, and 8 more variables: log_p_value <dbl>,
## # fdr <dbl>, tgt_num <dbl>, tgt_pct <dbl>, bgd_num <dbl>, bgd_pct <dbl>,
## # motif_pwm <named list>, log_odds_detection <dbl>
## # A tibble: 2 x 18
## consensus motif_name log_odds_detect… motif_id motif_pwm log_p_value_det…
## <chr> <chr> <dbl> <int> <list<df> <dbl>
## 1 TCGCATTG 1-TCGCATTG 8.24 1 [8 × 4] -7.02
## 2 GATTACTC 2-GATTACTC 8.24 2 [8 × 4] -6.73
## # … with 12 more variables: tgt_num <dbl>, tgt_pct <dbl>, bgd_num <dbl>,
## # bgd_pct <dbl>, log_p_value <dbl>, fdr <dbl>, tgt_pos <dbl>,
## # tgt_std <dbl>, bgd_pos <dbl>, bgd_std <dbl>, strand_bias <dbl>,
## # multiplicity <dbl>
These functions specifically are responsible for the core routine which tidies the HOMER results, and can be pointed to any existing HOMER motif enrichment result files by setting homer_dir
to FALSE
. The output provides a tidied version of the default HOMER output, capturing the various characteristics of enrichments, such as the identity of the motifs, target and background number/percents of the motifs across the provided sequences, statistics of enrichment, and finally the associated motif position weight matrices (PWM) in the list-column motif_pwm
. The columns are described further in the associated help files.
One core task of interest is working with the identified de novo motif PWMs that were enriched across the regions of interest. To access a specific motif PWM, the following shows several ways of accessing them from the motif_pwm
list-column in the known
and denovo
objects previously created.
## Known Motifs
known$motif_pwm[1] # positional access
known$motif_pwm['OCT:OCT'] # named access
## Denovo Motifs
denovo$motif_pwm[1] # positional access
## <list_of<
## tbl_df<
## A: double
## C: double
## G: double
## T: double
## >
## >[1]>
## $`1-TCGCATTG`
## # A tibble: 8 x 4
## A C G T
## <dbl> <dbl> <dbl> <dbl>
## 1 0.1 0.1 0.1 0.7
## 2 0.1 0.7 0.1 0.1
## 3 0.1 0.1 0.7 0.1
## 4 0.1 0.7 0.1 0.1
## 5 0.7 0.1 0.1 0.1
## 6 0.1 0.1 0.1 0.7
## 7 0.1 0.1 0.1 0.7
## 8 0.1 0.1 0.7 0.1
Note the column names identifying the nucleotide, the rows for the ordered nucleotide position, and the frequencies at each cell, where each row sums to 1.
To write a motif to a file, the helper function write_homer_motif()
can be used, and values can be manually specified for motifs of interest.
## Note double bracket to access list value;
## single bracket produces error
write_homer_motif(
motif_pwm = denovo$motif_pwm[['1-TCGCATTG']],
motif_name = 'my_first_motif',
log_odds_detection = 4.35,
consensus = 'CACATCCT',
file = paste0(results_dir, '/my_first_motif.motif')
)
Oftentimes, one wants to write all associated motif PWMs from a given de novo enrichment search. Given that write_homer_motif
produces a single motif at a time, we need a way to produce a file for each row of the denovo
object.
While there are various ways to iterate over the rows of the denovo
object, such as with for-loops
or the apply
family of functions, below is shown a method using the purrr
package from the tidyverse
family, using the map()
and walk()
functions. Briefly, these functions apply a function to each element of a vector, thus allowing for iteration over the rows of an object such as those we created, known
and denovo
. map
returns the transformed object, whereas walk
calls the function for its side-effect, such as in the case of writing a file as in write_homer_motif()
. Given that we need several varying inputs (all the arguments of write_homer_motif()
, we use the variant pwalk()
to allow for our many varying inputs. The resulting code below writes all of our denovo results (three motifs in total) to unique files in our temporary R
results directory.
## Write to multiple files - vary the `file` argument
library(purrr)
pwalk(
.l = list(
motif_pwm = denovo$motif_pwm,
motif_name = denovo$motif_name,
log_odds_detection = denovo$log_odds_detection,
consensus = denovo$consensus,
file = paste0(results_dir, '/', denovo$motif_name, '.motif')
),
.f = write_homer_motif
)
## Write to a single file with all motifs
## Keep the `file` argument constant
my_motifs_file <- paste0(results_dir, '/my-motifs.motif')
## Append by default is TRUE, e.g. can write more than one
## motif at once, but this means don't run this code more
## than once!
pwalk(
.l = list(
motif_pwm = denovo$motif_pwm,
motif_name = denovo$motif_name,
log_odds_detection = denovo$log_odds_detection,
consensus = denovo$consensus
),
.f = write_homer_motif,
file = my_motifs_file,
append = TRUE
)
Now that we have produced our motif PWM as a file, it can be used to map the motif back to where it occurs in our set of regions using the find_motifs_instances()
function. While it is also a wrapper for findMotifsGenome.pl
, it partitions out a unique piece of functionality that presents an entirely different type of results. Note that this function requires the use of a HOMER motif file that is written out (either by hand or by the write_homer_motif()
function) as it contains the necessary parametrization for motif finding.
Also note that we can search for either a single motif or multiple motifs concurrently, where the multiple motifs are written into a single file, as shown above. Here, we will search for all three of the de novo motifs identified initially in our regions.
motif_instances_file <- paste0(results_dir, '/motif-instances.txt')
find_motifs_instances(
dat,
path = motif_instances_file,
genome = 'mm10',
motif_file = my_motifs_file,
scan_size = 50,
cores = 1, cache = 100
)
This function, similarly to find_motifs_genome()
and write_homer_motif()
, produces output that is written to into a file outside of R
, and similarly relies on a helper function to read the resulting output.
The function read_motifs_instances()
works in tandem with find_motifs_instances()
to read its output, returning the location and scores of the specified motifs from the previous step.
motif_instances <- read_motifs_instances(motif_instances_file)
motif_instances
## # A tibble: 2 x 6
## region_id offset sequence motif_name strand motif_score
## <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 region_2 -24 TCGCATTG 1-TCGCATTG + 8.24
## 2 region_1 -5 GATTACTC 2-GATTACTC + 8.24
Using the verbs learned in the previous section, this part will focus on tying it all together to create a framework for testing for motif enrichment and subsequently analysing the results across multiple sets of regions in a tidy fashion. This section relies heavily on the tidyverse
framework, and in particular will use the tools from purrr
, map()/walk()
, introduced in the previous section. Additionally, we will simulate a dataset using the valr
package function bed_random()
. While outside of the scope of this tutorial, valr
provides additional functionality to work with genomic intervals using similar tidyverse
conventions, and is recommended for manipulating regions of interest from high-throughput experiments.
Below are the libraries required for this next section.
## Required libraries for advanced usage tutorial
library(valr)
library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)
By bringing the usage of HOMER into R
via an API, it becomes possible to construct multiple queries, execute them all, and then read in the results from all the queries.
First, we will create a simulated set of intervals for which to test for motif enrichment, formatting the regions into a tibble
object which allows for the creation of list-columns, which tidily encapsulate the genomic intervals.
## 1. Construct multiple region samples -------------------------------------
## Create genome from first two chromosomes of mouse
genome <- data.frame(chrom = c('chr1', 'chr2'),
size = c(195471971, 182113224))
## Set seeds and ids for the simulated sets of regions
seed <- 1:5
id <- letters[1:5]
## Simulate different sets of regions (each with unique seeds) and munge
## into the tibble format, keeping the regions as a list-column
sim <- map(seed, ~ valr::bed_random(genome, length = 50, n = 50, seed = .x))
tbl_regions <- tibble(id = id, regions = sim)
## Inspect overall tibble organization
tbl_regions
## # A tibble: 5 x 2
## id regions
## <chr> <list>
## 1 a <tibble [50 × 3]>
## 2 b <tibble [50 × 3]>
## 3 c <tibble [50 × 3]>
## 4 d <tibble [50 × 3]>
## 5 e <tibble [50 × 3]>
## View regions sample
tbl_regions$regions[[1]] %>% print(n = 5)
## # A tibble: 50 x 3
## chrom start end
## <chr> <int> <int>
## 1 chr1 3322213 3322263
## 2 chr1 12952049 12952099
## 3 chr1 13419401 13419451
## 4 chr1 13496953 13497003
## 5 chr1 44639848 44639898
## # … with 45 more rows
The next step involves specifying the paths to the results directory for each respective set of regions. This is necessary for find_motifs_genome()
to output the motif enrichment results for each set of regions separately (otherwise the results will be overwritten multiple times).
## 2. Find Motifs in Genome -------------------------------------------------
## Create new (temp) results directory
results_dir <- tempfile(pattern = 'ht-dir_')
## Append results path to tibble of regions
tbl_regions <- tbl_regions %>%
mutate(path = paste0(results_dir, '/homer_', id))
tbl_regions[1, ] # inspect resulting first row
## # A tibble: 1 x 3
## id regions path
## <chr> <list> <chr>
## 1 a <tibble [50 ×… /var/folders/8t/qchxz4r950zd5qnp21glqt8w0000gp/T//R…
Now, using each set of regions and the unique paths for each set of regions, find_motifs_genome()
can be run, varying across these two parameters, and setting as constant the various motif enrichment search parameters for all the different sets.
One other potential application not shown here is testing how these various parameters affect motif enrichment results if we were instead to keep the regions tested as constant instead. This is left as an exercise for the interested reader.
## Iterate over all regions to perform motif finding
## Writing each set of regions results to different directory
## Only perform known motif results enrichments
pwalk(
## Varying parameters (regions, output directory)
.l = list(x = tbl_regions$regions, path = tbl_regions$path),
## Function to find motifs across genome
.f = find_motifs_genome,
## Constant parameters (motif search settings)
genome = 'mm10',
scan_size = 50,
optimize_count = 2,
only_known = TRUE,
cores = 2, cache = 100,
keep_minimal = TRUE, overwrite = TRUE
)
Following the motif enrichment search for each set of regions, the results can be read in, iterating over the variable path
within the tbl_regions
object to pull in the relevant results. The results are cleaned to remove the path
and regions
columns subsequently, and then expanded such that each individual motif enrichment result has its corresponding id
as a separate variable.
## 3. Read in All Results ----------------------------------------------------
## Read known motif enrichment results into a new list-column
## Remove the path and regions columns for cleanliness
## Unnest results to expand into plotting friendly format
tbl_results <- tbl_regions %>%
mutate(known_results = map(path, read_known_results)) %>%
select(-path, -regions) %>%
unnest()
## View sampling of results
tbl_results %>% sample_n(size = 5)
## # A tibble: 5 x 15
## id motif_name motif_family experiment accession database consensus
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 e Smad3 MAD NPC-Smad3… GSE36673 Homer TWGTCTGV
## 2 a Foxo1 Forkhead RAW-Foxo1… Fan_et_a… Homer CTGTTTAC
## 3 b CLOCK bHLH Liver-Clo… GSE39860 Homer GHCACGTG
## 4 a Tlx? NR NPC-H3K4m… GSE16256 Homer CTGGCAGS…
## 5 c Bcl6 Zf Liver-Bcl… GSE31578 Homer NNNCTTTC…
## # … with 8 more variables: log_p_value <dbl>, fdr <dbl>, tgt_num <dbl>,
## # tgt_pct <dbl>, bgd_num <dbl>, bgd_pct <dbl>, motif_pwm <list>,
## # log_odds_detection <dbl>
In this tidy form, it becomes possible to summarise results on the basis of motif families, as opposed to simply individual motifs. Below, we use the id
and motif_family
columns to group the results, and summarise a given motif family by taking the most significant individual motif appearance as the summary statistic (e.g., the max -log10 p-value).
## Summarise number of
tbl_summary <- tbl_results %>%
group_by(id, motif_family) %>%
summarise(top_log_p_value = max(log_p_value)) %>%
ungroup()
## Print example results
tbl_summary %>% sample_n(5)
## # A tibble: 5 x 3
## id motif_family top_log_p_value
## <chr> <chr> <dbl>
## 1 b EBV-virus 0
## 2 a Paired,Homeobox,longest 0
## 3 e NR,IR3 0
## 4 e Zf,DR8 0
## 5 d ETS,bHLH 0
Note that few motifs appear significant in this simulated dataset, as regions were chosen at random, without a true biological correlate.
marge
presents an R
centric form of performing and analyzing motif enrichment results using the popular HOMER suite of tools, providing an easy-to-use interface and results in line with modern R
idioms. We envision that marge
will serve as a valuable tool to assist researchers in performing motif enrichment analyses quickly, easily, and reproducibly.
Online documentation for HOMER and installation details can be found at: http://homer.ucsd.edu/homer/
marge
can be installed via devtools
using: devtools::install_github('robertamezquita/marge', ref = 'master')
The latest marge
source code is available at: https://github.com/robertamezquita/marge/
Stable versions of marge
are located in the Downloads section of the source code repository, under the tab ‘Tags’, at: https://github.com/robertamezquita/marge/downloads/