read_motif.Rd
The core function to read in motif files, whether from the HOMER database, from HOMER denovo motif enrichment results, or even custom motifs. In all cases, these files must be in the HOMER-format. See below for more details.
read_motif(path)
path | location of motif file |
---|
at minimum, a tibble with the following columns:
consensus
the consensus sequence of the denovo motif
motif_name
name of the motif
log_odds_detection
threshold used to determine bound vs. unbound sites
motif_pwm
a list column with PWMs for each motif
The following columns are presented when available from complete *.motif*
files
or from HOMER results directories:
log_p_value_detection
from the original experiment used to ID motif
tgt_num
number of times motif appears in target sequences
tgt_pct
percent of times motif appears in target sequences
bgd_num
number of times motif appears in background sequences
bgd_pct
percent of times motif appears in background sequences
log_p_value
final enrichment from experiment -log10(p-value)
tgt_pos
average position of motif in target sequences, where
0 = start of sequences
tgt_std
standard deviation of position in target sequences
bgd_pos
average position of motif in background sequences,
where 0 = start of sequences
bgd_std
standard deviation of position in background sequences
strand_bias
log ratio of + strand occurrences to - strand occurrences
multiplicity
average number of occurrences per sequence in
sequences with 1 or more binding sites
To read-in a HOMER-formatted motif, at a minimum, the first three fields are required to properly ID the motif:
">" + Consensus sequence
The dominant or likeliest sequence
Motif name
Should be unique
Log odds detection threshold
determines bound vs. unbound sites
The remaining extra fields of HOMER-formatted motifs are described at the URL below, and primarily meant for interpreting motifs from HOMER's own database. To read more about the HOMER format, see: http://homer.ucsd.edu/homer/motif/creatingCustomMotifs.html
Note that HOMER also has additional information in the motif name
regarding its origin and identity. See the internal function .parse_homer_subfields
for more info and to break this field up.
Subsequent lines (after the ">") describe the position weight matrix (PWM), with columns in order of A, C, G, T describing the probabilities of per position of each nucleotide.
Note that it is possible to combine complete information (HOMER-formatted) motifs
with minimal motifs. Simply use dplyr::bind_rows
for easy concatenation
despite column spec differences.