Introduction to HOMER
The best way to learn about HOMER
is to go through the tutorial pages. We've tried to spell out
what happens in each step and explain the "why". A brief
description of the Motif Finding component of HOMER is found
below. Explanation of the sequencing analysis components of HOMER
are integrated into the tutorials.
General Introduction to Motif Discovery with HOMER
HOMER is a collection of tools
that are commonly needed for the
analysis of gene expression profiling (microarray) and genome-wide
location analysis experiments (ChIP-Seq or ChIP-Chip). There are
also routines for other types of sequencing experiments, such as
DNase-Seq or GRO-Seq.
Some of the things HOMER does NOT DO is find
differentially expressed genes, cluster gene expression profiles, or
search for all the instances Transfac motifs in order to make you
hopelessly confused!!! The idea was not to completely reinvent
the wheel if possible.
Unfortunately, HOMER must be run as a command-line tool, and may be
difficult to use if you are new to UNIX. While commands have been
distilled to be as simple and user-friendly as possible, basic
knowledge of the UNIX environment and file system is critical (but can
probably be learned quickly after typing “unix tutorial” into
google). I am proud to say that may of the people using HOMER are
completely new to UNIX, so it is indeed possible. In addition, a
spreadsheet program (i.e. EXCEL) is
needed to graph and visualize some of the results produced by HOMER.
Below is a description of how motif analysis is executed with
HOMER. Documentation describing the steps of analysis for Next-Gen Sequencing (or genomic
position
analysis) or Microarrays
(gene-based analysis) are covered in separate sections.
De Novo Motif
Discovery Strategy
HOMER was designed as a de novo
motif discovery algorithm that scores motifs by looking for motifs with
differential enrichment between two sets of sequences. This means
that HOMER uses two sets of sequences when performing motif finding –
1. target sequences of interest (i.e. promoters of genes that are
co-regulated) and 2. a set of background sequences (i.e. promoters of
genes that are not regulated). Without background sequences a
motif discovery algorithm must guess what sequences are expected to be
found by chance, such as assuming background sequences are a random
collection of A, C, G, and T. This can be extremely dangerous
since real genomic sequence is anything but random.
In practice HOMER will try to select the appropriate background
sequences for you, but results can vary depending on what is used as
background and certain applications may require careful consideration
of these sequences. By default HOMER will use confident,
non-regulated promoters as background when analyzing promoters, and
sequences in the vicinity of genes for ChIP-Seq analysis (i.e. from
–50kb to +50kb). In each case sequences are matched for their GC
content to avoid bias from CpG Islands.
Once target and background
sequences are chosen, HOMER looks for motifs of a specific length that
are overrepresented in the target set relative to the background
set. This enrichment is measured using the cummulative
hypergeometric
distribution (or cummulative binomial distribution for large data
sets), and places no requirement on the degeneracy
of the motif or the number of occurrences. Motifs are found by
first exhaustively checking the enrichment of simple motifs, then
refining promising candidates into accurate probability matrices.
With v3.0 of HOMER, the motif discovery software has been rewritten and
modernized (the homer2 executable). There is a subtle, but very
important difference in how the new version of HOMER performs de novo motif analysis. The
original HOMER divided the input sequences into short oligos to perform
the analysis, and once a motif was found, only the oligos considered
"bound" by the motif were removed from the analysis. The problem
was that several oligos representing "offsets" of the original motif
(think GGAAGT vs. GAAGTg) were left for the 2nd round of motif
enrichment to find, creating results that often contained several
verions of the original motif. The new version revisits the input
sequences and removes all oligos that are slightly offset from the
optimial motifs, making it much more sensitive to co-enriched motifs.
Known Motif Discovery Strategy
The biggest problem when looking
for “known” motifs is defining how degenerate you should allow them to
be. To circumvent this problem, we loaded motif direved from
published ChIP-Seq experiments that were already optimized for
degeneracy thresholds.
Interpretation of Motif Discovery Results
De Novo Results
Unfortunately, if you give HOMER
random data, HOMER will find motifs, and they may look
significant. Due to the finite amount of data and many degrees of
freedom in a motif probability matrix, it is easy to find a motif with
a seemingly significant p-value. Because of this, we can only
trust the most promising of motifs as likely to be real. For most
promoter datasets, motifs with a p-value of more than 1e-10 or even
1e-12 are likely to be false positives. In general the p-value
cutoff should be estimated by randomizing data labels and running the
algorithm several times. In practice you should start ignoring
results that are either below 1e-10 or when the results start becoming
very different from one another (in terms of sequence) yet have similar
p-values. In addition, high quality motifs usually appear
multiple times in the list with different offsets (i.e. nnnTGACTCAnn
and nTGACTCAnnnn). HOMER attempts to remove extremely similar
motifs, but different offsets of motifs are likely to be present if the
signal is strong (remember motifs may appear as if on the negative
strand).
Matching De Novo to
Known Motifs
Homer makes every attempt to tell you if the motifs it discovered
resembles a known motif. The difficulty of interpreting these
results SHOULD NOT BE UNDERESTIMATED!!! Consider the following:
- Databases of known motifs are a mixture of accurate and
inaccurate motifs
- Databases of known motifs are not complete
- The literature (especially motif finding papers) is full of
inaccurate assessments and motif annotations that are ludicrous.
HOMER tries to find the known
motifs with the best correlation between the known motif and de novo motif. It then aligns
the motifs from the top hits so that you can see it and judge the
alignment for yourself. The top known motif match is not always
the best match. The top match is not always annotated
correctly. If you feel something is worth pursuing, look up the
known binding sites of the transcription factor via PUBMED.
Feedback I got when writing the program was to provide the name of the
motif in the main result table – this was promptly followed by the
misinterpretation of results because people are too lazy to look at the
alignment to figure out if it makes any sense. These results do
not write the paper for you – critical thinking and follow-up is
required.
Additional Reading: Tips for de
novo
motif finding
Known Motif Enrichment
First and most important: There
is a subtle but IMPORTANT difference between looking for motifs de novo
and looking for known motif enrichment. De novo motif discovery
allows you to directly query the sequence to discover which motifs are
the MOST enriched sequences in your target set. Known motif
discovery will simply tell you which of the known motifs is most
enriched in your target set.
This may not seem important but consider the following scenario:
You have a set of random GA-rich sequences and compare them to random
genomic sequences. De novo motif finding will likely return a
G/A-rich matrix that doesn’t look anything like a transcription
factor. Known motif finding will return astonishingly high
p-values for motifs like PU.1 (GAGGAAGT) and ISRE (GAAACTGAAA).
Because of this de novo motif finding results are much more trustful
in terms of results.
The greatest advantage to using known motifs is found when you have a
limited set of target sequences. The less data that is available
or the weaker the true signal, it is difficult for de novo motif
finding to accurately define a signal that is significant. Known
motifs have the advantage of many less degrees of freedom and in may
cases find the correct motifs when the enrichment falls below the 1e-10
thresholds for believability when considering de novo results.
A more detailed decription of the motif finding procedure is avaliable
in the Motif Finding Tutorial.
|