Quantifying Data and Motifs and Comparing Peaks/Regions
in
the Genome
Homer contains a useful, all-in-one program for performing
peak
annotation called annotatePeaks.pl.
In
addition
to
associating
peaks
with
nearby
genes,
annotatePeaks.pl
can perform Gene
Ontology Analysis, genomic feature association analysis
(Genome
Ontology), associate peaks with gene expression data,
calculate
ChIP-Seq Tag densities from different experiments, and find
motif
occurrences in peaks. annotatePeaks.pl
can also be used to create histograms and heatmaps.
Description
of the annotation functions are covered here,
while
quantification
of
tags,
motifs,
histograms, etc. are covered below.
annotatePeaks.pl
<peak/BED
file>
<genome>
[options]
>
<output
file>
i.e. annotatePeaks.pl
ERpeaks.txt
hg18 > outputfile.txt
Everything packed into one program
The great thing about
annotatePeaks.pl is that it combines many features into a
single
location. It is the primary program to investigate
how sequencing
reads, sequence motifs, and other information and
annotations interact.
Three primary options are available to specify types of
data that can
be processed by annotatePeaks.pl:
-d <tag directory1>
[tag
directory 2] ...
-m <motif file 1> [motif file 2] ...
-p <peak/BED file 1> [peak/BED file 2] ...
By default, these data types are processed relative to
each peak/region
provided in the primary input file. There are a
bunch of options
that help fine tune how each type of data is considered by
the program
covered below.
However, annotatePeaks.pl can take the same input data and
do other
things, such as make histograms and heatmaps, allowing you
to explore
the data in a different way.
-hist <# bin size>
Acceptable Input files
annotatePeaks.pl
accepts HOMER
peak files
or BED files:
HOMER peak files should have at minimum 5 columns
(separated by TABs,
additional columns will be ignored):
- Column1: Unique Peak ID
- Column2: chromosome
- Column3: starting position
- Column4: ending position
- Column5: Strand (+/- or 0/1, where 0="+", 1="-")
BED files should have at
minimum
6 columns (separated by TABs,
additional columns will be ignored)
- Column1: chromosome
- Column2: starting position
- Column3: ending position
- Column4: Unique Peak ID
- Column5: not used
- Column6: Strand (+/- or 0/1, where 0="+", 1="-")
In theory, HOMER will accept
BED
files with only 4 columns (+/- in the
4th column), and files without unique IDs, but this is NOT
recommended. For one, if you don't have unique IDs
for your
regions, it's hard to go back and figure out which region
contains
which peak.
Mac
Users: If using a EXCEL to prepare input files,
make sure to
save files as a "Text
(Windows)" if running MacOS - saving as "Tab delimited
text" in Mac
produces problems for the software. Otherwise, you
can run the
script " changeNewLine.pl
<filename>" to convert the Mac-formatted
text file to a
Windows/Dos/Unix formatted text file.
If errors occur, it is likely that the
file is not in the correct format, or the first column is
not actually
populated with unique identifiers.
TSS Mode
Instead of supplying a
peak file,
HOMER makes it easy to do analysis of features near
TSS. To
analyze transcription start sites as if they were peaks,
use " tss" as the
first argument.
annotatePeaks.pl
tss
hg18 ...
To restrict the analysis
to a
subset of TSS promoters, add the option "-list <file>"
where the file
is a Tab-delimited text file with the first column
containing gene
Identifiers. TSS mode also works with custom gene
definitions
specified with "-gtf
<GTF file>".
Specifying the Peak Size - the most important parameter
Pretty much everything
explained
in the following sections depends heavily on the " -size <#>"
parameter. A
couple of quick notes:
-size
<#> : will perform analysis on the # bp
surrounding the
peak centers [example: -size 1000]
-size <#,#>
: will
perform analysis from # to # relative to peak center
[example: -size
-200,50]
-size given :
will
perform analysis on different sized peaks - size given
by actual
coordinates in peak/BED file [example: -size given]
For example, if you peaks are actually transcription start
sites, you
might want to specify "-size -500,100" to perform the
analysis upstream
-500 bp to +100 bp downstream. If your peaks/regions
are actually
"transcript" regions, specifying " -size
given" will count reads along the entire
transcript. If it
doesn't make sense, watch Delta Force I and II back to
back. That
should numb the brain enough to get it.
Annotating Individual Peaks
Calculating ChIP-Seq Tag
Densities
across different
experiments
annotatePeaks.pl is useful
program for cross-referencing data from multiple
experiments. In
order to count the number of tags from different
sequencing
experiments,
you must first create tag
directories for
each of these experiments. Once created, tag counts
from these
directories in the vicinity of your peaks can be added by
specifying " -d <tag
directory 1> <tag
directory 2> ...". You can specify as
many tag
directories as you like. Tag totals for each
directory will be
placed in new columns starting on column 18. For
example:
annotatePeaks.pl
pu1peaks.txt
mm8
-size
400
-d
Macrophage-PU.1/
Bcell-PU.1/
>
output.txt
output.txt, when opened in EXCEL, will look like this:
HOMER automatically normalizes each directory by the total
number of
mapped tags such that each directory contains 10 million tags.
This total
can be changed by specifying " -norm
<#>" or by specifying " -noadj",
which
will
skip
this
normalization
step.
The other important parameter when counting tags is to
specify the size
of the region you would like to count tags in with " -size <#>".
For example,
" -size 1000" will
count tags in
the 1kb region centered on each peak,
while " -size 50"
will count
tags in the 50 bp region centered on the
peak (default is 200). The number of tags is not
normalized by
the size of the region.
One last thing to keep in mind is that in order to fairly
count tags,
HOMER will automatically center tags based on their
estimated
ChIP-fragment lengths. This is can be overridden by
specifying a
fixed ChIP-fragment length using " -len
<#>" or " -fragLength
<#>". This is important to consider
when trying to
count RNA tags, or things such as 5' RNA CAGE/TSS-Seq,
where you may
want to specify " -len 0"
so
that HOMER doesn't try to move the tags before counting
them.
HOMER can also quantify signal in bedGraph and WIG files
using " -bedGraph <bedgraph file1> ..." and " -wig
<wiggle file1> ...", respectively.
Making Scatter Plots
X-Y scatter plots are a
great way
to present information, and by counting tag densities from
different
tag directories, you can visualize the relative levels of
different
sequencing experiments. Because of the large range
of values, it
can be difficult to appreciate the relationship between
data sets
without log transforming the data (or sqrt to stay Poisson
friendly). Also, due to the digital nature of tag
counting, it
can be hard to properly assess the data from a X-Y scatter
plot since
may of the data points will have the same values and
overlap. To
assist with these issues, you can specify " -log" or " -sqrt" to transform
the data.
These functions will actually report "log(value+1+rand)"
and
"sqrt(value+rand)", respectively, where rand is a random
"fraction of a
tag" that adds jitter
to your
data so that data points with low tag counts will not have
exactly the
same value. For example, lets look at the distribution of
H3K4me1 and
H3K4me3 near Oct4 peaks in mouse embryonic stem cells:
annotatePeaks.pl Oct4.peaks.txt mm8 -size 1000 -d
H3K4me1-ChIP-Seq/
H3K4me3-ChIP-Seq/ > output.txt
Opening output.txt with
EXCEL
and plotting the last two columns:
Using EXCEL to take the log(base 2) of the data:
Now using the "-log" option:
annotatePeaks.pl Oct4.peaks.txt mm8 -size 1000 -log
-d
H3K4me1-ChIP-Seq/ H3K4me3-ChIP-Seq/ > output.txt
Believe it or not, all of these X-Y plots show the same
data.
Interesting, eh?
Finding instances of motifs
near
peaks
Figuring out which peaks
have
instances of motifs found with
findMotifsGenome.pl is very easy. Simply use " -m <motif file1>
<motif file
2>..." with annotatePeaks.pl.
(Motif
files
can
be
concatenated
into
a
single
file
for
ease
of
use)
This
will
search for each of these motifs near each peak in
your peak file. Use " -size
<#>" to specify the size of the region
around the peak
center you wish to search. Found instances of each
motif will be
reported in additional columns of the output file.
For example:
annotatePeaks.pl
pu1peaks.txt
mm8
-size
200 -m pu1.motif cebp.motif
> output.txt
Opening output.txt with EXCEL:
Each instance of the motif
is
specified in the following format (separated by commas):
Distance from
Peak Center(Sequence
Matching Motif,Strand,Average
Conservation)
The average conservation
will not
be reported unless you specify "-cons".
Also,
when
finding
motifs, the average CpG/GC content will
automatically be reported since it has to extract peak
sequences from
the genome anyway.
There are also a bunch of motif specific options for
specialized
analysis:
"-norevopp"
(only
search
+
strand
relative to peak strand for motifs)
"-nmotifs"
(just
report
the
total
number
of
motifs
per
peak)
"-mscore"
(report the maximum
log-odds score of the motif in each peak)
"-rmrevopp
<#>" (tries to avoid
double counting reverse opposites within # bp)
"-mdist"
(reports distance to
closest motif)
"-fm <motif file
1> [motif file
2]" (list of motifs to filter out of results if
found)
"-mfasta
<filename>"
(reports sites in a fasta file - for building new
motifs)
"-mbed <filename>"
(Output motif positions to a BED file to load at UCSC -
see below)
"-matrix
<filename>"
(outputs a motif co-occurrence matrix with the p-value
of co-occurrence
assuming instance of each motif are independently
distributed amongst
the peaks)
Visualizing Motif positions
in the
UCSC Genome Browser
This feature may seem
slightly
out of place, but since annotatePeaks.pl
is the workhorse of HOMER, you
can add " -mbed
<filename>"
in
conjunction
with
" -m <motif file
1> [motif file 2] ..." to produce a BED file
describing motif
positions near peaks that can be loaded as a custom track
in the UCSC
genome browser. In the example below, you would load
motif.bed
as a custom track:
annotatePeaks.pl
pu1peaks.txt mm8
-size 200 -m pu1.motif cebp.motif -mbed motif.bed
>
output.txt
Finding the distance to other
sets
of Peaks
In order to find the nearest
peak
from another set of peaks, use "-p
<peak file 1> [peak file 2] ...".
This will add
columns to the output spreadsheet that will specify the
nearest peak ID
and the distance to that peak. If all you want is
the distance
(so you can sort this column), add the option "-pdist" to the
command.
Otherwise, if you prefer to count the number of peaks in
the peak file
found within the indicated regions (i.e. with "-size <#>"), add
"-pcount".
Creating
Histograms from High-throughput Sequencing data and Motifs
HOMER can be used to make
histograms that document sequencing library and
motif densities relative to specific positions in the
genome.
This can be done near peaks, subsets of peaks, or near
promoters, exon
junctions or anywhere else you find interesting. To
make
histograms, use the annotatePeaks.pl
program but add the parameters "-hist
<#>" to produce a tab delimited text file
that can then be
visualized using EXCEL or other data visualization
software.
Basic usage:
annotatePeaks.pl
<peak
file>
<genome>
-size
<#> -hist
<#> -d
<tag directory
1> [tag directory2] ... -m <motif 1> <motif
2> ...
> <output matrix file>
i.e. annotatePeaks.pl
ERpeaks.txt
hg18 -size 6000 -hist 25
-d MCF7-H3K4me1/ MCF7-H3K4me2/
MCF7-H3K4me3/ > outputfile.txt
Running this command is very similar to creating
annotated
peak
files - in fact, most of the data can be used to
make both types of files - hence the reason for combining
this
functionality in the same command. Be default, HOMER
normalizes
the output histogram such that the resulting units are per
bp per peak,
on
top
of
the
standard
total
mapped
tag
normalization
of 10
million tags.
Histograms of Tag
Directories:
For each tag directory or
motif,
HOMER will output 3 columns in the histogram. In the
case of tag
directories, the first column will indicate ChIP-Fragment Coverage,
which is
calculated by extending tags by their estimated
ChIP-fragment length,
and is analogous to the profiles made for the UCSC Genome
Browser. The 2nd and 3rd columns report the density
of 5' and 3'
aligned tags, and are independent of fragment
length. For
example, lets look at H3K4me2 distribution near Androgen
Receptor (AR)
peaks before and after 16 hours of treatment with
testosterone (dht):
annotatePeaks.pl
ARpeaks.txt
hg18
-size
4000
-hist
10
-d
H3K4me2-control/
H3K4me2-dht-16h/
>
outputfile.txt
Opening outputfile.txt with EXCEL, we see:
Graphing columns B and E while using column A for the
x-coordinates, we
get the following:
However, if we graph only the 5' and 3' tags that come
from the
H3K4me2-dht-16h directory (columns F and G):

Here we can see how the 5' and 3' reads from the H3K4me2
marked
nucleosomes are distributed near the AR sites.
Histograms of Motif
Densities:
Making histograms out of
motif
occurrences is very similar to sequencing tag
distributions. Run
the annotatePeaks.pl program with " -hist
<#>" and " -m
<motif
file>" (you can also find motif densities and
tag densities
at the same time):
annotatePeaks.pl
ARpeaks.txt
hg18
-size
1000
-hist
5
-m
are.motif
fox.motif
ap1.motif >
outputfile.txt
Graphing outputfile.txt with EXCEL:
Centering Peaks
on Motifs
One cool analysis strategy
is to
center peaks on a specific
motif. For example, by centering peak for the
Androgen Receptor
Transcription
Factor on the ARE motif (GNACANNNTGTNC), you can map the
spacial
relationship between the motif and other sequence features
and sequencing reads.
To center peaks on a motif, run annotatePeaks.pl
with the following
options:
annotatePeaks.pl
<peak
file>
<genome>
-size
<#>
-center
<motif
file> > newpeakfile.txt
i.e. annotatePeaks.pl
ARpeaks.txt
hg18r -size 200 -center are.motif >
areCenteredPeaks.txt
Now the idea is to use the
new
peak file to perform analysis:
annotatePeaks.pl
areCenteredPeaks.txt
hg18 -size 6000 -hist 10 -d H3K4me2-notx/ ... >
output.txt
Creating Heatmaps
from High-throughput Sequencing data
HOMER is not capable of
generating actual Heatmaps per se,
but
it
will
generate the data matrix (similar to a gene expression
matrix) than can
then be visualized using standard gene expression heatmap
tools.
For example, I will generate a heatmap data matrix file
using HOMER,
and then open it with Cluster
3.0
(Micheal
Eisen/de
Hoon) to cluster it and/or visualize it with Java Tree View
(by Alok J.
Saldanha). In reality, you can use any
clustering and/or
heatmap visualization software (i.e. R).
Basic usage (add " -ghist"
when
making a histogram):
annotatePeaks.pl
<peak
file>
<genome>
-size
<#>
-hist
<#> -ghist
-d <tag directory
1> [tag directory2] ... > <output matrix
file>
i.e. annotatePeaks.pl
ARpeaks.txt hg18
-size 6000 -hist 25 -ghist
-d H3K4me2-control/ H3K4me2-dht-16h/ >
outputfile.txt
Running this command is very similar to making histograms
with annotatePeaks.pl.
In
fact, a heatmap isn't really all that
different from a histogram - basically, instead of
averaging all of the
data from each peak, we keep data from each peak separate
and visualize
it all together in a heatmap. The key difference
when making a
heat map or a histogram is that you must add "-ghist" when making a
heatmap.
Format of Data Matrix Output
File
The resulting file is a
tab-delimited text file where the first row is a header
file and the
remaining rows represent each peak from the input peak
file. The
first set of columns will describe the read densities from
the first tag directory
as a
function of distance from the center of the peak, with bin
sizes
corresponding to the parameter used with " -hist <#>".
After the
first block of columns, a second block will start over
with read
densities from the 2nd
tag directory, and so on. If you would like
to cluster
this file to help organize the patterns,
make sure you only cluster the "genes" (i.e. rows).
Example: annotatePeaks.pl
ARpeaks.txt hg18 -size
6000 -hist 25 -ghist -d H3K4me2-control/
H3K4me2-dht-16h/ >
outputfile.txt
After creating the file, I loaded into Cluster 3.0 to
cluster and
clustered the genes using "centered correlation" as the
distance
metric, then loaded that output into Java Tree View.
General Options to Control Data Analysis Behavior
-strand
<+|-|both> (only look for motif etc. on +
or - strand,
default both)
-fragLength <#> (Fragment
length,
default=auto,
might
want to set to 0 for RNA)
-pc <#>
(maximum
number of tags to count per bp, default=0 [no maximum])
-cons (Retrieve
conservation information for peaks/sites - creates new
column for this
information)
-CpG (Calculate
CpG/GC
content)
-norevopp (do not
search
for motifs on the opposite strand [works with -center
too])
-norm <#>
(normalize tags to this tag count, default=1e7, 0=average
tag count in
all directories)
Use -noadj to disable
tag normalization
for sequencing depth
Command line options for annotatePeaks.pl
Usage: annotatePeaks.pl <peak file |
tss>
<genome version> [additional options...]
Available Genomes (required argument):
(name,org,directory,default promoter set)
User defined annotation files (default is
UCSC refGene annotation):
annotatePeaks.pl
accepts GTF (gene transfer formatted) files to annotate
positions relative
to custom
annotations, such as those from de novo transcript discovery
or Gencode.
-gtf <gtf
format file> (-gff and -gff3 can work for those files,
but GTF is better)
Peak vs. tss/tts/rna mode (works with
custom GTF file):
If the first
argument is "tss" (i.e. annotatePeaks.pl tss hg18 ...) then
a TSS centric
analysis will be
carried out. Tag counts and motifs will be found
relative to the TSS.
(no position file
needed) ["tts" now works too - e.g. 3' end of gene]
["rna" specifies
gene bodies, will automaticall set "-size given"]
NOTE: The default
TSS peak size is 4000 bp, i.e. +/- 2kb (change with -size
option)
-list <gene id
list> (subset of genes to perform analysis [unigene, gene
id, accession,
probe, etc.], default = all promoters)
-cTSS
<promoter position file i.e. peak file> (should be
centered on TSS)
Primary Annotation Options:
-mask (Masked
repeats, can also add 'r' to end of genome name)
-m <motif file
1> [motif file 2] ... (list of motifs to find in peaks)
-mscore (reports the highest log-odds score within the peak)
-nmotifs (reports the number of motifs per peak)
-mdist (reports distance to closest motif)
-mfasta <filename> (reports sites in a fasta file -
for building new motifs)
-fm <motif file 1> [motif file 2] (list of motifs to
filter from above)
-rmrevopp <#> (only count sites found within <#>
on both strands once, i.e. palindromic)
-matrix <prefix> (outputs a motif co-occurrence files:
prefix.count.matrix.txt - number of peaks with motif
co-occurrence
prefix.ratio.matrix.txt - ratio of observed vs.
expected co-occurrence
prefix.logPvalue.matrix.txt - co-occurrence enrichment
prefix.stats.txt - table of pair-wise motif co-occurrence
statistics
additional options:
-matrixMinDist <#> (minimum distance between motif
pairs - to avoid overlap)
-matrixMaxDist <#> (maximum distance between motif
pairs)
-mbed <filename> (Output motif positions to a BED file
to load at UCSC (or -mpeak))
-d <tag
directory 1> [tag directory 2] ... (list of experiment
directories to show
tag counts for) NOTE: -dfile <file> where file is a
list of directories in first column
-bedGraph
<bedGraph file 1> [bedGraph file 2] ... (read coverage
counts from bedGraph files)
-wig <wiggle
file 1> [wiggle file 2] ... (read coverage counts from
wiggle files)
-p <peak
file> [peak file 2] ... (to find nearest peaks)
-pdist to report only distance (-pdist2 gives directional
distance)
-pcount to report number of peaks within region
-vcf <VCF
file> (annotate peaks with genetic variation infomation,
one col per individual)
-editDistance (Computes the # bp changes relative to
reference)
-individuals <name1> [name2] ... (restrict analysis to
these individuals)
-gene <data
file> ... (Adds additional data to result based on the
closest gene.
This is useful for adding gene expression data. The
file must have a header,
and the first column must be a GeneID, Accession number,
etc. If the peak
cannot be mapped to data in the file then the entry will be
left empty.
-go <output
directory> (perform GO analysis using genes near peaks)
-genomeOntology
<output directory> (perform genomeOntology analysis on
peaks)
-gsize <#> (Genome size for genomeOntology analysis,
default: 2e9)
Annotation vs. Histogram mode:
-hist <bin
size in bp> (i.e 1, 2, 5, 10, 20, 50, 100 etc.)
The -hist option
can be used to generate histograms of position dependent
features relative
to the center of
peaks. This is primarily meant to be used with -d and
-m options to map
distribution of
motifs and ChIP-Seq tags. For ChIP-Seq peaks for a
Transcription factor
you might want to
use the -center option (below) to center peaks on the known
motif
** If using
"-size given", histogram will be scaled to each region (i.e.
0-100%), with
the -hist
parameter being the number of bins to divide each region
into.
Histogram Mode specific Options:
-nuc (calculated mononucleotide frequencies at each
position,
Will report by default if extracting sequence for other
purposes like motifs)
-di (calculated dinucleotide frequencies at each position)
-histNorm <#> (normalize the total tag count for each
region to 1, where <#> is the
minimum tag total per region - use to avoid tag spikes from
low coverage
-ghist (outputs profiles for each gene, for peak shape
clustering)
-rm <#> (remove occurrences of same motif that occur
within # bp)
Peak Centering: (other options are
ignored)
-center <motif
file> (This will re-center peaks on the specified motif,
or remove peak
if there is no motif in the peak. ONLY recentering
will be performed, and all other
options will be ignored. This will output a new peak
file that can then be reanalyzed
to reveal fine-grain structure in peaks (It is advised to
use -size < 200) with this
to keep peaks from moving too far (-mirror flips the
position)
-multi (returns
genomic positions of all sites instead of just the closest
to center)
Advanced Options:
-len <#> /
-fragLength <#> (Fragment length, default=auto, might
want to set to 0 for RNA)
-size <#>
(Peak size[from center of peak], default=inferred from peak
file)
-size #,# (i.e. -size -10,50 count tags from -10 bp to +50
bp from center)
-size "given" (count tags etc. using the actual regions -
for variable length regions)
-log (output tag
counts as log2(x+1+rand) values - for scatter plots)
-sqrt (output tag
counts as sqrt(x+rand) values - for scatter plots)
-strand
<+|-|both> (Count tags on specific strands relative to
peak, default: both)
-pc <#>
(maximum number of tags to count per bp, default=0 [no
maximum])
-cons (Retrieve
conservation information for peaks/sites)
-CpG (Calculate
CpG/GC content)
-ratio (process
tag values as ratios - i.e. chip-seq, or mCpG/CpG)
-nfr (report
nuclesome free region scores instead of tag counts, also
-nfrSize <#>)
-norevopp (do not
search for motifs on the opposite strand [works with -center
too])
-noadj (do not
adjust the tag counts based on total tags sequenced)
-norm <#>
(normalize tags to this tag count, default=1e7, 0=average
tag count in all directories)
-pdist (only
report distance to nearest peak using -p, not peak name)
-map <mapping
file> (mapping between peak IDs and promoter IDs,
overrides closest assignment)
-noann, -nogene
(skip genome annotation step, skip TSS annotation)
-homer1/-homer2
(by default, the new version of homer [-homer2] is used for
finding motifs)
|