|
This is the old version of
the documentation: New
Version
ChIP-Seq Analysis: Finding Enriched Motifs in ChIP-Seq Peaks
HOMER was initially developed to automate the process of finding
enriched motifs in ChIP-Seq peaks. More generally, HOMER analyzes
genomic positions, not limited to only ChIP-Seq peaks, for enriched
motifs. The main idea is that all the user really needs is a file
containing genomic coordinates (i.e. peak file),
and
HOMER
will
generally
take
care
of
the
rest.
To
analyze a peak file for
motifs, run the following command:
findMotifsGenome.pl
<peak
file>
<genome>
<output
directory>
[options]
i.e. findMotifsGenome.pl ERpeaks.txt
hg18r ER_MotifOutput/ -len 8,10,12
A variety of output files will be placed in the <output
directory>, including html pages showing the results.
The findMotifsGenome.pl program is a wrapper that helps set up the data
for analysis using the HOMER motif discovery algorithm. By
default this will perform de novo
motif discovery as well as check the enrichment of known motifs.
If you have not done so already, please look over this page describing how
HOMER analyzes sequences for enriched motifs.
An important prerequisite for analyzing genomic motifs is that the
appropriate genome must by
configured for use with HOMER.
Acceptable Input files
findMotifsGenome.pl accepts files
in HOMER's "peak file" format. The minimum file requirements are
as follows (separated by TABs):
- Column1: Unique Peak ID
- Column2: chromosome
- Column3: starting position
- Column4: ending position
- Column5: Strand (+/- or 0/1, where 0="+", 1="-")
Additional columns will be ignored. If starting with a BED file,
convert it to a peak file using the bed2pos.pl
program. If using a EXCEL, make sure to save files as a "Text
(Windows)" if running MacOS. If errors occur, it is likely that the
file is not in the correct format, or the first column is not actually
populated with unique identifiers.
!!! COMMON PROBLEM: If this
program isn't working, make sure you save your peak files as "text
(windows)" from EXCEL when on a Mac. Run the checkPeakFile.pl program to see if
your file is the correct format, and use changeNewLine.pl if you didn't save
your file in "text (windows)" format.
!!! OK - an even MORE common PROBLEM, particularly if
you use a different peak finding program, make SURE you use UNIQUE peak
IDs. If you think about it, the point of having a peak ID is so
that you can tell them apart, so having duplicates is a horrible
idea!!! Repeated peak IDs will cause older verions of HOMER to
crash!! The program renamePeaks.pl is now included to rename the
peaks if you're you need help with this.
Important motif finding parameters
Region Size ("-size <#>") -
this specifies the size (centered on the peak centers) to look for
motifs. I'd recommend 50 bp for establishing the primary motif
bound by a given transcription factory, 200 bp for finding
"co-enriched" motifs for a transcription factor, and 1000 bp for
searching H3K4me or H3/H4 acetylated regions.
Motif length ("-len <#>"
or "-len <#>,<#>,...")
-
specify
the
length
of
motifs
to
be
found.
HOMER
will find motifs of each size
separately and then combine the results at the end. The length of
time it takes to find motifs increases greatly with increasing
size. In general, it's best to try out enrichment with shorter
lengths (i.e. 8 and/or 10) before trying longer lengths (i.e. 12 or
14). Much longer motifs can be found with HOMER, but it's best to
use smaller sets of sequence when trying to find long motifs (i.e. use
"-len 20 -size 50"), otherwise it may take way too long (or take too
much memory).
Number of motifs to find ("-S <#>")
-
specifies
the
number
of
motifs
of
each
length
to
find. (recommend "-S 50" or "-S 100").
Normalize GC% content instead of CpG% content ("-gc"), or disable GC/CpG
normalization ("-noweight").
Use custom background regions ("-bg
<peak file of background regions>") - these will still be
normalized for CpG% or GC% content just like randomly chosen sequences
By default, findMotifsGenome.pl
uses the binomial distribution to score motifs. This works well
when the number of background sequences greatly out number the target
sequences - however, if you are using "-bg"
option
above,
and
the
number of background sequences is smaller than
target sequences, it is a good idea to use the hypergeometric
distribution instead ("-h").
Find enrichment of individual oligos ("-oligo").
This
creates
output
files
in
the
output
directory
named
oligo.length.txt.
Force findMotifsGenome.pl to re-preparse genome for the given region
size ("-preparse").
How findMotifsGenome.pl works
There are a series of steps that
the program goes through to find quality motifs:
- Extract sequences from the genome corresponding to the
peaks in the input file
- Removes sequences with >70% Ns
- Calculate the CpG/GC content of input sequences
- (If not done during a previous run) Preparse genome for
control fragments of the specified size
- Randomly select background sequences matching
CpG characteristics of input sequences
- Perform de novo
motif finding
- Generate output files for de
novo motif finding
- Check enrichment of known motifs
- Generate output files for known motif enrichment
Interpreting motif finding results
The format of the output files
generated by findMotifsGenome.pl
are identical to those generated by the promoter-based version findMotifs.pl ( description).
In general, when analyzing ChIP-Seq / ChIP-Chip peaks you should expect
to see strong enrichment for a motif resembling the site recognized by
the DNA binding domain of the factor you are studying. Enrichment
p-values reported by HOMER should be very very significant (i.e.
<< 1e-50). If this is not the case, there is a strong
possibility that the experiment may have failed in one way or
another. For example, the peaks could be of low quality because
the factor is not expressed very high.
Practical Tips for
Motif finding
Command-line options for findMotifsGenome.pl
Program will find de novo and known motifs in
regions in the genome
Usage: findMotifsGenome.pl <pos file>
<genome> <output directory> [additional options]
Example: findMotifsGenome.pl peaks.txt mm8r
peakAnalysis -size 200 -len 8
Basic options:
-bg <background position
file> (genomic positions to be used as background, default=automatic)
-len
<#>[,<#>,<#>...] (motif length, default=10) [NOTE:
values greater 12 may cause the program
to
run out of memory - in these cases decrease the number of sequences
analyzed (-N)]
-size <#> (fragment
size to use for motif finding, default=200)
-S <#> (Number of
motifs to optimize, default: 100)
-mis <#> (global
optimization: searches for strings with # mismatches, default: 2)
-depth
[low|med|high|allnight] (time spent on local optimization default: med)
Scanning sequence for motifs
-find <motif file>
(This will cause the program to only scan for motifs)
Known Motif Options
-mcheck <motif file>
(known motifs to check against de novo motifs,
default: /bioinformatics/homer/data/knownTFs/all.motifs
-mknown <motif file>
(known motifs to check for enrichment,
default: /bioinformatics/homer/data/knownTFs/known.motifs
Sequence normalization options:
-tss (normalize based on
distance from TSS)
-cgtss (normalize based on
CpG content and distance from TSS)
-cg DEFAULT (normalize based
on CpG content)
-noweight (no CG correction)
Advanced options:
-h (use hypergeometric for
p-values, binomial is default)
-N <#> (Number of
sequences to use for motif finding, default=max(50k, 2x input)
-noforce (will attempt to
reuse sequence files etc. that are already in output directory)
-local <#> (use local
background, # of equal size regions around peaks to use i.e. 2)
-gc (use GC% instead of CpG%
for sequence content normalization [NOT WORKING...]
-noknown (don't search for
known motif enrichment)
-nocheck (don't search for
de novo vs. known motif similarity)
-nomotif (don't search for
de novo motif enrichment)
-norevopp (don't search
reverse strand for motifs)
-redundant <#> (Remove
redundant sequences matching greater than # percent, i.e. -redundant
0.5)
-float (allow Homer to
adjust the degeneracy threshold for known motifs to get best
p-value[dangerous])
-mask <motif file1>
[motif file 2]... (motifs to mask before motif finding)
-refine <motif file1>
(motif to optimize)
-rand (randomize target and
background sequences labels)
-ref <peak file> (use
file for target and background - first argument is list of peak ids for
targets)
-oligo (perform analysis of
individual oligo enrichment)
-dumpFasta (Dump fasta files
for target and background sequences for use with other programs)
-preparse (force new
background files to be created)
|