|
Also check out this
page in the Basic NGS Tutorial (It's more
up-to-date)
Alignment of High-throughput Sequencing Data
Homer does not perform
alignment - this is something that must be done before
running homer. Several quality tools are available
for alignment of short reads to large genomes. Check
out this
link for a list of programs that do short read
alignment. BLAST, BLAT, and other traditional
alignment programs, while great at what they do, are not
practical for alignment of these types of data.
If you need help deciding on a program to use, I'll
recommend Bowtie
(it's nice and fast).
If you have a core that maps your data for you, don't
worry about this step. However, in many cases there
is public data available that hasn't been mapped to the
genome or mapped to a different version of the genome or
mapped with different parameters. In these cases it
is nice to be able to map data yourself to keep a nice,
consistent set of data for analysis.
Most types of ChIP-Seq/DNase-Seq/MNase-Seq and GRO-Seq
simply need to be mapped to the genome, as they represent
the sequencing of genomic DNA (or nascent RNA, which
should not be spliced yet). If analyzing RNA-Seq,
you may be throwing away interesting information about
splicing if you simply align the data to the genome.
If aligning RNA, I'd recommend sticking to the formal wear
and trying Tophat,
which does a good job of identifying splice junctions in
your data.
Which reference genome (version) should I map my reads
to?
Both the organism and the
exact version
(i.e. hg18, hg19) are very important when mapping
sequencing reads. Reads mapped to one version are
NOT interchangeable with reads mapped to a different
version. I would follow this recommendation list when
choosing a genome (Obviously try to match species or sub
species when selecting a genome):
- Do you have a favorite genome in the lab that
already has a bunch of experiments mapped to it?
Use that one.
- Do any of your collaborators have a favorite genome?
- Use the latest stable release - I would recommend
using genomes curated at UCSC so that you can easily
visualize your data later using the UCSC Genome Browser.
(i.e. mm9, hg18)
Q: I'm changing genome
versions, can I just "liftover" my data using UCSC
liftover tool, or do I need to remap it to the new
genome version?
If you want to do it right,
you need to remap it. This is because some regions
of the genome that are considered "unique" in one version
may suddenly be found multiple times in the new version
and vice versa, so using the liftover tool will yield
different results from remapping. However, liftover
is fine if you're looking for a quick and dirty solution.
If you fell like cheating, as Chuck often does, try convertCoordinates.pl.
- it's a wrapper that uses the "liftOver" program to
migrate peak files and whole Tag Directores.
Should I trim my reads when mapping to the genome?
Depends. In the old
days, the read quality dropped off quite a bit past ~30
bp, but these days even the end of sequencing reads are
pretty high quality. In the end, I would recommend
mapping ~32 bp reads with up the 3 mismatches, using only
the uniquely alignable reads for downstream
analysis. That will give you access to probably
80-90% of what is interesting in your data set.
I have barcodes and/or adpater sequences in my
reads. Should I remove them first or just map them?
Example - Alignment with bowtie:
Step 1 - Build Index (takes a while, but only do this
once):
After installing bowtie,
the reference genome must first be "indexed" so that
reads may be quickly aligned. You can download
pre-made indices from the bowtie website (check for
those here
first). Otherwise, to perform make your own from
FASTA files, do the following:
- Download FASTA files for the unmasked genome of
interest if you haven't already (i.e. from UCSC)
- From the directory containing the FASTA files, run
the "bowtie-build" command. For example, for
hg18:
- /path-to-bowtie-programs/bowtie-build
chr1.fa,chr2.fa,chr3.fa,...chrY.fa,chrM.fa hg18
- Where
...
are
the
rest
of
the
*.fa
files.
This
command
will
take
a
long
time
to run, but will produce several files named hg18.*.ebwt
- Copy the *.ebwt files to the bowtie indexes
directory so that bowtie knows where to find them
later:
- cp *.ebwt /path-to-bowtie-programs/indexes/
Step 2 - Align sequences with bowtie (perform for each
experiment):
The most common output
format for high-throughput sequencing is FASTQ format,
which contains information about the sequence (A,C,G,Ts)
and quality information which describes how certain the
sequencer is of the base calls that were made. In
the case of Illumina sequencing, the output is usually a
" s_1_sequence.txt"
file. In addition, much of the data available in
the SRA,
the primary archive of high-throughput sequencing data,
is in this format. To map this data, run the
following command:
/path-to-bowtie-programs/bowtie
-q --best -m 1 -p <# cpu> <genome>
<fastq file> <output filename>
Where <genome> would be hg18 from the
index made above, <fastq file> could be
"s_1_sequence.txt", and <output filename>
something like "s_1_sequence.hg18.alignment.txt"
The parameters "--best" and "-m 1" are needed to make
sure bowtie outputs only unique alignments. There
are many options and many different ways to perform
alignments, with different trade-offs for different
types of projects - well beyond the scope of what I am
describing here.
NOTE: HOMER
contains automated parsing for uniquely aligned reads
from output files generated with bowtie in this
fashion. Homer also accepts *eland_result.txt and
*_export.txt formats from the Illumina pipeline.
If different programs are used, or special parsing of
output files are needed, please parse/reformat alignment
files to general BED format, which is also accepted by
HOMER. HOMER also accepts SAM formatted
file. If using BAM files, use "samtools view
input.bam > output.sam" to convert to a SAM file.
|