homerTools - General sequence manipulation
homerTools is a
utility program Chuck uses for basic sequence manipulation
of FASTQ files, extracting sequences from genome FASTA
files, and calculating nucleotide frequencies. To run
homerTools do the
following:
homerTools [command] [command specific options]
i.e. homerTools trim -3
AAAAAAAA s_1_sequence.txt
The following commands are available in homerTools:
barcodes - for separating and removing 5'
barcodes from FASTQ/FASTA files
trim - for
trimming by adapter sequence, specific lengths, etc. from
FASTQ/FASTA files
freq - for
calculating nucleotide frequencies in FASTQ/FASTA/txt
sequence files
extract - for
extracting specific regions of seqeuence from genomic
FASTA files
Separating 5' Barcodes:
To separate and remove 5'
barcodes from sequencing data (where the first "x" base
pairs of the read are the barcode):
homerTools barcodes <# length of barcode>
[options] <sequence file1> [sequence file2] ...
i.e. homerTools barcodes 3
s_1_sequence.txt (removes first 3bp as the
barcode and sorts the reads by barcode)
The 3rd argument must be the length of the 5' barcode,
which will be the first base pairs in the sequence.
By default, this command creates files named
"filename.barcode", such as s_1_sequence.txt.AAA,
s_1_sequence.txt.AAC, s_1_sequence.txt.AAG etc. The
parameter " -min <#>"
specifies the minimum barcode frequence to keep (default
is 0.02 [2%]). The frequency of each barcode is
recorded in the output file "filename.freq.txt". If
important barcodes were deleted, rerun the command with a
smaller value for " -min
<#>".
Trimming Sequence Files
With all the fancy types of
sequencing being done, it is getting common to find
adapters as part of the sequences that are analyzed.
The trim command allows users to trim sequences from the
3' and 5' ends by either a specific number of nucleotides
or remove a specific adapter sequence. The basic
command is executed like this:
homerTools
trim
[options] <sequence file1> [sequence file2]
The output will be placed in files "filename.trimmed" and
the distribution of sequence lengths after trimming will
be in "filename.lengths" for each of the input
files. The following options control how homerTools
trims the sequences:
-len <#> (trim sequences to this
length)
-min <#>
(remove sequence that are shorter than this after
timming)
-3 <#>
(trim this many bp off the 3' end of the sequence)
-5 <#>
(trim this many bp off the 5' end of the sequence)
-3 <ACGT>
(trim adapter sequence (i.e. "-3 GGAGGATTT") from the 3'
end of the sequence)
-5 <ACGT>
(trim adapter sequence (i.e. "-5 GGAGGATTT") from the 5'
end of the sequence)
For adapter sequence trimming, it will search for the
first full match to the sequence and delete the rest of
the sequence. For example if you specify "-3 AA", it
will search for the first instance of "AA" and delete
everything after it. It will also delete partial
matches if they are at the end of the sequence (or
beginning for 5'). As another example, our lab uses
an amplification strategy for RNA that results in the
ligation of a polyA tail to the RNA sequence. If the
reads are long enough, the read will be just As.
i.e.
GAGATTATCTACGTACCGAAAAAAAAAAAAAAAAAA
Trimming with "-3
AAAAAAAAA" will cleave the complete polyA
stretch.
In this example: GAGATTATCTACGTACCGTACTGCATGACGGGAAAA,
only the final 4 As would be trimmed.
Common trimming tasks
TruSeq adapter trimming:
homerTools trim -3 GATCGGAAGAGCACACGTCT -mis 1
-minMatchLength 4 -min 15 file.fastq
Small RNA adapter trimming:
homerTools trim -3 TCGTATGCCGTCTTCTGCTTGT -mis 1
-minMatchLength 4 -min 15 file.fastq
Extracting Genomic Sequences From FASTA Files
The extract command can be
used to extract large numbers of specific genomic
sequence. The first input file you need is a HOMER
style peak file or a BED file with genomic
locations. Next, you must have the genomic DNA
sequences in one of two formats: (1) a directory of
chr1.fa, chr2.fa FASTA files (can be masked file like
*.fa.masked), or (2) a single file FASTA file with all of
the chromosomes concatonated in one file. The
sequences are sent to stdout
as a tab-delimited file, or as a FASTA formatted file if " -fa" is added to the
end of the command. Save the output to a file by
adding " >
outputfile.txt" to the end of the command. The
program is run like this:
homerTools extract <peak/BED file>
<FASTA directory or file location> [-fa]
i.e. homerTools
extract peaks.bed
/home/chucknorris/homer/data/genomes/mm9/ >
outputSequences.txt
Or, to get FASTA files back, i.e. homerTools extract
peaks.bed /home/chucknorris/homer/data/genomes/mm9/
-fa > outputSequence.fa
Calculating Nucleotide Frequencies
The freq command will
calculate nucleotide frequencies from FASTQ, FASTA, or
tab-delimited text sequence files. The program tries
to auto detect the format, but it may help to specify the
format directly (" -format
fastq", " -format
fasta", " -format
tsv"). The program outputs a
position-dependent nucleotide/dinucleotide frequency file
as a function of the distance from the start of the
sequencing reads. The output is sent to stdout, unless you
specify " -o
<outputfile.txt>". If you specify " -gc <outpufile2.txt>",
the program will also create a file that specifies the
cumulative frequency of CpG, total G+C, total A+G,
and total A+C in each individual sequence.
homerTools freq -format fastq
s_1_sequence.txt > s_1.frequency.txt
homerTools freq -format
fastq s_1_sequence.txt -gc GCdistribution.txt -o
positionFrequency.txt
homerTools Command Line options:
Usage: homerTools
<command> [--help | options]
Collection of
tools for sequence manipulation
Commands: [type
"homerTools <command>" to see individual command
options]
barcodes - separate FASTQ file by barcodes
trim - trim adapter sequences or fixed sizes from FASTQ
files(also splits)
freq - calculate position-dependent nucleotide/dinucleotide
frequencies
extract - extract specific sequences from FASTA file(s)
decontaminate - remove bad tags from a contaminated tag
directory
cluster - hierarchical clustering of a NxN distance matrix
special - specialized routines (i.e. only really useful for
chuck)
Options for
command: barcode
-min <#> (Minimum frequency of barcodes to keep:
default=0.020
-freq <filename> (output file for barcode frequencies,
default=file.freq.txt)
-qual <#> (Minimum quality score for barcode
nucleotides, default=not used)
-qualBase <character> (Minimum quality character in
FASTQ file, default=B)
Options for
command: trim
-3 <#|[ACGT]> (trim # bp or adapter sequence from 3'
end of sequences)
-5 <#|[ACGT]> (trim # bp or adapter sequence from 5'
end of sequences)
-mis <#> (Maximum allowed mismatches in adapter
sequence, default: 0)
-minMatchLength <#> (minimum adapter sequence at edge
to match, default: half adapter length)
-len <#> (Keep first # bp of sequence - i.e. make them
the same length)
-stats <filename> (Output trimming statistics to
filename, default: sent to stdout)
-min <#> (Minimum size of trimmed sequence to keep,
default: 1)
-max <#> (Maximum read length, default: 100000)
-suffix <filename suffix> (output is sent to
InuptFileName.suffix, default: trimmed)
-lenSuffix <filename suffix> (length distribution is
sent to InuptFileName.suffix, default: lengths)
-split <#> (Split reads into two reads at bp #, output
to trimmed1 and trimmed2)
-revopp <#> (Return reverse opposite of read [if used
with -split, only the 2nd
half of the read will be retuned as reverse opposite])
Options for
command: freq
-format <tsv|fasta|fastq> (sequence file format,
default: auto detect)
-offset <#> (offset of first base in output file,
default: 0)
-maxlen <#> (Maximum length of sequences to consider,
default: length of 1st seq)
-o <filename> (Output filename, default: output sent
to stdout)
-gc <filename> (calculate CpG/GC content per sequence
output to "filename")
OutputFormat:
name<tab>CpG<tab>GC<tab>AG<tab>AC<tab>Length
Options for
command: extract
-fa (output sequences in FASTA format - default is
tab-delimited format)
Alternate Usage:
homerTools extract stats <Directory of FASTA files>
Displays stats about the genome files (such as length)
Options for
command: decontaminate
-frac <#> (Estimate fraction of sample that is
contaminated, default: auto)
-estimateOnly (Only estimate the contamination, do not
decontaminate)
-o <output tag directory> (default: overrites
contaminated tag directory)
-size <#> (Peak size for estimating contamination/Max
distance from contaminant
reads to remove contaminated reads, default: 250)
-min <#> (Minimum tag count to consider when
estimating contamination, default: 20)
|