HOMER

Software for motif discovery and next-gen sequencing analysis

homerTools - General sequence manipulation

homerTools is a utility program Chuck uses for basic sequence manipulation of FASTQ files, extracting sequences from genome FASTA files, and calculating nucleotide frequencies. To run homerTools do the following:

homerTools [command] [command specific options]

i.e. homerTools trim -3 AAAAAAAA s_1_sequence.txt

The following commands are available in homerTools:

barcodes - for separating and removing 5' barcodes from FASTQ/FASTA files
trim - for trimming by adapter sequence, specific lengths, etc. from FASTQ/FASTA files
freq - for calculating nucleotide frequencies in FASTQ/FASTA/txt sequence files
extract - for extracting specific regions of seqeuence from genomic FASTA files

Separating 5' Barcodes:

To separate and remove 5' barcodes from sequencing data (where the first "x" base pairs of the read are the barcode):

homerTools barcodes <# length of barcode> [options] <sequence file1> [sequence file2] ...

i.e. homerTools barcodes 3 s_1_sequence.txt (removes first 3bp as the barcode and sorts the reads by barcode)

The 3rd argument must be the length of the 5' barcode, which will be the first base pairs in the sequence. By default, this command creates files named "filename.barcode", such as s_1_sequence.txt.AAA, s_1_sequence.txt.AAC, s_1_sequence.txt.AAG etc. The parameter "-min <#>" specifies the minimum barcode frequence to keep (default is 0.02 [2%]). The frequency of each barcode is recorded in the output file "filename.freq.txt". If important barcodes were deleted, rerun the command with a smaller value for "-min <#>".

Trimming Sequence Files

With all the fancy types of sequencing being done, it is getting common to find adapters as part of the sequences that are analyzed. The trim command allows users to trim sequences from the 3' and 5' ends by either a specific number of nucleotides or remove a specific adapter sequence. The basic command is executed like this:

homerTools trim [options] <sequence file1> [sequence file2]

The output will be placed in files "filename.trimmed" and the distribution of sequence lengths after trimming will be in "filename.lengths" for each of the input files. The following options control how homerTools trims the sequences:

-len <#> (trim sequences to this length)
-min <#> (remove sequence that are shorter than this after timming)
-3 <#> (trim this many bp off the 3' end of the sequence)
-5 <#> (trim this many bp off the 5' end of the sequence)
-3 <ACGT> (trim adapter sequence (i.e. "-3 GGAGGATTT") from the 3' end of the sequence)
-5 <ACGT> (trim adapter sequence (i.e. "-5 GGAGGATTT") from the 5' end of the sequence)

For adapter sequence trimming, it will search for the first full match to the sequence and delete the rest of the sequence. For example if you specify "-3 AA", it will search for the first instance of "AA" and delete everything after it. It will also delete partial matches if they are at the end of the sequence (or beginning for 5'). As another example, our lab uses an amplification strategy for RNA that results in the ligation of a polyA tail to the RNA sequence. If the reads are long enough, the read will be just As.

i.e. GAGATTATCTACGTACCGAAAAAAAAAAAAAAAAAA

Trimming with "-3 AAAAAAAAA" will cleave the complete polyA stretch.

In this example: GAGATTATCTACGTACCGTACTGCATGACGGGAAAA, only the final 4 As would be trimmed.

Common trimming tasks

TruSeq adapter trimming:

homerTools trim -3 GATCGGAAGAGCACACGTCT -mis 1 -minMatchLength 4 -min 15 file.fastq

Small RNA adapter trimming:

homerTools trim -3 TCGTATGCCGTCTTCTGCTTGT -mis 1 -minMatchLength 4 -min 15 file.fastq

Extracting Genomic Sequences From FASTA Files

The extract command can be used to extract large numbers of specific genomic sequence. The first input file you need is a HOMER style peak file or a BED file with genomic locations. Next, you must have the genomic DNA sequences in one of two formats: (1) a directory of chr1.fa, chr2.fa FASTA files (can be masked file like *.fa.masked), or (2) a single file FASTA file with all of the chromosomes concatonated in one file. The sequences are sent to stdout as a tab-delimited file, or as a FASTA formatted file if "-fa" is added to the end of the command. Save the output to a file by adding " > outputfile.txt" to the end of the command. The program is run like this:

homerTools extract <peak/BED file> <FASTA directory or file location> [-fa]

i.e. homerTools extract peaks.bed /home/chucknorris/homer/data/genomes/mm9/ > outputSequences.txt
Or, to get FASTA files back, i.e. homerTools extract peaks.bed /home/chucknorris/homer/data/genomes/mm9/ -fa > outputSequence.fa

Calculating Nucleotide Frequencies

The freq command will calculate nucleotide frequencies from FASTQ, FASTA, or tab-delimited text sequence files. The program tries to auto detect the format, but it may help to specify the format directly ("-format fastq", "-format fasta", "-format tsv"). The program outputs a position-dependent nucleotide/dinucleotide frequency file as a function of the distance from the start of the sequencing reads. The output is sent to stdout, unless you specify "-o <outputfile.txt>". If you specify "-gc <outpufile2.txt>", the program will also create a file that specifies the cumulative frequency of CpG, total G+C, total A+G, and total A+C in each individual sequence.

homerTools freq -format fastq s_1_sequence.txt > s_1.frequency.txt
homerTools freq -format fastq s_1_sequence.txt -gc GCdistribution.txt -o positionFrequency.txt

homerTools Command Line options:

        Usage: homerTools <command> [--help | options]

        Collection of tools for sequence manipulation

        Commands: [type "homerTools <command>" to see individual command options]
                barcodes - separate FASTQ file by barcodes
                trim - trim adapter sequences or fixed sizes from FASTQ files(also splits)
                freq - calculate position-dependent nucleotide/dinucleotide frequencies
                extract - extract specific sequences from FASTA file(s)
                decontaminate - remove bad tags from a contaminated tag directory
                cluster - hierarchical clustering of a NxN distance matrix
                special - specialized routines (i.e. only really useful for chuck)

        Options for command: barcode
                -min <#> (Minimum frequency of barcodes to keep: default=0.020
                -freq <filename> (output file for barcode frequencies, default=file.freq.txt)
                -qual <#> (Minimum quality score for barcode nucleotides, default=not used)
                -qualBase <character> (Minimum quality character in FASTQ file, default=B)

        Options for command: trim
                -3 <#|[ACGT]> (trim # bp or adapter sequence from 3' end of sequences)
                -5 <#|[ACGT]> (trim # bp or adapter sequence from 5' end of sequences)
                        -mis <#> (Maximum allowed mismatches in adapter sequence, default: 0)
                        -minMatchLength <#> (minimum adapter sequence at edge to match, default: half adapter length)
                -len <#> (Keep first # bp of sequence - i.e. make them the same length)
                -stats <filename> (Output trimming statistics to filename, default: sent to stdout)
                -min <#> (Minimum size of trimmed sequence to keep, default: 1)
                -max <#> (Maximum read length, default: 100000)
                -suffix <filename suffix> (output is sent to InuptFileName.suffix, default: trimmed)
                -lenSuffix <filename suffix> (length distribution is sent to InuptFileName.suffix, default: lengths)
                -split <#> (Split reads into two reads at bp #, output to trimmed1 and trimmed2)
                -revopp <#> (Return reverse opposite of read [if used with -split, only the 2nd
                                half of the read will be retuned as reverse opposite])

        Options for command: freq
                -format <tsv|fasta|fastq> (sequence file format, default: auto detect)
                -offset <#> (offset of first base in output file, default: 0)
                -maxlen <#> (Maximum length of sequences to consider, default: length of 1st seq)
                -o <filename> (Output filename, default: output sent to stdout)
                -gc <filename> (calculate CpG/GC content per sequence output to "filename")
                        OutputFormat: name<tab>CpG<tab>GC<tab>AG<tab>AC<tab>Length

        Options for command: extract
                -fa (output sequences in FASTA format - default is tab-delimited format)

        Alternate Usage: homerTools extract stats <Directory of FASTA files>
                Displays stats about the genome files (such as length)

        Options for command: decontaminate
                -frac <#> (Estimate fraction of sample that is contaminated, default: auto)
                -estimateOnly (Only estimate the contamination, do not decontaminate)
                -o <output tag directory> (default: overrites contaminated tag directory)
                -size <#> (Peak size for estimating contamination/Max distance from contaminant
                        reads to remove contaminated reads, default: 250)
                -min <#> (Minimum tag count to consider when estimating contamination, default: 20)

Can't figure something out? Questions, comments, concerns, or other feedback:
cbenner@salk.edu