HOMER

Software for motif discovery and ChIP-Seq analysis

This is the old version of the documentation: New Version

ChIP-Seq Analysis: Step 1, Creating a "Tag Directory"

To facility the analysis of ChIP-Seq (or any other type of short read re-sequencing data), it is useful to first transform the sequence alignment into platform independent data structure representing the experiment, analogous to loading the data into a database. HOMER does this by placing all relevant information about the experiment into a "Tag Directory", which is essentially a directory on your computer that contains several files describing your experiment.

To create a "Tag Directory", you must have alignment files in one of the following formats:

BED format
*.eland_result.txt or *_export.txt format from the Illumina pipeline
bowtie output format

If your alignment is in a different format, it is recommended that you convert it into a BED file format:

Column1: chromosome
Column2: start position
Column3: end position
Column4: Name (or strand +/-)
Column5: Number of reads at this position
Column6: Strand +/-

Alternatively (or in combination), you can make tag directories from existing tag directories or from tag files (explained below).

To make a tag directory, run the following command:

makeTagDirectory <Output Directory Name> [options] <alignment file1> [alignment file 2] ...

Where the first argument must be the output directory (required). If it does not exist, it will be created. If it does exist, it will be overwritten.
An example:

makeTagDirectory Macrophage-PU.1-ChIP-Seq/ pu1.lane1.bed pu1.land2.bed pu1.lane3.bed

Several additional options exist for makeTagDirectory. The program attempts to guess the format of your alignment files, but if it is unsuccessful, you can force the format with "-format <X>". To combine tag directories, for example when combining two separate experiments into one, do the following:

makeTagDirectory Combined-PU.1-ChIP-Seq/ -d Exp1-ChIP-Seq/ Exp2-ChIP-Seq/ Exp3-ChIP-Seq/

What does makeTagDirectory do?

makeTagDirectory basically parses through the alignment file and splits the tags into separate files based on their chromosome. As a result, several *.tags.tsv files are created in the output directory. These are made to very efficiently return to the data during downstream analysis. This also helps speed up the analysis of very large data sets without running out of memory.

In the end, your output directory will contain several *.tags.tsv files, as well as a file named "tagInfo.txt". This file contains information about your sequencing run, including the total number of tags considered. This file is used by later programs to quickly reference information about the experiment, and can be manually modified to set certain parameters for analysis.

makeTagDirectory also performs several quality control steps which are covered in the next section.

Command line options of makeTagDirectory command:

        Usage: makeTagDirectory <directory> <alignment file 1> [file 2] ... [options]

        Creates a platform-independent 'tag directory' for later analysis.
        Currently BED, Eland, and bowtie files are accepted. The program will try to
        automatically detect the alignment format if not specified
        Existing tag directories can be added or combined to make a new one using -d/-t
        If more than one format is needed and the program cannot auto-detect it properly,
        make separate tag directories by running the program separately, then combine them.

        Options:
                -genome <genome name> (specify genome for later analysis)
                                To list available genomes, run "??"
                -name <experiment name> (optional, names the experiment)
                -format <X> where X can be: (with column specifications underneath)
                        bed - BED format files:
                                (1:chr,2:start,3:end,4:+/- or read name,5:# tags,6:+/-)
                        bowtie - output from bowtie (run with --best -k 2 options)
                                (1:read name,2:+/-,3:chr,4:position,5:seq,6:quality,
                                                        7:NA,8:mismatch info)
                        eland_result - output from basic eland
                                (1:read name,2:seq,3:code,4:#zeroMM,5:#oneMM,6:#twoMM,7:chr,
                                                        8:position,9:F/R,10-:mismatches
                        eland_export - output from illumina pipeline (22 columns total)
                                (1-5:read name info,9:sequence,10:quality,11:chr,13:position,14:strand)
                -C (color space mapping with bowtie)
                -keep (keep one mapping of each read regardless if multiple equal mappings exist)
                -forceBED (if 5th column of BED file contains stupid values, like mapping quality
                                instead of number of tags, then ignore this column)
                -d <tag directory> [tag directory 2] ... (add Tag directory to new tag directory)
                -t <tag file> [tag file 2] ... (add tag file i.e. *.tags.tsv to new tag directory)

Next: Basic quality control (sequence bias, fragment length estimation)

Can't figure something out? Questions, comments, concerns, or other feedback:
cbenner@ucsd.edu