Running the Example Pipeline¶
This page serves to familiarize new users with the basic flow of running the ImmuneDB pipeline. Example input FASTQ files are provided which contain human B-cell heavy chain sequences.
Commands are listed as either being run in either the Docker container or on the host.
To begin, run the Docker container as documented:
$ docker run -v $HOME/immunedb_share:/share \ -p 8080:8080 -it arosenfeld/immunedb:v0.29.0
Before ImmuneDB can be run, metadata must be specified for each input file. For this example, one has already been created for you. To learn how to create a metadata file for your own data, see Creating a Metadata Sheet.
ImmuneDB Instance Creation¶
Next, we create a database for the data with:
$ immunedb_admin create example_db
This creates a new database named
Identifying or Importing Sequences¶
The first step of the pipeline is to annotate sequences and store the resulting
data in the newly created database. To do so, the
used. It requires that V and J germline sequences be specified in two separate
FASTA files. The Docker image provides Human & Mouse IGH, TRA, and TRB
For this example, there are two provided input files in
/example along with
metadata.tsv file which you can view with:
$ ls /example
Given this, run the
$ immunedb_identify example_db \ /root/germlines/imgt_human_ighv.fasta \ /root/germlines/imgt_human_ighj.fasta \ /example
ImmuneDB determines the uniqueness of a sequence both at the sample and subject
level. For the latter,
immunedb_collapse is used to find sequences that are the
same except at positions that have an
N. Thus, the sequences
ANCN would be collapsed.
To collapse sequences, run:
$ immunedb_collapse example_db
After sequences are assigned V and J genes, they can be clustered into clones
based on CDR3 Amino Acid similarity with the
immunedb_clones command. This
takes a number of arguments which should be read before use.
There are three ways to create clones: based on CDR3 AA similarity, T-cell exact CDR3 NT identity, and a lineage based method. For this example we’ll use the similarity based method with default parameters:
$ immunedb_clones example_db similarity
This will create clones where all sequences in a clone will have the same V-gene, J-gene, and (by default) 85% CDR3 AA identity.
Two sets of statistics can be calculated in ImmuneDB:
- Clone Statistics: For each clone and sample combination, how many unique and total sequences appear as well as the mutations from the germline.
- Sample Statistics: Distribution of sequence and clone features on a per-sample basis, including V and J usage, nucleotides matching the germline, copy number, V length, and CDR3 length. It calculates all of these with and without outliers, and including and excluding partial reads.
These are calculated with the
commands and must be run in that order.
$ immunedb_clone_stats example_db $ immunedb_sample_stats example_db
Selection Pressure (Optional)¶
Selection pressure calculations are time-consuming, so you can skip this step if time is limited.
Selection pressure of clones can be calculated with Baseline. To do so run:
$ immunedb_clone_pressure example_db \ /apps/baseline/Baseline_Main.r
Note, this process is relatively slow and may take some time to complete.
Clone Trees (Optional)¶
Lineage trees for clones is generated with the
command. The only currently supported method is neighbor-joining as provided
Among others, the
--min-mut-copies parameter allows for mutations to be
omitted if they have not occurred at least a specified number of times. This
can be useful to correct for sequencing error.
$ immunedb_clone_trees example_db --min-mut-copies 2