Exporting Data to Files

You can use the immunedb_export command to export your data in a variety of formats.

Exporting Samples

To export samples statistics run the command:

$ immunedb_export PATH_TO_CONFIG samples

After completion, a TSV file samples.tsv will be written with the following headers, one line per sample:

Field Description
id Unique numeric sample identifier
name Name given to the sample
subject Subject from which the sample originated
input_sequences Reads input into ImmuneDB
identified Reads successfully annotated
in_frame Reads in-frame
stops Reads with stop codons
functional Functional reads (in-frame and no stop codons)
avg_clone_cdr3_num_nts Average clonal CDR3 length in nucleotides
avg_clone_v_identity Average clonal V-region identity
clones Total number of clones

Exporting Clones

In it’s most basic form, the command to export clones is:

$ immunedb_export PATH_TO_CONFIG clones

This will generate one file per sample each with one line per clone having the fields below. Note that intances, copies, avg_v_identity, and top_copy_seq are for the clone in the context of that sample. That is, those fields may vary for the same clone in different samples.

Field Description
clone_id Database-wide unique clone identifier. This number can be used to track clones across samples.
subject Subject in which the clone was found
v_gene V-gene of the clone
j_gene J-gene of the clone
functional If the clone is in-frame and contains no stop in the consensus (T or F)
insertions Insertions in the clone (deprecated)
deletions Deletions in the clone (deprecated)
cdr3_nt CDR3 nucleotide sequence
cdr3_num_nts CDR3 nucleotide sequence length
cdr3_aa CDR3 amino-acid sequence
uniques Unique sequences in the clone overall
instances Sequences instances in the clone in the associated sample
copies Copies in the clone in the associated sample
germline Clonal germline sequence
parent_id Parent ID (deprecated)
avg_v_identity Average V-gene identity to germline
top_copy_seq Nucleotide sequence of top-copy sequence

The --pool-on parameter can be used to change how data is aggregated. By default it takes the value sample (as described above) but it also accepts, subject, or any custom metadata field(s).

For the purposes of illustration, assume we have samples with the associated metadata below.

sample subject tissue subset
sample1 S1 blood naive
sample2 S1 spleen naive
sample3 S1 spleen mature
sample4 S3 blood native

Passing --pool-on subject will generate one file per subject with the clone information aggregated across all samples in that subject. Alternatively, passing --pool-on tissue will generate one file per subject/tissue combination. You can pass multiple metadata fields to the --pool-on parameter as well. For example --pool-on tissue subset will generate one file per subject/tissue/subset combination.

Two other common parameters are --sample-ids which restricts which samples to include in the export and --format which accepts immunedb (the default) or vdjtools for interoperability with the VDJtools suite.

Exporting Sequences

Sequences can be exported in Change-O and AIRR formats.

The basic command is:

$ immunedb_export PATH_TO_CONFIG sequences

This will generate one file per sample in Change-O format. To use AIRR format, specify --format airr. You can filter out sequences that were not assigned to a clone with the --clones-only flag.

Exporting Selection Pressure

If selection pressure was calculated with the immunedb_clone_pressure command, the results can be exported in TSV format, one row per clone/sample combination. Additionally, unless the --filter samples is passed, there will be one additional row per clone with a All Samples value for the sample which indicates the overall selection pressure on the clone.

For more information on interpreting the values see Uduman, et al, 2011 and Yaari, et al. 2012.

Field Value
clone_id Clone ID
subject Subject to which the clone belongs
sample Sample within which the selection pressure was calculated. If All Samples the overall selection pressure for the clone.
threshold The threshold at which the selection pressure was calculated
expected_REGION_TYPE The expected number of TYPE (r or s) mutations in REGION (cdr or fwr)
observed_REGION_TYPE The observed number of TYPE (r or s) mutations in REGION (cdr or fwr)
sigma_REGION The selection pressure in REGION
sigma_REGION_cilower The lower bound of the confidence interval of selection in REGION
sigma_REGION_ciupper The upper bound of the confidence interval of selection in REGION
sigma_p_REGION The P-value of the selection in REGION

Exporting MySQL Data

The final method of exporting data is to dump the entire MySQL database to a file. This is meant to be a backup method rather than for downstream-analysis.

To backup run:

$ immunedb_admin backup PATH_TO_CONFIG BACKUP_PATH

To restore a backup run:

$ immunedb_admin restore PATH_TO_CONFIG BACKUP_PATH