Invoke with StarScope

Input

The input sample list file is csv file with five columns: sample, fastq_1, fastq_2, feature_types and expected_cells. The fourth column represents library types: GEX, VDJ-T or VDJ-B. A sample with only VDJ-T or VDJ-B library is supported, but all the samples must have at least one VDJ library. Please use scRNA-seq workflow if you only have GEX library. The first column sample indicates sample IDs, and multiple fastq files with the same sample ID will be concatenated before further processing (e.g. two pairs of human_pbmc_s1 fastq files will be cat to a single pair). Multiple samples in one single sample list will be submitted parallelly and processed asynchronously. The fourth column indicates expected number of cells used for starsolo --soloCellFilter parameter.

sample,fastq_1,fastq_2,feature_types,expected_cells
human_test,human_test_gex.R1.fq.gz,human_test_gex.R2.fq.gz,GEX,3000
human_test,human_test_tcr.R1.fq.gz,human_test_tcr.R2.fq.gz,VDJ-T,2000
human_pbmc_s1,human_pbmc_s1_gex_R1_001.fastq.gz,human_pbmc_s1_gex_R2_001.fastq.gz,GEX,8000
human_pbmc_s1,human_pbmc_s1_gex_R1_002.fastq.gz,human_pbmc_s1_gex_R2_002.fastq.gz,GEX,8000
human_pbmc_s1,human_pbmc_s1_bcr.R1.fastq.gz,human_pbmc_s1,human_pbmc_s1_bcr.R2.fastq.gz,VDJ-B,1000
Note

All samples in the sampleList will use the same options. Therefore, a combination of samples from different species is not supported.

Command

Invoke with config file

Due to the abundance of processes and options, we suggest utilizing a custom configuration file when invoking the pipeline. This allows for greater control over resource management. ThunderBio has released two versions of the chemistry, each with a distinct barcode structure. Please refer to the corresponding configuration examples provided below.

starscope vdj_gex --input sampleList.csv --config example_docker.config
params {
  genomeDir = "/refdata/human/starsolo/"
  genomeGTF = "/refdata/human/refdata-gex-GRCh38-2020-A/genes/genes.gtf"
  whitelist = "/starscope/whitelist/TB_v3_20240429.BC1.tsv /starscope/whitelist/TB_v3_20240429.BC2.tsv /starscope/whitelist/TB_v3_20240429.BC3.tsv"
  soloType = "CB_UMI_Complex"
  trimLength = 50
  soloAdapterSequence = "NNNNNNNNNGTGANNNNNNNNNGACANNNNNNNNNNNNNNNNN"
  soloCBposition = "2_0_2_8 2_13_2_21 2_26_2_34"
  soloUMIposition = "2_35_2_42"
  soloCBmatchWLtype = "1MM"
  publishSaturation = true
  
  // 5' VDJ specific params
  // trust4 reference
  trust4_vdj_refGenome_fasta = "/starscope/scRNA-seq/vdj/reference/hg38_bcrtcr.fa"
  trust4_vdj_imgt_fasta = "/starscope/scRNA-seq/vdj/reference/human_IMGT+C.fa"
  // strand reverse
  soloStrand = "Reverse"
}

// uncomment line below if using slurm, see https://www.nextflow.io/docs/latest/executor.html
// process.executor = 'slurm'

// uncomment below chunk if using conda
// process.conda = "/home/xzx/Tools/mambaforge/envs/starscope_scRNAseq_env"
// conda.enabled = true

// docker setting, comment chunk below if using conda
process.container = "registry-intl.cn-hangzhou.aliyuncs.com/thunderbio/starscope_scrnaseq_env:1.2.5"
docker.enabled = true
docker.userEmulation = true
docker.runOptions = '--init -u $(id -u):$(id -g) $(opt=""; for group in $(id -G); do opt=$opt" --group-add $group"; done; echo $opt)'

// Resouces for each process
process {
  withLabel: process_high {
    cpus = 16
    memory = 40.GB
  }
  withLabel: process_medium {
    cpus = 4
    memory = 20.GB
  }
  withLabel: process_low {
    cpus = 4
    memory = 20.GB
  }
  withName: CHECK_SATURATION {
    cpus = 4
    memory = 10.GB
  }
  withName: CAT_FASTQ {
    cpus = 2
    memory = 4.GB
  }
  withName: TRIM_FASTQ {
    cpus = 12
    memory = 20.GB
  }
  withName: MULTIQC {
    cpus = 4
    memory = 10.GB
  }
  withName: STARSOLO {
    cpus = 16
    memory = 40.GB
  }
  withName: REPORT {
    cpus = 4
    memory = 40.GB
  }
  withName: FEATURESTATS {
    cpus = 2
    memory = 8.GB
  }
  withName: GENECOVERAGE {
    cpus = 8
    memory = 10.GB
  }
  withName: VDJ_CELLCALLING_WITHOUTGEX {
    cpus = 10
    memory = 20.GB
  }
  withName: VDJ_CELLCALLING_WITHGEX {
    cpus = 10
    memory = 20.GB
  }
  withName: GET_VERSIONS_VDJ {
    cpus = 2
    memory = 10.GB
  }
  withName: VDJ_ASSEMBLY {
    cpus = 32
    memory = 30.GB
  }
  withName: VDJ_METRICS {
    cpus = 2
    memory = 10.GB
  }
  withName:REPORT_VDJ {
    cpus = 4
    memory = 40.GB
  }
}

Invoke with command line options

To invoke VDJ pipeline with conda environment:

starscope vdj_gex --conda \
                  --conda_env /path/to/conda/env \
                  --input sampleList.csv \
                  --genomeDir /path/to/STAR/reference/dir \
                  --genomeGTF /path/to/genomeGTF \
                  --whitelist "/path/to/TB_v3_20240429.BC1.tsv /path/to/TB_v3_20240429.BC2.tsv /path/to/TB_v3_20240429.BC3.tsv" \
                  --trust4_vdj_refGenome_fasta /path/to/refGenome_vdj_fasta \
                  --trust4_vdj_imgt_fasta /path/to/imgt_vdj_fasta \
                  --trimLength 28 \
                  --soloType CB_UMI_Complex \
                  --soloAdapterSequence NNNNNNNNNGTGANNNNNNNNNGACANNNNNNNNNNNNNNNNN \
                  --soloCBposition "2_0_2_8 2_13_2_21 2_26_2_34" \
                  --soloUMIposition 2_35_2_42 \
                  --soloCBmatchWLtype 1MM

user will have to add --conda and indicate conda env path with --conda_env. To check your env path, please use mamba env list:

# conda environments:
#
base                       /home/xzx/Tools/mambaforge
starscope_scRNAseq_env     /home/xzx/Tools/mambaforge/envs/starscope_scRNAseq_env

and provide the second column (e.g. /home/xzx/Tools/mambaforge/envs/starscope_scRNAseq_env).

Required Options

--genomeDir: STARsolo reference index path.

For instance, a typical file structure of the index folder will be like below:

## human hg38
/refdata/human/starsolo/
├── chrLength.txt
├── chrNameLength.txt
├── chrName.txt
├── chrStart.txt
├── exonGeTrInfo.tab
├── exonInfo.tab
├── geneInfo.tab
├── Genome
├── genomeParameters.txt
├── Log.out
├── SA
├── SAindex
├── sjdbInfo.txt
├── sjdbList.fromGTF.out.tab
├── sjdbList.out.tab
└── transcriptInfo.tab

To create the index above, use the command below:

STAR  --runMode genomeGenerate \
      --runThreadN 10 \
      --genomeDir /path/to/outputDir \
      --genomeFastaFiles /path/to/genome.fa  \
      --sjdbGTFfile /path/to/genes.gtf

--genomeGTF: reference genome GTF file path.

User could generate the “filtered” GTF file using 10X’s mkref tool: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/references

--whitelist: The white list file(s) path. ThunderBio whitelist files were distributed with StarScope:

/starscope/whitelist/
├── TB_v3_20240429.BC1.tsv
├── TB_v3_20240429.BC2.tsv
├── TB_v3_20240429.BC3.tsv
└── V2_barcode_seq_210407_concat.txt.gz

For ThunderBio chemistry v3, please use

--whitelsit "/starscope/whitelist/TB_v3_20240429.BC1.tsv /starscope/whitelist/TB_v3_20240429.BC2.tsv /starscope/whitelist/TB_v3_20240429.BC3.tsv"

Outputs

Each sample will have a separated result folder named with sample ID. The sub-directory final contains most of the result files:

  • HTML report (e.g. TB_pbmc_test_VDJ_report.html)
  • TCR/BCR trust4 report tsv file (e.g. TB_pbmc_test_BCR_results.tsv).
  • TCR/BCR filtered report file, conains lineage information and only productive TCR/BCR (e.g. TB_pbmc_test_BCR_results.productiveOnly_withLineage.tsv).
  • TCR/BCR cloneType table, with VDJ gene annotation (e.g. TB_pbmc_test_BCR_clonotypes.tsv)

gene expression matrix were strored under starsolo’s result directory:

  • filtered matrix, contains cell associated barcodes only:
results/sampelID/starsolo/GEX/sampleID_GEX.matrix_filtered
  • raw matrix, contains all barcodes:
results/sampelID/starsolo/GEX/sampleID_GEX.matrix_raw

The pipeline_info directory contains statistics of the pipeline running resources.

Full output directory structure:

results/TB_pbmc_test/
├── cutqc
│   ├── TB_pbmc_test.cutadapt.json
│   ├── TB_pbmc_test_GEX.cutadapt.json
│   ├── TB_pbmc_test_VDJ-B.cutadapt.json
│   └── TB_pbmc_test_VDJ-T.cutadapt.json
├── final
│   ├── TB_pbmc_test_BCR_clonotypes.tsv
│   ├── TB_pbmc_test_BCR_results.productiveOnly_withLineage.tsv
│   ├── TB_pbmc_test_BCR_results.tsv
│   ├── TB_pbmc_test_GEX.saturation_out.json
│   ├── TB_pbmc_test_TCR_clonotypes.tsv
│   ├── TB_pbmc_test_TCR_results.productiveOnly_withLineage.tsv
│   ├── TB_pbmc_test_TCR_results.tsv
│   ├── TB_pbmc_test_VDJ-B.metrics.json
│   ├── TB_pbmc_test_VDJ-B.metrics.tsv
│   ├── TB_pbmc_test_VDJ_report.html
│   ├── TB_pbmc_test_VDJ-T.metrics.json
│   ├── TB_pbmc_test_VDJ-T.metrics.tsv
│   └── versions.json
├── multiqc
│   ├── TB_pbmc_test_GEX_multiqc_report.html
│   ├── TB_pbmc_test_multiqc_report.html
│   ├── TB_pbmc_test_VDJ-B_multiqc_report.html
│   └── TB_pbmc_test_VDJ-T_multiqc_report.html
├── saturation
│   └── TB_pbmc_test_GEX.saturation_out.json
├── starsolo
│   ├── GEX
│   │   ├── TB_pbmc_test_GEX.CellReads.stats
│   │   ├── TB_pbmc_test_GEX.Log.final.out
│   │   ├── TB_pbmc_test_GEX.Log.out
│   │   ├── TB_pbmc_test_GEX.Log.progress.out
│   │   ├── TB_pbmc_test_GEX.matrix_filtered
│   │   │   ├── barcodes.tsv.gz
│   │   │   ├── features.tsv.gz
│   │   │   └── matrix.mtx.gz
│   │   ├── TB_pbmc_test_GEX.matrix_raw
│   │   │   ├── barcodes.tsv.gz
│   │   │   ├── features.tsv.gz
│   │   │   └── matrix.mtx.gz
│   │   ├── TB_pbmc_test_GEX.SJ.out.tab
│   │   ├── TB_pbmc_test_GEX_summary.unique.csv
│   │   └── TB_pbmc_test_GEX_UMIperCellSorted.unique.txt
│   ├── VDJ-B
│   │   ├── TB_pbmc_test_VDJ-B.CellReads.stats
│   │   ├── TB_pbmc_test_VDJ-B.Log.final.out
│   │   ├── TB_pbmc_test_VDJ-B.Log.out
│   │   ├── TB_pbmc_test_VDJ-B.Log.progress.out
│   │   ├── TB_pbmc_test_VDJ-B.matrix_filtered
│   │   ├── TB_pbmc_test_VDJ-B.SJ.out.tab
│   │   ├── TB_pbmc_test_VDJ-B_summary.unique.csv
│   │   └── TB_pbmc_test_VDJ-B_UMIperCellSorted.unique.txt
│   └── VDJ-T
│       ├── TB_pbmc_test_VDJ-T.CellReads.stats
│       ├── TB_pbmc_test_VDJ-T.Log.final.out
│       ├── TB_pbmc_test_VDJ-T.Log.out
│       ├── TB_pbmc_test_VDJ-T.Log.progress.out
│       ├── TB_pbmc_test_VDJ-T.matrix_filtered
│       ├── TB_pbmc_test_VDJ-T.SJ.out.tab
│       ├── TB_pbmc_test_VDJ-T_summary.unique.csv
│       └── TB_pbmc_test_VDJ-T_UMIperCellSorted.unique.txt
└── trust4
    ├── VDJ-B
    │   ├── TB_pbmc_test_VDJ-B_barcode_airr.tsv
    │   ├── TB_pbmc_test_VDJ-B_barcode_report.filterDiffusion.tsv
    │   ├── TB_pbmc_test_VDJ-B.cloneType_out.tsv
    │   ├── TB_pbmc_test_VDJ-B_final.out
    │   ├── TB_pbmc_test_VDJ-B_readsAssign.out
    │   ├── TB_pbmc_test_VDJ-B.vdj_cellOut.tsv
    │   └── TB_pbmc_test_VDJ-B.vdj_metrics.json
    └── VDJ-T
        ├── TB_pbmc_test_VDJ-T_barcode_airr.tsv
        ├── TB_pbmc_test_VDJ-T_barcode_report.filterDiffusion.tsv
        ├── TB_pbmc_test_VDJ-T.cloneType_out.tsv
        ├── TB_pbmc_test_VDJ-T_final.out
        ├── TB_pbmc_test_VDJ-T_readsAssign.out
        ├── TB_pbmc_test_VDJ-T.vdj_cellOut.tsv
        └── TB_pbmc_test_VDJ-T.vdj_metrics.json

15 directories, 63 files

WorkDir

By default, the intermediate files will be written to sub-directory of work under the pipeline running directory, please feel free to remove it after all the processes finished successfully.