Chapter 2

scRNA-seq Workflow

Subsections of scRNA-seq Workflow

Invoke with StarScope

Input

The input sample list file is a simple csv with four columns. The first column sample indicates sample IDs, and multiple fastq files with the same sample ID will be concatenated before further processing (e.g. two pairs of human_pbmc_s1 fastq files will be cat to a single pair). Multiple samples in one single sample list will be submitted parallelly and processed asynchronously. The fourth column indicates expected number of cells used for starsolo --soloCellFilter parameter.

sample,fastq_1,fastq_2,expected_cells
human_test,human_test.R1.fq.gz,human_test.R2.fq.gz,3000
human_pbmc_s1,human_pbmc_s1_R1_001.fastq.gz,human_pbmc_s1_R2_001.fastq.gz,8000
human_pbmc_s1,human_pbmc_s1_R1_002.fastq.gz,human_pbmc_s1_R2_002.fastq.gz,8000
Note

All samples in the sampleList will use the same options. Therefore, a combination of samples from different species is not supported.

Command

Invoke with config file

Due to the abundance of processes and options, we suggest utilizing a custom configuration file when invoking the pipeline. This allows for greater control over resource management. Please refer to the corresponding configuration examples below to utilize docker or singularity as running environment.

starscope gex --input sampleList.csv --config example_docker.config
params {
  genomeDir = "/refdata/human/starsolo/"
  genomeGTF = "/refdata/human/refdata-gex-GRCh38-2020-A/genes/genes.gtf"
  whitelist = "starscope/whitelist/V2_barcode_seq_210407_concat.txt"
  trimLength = 50
  soloType = "CB_UMI_Simple"
  soloCBstart = 1
  soloCBlen = 29
  soloUMIstart = 30
  soloUMIlen = 10
  publishSaturation = true
}

// uncomment line below if using slurm, see https://www.nextflow.io/docs/latest/executor.html
// process.executor = 'slurm'

// uncomment below chunk if using conda
// process.conda = "/home/xzx/Tools/mambaforge/envs/starscope_scRNAseq_env"
// conda.enabled = true

// docker setting, comment chunk below if using conda
process.container = "registry-intl.cn-hangzhou.aliyuncs.com/thunderbio/starscope_scrnaseq_env:1.2.5"
docker.enabled = true
docker.userEmulation = true
docker.runOptions = '--init -u $(id -u):$(id -g) $(opt=""; for group in $(id -G); do opt=$opt" --group-add $group"; done; echo $opt)'

// Resouces for each process
process {
  withLabel: process_high {
    cpus = 16
    memory = 40.GB
  }
  withLabel: process_medium {
    cpus = 4
    memory = 20.GB
  }
  withLabel: process_low {
    cpus = 4
    memory = 20.GB
  }
  withName: CHECK_SATURATION {
    cpus = 4
    memory = 10.GB
  }
  withName: CAT_FASTQ {
    cpus = 2
    memory = 4.GB
  }
  withName: TRIM_FASTQ {
    cpus = 12
    memory = 20.GB
  }
  withName: MULTIQC {
    cpus = 4
    memory = 10.GB
  }
  withName: STARSOLO {
    cpus = 16
    memory = 40.GB
  }
  withName: REPORT {
    cpus = 4
    memory = 40.GB
  }
  withName: FEATURESTATS {
    cpus = 2
    memory = 8.GB
  }
  withName: GENECOVERAGE {
    cpus = 8
    memory = 10.GB
  }
}
params {
  genomeDir = "/refdata/human/starsolo/"
  genomeGTF = "/refdata/human/refdata-gex-GRCh38-2020-A/genes/genes.gtf"
  whitelist = "starscope/whitelist/V2_barcode_seq_210407_concat.txt"
  trimLength = 50
  soloType = "CB_UMI_Simple"
  soloCBstart = 1
  soloCBlen = 29
  soloUMIstart = 30
  soloUMIlen = 10
  publishSaturation = true
}

// uncomment line below if using slurm, see https://www.nextflow.io/docs/latest/executor.html
// process.executor = 'slurm'

// command below generate singularity image: starscope_scrnaseq_env_latest.sif
// singularity pull registry-intl.cn-hangzhou.aliyuncs.com/thunderbio/starscope_scrnaseq_env:latest
process.container = "starscope_scrnaseq_env_latest.sif"
singularity.enabled = true

// Resouces for each process
process {
  withLabel: process_high {
    cpus = 16
    memory = 40.GB
  }
  withLabel: process_medium {
    cpus = 4
    memory = 20.GB
  }
  withLabel: process_low {
    cpus = 4
    memory = 20.GB
  }
  withName: CHECK_SATURATION {
    cpus = 4
    memory = 10.GB
  }
  withName: CAT_FASTQ {
    cpus = 2
    memory = 4.GB
  }
  withName: TRIM_FASTQ {
    cpus = 12
    memory = 20.GB
  }
  withName: MULTIQC {
    cpus = 4
    memory = 10.GB
  }
  withName: STARSOLO {
    cpus = 16
    memory = 40.GB
  }
  withName: REPORT {
    cpus = 4
    memory = 40.GB
  }
  withName: FEATURESTATS {
    cpus = 2
    memory = 8.GB
  }
  withName: GENECOVERAGE {
    cpus = 8
    memory = 10.GB
  }
}
params {
  genomeDir = "/refdata/human/starsolo/"
  genomeGTF = "/refdata/human/refdata-gex-GRCh38-2020-A/genes/genes.gtf"
  whitelist = "3M-february-2018.txt"
  trimLength = 28
  soloType = "CB_UMI_Simple"
  soloCBstart = 1
  soloCBlen = 16
  soloUMIstart = 17
  soloUMIlen = 12
  publishSaturation = true
}

// uncomment line below if using slurm, see https://www.nextflow.io/docs/latest/executor.html
// process.executor = 'slurm'

// uncomment below chunk if using conda
// process.conda = "/home/xzx/Tools/mambaforge/envs/starscope_scRNAseq_env"
// conda.enabled = true

// docker setting, comment chunk below if using conda
process.container = "registry-intl.cn-hangzhou.aliyuncs.com/thunderbio/starscope_scrnaseq_env:1.2.5"
docker.enabled = true
docker.userEmulation = true
docker.runOptions = '--init -u $(id -u):$(id -g) $(opt=""; for group in $(id -G); do opt=$opt" --group-add $group"; done; echo $opt)'

// Resouces for each process
process {
  withLabel: process_high {
    cpus = 16
    memory = 40.GB
  }
  withLabel: process_medium {
    cpus = 4
    memory = 20.GB
  }
  withLabel: process_low {
    cpus = 4
    memory = 20.GB
  }
    withLabel: process_high {
    cpus = 32
    memory = 32.GB
  }
  withLabel: process_medium {
    cpus = 32
    memory = 20.GB
  }
  withLabel: process_low {
    cpus = 4
    memory = 20.GB
  }
  withName: CHECK_SATURATION {
    cpus = 4
    memory = 10.GB
  }
  withName: CAT_FASTQ {
    cpus = 2
    memory = 4.GB
  }
  withName: TRIM_FASTQ {
    cpus = 12
    memory = 20.GB
  }
  withName: MULTIQC {
    cpus = 4
    memory = 10.GB
  }
  withName: STARSOLO {
    cpus = 16
    memory = 40.GB
  }
  withName: REPORT {
    cpus = 4
    memory = 40.GB
  }
  withName: FEATURESTATS {
    cpus = 2
    memory = 8.GB
  }
  withName: GENECOVERAGE {
    cpus = 8
    memory = 10.GB
  }
}

Invoke with command line options

To invoke scRNA-seq pipeline with conda environment:

starscope gex --conda \
              --conda_env /path/to/env \
              --input sampleList.csv \
              --genomeDir /path/to/STAR/reference/dir \
              --genomeGTF /path/to/genomeGTF \
              --whitelist /path/to/whitelist/V2_barcode_seq_210407_concat.txt \
              --trimLength 28 \
              --soloType CB_UMI_Simple \
              --soloCBstart 1 \
              --soloCBlen 29 \
              --soloUMIstart 30 \
              --soloUMIlen 10
starscope gex --conda \
              --conda_env /path/to/env \
              --input sampleList.csv \
              --genomeDir /path/to/STAR/reference/dir \
              --genomeGTF /path/to/genomeGTF \
              --whitelist "/path/to/TB_v3_20240429.BC1.tsv /path/to/TB_v3_20240429.BC2.tsv /path/to/TB_v3_20240429.BC3.tsv" \
              --trimLength 28 \
              --soloType CB_UMI_Complex \
              --soloAdapterSequence NNNNNNNNNGTGANNNNNNNNNGACANNNNNNNNNNNNNNNNN \
              --soloCBposition "2_0_2_8 2_13_2_21 2_26_2_34" \
              --soloUMIposition 2_35_2_42 \
              --soloCBmatchWLtype 1MM

user will have to add --conda and indicate conda env path with --conda_env. To check your env path, please use mamba env list:

# conda environments:
#
base                       /home/xzx/Tools/mambaforge
starscope_scRNAseq_env     /home/xzx/Tools/mambaforge/envs/starscope_scRNAseq_env

and provide the second column (e.g. /home/xzx/Tools/mambaforge/envs/starscope_scRNAseq_env).

Required Options

--genomeDir: STARsolo reference index path.

For instance, a typical file structure of the index folder will be like:

## human hg38
/refdata/human/starsolo/
├── chrLength.txt
├── chrNameLength.txt
├── chrName.txt
├── chrStart.txt
├── exonGeTrInfo.tab
├── exonInfo.tab
├── geneInfo.tab
├── Genome
├── genomeParameters.txt
├── Log.out
├── SA
├── SAindex
├── sjdbInfo.txt
├── sjdbList.fromGTF.out.tab
├── sjdbList.out.tab
└── transcriptInfo.tab

To create the index above, use the command below:

STAR  --runMode genomeGenerate \
      --runThreadN 10 \
      --genomeDir /path/to/outputDir \
      --genomeFastaFiles /path/to/genome.fa  \
      --sjdbGTFfile /path/to/genes.gtf

--genomeGTF: reference genome GTF file path.

User could generate the “filtered” GTF file using 10X’s mkref tool: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/references

--whitelist: The white list file(s) path. ThunderBio whitelist files were distributed with StarScope:

/starscope/whitelist/
├── TB_v3_20240429.BC1.tsv
├── TB_v3_20240429.BC2.tsv
├── TB_v3_20240429.BC3.tsv
└── V2_barcode_seq_210407_concat.txt.gz

For ThunderBio chemistry v3, please use

--whitelsit "/starscope/whitelist/TB_v3_20240429.BC1.tsv /starscope/whitelist/TB_v3_20240429.BC2.tsv /starscope/whitelist/TB_v3_20240429.BC3.tsv"

and use decompressed V2_barcode_seq_210407_concat.txt.gz for ThunderBio chemistry v2:

--whitelist /starscope/whitelist/V2_barcode_seq_210407_concat.txt

Other options are parsed to STARsolo

soloType                    None
    string(s): type of single-cell RNA-seq
                            CB_UMI_Simple   ... (a.k.a. Droplet) one UMI and one Cell Barcode of fixed length in read2, e.g. Drop-seq and 10X Chromium.
                            CB_UMI_Complex  ... one UMI of fixed length, but multiple Cell Barcodes of varying length, as well as adapters sequences are allowed in read2 only, e.g. inDrop.
                            CB_samTagOut (Not supported)    ... output Cell Barcode as CR and/or CB SAm tag. No UMI counting. --readFilesIn cDNA_read1 [cDNA_read2 if paired-end] CellBarcode_read . Requires --outSAMtype BAM Unsorted [and/or SortedByCoordinate] (Not supported by StarScope for now)
                            SmartSeq (Not supported)        ... Smart-seq: each cell in a separate FASTQ (paired- or single-end), barcodes are corresponding read-groups, no UMI sequences, alignments deduplicated according to alignment start and end (after extending soft-clipped bases) (Not supported by StarScope for now)

soloCBstart                 1
    int>0: cell barcode start base

soloCBlen                   16
    int>0: cell barcode length

soloUMIstart                17
    int>0: UMI start base

soloUMIlen                  10
    int>0: UMI length

soloCBposition              -
    strings(s)              position of Cell Barcode(s) on the barcode read.
                            Presently only works with --soloType CB_UMI_Complex, and barcodes are assumed to be on Read2.
                            Format for each barcode: startAnchor_startPosition_endAnchor_endPosition
                            start(end)Anchor defines the Anchor Base for the CB: 0: read start; 1: read end; 2: adapter start; 3: adapter end
                            start(end)Position is the 0-based position with of the CB start(end) with respect to the Anchor Base
                            String for different barcodes are separated by space.
                            Example: inDrop (Zilionis et al, Nat. Protocols, 2017):
                            --soloCBposition  0_0_2_-1  3_1_3_8

soloUMIposition             -
    string                  position of the UMI on the barcode read, same as soloCBposition
                            Example: inDrop (Zilionis et al, Nat. Protocols, 2017):
                            --soloCBposition  3_9_3_14

soloAdapterSequence         -
    string:                 adapter sequence to anchor barcodes.

soloCellFilter              CellRanger2.2 3000 0.99 10
    string(s):              cell filtering type and parameters
                            None            ... do not output filtered cells
                            TopCells        ... only report top cells by UMI count, followed by the exact number of cells
                            CellRanger2.2   ... simple filtering of CellRanger 2.2.
                                                Can be followed by numbers: number of expected cells, robust maximum percentile for UMI count, maximum to minimum ratio for UMI count
                                                The harcoded values are from CellRanger: nExpectedCells=3000;  maxPercentile=0.99;  maxMinRatio=10
                            EmptyDrops_CR   ... EmptyDrops filtering in CellRanger flavor. Please cite the original EmptyDrops paper: A.T.L Lun et al, Genome Biology, 20, 63 (2019): https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1662-y
                                                Can be followed by 10 numeric parameters:  nExpectedCells   maxPercentile   maxMinRatio   indMin   indMax   umiMin   umiMinFracMedian   candMaxN   FDR   simN
                                                The harcoded values are from CellRanger:             3000            0.99            10    45000    90000      500               0.01      20000  0.01  10000

Outputs

Each sample will have a separated result folder named with sample ID. The sub-directory final contains most of the result files, including html report, gene expression matrix (filtered: contains cell associated barcodes only; raw: contains all barcodes).

The pipeline_info directory contains statistics of the pipeline running resources.

results/human_test/
├── cutqc
│   └── human_test.cutadapt.json
├── final
│   ├── human_test_DEG.tsv
│   ├── human_test.matrix_filtered
│   │   ├── barcodes.tsv.gz
│   │   ├── features.tsv.gz
│   │   └── matrix.mtx.gz
│   ├── human_test.matrix_raw
│   │   ├── barcodes.tsv.gz
│   │   ├── features.tsv.gz
│   │   └── matrix.mtx.gz
│   ├── human_test.metrics.json
│   ├── human_test.metrics.tsv
│   ├── human_test_report.html
│   └── human_test.saturation_out.json
├── multiqc
│   └── human_test_multiqc_report.html
└── starsolo
    ├── human_test.CellReads.stats
    ├── human_test.Log.final.out
    ├── human_test.Log.out
    ├── human_test.Log.progress.out
    ├── human_test.matrix_filtered
    │   ├── barcodes.tsv.gz
    │   ├── features.tsv.gz
    │   └── matrix.mtx.gz
    ├── human_test.matrix_raw
    │   ├── barcodes.tsv.gz
    │   ├── features.tsv.gz
    │   └── matrix.mtx.gz
    ├── human_test.SJ.out.tab
    ├── human_test_summary.unique.csv
    └── human_test_UMIperCellSorted.unique.txt

8 directories, 26 files

WorkDir

By default, the intermediate files will be written to sub-directory of work under the pipeline running directory, please feel free to remove it after all the processes finished successfully.

Make Reference

To build a genome reference index, StarScope included a subcommand mkref to help user achieve this via nextflow run.

starscope mkref --conda \
--conda_env /path/to/env \
--genomeFasta genome.fa \
--genomeGTF genes.gtf 

The output index directory will be like:

results/STAR_reference/
├── chrLength.txt
├── chrNameLength.txt
├── chrName.txt
├── chrStart.txt
├── exonGeTrInfo.tab
├── exonInfo.tab
├── geneInfo.tab
├── Genome
├── genomeParameters.txt
├── Log.out
├── SA
├── SAindex
├── sjdbInfo.txt
├── sjdbList.fromGTF.out.tab
├── sjdbList.out.tab
└── transcriptInfo.tab

0 directories, 16 files

We also included a helper script to perform a 10X like filtration step on reference files downloaded from Ensembl, user could invoke with:

starscope/process_reference.sh --fasta origin_genome.fa.gz --gtf origin.gtf.gz

The output files are genome.fa and genes.gtf in the working directory.