StarScope

StarScope

StarScope provides a convenient way to run ThunderBio single cell analysis nextflow pipelines. scRNA-seq workflow uses starsolo (https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md) and Seurat package (https://satijalab.org/seurat/) as the core modules to process 3’ single cell RNA-seq data and generate a concise HTML report. We also integrated single cell immune repertopire and ATAC analysis workflows as other two subcommands, both of which were also built using nextflow.

Quick Start with ThunderBio Example Data

ThunderBio will provide human and mouse pre-built reference and two sets example data upon request. The demo_data fold contains two directories, human and mouse. In each directory, user could invoke the pipeline by running command below. Please note the reference index has to be placed in the same parent folder of the demo_data.

## Human Data Testing with conda or docker
cd demo_data/human
starscope run --input sampleList.csv --config thunderbio_human_conda.config
starscope run --input sampleList.csv --config thunderbio_human_docker.config
## Mouse Data Testing with conda or docekr
cd demo_data/mouse
demo_data/
├── human
│   ├── human_test.R1.fq.gz
│   ├── human_test.R2.fq.gz
│   ├── sampleList.csv
│   ├── thunderbio_human_conda.config
│   └── thunderbio_human_docker.config
├── mouse
│   ├── mouse_test.R1.fq.gz
│   ├── mouse_test.R2.fq.gz
│   ├── sampleList.csv
│   ├── thunderbio_mouse_conda.config
│   └── thunderbio_mouse_docker.config
└── V2_barcode_seq_210407_concat.txt

Release Note

StarScope v1.1.4

  • Updated script for saturation calculation, and included UMI and gene hist data for {preseqR} prediction

  • Now a tsv summary file will be outputted in the report process

StarScope v1.1.3

  • Added --outBAMsortingBinsN 300 option to STARsolo process to solve the RAM issue when sorting a very large BAM file, refer to STAR #870

StarScope v1.1.2

  • Use soft link instead of cat when there is only one fastq file

  • Added support for VDJ subworkflow to only analyze BCR or TCR dataset

  • Added support for publishing saturation json file

StarScope v1.1.1

  • Honer the memory option in the report process

  • Added option for publishing starsolo BAM output

  • Added option to enable conda env for nextflow 22.10, and the nextflow version was fixed to 22.04

StarScope v1.1.0

  • Added VDJ workflow (GEX+VDJ)

  • Removed time limit for process resources

  • Modified the margin of the tables in the 3’-scRNA-Seq report

  • Adjusted knee plot to fixed size and center positioned

StarScope v1.0.0

  • Added pipeline running information to the reports

  • Fixed issue for saturation calculation script when there is no whitelist provided

  • Output a full list of DEGs

StarScope v0.0.9

  • Updated STAR to v2.7.10a in both the docker image and the conda environment

  • Adapt the pipeline to generate report for starsolo --soloMultiMappers option, only “Unique” and “EM” are supported by now.

  • Changed the default parameter for --soloFeatures to “GeneFull”, which includes both exon and intron reads

StarScope v0.0.8

  • Fixed the issue that RunPCA uses up all cores and consumes too much memory (Seurat #3991)

  • Changed h5file format to H5Seurat

  • Disabled sass cache in rmd report

  • Changed future package strategy to multisession to avoid fork errors

StarScope v0.0.7

  • Raw gene/barcode count table will be stored in a loom file and published for subsequent analysis

  • Mapping metrics will be written to a json file

  • Added support for 5’ RNA-seq library

StarScope v0.0.6

  • Now the BAM output will not by published

  • Added support for counting intron reads

  • Added check_version command, to print all software versions including docker or conda engine information

  • Fixed STAR to 2.7.9a and samtools to 1.15

  • Fixed issue that no report generated if too few cells detected

StarScope v0.0.5

  • Fixed issue that no report generated if there are too few cells left after filtering by nFeature >= 200

StarScope v0.0.3, v0.0.4

  • Set thunderbio data running parameters as defaults instead of 10x parameters

  • Added whitelists for thunderbio V2, 10x V2 and V3

  • Fixed nextflow default path issue

  • Fixed conda env name issue

  • Implemented docker container running environment

  • Added a new script for user to generate 10x compatible STAR reference

StarScope v0.0.2

  • Implemented mkref command

  • Added output section for README

StarScope v0.0.1

  • Implemented run command

  • Added running env checking procedure

Subsections of StarScope

Chapter 1

Basics

Instruction of the workflows.

Subsections of Basics

Installation

Requirements

  • Java 11 or higher
  • Nextflow
  • Conda/miniconda
  • Docker

Java

Nextflow will need java 11 or higher to be installed. The recommanded way to install java is through SDKMAN. Please use the command below:

Install SDKMAN:

curl -s https://get.sdkman.io | bash

Open a new terminal and install Java

sdk install java 17.0.10-tem

Check java installation and comfirm it’s version

java -version

Nextflow

Nextflow binary was already included in the StarScope directory. User also could download binary from nextflow’s github release page.

By default, starscope will invoke the nextflow executable stored in the same directory, user could add both of the two executables to $PATH (e.g. ~/.local/bin)

## starscope executable
ln -s starscope/starscope ~/.local/bin/starscope

## nextflow
ln -s starscope/nextflow ~/.local/bin/nextflow

Confirm that nextflow runs properly with the command below (require network access to github):

NXF_VER=23.10.1 nextflow run hello

The output will be:

N E X T F L O W ~ version 22.04.5
Launching `https://github.com/nextflow-io/hello` [distraught_ride] DSL2 - revision: 4eab81bd42 [master]
executor > local (4)
[92/5fbfca] process > sayHello (4) [100%] 4 of 4 ✔
Bonjour world!

Hello world!

Ciao world!

Hola world!

Conda/miniconda

Install Conda

We usually use miniforge’s conda distro, user also could install via conda official installer, or use mamba directly, which is much faster.

Miniforge:

wget -c https://github.com/conda-forge/miniforge/releases/download/24.3.0-0/Mambaforge-24.3.0-0-Linux-x86_64.sh
bash Mambaforge-24.3.0-0-Linux-x86_64.sh

Official minicoda:

wget -c wget -c https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Micromamba, user may need to put micromamba binary into $PATH

# Linux Intel (x86_64):
curl -Ls https://micro.mamba.pm/api/micromamba/linux-64/latest | tar -xvj bin/micromamba

Create Environment

One could create conda environment with the yaml file in the workflow directory.

## scRNA-seq/VDJ environment
mamba env create -f starscope/scRNA-seq/scRNAseq_env.yml

## scATAC-seq environment
mamba env create -f starscope/scATAC-seq/scATAC_env.yml

Or extract environment from archive distributed by ThunderBio with conda-pack

# Unpack environment into directory `starscope_env`
$ mkdir -p starscope_env
$ tar -xzf starscope_env.tar.gz -C starscope_env
# Activate the environment. This adds `starscope_env/bin` to your path
$ source starscope_env/bin/activate
# Cleanup prefixes from in the active environment.
# Note that this command can also be run without activating the environment
# as long as some version of Python is already installed on the machine.
(starscope_env) $ conda-unpack
# deactivete env
$ source starscope_env/bin/deactivate

Docker

Using docker is much easier to integrate the workflow to large infrastructure like cloud platforms or HPC, thus is recommended. To install:

curl -fsSL https://get.docker.com -o get-docker.sh
sh get-docker.sh

To use docker command without sudo, add your account to docker group:

sudo usermod -aG docker $(whoami)

Then log out and login again for the changes to take effect.

Please pull the pre-built image with:

## scATAC-seq image
docker pull registry-intl.cn-hangzhou.aliyuncs.com/thunderbio/starscope_scatac_env:latest

## scRNA-seq/VDJ image
docker pull registry-intl.cn-hangzhou.aliyuncs.com/thunderbio/starscope_scrnaseq_env:latest

Essentials

Running Information

After invoking the pipeline, nextflow will report the progress to stdout, with each row representing a process.

N E X T F L O W  ~  version 23.10.1
Launching `/thunderData/pipeline/starscope/scRNA-seq/main.nf` [adoring_ekeblad] DSL2 - revision: 8e27902b23
executor >  slurm (9)
[e0/1d00d4] process > scRNAseq:CAT_FASTQ (human_test)        [100%] 1 of 1 ✔
[37/8c0795] process > scRNAseq:TRIM_FASTQ (human_test)       [100%] 1 of 1 ✔
[20/1edf9b] process > scRNAseq:MULTIQC (human_test)          [100%] 1 of 1 ✔
[5a/e0becc] process > scRNAseq:STARSOLO (human_test)         [100%] 1 of 1 ✔
[02/15a3b1] process > scRNAseq:CHECK_SATURATION (human_test) [100%] 1 of 1 ✔
[09/e25428] process > scRNAseq:GET_VERSIONS (get_versions)   [100%] 1 of 1 ✔
[48/703c20] process > scRNAseq:FEATURESTATS (human_test)     [100%] 1 of 1 ✔
[79/cd2784] process > scRNAseq:GENECOVERAGE (human_test)     [100%] 1 of 1 ✔
[e6/808adf] process > scRNAseq:REPORT (human_test)           [100%] 1 of 1 ✔
Completed at: 09-May-2024 09:07:55
Duration    : 25m 9s
CPU hours   : 3.7
Succeeded   : 9

When encountering any error, nextflow will interrupt running and print error message to stderr directly.

User could also check the error message from running log file .nextflow.log

$ head .nextflow.log

May-09 08:42:37.523 [main] DEBUG nextflow.cli.Launcher - $> nextflow run /thunderData/pipeline/starscope/scRNA-seq -c /thunderData/pipeline/nf_scRNAseq_config/latest/thunderbio_human_config --input sampleList.csv
May-09 08:42:37.924 [main] INFO  nextflow.cli.CmdRun - N E X T F L O W  ~  version 23.10.1
May-09 08:42:38.096 [main] DEBUG nextflow.plugin.PluginsFacade - Setting up plugin manager > mode=prod; embedded=false; plugins-dir=/home/xzx/.nextflow/plugins; core-plugins: nf-amazon@2.1.4,nf-azure@1.3.3,nf-cloudcache@0.3.0,nf-codecommit@0.1.5,nf-console@1.0.6,nf-ga4gh@1.1.0,nf-google@1.8.3,nf-tower@1.6.3,nf-wave@1.0.1
May-09 08:42:38.147 [main] INFO  o.pf4j.DefaultPluginStatusProvider - Enabled plugins: []
May-09 08:42:38.150 [main] INFO  o.pf4j.DefaultPluginStatusProvider - Disabled plugins: []
May-09 08:42:38.163 [main] INFO  org.pf4j.DefaultPluginManager - PF4J version 3.4.1 in 'deployment' mode
May-09 08:42:38.234 [main] INFO  org.pf4j.AbstractPluginManager - No plugins
May-09 08:42:42.225 [main] DEBUG nextflow.config.ConfigBuilder - Found config base: /thunderData/pipeline/starscope/scRNA-seq/nextflow.config
May-09 08:42:42.231 [main] DEBUG nextflow.config.ConfigBuilder - User config file: /thunderData/pipeline/nf_scRNAseq_config/latest/thunderbio_human_config_v2
May-09 08:42:42.233 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /thunderData/pipeline/starscope/scRNA-seq/nextflow.config

Nextflow Log CLI

After each invokation, the pipeline running information could be retrieved by nextflow log command, and user could check the RUN NAME, STATUS and SESSION ID from the command output.

$ nextflow log

TIMESTAMP          	DURATION	RUN NAME       	STATUS	REVISION ID	SESSION ID                          	COMMAND                                                                                                                                                      
2024-05-09 08:42:44	25m 12s 	adoring_ekeblad	OK    	8e27902b23 	8670925f-ce5a-4f7a-b327-a98b288e6aa6	nextflow run /thunderData/pipeline/starscope/scRNA-seq -c /thunderData/pipeline/nf_scRNAseq_config/latest/thunderbio_human_config --input sampleList.csv

Work Dir and Intermediate Files

Each task of the process will be conducted in a sub-directory of the workDir set in nextflow configuration file. By default, StarScope set this to work folder under project running directory. To confirm each tasks’ working directory, user will have to check the task hash id with command below. The adoring_ekeblad is the RUN NAME from nextflow log output.

$ nextflow log adoring_ekeblad -f hash,name,exit,status

e0/1d00d4	scRNAseq:CAT_FASTQ (human_test)	0	COMPLETED
09/e25428	scRNAseq:GET_VERSIONS (get_versions)	0	COMPLETED
37/8c0795	scRNAseq:TRIM_FASTQ (human_test)	0	COMPLETED
20/1edf9b	scRNAseq:MULTIQC (human_test)	0	COMPLETED
5a/e0becc	scRNAseq:STARSOLO (human_test)	0	COMPLETED
79/cd2784	scRNAseq:GENECOVERAGE (human_test)	0	COMPLETED
48/703c20	scRNAseq:FEATURESTATS (human_test)	0	COMPLETED
02/15a3b1	scRNAseq:CHECK_SATURATION (human_test)	0	COMPLETED
e6/808adf	scRNAseq:REPORT (human_test)	0	COMPLETED

To check CAT_FASTQ process task working directory, we could use it’s hash_id (e0/1d00d4) to locate the folder in work:

$ ls -a work/e0/1d00d49d7d562790a4d4f5993852ba/

.   .command.begin  .command.log  .command.run  .command.trace  human_test_1.merged.fq.gz  human_test.R1.fq.gz
..  .command.err    .command.out  .command.sh   .exitcode       human_test_2.merged.fq.gz  human_test.R2.fq.gz

The work directory always contains several important hidden files:

  1. .command.out STDOUT from tool.
  2. .command.err STDERR from tool.
  3. .command.log contains both STDOUT and STDERR from tool.
  4. .command.begin created as soon as the job launches.
  5. .exitcode created when the job ends, with exit code.
  6. .command.trace logs of compute resource usage.
  7. .command.run wrapper script used to run the job.
  8. .command.sh process command used for this task.
$ cat work/e0/1d00d49d7d562790a4d4f5993852ba/.command.sh 

#!/bin/bash -ue
ln -s human_test.R1.fq.gz human_test_1.merged.fq.gz
ln -s human_test.R2.fq.gz human_test_2.merged.fq.gz

Running in Background

The nextflow pipeline could be execute in background, with -bg option:

starscope gex --input sampleList.csv --config custom_config -bg

Resume Previous Run

One of the core features of Nextflow is the ability to cache task executions and re-use them in subsequent runs to minimize duplicate work. Resumability is useful both for recovering from errors and for iteratively developing a pipeline. It is similar to checkpointing, a common practice used by HPC applications.

To resume from previous run, please use the command below after entering the project running directory:

starscope gex --input sampleList.csv --config custom_config -bg -resume

Or resume from a specific run with session ID (check from nextflow log output):

starscope gex --input sampleList.csv --config custom_config -bg -resume 8670925f-ce5a-4f7a-b327-a98b288e6aa6

Additional resources:

Pipeline Tracing

By default, each run of the pipeline will generate three tracing files in results/pipeline_info/, check nextflow document for details.

  • execution_trace_<timeStamp>.txt
  • execution_timeline_<timeStamp>.html
  • execution_report_<timeStamp>.html

Trace Report

Nextflow generates an execution tracing tsv file with valuable details on each process, including submission time, start time, completion time, CPU usage, and memory consumption.

The content of the trace report will be like:

task_id	hash	native_id	name	status	exit	submit	duration	realtime	%cpu	peak_rss	peak_vmem	rchar	wchar
1	02/3370e5	2820	CAT_FASTQ (ATAC05_test)	COMPLETED	0	2024-04-09 01:34:57.788	4.7s	3ms	94.1%	0	0	90.5 KB	208 B
2	ad/83d089	2821	CHECK_BARCODE (ATAC05_test)	COMPLETED	0	2024-04-09 01:35:02.520	48m 31s	48m 25s	284.3%	280 MB	1.5 GB	164.6 GB	164.2 GB
3	94/66efd8	2822	TRIM_FASTQ (ATAC05_test)	COMPLETED	0	2024-04-09 02:23:33.118	5m 55s	5m 49s	1530.7%	799.8 MB	3.4 GB	141 GB	139.5 GB
4	7a/cb0bc7	2823	BWA_MAPPING (ATAC05_test)	COMPLETED	0	2024-04-09 02:29:28.124	48m 30s	48m 29s	1492.5%	24.7 GB	54.6 GB	198.4 GB	172.4 GB
6	b1/9ce1a2	2824	DEDUP (ATAC05_test)	COMPLETED	0	2024-04-09 03:17:58.183	14m 45s	14m 39s	190.6%	2.1 GB	2.4 GB	21.6 GB	11 GB
8	19/4eca53	2827	MULTIQC (ATAC05_test)	COMPLETED	0	2024-04-09 03:32:43.215	12m 50s	12m 31s	326.5%	1.6 GB	5.4 GB	22.2 GB	23.4 MB
7	f7/98d61e	2826	CHECK_SATURATION (ATAC05_test)	COMPLETED	0	2024-04-09 03:17:58.200	1h 33m 30s	1h 33m 26s	248.1%	8 GB	18.1 GB	498.8 GB	344.1 GB
5	72/fd29b0	2825	GENERATE_FRAGMENTS (ATAC05_test)	COMPLETED	0	2024-04-09 03:17:58.193	1h 57m 30s	24m 2s	307.0%	17.3 GB	51.3 GB	17.4 GB	6.8 GB
9	6a/d6df90	2828	SIGNAC (ATAC05_test)	COMPLETED	0	2024-04-09 05:15:28.385	6m 15s	6m 11s	91.6%	5.3 GB	16.2 GB	9.1 GB	2.1 GB
10	6a/c6c563	2829	STATS (ATAC05_test)	COMPLETED	0	2024-04-09 05:21:43.412	34.9s	32.5s	270.9%	8.6 MB	374.9 MB	13.2 GB	2.1 GB
11	4b/13dee8	2830	REPORT (ATAC05_test)	COMPLETED	0	2024-04-09 05:22:18.374	1m 20s	1m 18s	103.7%	5.5 GB	1 TB	329 MB	78.5 MB

Timeline Report

Nextflow can also create an HTML timeline report for all pipeline processes. See example below:

Each bar represents a single process run, the length of the bar represents the duration time (wall-time). Colored part of the bar indicates real processing time, grey part represents scheduling wait time.

Execution report

The execution report is more concise, which logs running information in the summary section, summarizes task resource usage using plotly.js and collects task metrics like trace report with more terms in a table.

Chapter 2

scRNA-seq Workflow

Subsections of scRNA-seq Workflow

Invoke with StarScope

Input

The input sample list file is a simple csv with four columns. The first column sample indicates sample IDs, and multiple fastq files with the same sample ID will be concatenated before further processing (e.g. two pairs of human_pbmc_s1 fastq files will be cat to a single pair). Multiple samples in one single sample list will be submitted parallelly and processed asynchronously. The fourth column indicates expected number of cells used for starsolo --soloCellFilter parameter.

sample,fastq_1,fastq_2,expected_cells
human_test,human_test.R1.fq.gz,human_test.R2.fq.gz,3000
human_pbmc_s1,human_pbmc_s1_R1_001.fastq.gz,human_pbmc_s1_R2_001.fastq.gz,8000
human_pbmc_s1,human_pbmc_s1_R1_002.fastq.gz,human_pbmc_s1_R2_002.fastq.gz,8000
Note

All samples in the sampleList will use the same options. Therefore, a combination of samples from different species is not supported.

Command

Invoke with config file

Due to the abundance of processes and options, we suggest utilizing a custom configuration file when invoking the pipeline. This allows for greater control over resource management. Please refer to the corresponding configuration examples below to utilize docker or singularity as running environment.

starscope gex --input sampleList.csv --config example_docker.config
params {
  genomeDir = "/refdata/human/starsolo/"
  genomeGTF = "/refdata/human/refdata-gex-GRCh38-2020-A/genes/genes.gtf"
  whitelist = "starscope/whitelist/V2_barcode_seq_210407_concat.txt"
  trimLength = 50
  soloType = "CB_UMI_Simple"
  soloCBstart = 1
  soloCBlen = 29
  soloUMIstart = 30
  soloUMIlen = 10
  publishSaturation = true
}

// uncomment line below if using slurm, see https://www.nextflow.io/docs/latest/executor.html
// process.executor = 'slurm'

// uncomment below chunk if using conda
// process.conda = "/home/xzx/Tools/mambaforge/envs/starscope_scRNAseq_env"
// conda.enabled = true

// docker setting, comment chunk below if using conda
process.container = "registry-intl.cn-hangzhou.aliyuncs.com/thunderbio/starscope_scrnaseq_env:1.2.5"
docker.enabled = true
docker.userEmulation = true
docker.runOptions = '--init -u $(id -u):$(id -g) $(opt=""; for group in $(id -G); do opt=$opt" --group-add $group"; done; echo $opt)'

// Resouces for each process
process {
  withLabel: process_high {
    cpus = 16
    memory = 40.GB
  }
  withLabel: process_medium {
    cpus = 4
    memory = 20.GB
  }
  withLabel: process_low {
    cpus = 4
    memory = 20.GB
  }
  withName: CHECK_SATURATION {
    cpus = 4
    memory = 10.GB
  }
  withName: CAT_FASTQ {
    cpus = 2
    memory = 4.GB
  }
  withName: TRIM_FASTQ {
    cpus = 12
    memory = 20.GB
  }
  withName: MULTIQC {
    cpus = 4
    memory = 10.GB
  }
  withName: STARSOLO {
    cpus = 16
    memory = 40.GB
  }
  withName: REPORT {
    cpus = 4
    memory = 40.GB
  }
  withName: FEATURESTATS {
    cpus = 2
    memory = 8.GB
  }
  withName: GENECOVERAGE {
    cpus = 8
    memory = 10.GB
  }
}
params {
  genomeDir = "/refdata/human/starsolo/"
  genomeGTF = "/refdata/human/refdata-gex-GRCh38-2020-A/genes/genes.gtf"
  whitelist = "starscope/whitelist/V2_barcode_seq_210407_concat.txt"
  trimLength = 50
  soloType = "CB_UMI_Simple"
  soloCBstart = 1
  soloCBlen = 29
  soloUMIstart = 30
  soloUMIlen = 10
  publishSaturation = true
}

// uncomment line below if using slurm, see https://www.nextflow.io/docs/latest/executor.html
// process.executor = 'slurm'

// command below generate singularity image: starscope_scrnaseq_env_latest.sif
// singularity pull registry-intl.cn-hangzhou.aliyuncs.com/thunderbio/starscope_scrnaseq_env:latest
process.container = "starscope_scrnaseq_env_latest.sif"
singularity.enabled = true

// Resouces for each process
process {
  withLabel: process_high {
    cpus = 16
    memory = 40.GB
  }
  withLabel: process_medium {
    cpus = 4
    memory = 20.GB
  }
  withLabel: process_low {
    cpus = 4
    memory = 20.GB
  }
  withName: CHECK_SATURATION {
    cpus = 4
    memory = 10.GB
  }
  withName: CAT_FASTQ {
    cpus = 2
    memory = 4.GB
  }
  withName: TRIM_FASTQ {
    cpus = 12
    memory = 20.GB
  }
  withName: MULTIQC {
    cpus = 4
    memory = 10.GB
  }
  withName: STARSOLO {
    cpus = 16
    memory = 40.GB
  }
  withName: REPORT {
    cpus = 4
    memory = 40.GB
  }
  withName: FEATURESTATS {
    cpus = 2
    memory = 8.GB
  }
  withName: GENECOVERAGE {
    cpus = 8
    memory = 10.GB
  }
}
params {
  genomeDir = "/refdata/human/starsolo/"
  genomeGTF = "/refdata/human/refdata-gex-GRCh38-2020-A/genes/genes.gtf"
  whitelist = "3M-february-2018.txt"
  trimLength = 28
  soloType = "CB_UMI_Simple"
  soloCBstart = 1
  soloCBlen = 16
  soloUMIstart = 17
  soloUMIlen = 12
  publishSaturation = true
}

// uncomment line below if using slurm, see https://www.nextflow.io/docs/latest/executor.html
// process.executor = 'slurm'

// uncomment below chunk if using conda
// process.conda = "/home/xzx/Tools/mambaforge/envs/starscope_scRNAseq_env"
// conda.enabled = true

// docker setting, comment chunk below if using conda
process.container = "registry-intl.cn-hangzhou.aliyuncs.com/thunderbio/starscope_scrnaseq_env:1.2.5"
docker.enabled = true
docker.userEmulation = true
docker.runOptions = '--init -u $(id -u):$(id -g) $(opt=""; for group in $(id -G); do opt=$opt" --group-add $group"; done; echo $opt)'

// Resouces for each process
process {
  withLabel: process_high {
    cpus = 16
    memory = 40.GB
  }
  withLabel: process_medium {
    cpus = 4
    memory = 20.GB
  }
  withLabel: process_low {
    cpus = 4
    memory = 20.GB
  }
    withLabel: process_high {
    cpus = 32
    memory = 32.GB
  }
  withLabel: process_medium {
    cpus = 32
    memory = 20.GB
  }
  withLabel: process_low {
    cpus = 4
    memory = 20.GB
  }
  withName: CHECK_SATURATION {
    cpus = 4
    memory = 10.GB
  }
  withName: CAT_FASTQ {
    cpus = 2
    memory = 4.GB
  }
  withName: TRIM_FASTQ {
    cpus = 12
    memory = 20.GB
  }
  withName: MULTIQC {
    cpus = 4
    memory = 10.GB
  }
  withName: STARSOLO {
    cpus = 16
    memory = 40.GB
  }
  withName: REPORT {
    cpus = 4
    memory = 40.GB
  }
  withName: FEATURESTATS {
    cpus = 2
    memory = 8.GB
  }
  withName: GENECOVERAGE {
    cpus = 8
    memory = 10.GB
  }
}

Invoke with command line options

To invoke scRNA-seq pipeline with conda environment:

starscope gex --conda \
              --conda_env /path/to/env \
              --input sampleList.csv \
              --genomeDir /path/to/STAR/reference/dir \
              --genomeGTF /path/to/genomeGTF \
              --whitelist /path/to/whitelist/V2_barcode_seq_210407_concat.txt \
              --trimLength 28 \
              --soloType CB_UMI_Simple \
              --soloCBstart 1 \
              --soloCBlen 29 \
              --soloUMIstart 30 \
              --soloUMIlen 10
starscope gex --conda \
              --conda_env /path/to/env \
              --input sampleList.csv \
              --genomeDir /path/to/STAR/reference/dir \
              --genomeGTF /path/to/genomeGTF \
              --whitelist "/path/to/TB_v3_20240429.BC1.tsv /path/to/TB_v3_20240429.BC2.tsv /path/to/TB_v3_20240429.BC3.tsv" \
              --trimLength 28 \
              --soloType CB_UMI_Complex \
              --soloAdapterSequence NNNNNNNNNGTGANNNNNNNNNGACANNNNNNNNNNNNNNNNN \
              --soloCBposition "2_0_2_8 2_13_2_21 2_26_2_34" \
              --soloUMIposition 2_35_2_42 \
              --soloCBmatchWLtype 1MM

user will have to add --conda and indicate conda env path with --conda_env. To check your env path, please use mamba env list:

# conda environments:
#
base                       /home/xzx/Tools/mambaforge
starscope_scRNAseq_env     /home/xzx/Tools/mambaforge/envs/starscope_scRNAseq_env

and provide the second column (e.g. /home/xzx/Tools/mambaforge/envs/starscope_scRNAseq_env).

Required Options

--genomeDir: STARsolo reference index path.

For instance, a typical file structure of the index folder will be like:

## human hg38
/refdata/human/starsolo/
├── chrLength.txt
├── chrNameLength.txt
├── chrName.txt
├── chrStart.txt
├── exonGeTrInfo.tab
├── exonInfo.tab
├── geneInfo.tab
├── Genome
├── genomeParameters.txt
├── Log.out
├── SA
├── SAindex
├── sjdbInfo.txt
├── sjdbList.fromGTF.out.tab
├── sjdbList.out.tab
└── transcriptInfo.tab

To create the index above, use the command below:

STAR  --runMode genomeGenerate \
      --runThreadN 10 \
      --genomeDir /path/to/outputDir \
      --genomeFastaFiles /path/to/genome.fa  \
      --sjdbGTFfile /path/to/genes.gtf

--genomeGTF: reference genome GTF file path.

User could generate the “filtered” GTF file using 10X’s mkref tool: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/references

--whitelist: The white list file(s) path. ThunderBio whitelist files were distributed with StarScope:

/starscope/whitelist/
├── TB_v3_20240429.BC1.tsv
├── TB_v3_20240429.BC2.tsv
├── TB_v3_20240429.BC3.tsv
└── V2_barcode_seq_210407_concat.txt.gz

For ThunderBio chemistry v3, please use

--whitelsit "/starscope/whitelist/TB_v3_20240429.BC1.tsv /starscope/whitelist/TB_v3_20240429.BC2.tsv /starscope/whitelist/TB_v3_20240429.BC3.tsv"

and use decompressed V2_barcode_seq_210407_concat.txt.gz for ThunderBio chemistry v2:

--whitelist /starscope/whitelist/V2_barcode_seq_210407_concat.txt

Other options are parsed to STARsolo

soloType                    None
    string(s): type of single-cell RNA-seq
                            CB_UMI_Simple   ... (a.k.a. Droplet) one UMI and one Cell Barcode of fixed length in read2, e.g. Drop-seq and 10X Chromium.
                            CB_UMI_Complex  ... one UMI of fixed length, but multiple Cell Barcodes of varying length, as well as adapters sequences are allowed in read2 only, e.g. inDrop.
                            CB_samTagOut (Not supported)    ... output Cell Barcode as CR and/or CB SAm tag. No UMI counting. --readFilesIn cDNA_read1 [cDNA_read2 if paired-end] CellBarcode_read . Requires --outSAMtype BAM Unsorted [and/or SortedByCoordinate] (Not supported by StarScope for now)
                            SmartSeq (Not supported)        ... Smart-seq: each cell in a separate FASTQ (paired- or single-end), barcodes are corresponding read-groups, no UMI sequences, alignments deduplicated according to alignment start and end (after extending soft-clipped bases) (Not supported by StarScope for now)

soloCBstart                 1
    int>0: cell barcode start base

soloCBlen                   16
    int>0: cell barcode length

soloUMIstart                17
    int>0: UMI start base

soloUMIlen                  10
    int>0: UMI length

soloCBposition              -
    strings(s)              position of Cell Barcode(s) on the barcode read.
                            Presently only works with --soloType CB_UMI_Complex, and barcodes are assumed to be on Read2.
                            Format for each barcode: startAnchor_startPosition_endAnchor_endPosition
                            start(end)Anchor defines the Anchor Base for the CB: 0: read start; 1: read end; 2: adapter start; 3: adapter end
                            start(end)Position is the 0-based position with of the CB start(end) with respect to the Anchor Base
                            String for different barcodes are separated by space.
                            Example: inDrop (Zilionis et al, Nat. Protocols, 2017):
                            --soloCBposition  0_0_2_-1  3_1_3_8

soloUMIposition             -
    string                  position of the UMI on the barcode read, same as soloCBposition
                            Example: inDrop (Zilionis et al, Nat. Protocols, 2017):
                            --soloCBposition  3_9_3_14

soloAdapterSequence         -
    string:                 adapter sequence to anchor barcodes.

soloCellFilter              CellRanger2.2 3000 0.99 10
    string(s):              cell filtering type and parameters
                            None            ... do not output filtered cells
                            TopCells        ... only report top cells by UMI count, followed by the exact number of cells
                            CellRanger2.2   ... simple filtering of CellRanger 2.2.
                                                Can be followed by numbers: number of expected cells, robust maximum percentile for UMI count, maximum to minimum ratio for UMI count
                                                The harcoded values are from CellRanger: nExpectedCells=3000;  maxPercentile=0.99;  maxMinRatio=10
                            EmptyDrops_CR   ... EmptyDrops filtering in CellRanger flavor. Please cite the original EmptyDrops paper: A.T.L Lun et al, Genome Biology, 20, 63 (2019): https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1662-y
                                                Can be followed by 10 numeric parameters:  nExpectedCells   maxPercentile   maxMinRatio   indMin   indMax   umiMin   umiMinFracMedian   candMaxN   FDR   simN
                                                The harcoded values are from CellRanger:             3000            0.99            10    45000    90000      500               0.01      20000  0.01  10000

Outputs

Each sample will have a separated result folder named with sample ID. The sub-directory final contains most of the result files, including html report, gene expression matrix (filtered: contains cell associated barcodes only; raw: contains all barcodes).

The pipeline_info directory contains statistics of the pipeline running resources.

results/human_test/
├── cutqc
│   └── human_test.cutadapt.json
├── final
│   ├── human_test_DEG.tsv
│   ├── human_test.matrix_filtered
│   │   ├── barcodes.tsv.gz
│   │   ├── features.tsv.gz
│   │   └── matrix.mtx.gz
│   ├── human_test.matrix_raw
│   │   ├── barcodes.tsv.gz
│   │   ├── features.tsv.gz
│   │   └── matrix.mtx.gz
│   ├── human_test.metrics.json
│   ├── human_test.metrics.tsv
│   ├── human_test_report.html
│   └── human_test.saturation_out.json
├── multiqc
│   └── human_test_multiqc_report.html
└── starsolo
    ├── human_test.CellReads.stats
    ├── human_test.Log.final.out
    ├── human_test.Log.out
    ├── human_test.Log.progress.out
    ├── human_test.matrix_filtered
    │   ├── barcodes.tsv.gz
    │   ├── features.tsv.gz
    │   └── matrix.mtx.gz
    ├── human_test.matrix_raw
    │   ├── barcodes.tsv.gz
    │   ├── features.tsv.gz
    │   └── matrix.mtx.gz
    ├── human_test.SJ.out.tab
    ├── human_test_summary.unique.csv
    └── human_test_UMIperCellSorted.unique.txt

8 directories, 26 files

WorkDir

By default, the intermediate files will be written to sub-directory of work under the pipeline running directory, please feel free to remove it after all the processes finished successfully.

Make Reference

To build a genome reference index, StarScope included a subcommand mkref to help user achieve this via nextflow run.

starscope mkref --conda \
--conda_env /path/to/env \
--genomeFasta genome.fa \
--genomeGTF genes.gtf 

The output index directory will be like:

results/STAR_reference/
├── chrLength.txt
├── chrNameLength.txt
├── chrName.txt
├── chrStart.txt
├── exonGeTrInfo.tab
├── exonInfo.tab
├── geneInfo.tab
├── Genome
├── genomeParameters.txt
├── Log.out
├── SA
├── SAindex
├── sjdbInfo.txt
├── sjdbList.fromGTF.out.tab
├── sjdbList.out.tab
└── transcriptInfo.tab

0 directories, 16 files

We also included a helper script to perform a 10X like filtration step on reference files downloaded from Ensembl, user could invoke with:

starscope/process_reference.sh --fasta origin_genome.fa.gz --gtf origin.gtf.gz

The output files are genome.fa and genes.gtf in the working directory.

Chapter 3

VDJ Workflow

Chapter homepage for VDJ immune repertoire analysis module.

Subsections of VDJ Workflow

Invoke with StarScope

Input

The input sample list file is csv file with five columns: sample, fastq_1, fastq_2, feature_types and expected_cells. The fourth column represents library types: GEX, VDJ-T or VDJ-B. A sample with only VDJ-T or VDJ-B library is supported, but all the samples must have at least one VDJ library. Please use scRNA-seq workflow if you only have GEX library. The first column sample indicates sample IDs, and multiple fastq files with the same sample ID will be concatenated before further processing (e.g. two pairs of human_pbmc_s1 fastq files will be cat to a single pair). Multiple samples in one single sample list will be submitted parallelly and processed asynchronously. The fourth column indicates expected number of cells used for starsolo --soloCellFilter parameter.

sample,fastq_1,fastq_2,feature_types,expected_cells
human_test,human_test_gex.R1.fq.gz,human_test_gex.R2.fq.gz,GEX,3000
human_test,human_test_tcr.R1.fq.gz,human_test_tcr.R2.fq.gz,VDJ-T,2000
human_pbmc_s1,human_pbmc_s1_gex_R1_001.fastq.gz,human_pbmc_s1_gex_R2_001.fastq.gz,GEX,8000
human_pbmc_s1,human_pbmc_s1_gex_R1_002.fastq.gz,human_pbmc_s1_gex_R2_002.fastq.gz,GEX,8000
human_pbmc_s1,human_pbmc_s1_bcr.R1.fastq.gz,human_pbmc_s1,human_pbmc_s1_bcr.R2.fastq.gz,VDJ-B,1000
Note

All samples in the sampleList will use the same options. Therefore, a combination of samples from different species is not supported.

Command

Invoke with config file

Due to the abundance of processes and options, we suggest utilizing a custom configuration file when invoking the pipeline. This allows for greater control over resource management. ThunderBio has released two versions of the chemistry, each with a distinct barcode structure. Please refer to the corresponding configuration examples provided below.

starscope vdj_gex --input sampleList.csv --config example_docker.config
params {
  genomeDir = "/refdata/human/starsolo/"
  genomeGTF = "/refdata/human/refdata-gex-GRCh38-2020-A/genes/genes.gtf"
  whitelist = "/starscope/whitelist/TB_v3_20240429.BC1.tsv /starscope/whitelist/TB_v3_20240429.BC2.tsv /starscope/whitelist/TB_v3_20240429.BC3.tsv"
  soloType = "CB_UMI_Complex"
  trimLength = 50
  soloAdapterSequence = "NNNNNNNNNGTGANNNNNNNNNGACANNNNNNNNNNNNNNNNN"
  soloCBposition = "2_0_2_8 2_13_2_21 2_26_2_34"
  soloUMIposition = "2_35_2_42"
  soloCBmatchWLtype = "1MM"
  publishSaturation = true
  
  // 5' VDJ specific params
  // trust4 reference
  trust4_vdj_refGenome_fasta = "/starscope/scRNA-seq/vdj/reference/hg38_bcrtcr.fa"
  trust4_vdj_imgt_fasta = "/starscope/scRNA-seq/vdj/reference/human_IMGT+C.fa"
  // strand reverse
  soloStrand = "Reverse"
}

// uncomment line below if using slurm, see https://www.nextflow.io/docs/latest/executor.html
// process.executor = 'slurm'

// uncomment below chunk if using conda
// process.conda = "/home/xzx/Tools/mambaforge/envs/starscope_scRNAseq_env"
// conda.enabled = true

// docker setting, comment chunk below if using conda
process.container = "registry-intl.cn-hangzhou.aliyuncs.com/thunderbio/starscope_scrnaseq_env:1.2.5"
docker.enabled = true
docker.userEmulation = true
docker.runOptions = '--init -u $(id -u):$(id -g) $(opt=""; for group in $(id -G); do opt=$opt" --group-add $group"; done; echo $opt)'

// Resouces for each process
process {
  withLabel: process_high {
    cpus = 16
    memory = 40.GB
  }
  withLabel: process_medium {
    cpus = 4
    memory = 20.GB
  }
  withLabel: process_low {
    cpus = 4
    memory = 20.GB
  }
  withName: CHECK_SATURATION {
    cpus = 4
    memory = 10.GB
  }
  withName: CAT_FASTQ {
    cpus = 2
    memory = 4.GB
  }
  withName: TRIM_FASTQ {
    cpus = 12
    memory = 20.GB
  }
  withName: MULTIQC {
    cpus = 4
    memory = 10.GB
  }
  withName: STARSOLO {
    cpus = 16
    memory = 40.GB
  }
  withName: REPORT {
    cpus = 4
    memory = 40.GB
  }
  withName: FEATURESTATS {
    cpus = 2
    memory = 8.GB
  }
  withName: GENECOVERAGE {
    cpus = 8
    memory = 10.GB
  }
  withName: VDJ_CELLCALLING_WITHOUTGEX {
    cpus = 10
    memory = 20.GB
  }
  withName: VDJ_CELLCALLING_WITHGEX {
    cpus = 10
    memory = 20.GB
  }
  withName: GET_VERSIONS_VDJ {
    cpus = 2
    memory = 10.GB
  }
  withName: VDJ_ASSEMBLY {
    cpus = 32
    memory = 30.GB
  }
  withName: VDJ_METRICS {
    cpus = 2
    memory = 10.GB
  }
  withName:REPORT_VDJ {
    cpus = 4
    memory = 40.GB
  }
}

Invoke with command line options

To invoke VDJ pipeline with conda environment:

starscope vdj_gex --conda \
                  --conda_env /path/to/conda/env \
                  --input sampleList.csv \
                  --genomeDir /path/to/STAR/reference/dir \
                  --genomeGTF /path/to/genomeGTF \
                  --whitelist "/path/to/TB_v3_20240429.BC1.tsv /path/to/TB_v3_20240429.BC2.tsv /path/to/TB_v3_20240429.BC3.tsv" \
                  --trust4_vdj_refGenome_fasta /path/to/refGenome_vdj_fasta \
                  --trust4_vdj_imgt_fasta /path/to/imgt_vdj_fasta \
                  --trimLength 28 \
                  --soloType CB_UMI_Complex \
                  --soloAdapterSequence NNNNNNNNNGTGANNNNNNNNNGACANNNNNNNNNNNNNNNNN \
                  --soloCBposition "2_0_2_8 2_13_2_21 2_26_2_34" \
                  --soloUMIposition 2_35_2_42 \
                  --soloCBmatchWLtype 1MM

user will have to add --conda and indicate conda env path with --conda_env. To check your env path, please use mamba env list:

# conda environments:
#
base                       /home/xzx/Tools/mambaforge
starscope_scRNAseq_env     /home/xzx/Tools/mambaforge/envs/starscope_scRNAseq_env

and provide the second column (e.g. /home/xzx/Tools/mambaforge/envs/starscope_scRNAseq_env).

Required Options

--genomeDir: STARsolo reference index path.

For instance, a typical file structure of the index folder will be like below:

## human hg38
/refdata/human/starsolo/
├── chrLength.txt
├── chrNameLength.txt
├── chrName.txt
├── chrStart.txt
├── exonGeTrInfo.tab
├── exonInfo.tab
├── geneInfo.tab
├── Genome
├── genomeParameters.txt
├── Log.out
├── SA
├── SAindex
├── sjdbInfo.txt
├── sjdbList.fromGTF.out.tab
├── sjdbList.out.tab
└── transcriptInfo.tab

To create the index above, use the command below:

STAR  --runMode genomeGenerate \
      --runThreadN 10 \
      --genomeDir /path/to/outputDir \
      --genomeFastaFiles /path/to/genome.fa  \
      --sjdbGTFfile /path/to/genes.gtf

--genomeGTF: reference genome GTF file path.

User could generate the “filtered” GTF file using 10X’s mkref tool: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/references

--whitelist: The white list file(s) path. ThunderBio whitelist files were distributed with StarScope:

/starscope/whitelist/
├── TB_v3_20240429.BC1.tsv
├── TB_v3_20240429.BC2.tsv
├── TB_v3_20240429.BC3.tsv
└── V2_barcode_seq_210407_concat.txt.gz

For ThunderBio chemistry v3, please use

--whitelsit "/starscope/whitelist/TB_v3_20240429.BC1.tsv /starscope/whitelist/TB_v3_20240429.BC2.tsv /starscope/whitelist/TB_v3_20240429.BC3.tsv"

Outputs

Each sample will have a separated result folder named with sample ID. The sub-directory final contains most of the result files:

  • HTML report (e.g. TB_pbmc_test_VDJ_report.html)
  • TCR/BCR trust4 report tsv file (e.g. TB_pbmc_test_BCR_results.tsv).
  • TCR/BCR filtered report file, conains lineage information and only productive TCR/BCR (e.g. TB_pbmc_test_BCR_results.productiveOnly_withLineage.tsv).
  • TCR/BCR cloneType table, with VDJ gene annotation (e.g. TB_pbmc_test_BCR_clonotypes.tsv)

gene expression matrix were strored under starsolo’s result directory:

  • filtered matrix, contains cell associated barcodes only:
results/sampelID/starsolo/GEX/sampleID_GEX.matrix_filtered
  • raw matrix, contains all barcodes:
results/sampelID/starsolo/GEX/sampleID_GEX.matrix_raw

The pipeline_info directory contains statistics of the pipeline running resources.

Full output directory structure:

results/TB_pbmc_test/
├── cutqc
│   ├── TB_pbmc_test.cutadapt.json
│   ├── TB_pbmc_test_GEX.cutadapt.json
│   ├── TB_pbmc_test_VDJ-B.cutadapt.json
│   └── TB_pbmc_test_VDJ-T.cutadapt.json
├── final
│   ├── TB_pbmc_test_BCR_clonotypes.tsv
│   ├── TB_pbmc_test_BCR_results.productiveOnly_withLineage.tsv
│   ├── TB_pbmc_test_BCR_results.tsv
│   ├── TB_pbmc_test_GEX.saturation_out.json
│   ├── TB_pbmc_test_TCR_clonotypes.tsv
│   ├── TB_pbmc_test_TCR_results.productiveOnly_withLineage.tsv
│   ├── TB_pbmc_test_TCR_results.tsv
│   ├── TB_pbmc_test_VDJ-B.metrics.json
│   ├── TB_pbmc_test_VDJ-B.metrics.tsv
│   ├── TB_pbmc_test_VDJ_report.html
│   ├── TB_pbmc_test_VDJ-T.metrics.json
│   ├── TB_pbmc_test_VDJ-T.metrics.tsv
│   └── versions.json
├── multiqc
│   ├── TB_pbmc_test_GEX_multiqc_report.html
│   ├── TB_pbmc_test_multiqc_report.html
│   ├── TB_pbmc_test_VDJ-B_multiqc_report.html
│   └── TB_pbmc_test_VDJ-T_multiqc_report.html
├── saturation
│   └── TB_pbmc_test_GEX.saturation_out.json
├── starsolo
│   ├── GEX
│   │   ├── TB_pbmc_test_GEX.CellReads.stats
│   │   ├── TB_pbmc_test_GEX.Log.final.out
│   │   ├── TB_pbmc_test_GEX.Log.out
│   │   ├── TB_pbmc_test_GEX.Log.progress.out
│   │   ├── TB_pbmc_test_GEX.matrix_filtered
│   │   │   ├── barcodes.tsv.gz
│   │   │   ├── features.tsv.gz
│   │   │   └── matrix.mtx.gz
│   │   ├── TB_pbmc_test_GEX.matrix_raw
│   │   │   ├── barcodes.tsv.gz
│   │   │   ├── features.tsv.gz
│   │   │   └── matrix.mtx.gz
│   │   ├── TB_pbmc_test_GEX.SJ.out.tab
│   │   ├── TB_pbmc_test_GEX_summary.unique.csv
│   │   └── TB_pbmc_test_GEX_UMIperCellSorted.unique.txt
│   ├── VDJ-B
│   │   ├── TB_pbmc_test_VDJ-B.CellReads.stats
│   │   ├── TB_pbmc_test_VDJ-B.Log.final.out
│   │   ├── TB_pbmc_test_VDJ-B.Log.out
│   │   ├── TB_pbmc_test_VDJ-B.Log.progress.out
│   │   ├── TB_pbmc_test_VDJ-B.matrix_filtered
│   │   ├── TB_pbmc_test_VDJ-B.SJ.out.tab
│   │   ├── TB_pbmc_test_VDJ-B_summary.unique.csv
│   │   └── TB_pbmc_test_VDJ-B_UMIperCellSorted.unique.txt
│   └── VDJ-T
│       ├── TB_pbmc_test_VDJ-T.CellReads.stats
│       ├── TB_pbmc_test_VDJ-T.Log.final.out
│       ├── TB_pbmc_test_VDJ-T.Log.out
│       ├── TB_pbmc_test_VDJ-T.Log.progress.out
│       ├── TB_pbmc_test_VDJ-T.matrix_filtered
│       ├── TB_pbmc_test_VDJ-T.SJ.out.tab
│       ├── TB_pbmc_test_VDJ-T_summary.unique.csv
│       └── TB_pbmc_test_VDJ-T_UMIperCellSorted.unique.txt
└── trust4
    ├── VDJ-B
    │   ├── TB_pbmc_test_VDJ-B_barcode_airr.tsv
    │   ├── TB_pbmc_test_VDJ-B_barcode_report.filterDiffusion.tsv
    │   ├── TB_pbmc_test_VDJ-B.cloneType_out.tsv
    │   ├── TB_pbmc_test_VDJ-B_final.out
    │   ├── TB_pbmc_test_VDJ-B_readsAssign.out
    │   ├── TB_pbmc_test_VDJ-B.vdj_cellOut.tsv
    │   └── TB_pbmc_test_VDJ-B.vdj_metrics.json
    └── VDJ-T
        ├── TB_pbmc_test_VDJ-T_barcode_airr.tsv
        ├── TB_pbmc_test_VDJ-T_barcode_report.filterDiffusion.tsv
        ├── TB_pbmc_test_VDJ-T.cloneType_out.tsv
        ├── TB_pbmc_test_VDJ-T_final.out
        ├── TB_pbmc_test_VDJ-T_readsAssign.out
        ├── TB_pbmc_test_VDJ-T.vdj_cellOut.tsv
        └── TB_pbmc_test_VDJ-T.vdj_metrics.json

15 directories, 63 files

WorkDir

By default, the intermediate files will be written to sub-directory of work under the pipeline running directory, please feel free to remove it after all the processes finished successfully.

Chapter 4

scATAC workflow

Subsections of scATAC workflow

Invoke with StarScope

Input

The input sample list file is a simple csv with three columns. The first column sample indicates sample IDs, and multiple fastq files with the same sample ID will be concatenated before further processing (e.g. two pairs of human_pbmc_s1 fastq files will be cat to a single pair). Multiple samples in one single sample list will be submitted parallelly and processed asynchronously. The workflow supports both ThunderBio and 10X ATAC data, please use the suggested configuration below. User could check 10X ATAC library structure from https://teichlab.github.io/scg_lib_structs/.

sample,fastq_1,fastq_2
human_pbmc_s1,human_pbmc_s1_R1_001.fastq.gz,human_pbmc_s1_R2_001.fastq.gz
human_pbmc_s1,human_pbmc_s1_R1_002.fastq.gz,human_pbmc_s1_R2_002.fastq.gz
human_pbmc_s2,human_pbmc_s2_R1_001.fastq.gz,human_pbmc_s2_R2_001.fastq.gz
human_tissue_s1,human_tissue_s1_R1_001.fastq.gz,human_tissue_s1_R2_001.fastq.gz
sample,fastq_1,fastq_2,fastq_3
10X_pbmc_10k_v1,10X_pbmc_10k_v1_R1_001.fastq.gz,10X_pbmc_10k_v1_R2_001.fastq.gz,10X_pbmc_10k_v1_R3_001.fastq.gz
10X_pbmc_10k_v1,10X_pbmc_10k_v1_R1_002.fastq.gz,10X_pbmc_10k_v1_R2_002.fastq.gz,10X_pbmc_10k_v1_R3_002.fastq.gz
10X_pbmc_10k_v2,10X_pbmc_10k_v2_R1_001.fastq.gz,10X_pbmc_10k_v2_R2_001.fastq.gz,10X_pbmc_10k_v2_R3_001.fastq.gz
10X_pbmc_10k_v2,10X_pbmc_10k_v2_R1_002.fastq.gz,10X_pbmc_10k_v2_R2_002.fastq.gz,10X_pbmc_10k_v2_R3_002.fastq.gz
Note

All samples in the sampleList will use the same options. Therefore, a combination of samples from different species is not supported.

Warning

We found that seq library doesn’t support fastq file with quality score larger than 40 (denoted as J in the fourth line). Please check the quality encoding on FASTQ wiki page. The error message is like:

KeyError: J

Raised from: std.internal.types.collections.dict.Dict.__getitem__:0
/pipeline/starscope/scATAC-seq/codon/lib/codon/stdlib/internal/types/collections/dict.codon:74:9
/scATAC_test/work/09/284782e6058425c6966542b7a70944/.command.sh: line 9: 684639 Aborted                 (core dumped) /pipeline/starscope/scATAC-seq/codon/bin/codon run -release -plugin seq /pipeline/starscope/scATAC-seq/bin/extract_and_correct_thunderbio_barcode_from_fastq.codon TB_v3_20240429.BC1.tsv,TB_v3_20240429.BC2.tsv,TB_v3_20240429.BC3.tsv test_sample_1.merged.fq.gz test_sample_2.merged.fq.gz test_sample_1.barcode.fq test_sample_2.barcode.fq test_sample_barcode_stats.tsv 8

User could use the command below to convert “J” (Q=41) to “I” (Q=40):

for i in *.fq.gz; do zcat $i | awk 'NR%4==1{print; getline; print; getline; print; getline; gsub("J", "I"); print}' | gzip -c > ${i%%.fq.gz}".modified.fq.gz"; done

Command

Invoke with config file

Since there are too many processes and options, we recommend to invoke pipeline with custom configuration file, which gives users more control on resources management.

starscope atac --input sampleList.csv --config example_docker.config
params {
  bwaIndex = "/refdata/human/bwa2_index/GRCh38-2020-A"
  genomeGTF = "/refdata/human/refdata-gex-GRCh38-2020-A/genes/genes.gtf"
  whitelist = "/barcodes/BC1.tsv /barcodes/BC2.tsv /barcodes/BC3.tsv" // TB whitelist
  // trimming options used by cutadapt
  trimOpt = '-a AGATGTGTATAAGAGACAG...CTGTCTCTTATACACATCTCCGAGCCCACGAGAC -A GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG...CTGTCTCTTATACACATCT'
  // cell filtration
  minCell = 10 // feature obeseved in at least "minCell" cells
  minFeature = 200 // cells with at least "minFeature" features
  publishSaturation = true
  platform = "TB" // TB or 10X
}

// uncomment line below if using slurm, see https://www.nextflow.io/docs/latest/executor.html
// process.executor = 'slurm'

// uncomment below chunk if using conda
// process.conda = "/home/xzx/Tools/mambaforge/envs/starscope_scATAC_env"
// conda.enabled = true

// docker setting, comment chunk below if using conda
process.container = "registry-intl.cn-hangzhou.aliyuncs.com/thunderbio/starscope_scatac_env:latest"
docker.enabled = true
docker.userEmulation = true
docker.runOptions = '--init -u $(id -u):$(id -g) $(for group in $(id -G); do echo "--group-add $group"; done)'

// Resouces for each process
process {
  withLabel: process_high {
    cpus = 16
    memory = 32.GB
  }
  withLabel: process_medium {
    cpus = 16
    memory = 20.GB
  }
  withLabel: process_low {
    cpus = 16
    memory = 20.GB
  }
  withName: CAT_FASTQ {
    cpus = 2
    memory = 10.GB
  }
  withName: CAT_FASTQ_10X {
    cpus = 2
    memory = 10.GB
  }
  withName: REPORT {
    cpus = 8
    memory = 40.GB
  }
  withName: CHECK_BARCODE {
    cpus = 16
    memory = 20.GB
  }
  CHECK_BARCODE_10X {
    cpus = 16
    memory = 20.GB
  }
  withName: TRIM_FASTQ {
    cpus = 16
    memory = 20.GB
  }
  withName: BWA_MAPPING {
    cpus = 16
    memory = 32.GB
  }
  withName: STATS {
    cpus = 4
    memory = 20.GB
  }
  withName: DEDUP {
    cpus = 2
    memory = 20.GB
  }
  withName: CHECK_SATURATION {
    cpus = 4
    memory = 20.GB
  }
  withName: MULTIQC {
    cpus = 4
    memory = 20.GB
  }
  withName: GENERATE_FRAGMENTS {
    cpus = 16
    memory = 40.GB
  }
  withName: SIGNAC {
    cpus = 4
    memory = 40.GB
  }
}

Invoke with command line options

To invoke scATAC pipeline with conda environment:

starscope atac --conda \
               --conda_env /path/to/conda/env \
               --input sampleList.csv \
               --bwaIndex /path/to/STAR/reference/dir \
               --genomeGTF /path/to/genomeGTF \
               --whitelist "/path/to/BC1.tsv /path/to/BC2.tsv /path/to/BC3.tsv" \
               --refGenome hg38 \
               --platform TB

user will have to add --conda and indicate conda env path with --conda_env. To check your env path, please use mamba env list:

# conda environments:
#
base                     /home/xzx/Tools/mambaforge
starscope_scATAC_env     /home/xzx/Tools/mambaforge/envs/starscope_scATAC_env

and provide the second column (e.g. /home/xzx/Tools/mambaforge/envs/starscope_scATAC_env).

Required Options

--bwaIndex: bwa-mem2 reference index path and file prefix.

For instance, a typical file structure of the index folder will be like:

## human hg38
/thunderData/refdata/human/bwa2_index/
├── GRCh38-2020-A.0123
├── GRCh38-2020-A.amb
├── GRCh38-2020-A.ann
├── GRCh38-2020-A.bwt.2bit.64
└── GRCh38-2020-A.pac

To create the index above, use the command below:

bwa-mem2 index -p GRCh38-2020-A genome.fa

user will have to provide --bwaIndex /thunderData/refdata/human/bwa2_index/GRCh38-2020-A to starscope command.

--genomeGTF: reference genome GTF file path.

--refGenome: reference genome assembly ID.

This parameter is used to extract genome size when calling peaks with macs2. Here we support hg38, hg19, mm9 and mm10 for human and mosue genome. User could use --genomeSize to indicate genome size directly.

--platform: StarScope scATAC pipeline supports ATAC data from both the ThunderBio and 10X genomics. Use either TB or 10X here.

--whitelist: The white list file(s) path. For ThunerBio scATAC data, whitelist files were distributed with StarScope:

starscope/whitelist/
├── TB_v3_20240429.BC1.tsv
├── TB_v3_20240429.BC2.tsv
├── TB_v3_20240429.BC3.tsv
└── V2_barcode_seq_210407_concat.txt.gz

Please use the whilelist option below, and don’t forget to add double quote.

--whitelsit "/path/to/starscope/whitelist/TB_v3_20240429.BC1.tsv /path/to/starscope/whitelist/TB_v3_20240429.BC2.tsv /path/to/starscope/whitelist/TB_v3_20240429.BC3.tsv"

Outputs

Each sample will have a separated result folder named with sample ID. The subdirectory final contains all the result files, including html report, fragments file (with all the barcodes) and a list of cell associated barcodes identified (e.g. TB_PBMC_s1_raw_cells.tsv).

The pipeline_info directory contains statistics of the pipeline running resources.

results/
├── TB_PBMC_s1
│   ├── final
│   │   ├── TB_PBMC_s1_combined_stats.tsv
│   │   ├── TB_PBMC_s1_filtered_obj.rds
│   │   ├── TB_PBMC_s1.fragments.sorted.bed.gz
│   │   ├── TB_PBMC_s1.fragments.sorted.bed.gz.tbi
│   │   ├── TB_PBMC_s1_macs2_peaks.narrowPeak
│   │   ├── TB_PBMC_s1_raw_cells.tsv
│   │   ├── TB_PBMC_s1_raw_meta.tsv
│   │   └── TB_PBMC_s1_scATAC_report.html
│   ├── multiqc
│   │   └── TB_PBMC_s1_multiqc_report.html
│   └── saturation
│       └── TB_PBMC_s1.saturation_out.json
└── pipeline_info
    ├── execution_report_2024-05-02_22-34-11.html
    ├── execution_timeline_2024-05-02_22-34-11.html
    └── execution_trace_2024-05-02_22-34-11.txt

5 directories, 13 files

WorkDir

By default, the intermediate files will be written to subdirectory of work under the pipeline running directory, please feel free to remove it after all the processes finished successfully.