Skip to content

First Steps: Episode 1

Episode Topic
0 How can I install the tools?
1 How can I use the static data?
2 How can I distribute my jobs on the cluster (Slurm)?
3 How can I organize my jobs with Snakemake?
4 How can I combine Snakemake and Slurm?

This is part one of the "First Steps" BIH Cluster Tutorial. Here we will build a small pipeline with alignment and variant calling. The premise is that you have the tools installed as described in Episode 0. For this episode, please make sure that you are on a compute node. As a reminder, the command to access a compute node with the required resources is

$ srun --time 7-00 --mem=8G --ntasks=8 --pty bash -i

Tutorial Input Files

We will provide you with some example FASTQ files, but you can use your own if you like. You can find the data here:

  • /data/cephfs-1/work/projects/cubit/tutorial/input/test_R1.fq.gz
  • /data/cephfs-1/work/projects/cubit/tutorial/input/test_R2.fq.gz

Creating a Project Directory

First, you should create a folder where the output of this tutorial will go. It would be good to have it in your work directory in /fast/users/$USER, because it is faster and there is more space available.

(first-steps) $ mkdir -p /fast/users/$USER/work/tutorial/episode1
(first-steps) $ pushd /fast/users/$USER/work/tutorial/episode1

Quotas / File System limits

  • Note well that you have a quota of 1 GB in your home directory at /fast/users/$USER. The reason for this is that nightly snapshots and backups are created for this directory which are precious resources.
  • This limit does not apply to your work directory at /fast/users/$USER/work. The limits are much higher here but no snapshots or backups are available.
  • There is no limit on your scratch directory at /fast/users/$USER/scratch. However, files placed here are automatically removed after 2 weeks. This is only appropriate for files during download or temporary files.

Creating a Directory for Temporary Files

In general it is advisable to have a proper temporary directory available. You can create one in your ~/scratch folder and make it available to the system.

(first-steps) $ export TMPDIR=/fast/users/$USER/scratch/tmp
(first-steps) $ mkdir -p $TMPDIR

Using the Cubit Static Data

The static data is located in /data/cephfs-1/work/projects/cubit/current/static_data. For our small example, the required reference genome and index can be found at:

  • /data/cephfs-1/work/projects/cubit/current/static_data/reference/GRCh37/g1k_phase1/human_g1k_v37.fasta
  • /data/cephfs-1/work/projects/cubit/current/static_data/precomputed/BWA/0.7.17/GRCh37/g1k_phase1/human_g1k_v37.fasta

Aligning the Reads

Let's align our data:

(first-steps) $ bwa mem -t 8 \
    /data/cephfs-1/work/projects/cubit/current/static_data/precomputed/BWA/0.7.17/GRCh37/g1k_phase1/human_g1k_v37.fasta \
    /data/cephfs-1/work/projects/cubit/tutorial/input/test_R1.fq.gz \
    /data/cephfs-1/work/projects/cubit/tutorial/input/test_R2.fq.gz \
| samtools view -b \
| samtools sort -O BAM -T $TMPDIR -o aln.bam

(first-steps) $ samtools index aln.bam

Perform Structural Variant Calling

And do the structural variant calling:

(first-steps) $ delly call \
    -g /data/cephfs-1/work/projects/cubit/current/static_data/reference/GRCh37/g1k_phase1/human_g1k_v37.fasta \

Note that delly will not find any variants.

Small Variant Calling (SNV, indel)

And now for the SNP calling (this step will take ~ 20 minutes):

(first-steps) $ gatk HaplotypeCaller \
    -R /data/cephfs-1/work/projects/cubit/current/static_data/reference/GRCh37/g1k_phase1/human_g1k_v37.fasta \
    -I aln.bam \
    -ploidy 2 \
    -O test.GATK.vcf

Outlook: More Programs and Static Data

So this is it! We used the tools that we installed previously, accessed the reference data and ran a simple alignment and variant calling pipeline. You can access a list of all static data through this wiki, follow this link to the Static Data. You can also have a peek via:

(first-steps) $ tree -L 3 /data/cephfs-1/work/projects/cubit/current/static_data | less