Retrotransposon:Gene Junction Reads Quantification Pipeline

This pipeline is designed to assemble transcripts starting from initial RNA-Seq reads. The reads are aligned using a two-pass alignment strategy to increase the sensitivity of identifying spliced reads. StringTie2 is then used to find novel exon structures and assemble the aligned reads into transcripts. Split RNA-seq reads are collected from the second pass alignment files. Exons are overlapped with retrotransposons (RTs), to find the if the edges of an exon are overlapping with an RT. The final step is to match the junctions previously identified with annotated or assembled exons. The final output can be used for further analysis.

The Code for this pipeline can be downloaded here here or from the Download page

This pipeline was developed by Wanqing Shao

Steps to Run Pipeline

The current pipeline starts from raw fastq files, generates bigwig files, and then assembled transcripts and junction files. The assembled transcripts and junction files can be used for later down stream analysis

This page provides a general overview of the pipeline, many of the scripts may require slight modifications to run properly (correction for tool path, library installation etc.)

Step 1: Adapter Trimming and First Pass Alignment

The script “first_alignment_pe_truseq.sh“ was used for adapter trimming and the first round of alignment for paired-end samples that use TruSeq adapters (if a different adapter is used, the script may need to be modified accordingly). The script uses a STAR index that is stored in the memory to reduce memory usage when multiple files are processed simultaneously.

To load STAR index in the memory and run first_alignment_PE_TruSeq.sh for multiple samples, run the following command in the terminal.

    STAR --genomeDir /path/to/star/index --genomeLoad LoadAndExit 
    parallel -uj 10 ./first_alignment_PE_TruSeq.sh {} /path/to/star/index 5 ::: sample1 sample2 sample3

STAR Version: STAR-2.7.0c

Cutadapt Version: 1.9

Step 2: Select Junctions and Updating STAR Index

This step only needs to be run once for each species' reference genome

For the CDK2AP1 project, we created updated STAR indices for the second pass alignment after the initial alignments where completed (This approach was selected due to computational efficiency). Alternatively, two-pass alignment can also be done with each sample "on the fly", see STAR manual for more details.

The following commands should be run in the terminal. The first command filters for canonical splicing junctions with at least 3 reads in at least 1 sample. Parameters may be adjusted to achieve the best performance.

    cat *SJ.out.tab | awk '{FS="\t";OFS="\t"} {if ($5 > 0 && $7 > 2 && $6==0 && $4==1) print $1, $2, $3, "+"; else if ($5 > 0 && $7 > 2 && $6==0 && $4==2) print $1, $2, $3, "-";}' | sort | uniq >  SJ.filtered.tab

    STAR --runThreadN 10 \
    --runMode genomeGenerate \
    --genomeDir /path/to/updated/STAR/index \
    --genomeFastaFiles /path/to/genome/fasta/file \
    --sjdbGTFfile /path/to/gene/gtf/file \
    --sjdbFileChrStartEnd /path/to/filtered/SJfile/SJ.filtered.tab

STAR Version: STAR-2.7.1a

Step 3: Second Pass Alignment

The script "second_alignment_PE.sh" is used for the second pass alignment.

To load the updated STAR index in into memory and run second_alignment_PE.sh for multiple samples, run the following two commands:

    STAR --genomeDir /path/to/updated/STAR/index --genomeLoad LoadAndExit
    parallel -uj 10 ./second_alignment_PE.sh {} /path/to/updated/STAR/index 5 ::: sample1 sample2 sample3

STAR Version: STAR-2.7.1a

Step 4: Indexing BAM Files

Index the BAM files using samtools

    parallel -uj 10 samtools index {} ::: *_2nd_pass*bam

Samtools Version: 1.9

Step 5: Generating Normalized Bigwig Files

Normalized bigwig files are generated using bamCoverage from deeptools

For the CDK2AP1 project, we used uniquely mapped reads for bigwig generation (--minMappingQuality 255) and normalized the tracks to Count Per Million (CPM) for ease of comparisons between samples.

To generate normalized bigwig files, run the following commands in the terminal. The resulting .bw files can be viewed in IGV.

    parallel -uj 10 bamCoverage --bam {} -o \`basename {} Aligned.sortedByCoord.out.bam\`uniq_reads_cpm.bw -of bigwig --minMappingQuality 255 --normalizeUsing CPM -p 3 ::: *_2nd_pass*bam

Deeptools Version: 2.2

Step 6: Transcript Assembly

StringTie2 was used for transcript assembly for this pipeline.

Notes about StringTie2:

StringTie2 considers multiply mapped reads (even-distribution)
If a sample is stranded, --rf or --fr could be specified to improve the accuracy of strand determination at repetitive regions
A guided assembly (with -G gtf) may generates better results.

Here, StringTie2 is run for individual samples, the parameter used are:

Require 2 reads for junction assembly
5 reads for single exon transcript assembly
2 reads for multi-exon transcript assembly

Parameters can be tweaked to adjust the sensitivity & specificity.

The StringTie output is then merged using taco. Taco will create a folder called merged_gtf, there will be a gtf file named assembly.gtf, this file contains the assembled transcript and can be viewed in IGV

To generate merged transcripts for each study, run the following commands in the terminal:

    parallel -uj 10 stringtie {} -p 2 -j 2 -s 5 -f 0.05 -c 2 -o \`basename {} Aligned.sortedByCoord.out.bam\`_stringtie_output.gtf ::: *_2nd_pass*bam  
    ls *gtf > gtf_to_merge.txt
    taco_run -o merged_gtf -p 10 ./gtf_to_merge.txt

Stringtie Version: 2.1.4

taco Version: v0.7.3

Step 7: Collecting Splicing Junctions

The script "bam2junc_uniq.sh" is used to collect splicing juctions, it calls the followign scripts: "filter_cs.py", "sam_to_bed.py", "bed2junc.pl".

"filter_cs.py", "sam_to_bed.py", "bed2junc.pl", "bam2junc_uniq.sh" and "merge_junction_files.r" were constructed for collecting splicing junctions detected in RNAseq datasets. Some of the scripts were adapted from leafcutter due to the original leafcutter script not handeling soft-clipping properly.

For CDK2AP1 we used uniquely aligned reads for junction signal quantification (samtools view -q 255, this only works for STAR output).

    parallel -uj 10 bam2junc_uniq.sh {} \`basename {} _Aligned.sortedByCoord.out.bam\`_uniq.junc ::: *_2nd_pass*bam 
    ls *_uniq.junc > junc_file_list.txt

The script "merge_junction_files.r" is then used to merge junctions. This R script merges the junction files from multiple samples and generates the junction-by-sample matrix (value for each cell is the number of reads across each junction). Junctions that are within 10bp away for both splicing donor site and acceptor site were merged.

The R librarys: parallel, methods, data.table, magrittr, optparse, reshape and GenomicRanges are required to run this script

Notes: This script is not that memory efficient for combining a large number of samples. For datasets with more than 100 samples/cells, -p as 1 is recommended.

    Rscript merge_junction_files.r -j junc_file_list.txt -p 10 -n all_samples

Step 8: Clean Repeatmakser File

Subset the repeat masker files from UCSC for future steps, saving the following columns: chr, start, end, strand and repClass. The code example below process the hg38.rmsk.txt file:

    echo -e "chr\tstart\tend\tstrand\trepName\trepClass\trepFamily" > hg38.rmsk.cleaned.txt
    cut -f6-8,10,11,12,13 hg38.rmsk.txt >> hg38.rmsk.cleaned.txt

Step 9: Overlap Exons with RTs

The script name "overlap_exon_with_te.r" process the gene and assembled transcript gtf files, collects the exons, calculates the percentage of RT sequence for each exon, and returns information about if the edges of each exon are overlapping with RT

At the moment, this script only works for cleaned repeatmasker files with the following columns: chr, start, end, strand and repClass.

For CDK2ap1, we focused on retrotransposons (LTR, SINE and LINE), but this script can be modified to include DNA elements

    Rscript overlap_exon_with_te.r -g hg38.ncbiRefSeq.gtf -t taco_assembly.gtf -p 5 -n hg38 -r hg38.rmsk.cleaned.txt

Step 10: Identifying RT-Gene Junctions

The script "overlap_junction_with_exon.r" matches the junctions with annotated or assembled exons. It generates a table with the following information:

info: junction information with chr:start:end format.
gene_name: which annotated gene that exon is matched to
junction_type: types of junctions including promoter_tss (RT provides the splicing donor site of the first intron, the 5' of the annotated or assembled transcript overlaps with an RT), promoter (RT provides the splicing donor site of the first intron, the 5' of the annotated or assembled transcript does not overlap with an RT), internal (RT provides splicing sites for internal exons), terminator (RT provides splicing acceptor sites for the last intron). If a splicing junction is involved in multiple transcripts and could be classified as different types, all types were listed and separated by "/"
max_te_perc: for exons that are adjacent to RT overlapping splicing sites, what percentage of their sequences are derived from TE.
start_te_repName, start_te_repFamily, start_te_repClass, start_te_info: The subfamily name, family name, class name and coordinates for RTs that overlap with the 5' of a splicing junction (splicing junctions were all considered as non-stranded). NA is used when the splicing site does not overlap with RT.
end_te_repName, end_te_repFamily, end_te_repClass, end_te_info: The subfamily name, family name, class name and coordinates for RTs that overlap with the 3' of a splicing junction (splicing junctions were all considered as non-stranded). NA is used when the splicing site does not overlap with RT.

R libraries: parallel, methods, data.table, magrittr, optparse, reshape and GenomicRanges are required for this script.

    Rscript overlap_junction_with_exon.r -e hg38_exon_collection.txt -j all_samples_junc_combined.txt -r hg38.rmsk.cleaned.txt -p 5 -n all_samples