Retrotransposon:Gene Junction Reads Quantification Pipeline
This pipeline is designed to assemble transcripts starting from initial RNA-Seq reads. The reads are aligned using a two-pass alignment strategy to increase the sensitivity of identifying spliced reads. StringTie2 is then used to find novel exon structures and assemble the aligned reads into transcripts. Split RNA-seq reads are collected from the second pass alignment files. Exons are overlapped with retrotransposons (RTs), to find the if the edges of an exon are overlapping with an RT. The final step is to match the junctions previously identified with annotated or assembled exons. The final output can be used for further analysis.
The Code for this pipeline can be downloaded here here or from the Download page
This pipeline was developed by Wanqing Shao
Steps to Run Pipeline
The current pipeline starts from raw fastq files, generates bigwig files, and then assembled transcripts and junction files. The assembled transcripts and junction files can be used for later down stream analysis
This page provides a general overview of the pipeline, many of the scripts may require slight modifications to run properly (correction for tool path, library installation etc.)
Step 1: Adapter Trimming and First Pass Alignment
The script “first_alignment_pe_truseq.sh“ was used for adapter trimming and the first round of alignment for paired-end samples that use TruSeq adapters (if a different adapter is used, the script may need to be modified accordingly). The script uses a STAR index that is stored in the memory to reduce memory usage when multiple files are processed simultaneously.
To load STAR index in the memory and run first_alignment_PE_TruSeq.sh for multiple samples, run the following command in the terminal.
STAR --genomeDir /path/to/star/index --genomeLoad LoadAndExit parallel -uj 10 ./first_alignment_PE_TruSeq.sh {} /path/to/star/index 5 ::: sample1 sample2 sample3
STAR Version: STAR-2.7.0c
Cutadapt Version: 1.9
Step 2: Select Junctions and Updating STAR Index
This step only needs to be run once for each species' reference genome
For the CDK2AP1 project, we created updated STAR indices for the second pass alignment after the initial alignments where completed (This approach was selected due to computational efficiency). Alternatively, two-pass alignment can also be done with each sample "on the fly", see STAR manual for more details.
The following commands should be run in the terminal. The first command filters for canonical splicing junctions with at least 3 reads in at least 1 sample. Parameters may be adjusted to achieve the best performance.
cat *SJ.out.tab | awk '{FS="\t";OFS="\t"} {if ($5 > 0 && $7 > 2 && $6==0 && $4==1) print $1, $2, $3, "+"; else if ($5 > 0 && $7 > 2 && $6==0 && $4==2) print $1, $2, $3, "-";}' | sort | uniq > SJ.filtered.tab
STAR --runThreadN 10 \ --runMode genomeGenerate \ --genomeDir /path/to/updated/STAR/index \ --genomeFastaFiles /path/to/genome/fasta/file \ --sjdbGTFfile /path/to/gene/gtf/file \ --sjdbFileChrStartEnd /path/to/filtered/SJfile/SJ.filtered.tab
STAR Version: STAR-2.7.1a
Step 3: Second Pass Alignment
The script "second_alignment_PE.sh" is used for the second pass alignment.
To load the updated STAR index in into memory and run second_alignment_PE.sh for multiple samples, run the following two commands:
STAR --genomeDir /path/to/updated/STAR/index --genomeLoad LoadAndExit parallel -uj 10 ./second_alignment_PE.sh {} /path/to/updated/STAR/index 5 ::: sample1 sample2 sample3
STAR Version: STAR-2.7.1a
Step 4: Indexing BAM Files
Index the BAM files using samtools
parallel -uj 10 samtools index {} ::: *_2nd_pass*bam
Samtools Version: 1.9
Step 5: Generating Normalized Bigwig Files
Normalized bigwig files are generated using bamCoverage from deeptools
For the CDK2AP1 project, we used uniquely mapped reads for bigwig generation (--minMappingQuality 255) and normalized the tracks to Count Per Million (CPM) for ease of comparisons between samples.
To generate normalized bigwig files, run the following commands in the terminal. The resulting .bw files can be viewed in IGV.
parallel -uj 10 bamCoverage --bam {} -o \`basename {} Aligned.sortedByCoord.out.bam\`uniq_reads_cpm.bw -of bigwig --minMappingQuality 255 --normalizeUsing CPM -p 3 ::: *_2nd_pass*bam
Deeptools Version: 2.2
Step 6: Transcript Assembly
StringTie2 was used for transcript assembly for this pipeline.
Notes about StringTie2:
- StringTie2 considers multiply mapped reads (even-distribution)
- If a sample is stranded, --rf or --fr could be specified to improve the accuracy of strand determination at repetitive regions
- A guided assembly (with -G gtf) may generates better results.
Here, StringTie2 is run for individual samples, the parameter used are:
- Require 2 reads for junction assembly
- 5 reads for single exon transcript assembly
- 2 reads for multi-exon transcript assembly
Parameters can be tweaked to adjust the sensitivity & specificity.
The StringTie output is then merged using taco. Taco will create a folder called merged_gtf, there will be a gtf file named assembly.gtf, this file contains the assembled transcript and can be viewed in IGV
To generate merged transcripts for each study, run the following commands in the terminal:
parallel -uj 10 stringtie {} -p 2 -j 2 -s 5 -f 0.05 -c 2 -o \`basename {} Aligned.sortedByCoord.out.bam\`_stringtie_output.gtf ::: *_2nd_pass*bam ls *gtf > gtf_to_merge.txt taco_run -o merged_gtf -p 10 ./gtf_to_merge.txt
Stringtie Version: 2.1.4
taco Version: v0.7.3
Step 7: Collecting Splicing Junctions
The script "bam2junc_uniq.sh" is used to collect splicing juctions, it calls the followign scripts: "filter_cs.py", "sam_to_bed.py", "bed2junc.pl".
"filter_cs.py", "sam_to_bed.py", "bed2junc.pl", "bam2junc_uniq.sh" and "merge_junction_files.r" were constructed for collecting splicing junctions detected in RNAseq datasets. Some of the scripts were adapted from leafcutter due to the original leafcutter script not handeling soft-clipping properly.
For CDK2AP1 we used uniquely aligned reads for junction signal quantification (samtools view -q 255, this only works for STAR output).
parallel -uj 10 bam2junc_uniq.sh {} \`basename {} _Aligned.sortedByCoord.out.bam\`_uniq.junc ::: *_2nd_pass*bam ls *_uniq.junc > junc_file_list.txt
The script "merge_junction_files.r" is then used to merge junctions. This R script merges the junction files from multiple samples and generates the junction-by-sample matrix (value for each cell is the number of reads across each junction). Junctions that are within 10bp away for both splicing donor site and acceptor site were merged.
The R librarys: parallel, methods, data.table, magrittr, optparse, reshape and GenomicRanges are required to run this script
Notes: This script is not that memory efficient for combining a large number of samples. For datasets with more than 100 samples/cells, -p as 1 is recommended.
Rscript merge_junction_files.r -j junc_file_list.txt -p 10 -n all_samples
Step 8: Clean Repeatmakser File
Subset the repeat masker files from UCSC for future steps, saving the following columns: chr, start, end, strand and repClass. The code example below process the hg38.rmsk.txt file:
echo -e "chr\tstart\tend\tstrand\trepName\trepClass\trepFamily" > hg38.rmsk.cleaned.txt cut -f6-8,10,11,12,13 hg38.rmsk.txt >> hg38.rmsk.cleaned.txt
Step 9: Overlap Exons with RTs
The script name "overlap_exon_with_te.r" process the gene and assembled transcript gtf files, collects the exons, calculates the percentage of RT sequence for each exon, and returns information about if the edges of each exon are overlapping with RT
At the moment, this script only works for cleaned repeatmasker files with the following columns: chr, start, end, strand and repClass.
For CDK2ap1, we focused on retrotransposons (LTR, SINE and LINE), but this script can be modified to include DNA elements
Rscript overlap_exon_with_te.r -g hg38.ncbiRefSeq.gtf -t taco_assembly.gtf -p 5 -n hg38 -r hg38.rmsk.cleaned.txt
Step 10: Identifying RT-Gene Junctions
The script "overlap_junction_with_exon.r" matches the junctions with annotated or assembled exons. It generates a table with the following information:
- info: junction information with chr:start:end format.
- gene_name: which annotated gene that exon is matched to
- junction_type: types of junctions including promoter_tss (RT provides the splicing donor site of the first intron, the 5' of the annotated or assembled transcript overlaps with an RT), promoter (RT provides the splicing donor site of the first intron, the 5' of the annotated or assembled transcript does not overlap with an RT), internal (RT provides splicing sites for internal exons), terminator (RT provides splicing acceptor sites for the last intron). If a splicing junction is involved in multiple transcripts and could be classified as different types, all types were listed and separated by "/"
- max_te_perc: for exons that are adjacent to RT overlapping splicing sites, what percentage of their sequences are derived from TE.
- start_te_repName, start_te_repFamily, start_te_repClass, start_te_info: The subfamily name, family name, class name and coordinates for RTs that overlap with the 5' of a splicing junction (splicing junctions were all considered as non-stranded). NA is used when the splicing site does not overlap with RT.
- end_te_repName, end_te_repFamily, end_te_repClass, end_te_info: The subfamily name, family name, class name and coordinates for RTs that overlap with the 3' of a splicing junction (splicing junctions were all considered as non-stranded). NA is used when the splicing site does not overlap with RT.
R libraries: parallel, methods, data.table, magrittr, optparse, reshape and GenomicRanges are required for this script.
Rscript overlap_junction_with_exon.r -e hg38_exon_collection.txt -j all_samples_junc_combined.txt -r hg38.rmsk.cleaned.txt -p 5 -n all_samples