SARS-CoV-2 Results

ScanFold results for the COVID-19 strain (SARS-CoV-2; NC_045512.2) can now be found on the RNAStructuromeDB: browse the results in IGV below, on JBrowse or download all ScanFold output files here.

The SARS-CoV-2 genome is a single stranded RNA molecule approximately 30,000 nucleotides long. The ScanFold program has been used to characterize its RNA folding landscape - highlighting regions of likely structure and function which serve as ideal targets for further analysis. Read about the analysis and our findings here: An in silico map of the SARS-CoV-2 RNA Structurome bioRxiv 2020

ScanFold results (using a 120 nt window and 1 nt step size) are shown in the IGV genome browser below. Scanning analysis window metrics are reported below ScanFold predicted structures (which are depicted as arc diagrams). Here, arcs depict base pairs and have been colored according to how unusually stable the depicted structure is: yellow indicates the structure is slightly more stable, green indicates one standard deviation more stable, and blue indicates two standard deviations more stable than expected based on sequence composition alone. When a structure is significantly more stable than expected it may indicate a potential for function. You can read more about this analysis and interpreting results in the papers linked below.

Read about the ScanFold method and how to interpret results in the following papers:

Andrews RJ, Baber L, Moss WN: Mapping the RNA structural landscape of viral genomes. Methods 2019
Andrews RJ, Roche J, Moss WN: ScanFold: an approach for genome-wide discovery of local RNA structural elements—applications to Zika virus and HIV PeerJ 2018.

The files listed below have been formatted for JBrowse (ie. BigWig tracks, indexed BGzipped GFF3 files, etc) and will be updated and available for download as they are added to the structurome.

SARS-CoV-2 Files:

Materials & Methods (Description and/or Program Settings) File
Dataset S4 - Alignment Files MAFFT alignment of SARS-Related genomes. FFT-ns-i method was used. Binary Data sars-related.fullgenome.mafft_.fasta
Dataset S3 - Motif Conservation Analyses Each of the motifs from Dataset S2 has been queried (using the Incarnato lab's cm-builder script) against ~25K coronavirus genomes using the Infernal program in order to generate covariation models of the given motif. The resulting covariation model stockholm alignments were then tested using the R-scape program for evidence of statisitcally significant covariation. Here base pairs with significant covariation (GTp test; E < 0.05) are highlighted in green. Additionally, each motif stockholm (Dataset S4) was used to generate a denovo structure model using the CaCoFold algorithm (https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008387). Package icon dataset_s3_motifconservation.zip
Dataset S2 - Motif Extract PDFs The ScanFold approach defines specific motifs of likely function. All predicted motifs containing at least one base pair with an average Zavg < -1 have been modeled, annotated, and compiled into individual PDF files for easy viewing. Package icon dataset_s2_motifextracts.zip
Dataset S1 - ScanFold and Reactivity analyses Supplemental Data Set 1 for "A map of the SARS-CoV-2 functional RNA Structurome" In Supplemental Data Set 1, there are two directories, "ScanFold Scans" and "Reactivity and Scan Analyses". The "ScanFold Scans" directory contains several internal directories with the in silico and constrained ScanFold output files for each model condition. The internal directories also contain the constraint files used for each associated run. The "Reactivity and Scan Analyses" contains 4 excel documents used to analyze the reactivity values/hard constraints and resulting ScanFold outputs. ScanFold_Metric_analysis.xlsx: Contains analyses and comparisons of in silico and top 10% constrained scan output ScanFold_Constraint_Conflict.xlsx: This file compares the top 10% of hard constraints from each dataset with in silico ScanFold predictions and reports positions where hard constraints and predicted base pairs conflict SHAPE_and_DMS_Constraint_Comparison.xlsx: Has the top 10% (and 20%) of hard constraints used from each experimental data set and compares their location/distribution in the genome. Also contains reactivitie ranges for the top 10% of SHAPE values. ScanFold_bp_Comparisons.xlsx: Contains comparisons of in silico and constrained .bp files and reports similarity of structure (both paired and unpaired) between the files. Sensitivity and PPV are also calculated here. All excel files contain additional information and descriptions of data. Package icon datasets1_scanfold_reactivity_analysis.zip
SARS-CoV-2 Motif Conservation Results This download contains a zip file with the full results of an Infernal/R-scape/CaCo fold analysis of ScanFold predicted structural motifs. The ScanFold approach defines specific motifs of likely function. All predicted motifs containing at least one base pair with an average Zavg < -1. The first set of motifs is from a purely in silico ScanFold analysis (unconstrained) and the other is from an experimentally informed ScanFold analysis (AllTop10). The AllTop10 results contain motifs predicted with 3 RNA structure probing datasets considered during folding. Here, the top 10% of reactive nucleotides were set to be unpaired during the ScanFold-Scan process (which informs the ultimate model building by disallowing highly reactive nucleotides from being paired). Each of these predicted motifs has been queried (using the Incarnato lab's cm-builder script) against ~25K coronavirus genomes using the Eddy lab's Infernal program in order to generate covariation models of each and iteratively search the coronavirus genome for homologs. If a covariation model with homologs is successfully built, the resulting covariation model stockholm alignment was then tested using the R-scape program for evidence of statistically significant covariation. Here base pairs with significant covariation (GTp test; E < 0.05) are highlighted in green. Additionally all stockholms were used to build models using the CaCo fold algorithm (which uses positive and negative covariation signals to identify base pairs). Package icon motifconservation.zip
High Resolution Figures Zippped folder containing all high resolution figures (TIFF format) from https://www.biorxiv.org/content/10.1101/2020.04.17.045161v1 Package icon hiresfigures.zip
Supplemental Tables (10.1101/2020.04.17.045161) Supplemental Tables described in: Andrews et. al. "An in silico map of the SARS-CoV-2 RNA Structurome" bioRxiv 2020 https://www.biorxiv.org/content/10.1101/2020.04.17.045161v1 Package icon supplemental_tables.zip
Scanning Window Results (SARS-CoV-2) This GFF3 file comprises all scanning window results (MFE, z-score, p-value, ED, native sequence, dot-bracket MFE structure, and centroid structure). It has been bg zipped for viewing in JBrowse or other genome viewers. Binary Data nc_045512.2.strand1.gff3.gz
SARS-CoV-2 Results The full results of the ScanFold analysis of SARS-CoV-2 have been zipped into this folder (including Supplementary Dataset 1 from https://www.biorxiv.org/content/10.1101/2020.04.17.045161v1). Package icon nc_045512_scanfoldresults.zip
SARS-CoV-2-ExtractedStructures This file contains all structures which contained at least one base pair with a Zavg of < -1. The sequences comprising these structures are then refolded individually and z-scores and ensemble diversity are recalculated for the motif. Plain text icon extractedstructures.txt
Thermodynamic z-score BigWig track format for JBrowse or other genome browsers. The z-score is calculated for each window of the input sequence. For each window we have two sets of sequences: native and 100 randomized sequences with the same nt content. MFE values are calculated for each. If the native sequence always has a much lower MFE than the average of scrambled versions this will lead to a negative z-score (if the native sequence MFE is always more positive, i.e., less stable, then the z-score will be positive). The equation normalizes the value by dividing by the standard deviation between all MFEs. The magnitude of the z-score then, states the number of standard deviations the native (window) MFE is from the random MFEs. Binary Data nc_045512.2.strand1_zscore.bw
p-value BigWig track format for JBrowse or other genome browsers. These values report the ratio of MFE random values which were more stable than native during calculation of z-score (100x randomizations). Binary Data nc_045512.2.strand1_pvalue.bw
MFE (kcal/mol) BigWig track format for JBrowse or other genome browsers. MFE values calculated for 120 nt sequences using RNAfold (v2.4.14). Binary Data nc_045512.2.strand1_mfe.bw
Ensemble Diversity Value BigWig track format for JBrowse or other genome browsers. High numbers indicate diverse structures can form, low numbers indicate a single dominant structure may be forming. Binary Data nc_045512.2.strand1_ed.bw
SARS-CoV-2 Gene Features The SARS-CoV-2 gene features reported at NCBI in GFF3 bgzip format (accession NC_045512.2). Binary Data sequence.sorted.gff3.gz