Introduction

This talk will be about the Wolfram Data Repository resource containing genetic sequences of the SARS-CoV-2 virus.
In[]:=
ResourceSearch["Genetic Sequences for the SARS-CoV-2 Coronavirus"]
Out[]=
Name
ResourceType
ResourceObject
Description
RepositoryLocation
DocumentationLink
Genetic Sequences for the SARS-CoV-2 Coronavirus
DataResource
ResourceObject["Genetic Sequences for the SARS-CoV-2 Coronavirus"]
Released nucleotide sequences of the SARS-CoV-2 virus (the virus associated with ⋱
https://www.wolframcloud.com/objects/resourcesyste...
https://datarepository.wolframcloud.com/resources/...

Where is this data from?

This data is provided by the National Center for Biotechnology Information's Severe acute respiratory syndrome coronavirus 2 data hub:
​

What data is contained in this resource?

This data provides a Dataset of genetic sequences of the virus along with metadata providing context about those sequences:
In[]:=
ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus"]
Out[]=
Accession
Species
Genus
Family
Length
Host
BioSample
Sequence
CollectionDate
ReleaseDate
GeographicLocation
NucleotideStatus
GenBankTitle
IsolationSource
Country
MN908947
Severe acute respiratory syndrome-related coronavirus
Betacoronavirus
Coronaviridae
29903
human
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAA ⋱
Dec 2019
Sun 12 Jan 2020
China
complete
Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
China
NC_045512
Severe acute respiratory syndrome-related coronavirus
Betacoronavirus
Coronaviridae
29903
human
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAA ⋱
Dec 2019
Mon 13 Jan 2020
China
refseq, complete
Wuhan seafood market pneumonia virus isolate Wuhan-Hu-1, complete genome
China
MN970003
Severe acute respiratory syndrome-related coronavirus
Betacoronavirus
Coronaviridae
290
human
TAAACACCTCATACCACTTATGTACAAAGGACTTCCTTGGAATGTAGTGCGTATAAAGATTGTACAAATGTTAAGTGACA ⋱
Wed 8 Jan 2020
Thu 23 Jan 2020
Thailand
Severe acute respiratory syndrome coronavirus 2 isolate SI200040-SP orf1ab polyp ⋱
lung, oronasopharynx
Thailand
MN970004
Severe acute respiratory syndrome-related coronavirus
Betacoronavirus
Coronaviridae
290
human
TAAACACCTCATACCACTTATGTACAAAGGACTTCCTTGGAATGTAGTGCGTATAAAGATTGTACAAATGTTAAGTGACA ⋱
Mon 13 Jan 2020
Thu 23 Jan 2020
Thailand
Severe acute respiratory syndrome coronavirus 2 isolate SI200121-SP orf1ab polyp ⋱
lung, oronasopharynx
Thailand
MN938384
Severe acute respiratory syndrome-related coronavirus
Betacoronavirus
Coronaviridae
29838
human
CAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTT ⋱
Fri 10 Jan 2020
Fri 24 Jan 2020
China
complete
Severe acute respiratory syndrome coronavirus 2 isolate 2019-nCoV_HKU-SZ-002a_2020, complete genome
oronasopharynx
China
MN938385
Severe acute respiratory syndrome-related coronavirus
Betacoronavirus
Coronaviridae
287
human
TGAGTTATGAGGATCAAGATGCACTTTTCGCATATACAAAACGTAATGTCATCCCTACTATAACTCAAATGAATCTTAAG ⋱
Jan 2020
Fri 24 Jan 2020
China
Severe acute respiratory syndrome coronavirus 2 isolate 2019-nCoV_HKU-SZ-001_202 ⋱
oronasopharynx
China
MN938386
Severe acute respiratory syndrome-related coronavirus
Betacoronavirus
Coronaviridae
287
human
TGAGTTATGAGGATCAAGATGCACTTTTCGCATATACAAAACGTAATGTCATCCCTACTATAACTCAAATGAATCTTAAG ⋱
Jan 2020
Fri 24 Jan 2020
China
Severe acute respiratory syndrome coronavirus 2 isolate 2019-nCoV_HKU-SZ-004_202 ⋱
oronasopharynx
China
MN938387
Severe acute respiratory syndrome-related coronavirus
Betacoronavirus
Coronaviridae
107
human
AATGTCTATGCAGATTCATTTGTAATTAGAGGTGATGAAGTCAGACAAATCGCTCCAGGGCAAACTGGAAAGATTGCTGA ⋱
Jan 2020
Fri 24 Jan 2020
China
Severe acute respiratory syndrome coronavirus 2 isolate 2019-nCoV_HKU-SZ-001_202 ⋱
oronasopharynx
China
MN938388
Severe acute respiratory syndrome-related coronavirus
Betacoronavirus
Coronaviridae
107
human
AATGTCTATGCAGATTCATTTGTAATTAGAGGTGATGAAGTCAGACAAATCGCTCCAGGGCAAACTGGAAAGATTGCTGA ⋱
Jan 2020
Fri 24 Jan 2020
China
Severe acute respiratory syndrome coronavirus 2 isolate 2019-nCoV_HKU-SZ-002b_20 ⋱
blood
China
MN938389
Severe acute respiratory syndrome-related coronavirus
Betacoronavirus
Coronaviridae
107
human
AATGTCTATGCAGATTCATTTGTAATTAGAGGTGATGAAGTCAGACAAATCGCTCCAGGGCAAACTGGAAAGATTGCTGA ⋱
Jan 2020
Fri 24 Jan 2020
China
Severe acute respiratory syndrome coronavirus 2 isolate 2019-nCoV_HKU-SZ-004_202 ⋱
oronasopharynx
China
MN938390
Severe acute respiratory syndrome-related coronavirus
Betacoronavirus
Coronaviridae
107
human
AATGTCTATGCAGATTCATTTGTAATTAGAGGTGATGAAGTCAGACAAATCGCTCCAGGGCAAACTGGAAAGATTGCTGA ⋱
Jan 2020
Fri 24 Jan 2020
China
Severe acute respiratory syndrome coronavirus 2 isolate 2019-nCoV_HKU-SZ-005_202 ⋱
oronasopharynx
China
MN975262
Severe acute respiratory syndrome-related coronavirus
Betacoronavirus
Coronaviridae
29891
human
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAA ⋱
Sat 11 Jan 2020
Fri 24 Jan 2020
China
complete
Severe acute respiratory syndrome coronavirus 2 isolate 2019-nCoV_HKU-SZ-005b_2020, complete genome
lung, oronasopharynx
China
MN975263
Severe acute respiratory syndrome-related coronavirus
Betacoronavirus
Coronaviridae
287
human
TGAGTTATGAGGATCAAGATGCACTTTTCGCATATACAAAACGTAATGTCATCCCTACTATAACTCAAATGAATCTTAAG ⋱
Jan 2020
Fri 24 Jan 2020
China
Severe acute respiratory syndrome coronavirus 2 isolate 2019-nCoV_HKU-SZ-007a_20 ⋱
oronasopharynx
China
MN975264
Severe acute respiratory syndrome-related coronavirus
Betacoronavirus
Coronaviridae
287
human
TGAGTTATGAGGATCAAGATGCACTTTTCGCATATACAAAACGTAATGTCATCCCTACTATAACTCAAATGAATCTTAAG ⋱
Jan 2020
Fri 24 Jan 2020
China
Severe acute respiratory syndrome coronavirus 2 isolate 2019-nCoV_HKU-SZ-007b_20 ⋱
oronasopharynx
China
MN975265
Severe acute respiratory syndrome-related coronavirus
Betacoronavirus
Coronaviridae
287
human
TGAGTTATGAGGATCAAGATGCACTTTTCGCATATACAAAACGTAATGTCATCCCTACTATAACTCAAATGAATCTTAAG ⋱
Jan 2020
Fri 24 Jan 2020
China
Severe acute respiratory syndrome coronavirus 2 isolate 2019-nCoV_HKU-SZ-007c_20 ⋱
lung, oronasopharynx
China
MN975266
Severe acute respiratory syndrome-related coronavirus
Betacoronavirus
Coronaviridae
107
human
AATGTCTATGCAGATTCATTTGTAATTAGAGGTGATGAAGTCAGACAAATCGCTCCAGGGCAAACTGGAAAGATTGCTGA ⋱
Jan 2020
Fri 24 Jan 2020
China
Severe acute respiratory syndrome coronavirus 2 isolate 2019-nCoV_HKU-SZ-007a_20 ⋱
oronasopharynx
China
MN975267
Severe acute respiratory syndrome-related coronavirus
Betacoronavirus
Coronaviridae
107
human
AATGTCTATGCAGATTCATTTGTAATTAGAGGTGATGAAGTCAGACAAATCGCTCCAGGGCAAACTGGAAAGATTGCTGA ⋱
Jan 2020
Fri 24 Jan 2020
China
Severe acute respiratory syndrome coronavirus 2 isolate 2019-nCoV_HKU-SZ-007b_20 ⋱
oronasopharynx
China
MN975268
Severe acute respiratory syndrome-related coronavirus
Betacoronavirus
Coronaviridae
107
human
AATGTCTATGCAGATTCATTTGTAATTAGAGGTGATGAAGTCAGACAAATCGCTCCAGGGCAAACTGGAAAGATTGCTGA ⋱
Jan 2020
Fri 24 Jan 2020
China
Severe acute respiratory syndrome coronavirus 2 isolate 2019-nCoV_HKU-SZ-007c_20 ⋱
lung, oronasopharynx
China
MN985325
Severe acute respiratory syndrome-related coronavirus
Betacoronavirus
Coronaviridae
29882
human
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAA ⋱
Sun 19 Jan 2020
Fri 24 Jan 2020
United States
complete
Severe acute respiratory syndrome coronavirus 2 isolate 2019-nCoV/USA-WA1/2020, complete genome
oronasopharynx
United States
MN988713
Severe acute respiratory syndrome-related coronavirus
Betacoronavirus
Coronaviridae
29882
human
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAA ⋱
Tue 21 Jan 2020
Sat 25 Jan 2020
United States
complete
Severe acute respiratory syndrome coronavirus 2 isolate 2019-nCoV/USA-IL1/2020, complete genome
lung, oronasopharynx
United States
showing 1–20 of 82
This metadata includes the location the genetic sample was acquired from, the collection and release dates, biological descriptions of sequence itself, and other metadata. Note that the “Country” column may be phased-out in future releases:
In[]:=
Keys@*First@*Normal@ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus"]
Out[]=
{Accession,Species,Genus,Family,Length,Host,BioSample,Sequence,CollectionDate,ReleaseDate,GeographicLocation,NucleotideStatus,GenBankTitle,IsolationSource,Country}
These sequences include both complete genomes and particular regions of the virus:
In[]:=
ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus"][All,{"GenBankTitle","Length"}][All,{StringDrop[#["GenBankTitle"],StringLength["Severe acute respiratory syndrome coronavirus 2"]]&,"Length"}]
Out[]=
isolate Wuhan-Hu-1, complete genome
29903
han-Hu-1, complete genome
29903
isolate SI200040-SP orf1ab polyprotein, RdRP region, (orf1ab) gene, partial cds
290
isolate SI200121-SP orf1ab polyprotein, RdRP region, (orf1ab) gene, partial cds
290
isolate 2019-nCoV_HKU-SZ-002a_2020, complete genome
29838
isolate 2019-nCoV_HKU-SZ-001_2020 ORF1ab polyprotein, RdRp region, (orf1ab) gene, partial cds
287
isolate 2019-nCoV_HKU-SZ-004_2020 ORF1ab polyprotein, RdRp region, (orf1ab) gene, partial cds
287
isolate 2019-nCoV_HKU-SZ-001_2020 surface glycoprotein (S) gene, partial cds
107
isolate 2019-nCoV_HKU-SZ-002b_2020 surface glycoprotein (S) gene, partial cds
107
isolate 2019-nCoV_HKU-SZ-004_2020 surface glycoprotein (S) gene, partial cds
107
isolate 2019-nCoV_HKU-SZ-005_2020 surface glycoprotein (S) gene, partial cds
107
isolate 2019-nCoV_HKU-SZ-005b_2020, complete genome
29891
isolate 2019-nCoV_HKU-SZ-007a_2020 ORF1ab polyprotein, RdRp region, (orf1ab) gene, partial cds
287
isolate 2019-nCoV_HKU-SZ-007b_2020 ORF1ab polyprotein, RdRp region, (orf1ab) gene, partial cds
287
isolate 2019-nCoV_HKU-SZ-007c_2020 ORF1ab polyprotein, RdRp region, (orf1ab) gene, partial cds
287
isolate 2019-nCoV_HKU-SZ-007a_2020 surface glycoprotein (S) gene, partial cds
107
isolate 2019-nCoV_HKU-SZ-007b_2020 surface glycoprotein (S) gene, partial cds
107
isolate 2019-nCoV_HKU-SZ-007c_2020 surface glycoprotein (S) gene, partial cds
107
isolate 2019-nCoV/USA-WA1/2020, complete genome
29882
isolate 2019-nCoV/USA-IL1/2020, complete genome
29882
showing 1–20 of 82

What analysis is provided as part of this resource?

Additional Elements

Get a timeline plot of collection dates:
In[]:=
ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus","CollectionTimeline"]
Out[]=
Contents cannot be rendered at this time; please try again later or download this notebook for full functionality »
See a timeline plot of release dates:
In[]:=
ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus","ReleaseTimeline"]
Out[]=
Contents cannot be rendered at this time; please try again later or download this notebook for full functionality »
In[]:=
ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus","AffectedLocations"]
Out[]=

Visualization

A phylogenetic tree comparison of complete genomes implies that while blocks of occurrences in China, the United States, and Japan are very similar, later occurrences are diverging as the virus spreads and mutates. Dropping the trailing sequences of adenine terms avoids arbitrary differences from varying poly(A) RNA tail lengths, which may be sequencing artifacts and shouldn’t affect viral adaptivity:
In[]:=
dropTrailingA[seq_]:=StringReplace[seq,StartOfString~~Shortest[a__]~~("A"..)~~EndOfStringa];​​Apply[ResourceFunction["PhylogeneticTreePlot"],Transpose[{dropTrailingA@First[#],Row@(Rest@#)}&/@(Values/@Normal[ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus"][Select[StringContainsQ[#GenBankTitle,"complete genome"]&],{"Sequence","GeographicLocation","CollectionDate"}]]​​)]​​]
Out[]=

Analysis

Observations of the content of different mutations by geographic location suggest that China has seen the most viral evolution, but each location has their unique strains:

How else might I analyze the data in this resource?

Apply an analysis similar to the above to virus fragments of the same type.
Use sequence alignment to uncover specific mutations and compare them over time to verify the viral lineages produced from the phylogenetic trees provided.
Calculate the difference between collection and release date by geographic location.
Classify which terms will be in the GenBank label based upon the sequence.
Compare sample and mutation counts to location populations.
...

How might I use this data in conjunction with other data I might have?

Cross Virus Comparison

If you have the sequences of other viruses, you might compare how similar they are to the viral samples here.

Gathering Data

Here is an example using the original reference sequence from above:
We will also use another complete sequence from the provided dataset, which I have chosen arbitrarily:
Access a sequence for a SARS-like virus that occurs in bats:
Similarly, access the reference SARS sequence for humans:

Similarity Comparison

First, we’ll define a normalized similarity function so that similarities of sequences of different lengths are treated identically:

Alignment Visualization

First, we’ll obtain the sequence alignments for each sequence with the original reference SARS-CoV-2 sample:
We can see that there are many fewer differences among SARS-CoV-2 sequences than with bat SARS:
If the alignment matches (and is a string, we’ll add the length), otherwise we’ll subtract the length of the mismatch:
We can create a set of coordinates with the slope determined by alignment similarity:
Using these functions, we can plot the similarity of different alignments in comparison with each other:

...