Geometrical Analysis of Genome for COVID-19 vs. SARS-Like Viruses

Mads Bahrami

Wolfram Research, Inc.

Z-curve theory provides a very unique geometrical approach for analyzing a genetic sequence, while preserving all the genome information. In this computational essay, I have developed a very efficient and fast function (in the Wolfram Language) to generate the Z-curve of any genetic sequence, regardless of its length. Then I generate Z-curves of 39 COVID-19 viruses together with 40 SARS-like viruses, using their complete genomes. Numerical analysis of corresponding Z-curves and their clustering shows a very close phylogenetic relationship between the family of COVID-19 viruses and the bat coronavirus isolate RaTG13 (MN996532), therefore supporting the hypothesis of a bat origin for COVID-19.

Collecting Data

Here is the accession list of SARS-like viruses:

viruses="AY274119.3","AY278488.2",

SequenceIcon

,"NC_004718","NC_009019";

One can use the function ImportFASTA, from the Wolfram Function Repository, to import FASTA data from the NCBI using an NCBI Reference Sequence. To save time, I have done this and given it an iconized form.

In[]:=

virusNameSeq=

VirusNameSequence

(*Map[ResourceFunction["ImportFASTA"],viruses]*);

Take the first element from all sublists of virusNameSeq, flatten the result and return the substrings from the first space to ",":

In[]:=

names=Flatten[StringCases[virusNameSeq[[All,1]]//Flatten," "~~x__~~","x]];Short[names,3]

Out[]//Short=

{Severe acute respiratory syndrome-related coronavirus isolate Tor2,SARS coronavirus BJ01,SARS coronavirus HKU-39849,SARS coronavirus CUHK-W1,SARS coronavirus Urbani,47,Rousettus bat coronavirus isolate GCCDC1 356,Bat coronavirus HKU9-1,Wuhan seafood market pneumonia virus isolate Wuhan-Hu-1,SARS coronavirus,Bat coronavirus HKU4-1}

Here is a replacement rule to change U to T and keep only A, C, G and T:

In[]:=

srules={"U""T",Except[Characters["ACGT"]]""};

Give an association with the viruses’ accessions as keys and sequences as values, then replace the sequences using srules:

In[]:=

dataSARSCoV=StringReplace[srules]/@AssociationThread[viruses,Flatten[virusNameSeq[[All,2]]]];

Take COVID-19 data from the Wolfram Data Repository, then select those whose whole GenBank title contains "complete genome":

In[]:=

dataWDR=ResourceData["Genetic Sequences for Novel Coronavirus 2019-nCoV from Wuhan, China"][Select[StringContainsQ[#GenBankTitle,"complete genome"]&]];

From the previous imported data, take only the Accession and Sequence columns, make them an association with accessions as keys and sequences as values, then replace the sequences using srules:

In[]:=

dataCOVID19=StringReplace[srules]/@<|Rule@@@Normal[Values[dataWDR[All,{"Accession","Sequence"}]]]|>;

Generate an association from SARS-like and COVID-19 viruses with accessions as keys and corresponding names as values:

In[]:=

accessionNameAssoc=Join[AssociationThread[viruses(Style[#,Bold,Italic]&/@names)],AssociationThread[Keys[dataCOVID19]StringSplit[StringSplit[Normal[dataWDR[All,"GenBankTitle"]],","][[All,1]],"isolate"|" 2 "|"strain"][[All,-1]]]];

Z-Curve

For a given genetic sequence of length N, let’s denote the number of occurrences of A, C, G and T bases at the

subsequence with

and

. Then the corresponding 3D coordinate for the

subsequence is obtained by the following transformation:

Out[]//TraditionalForm=



1	-1	1	-1
1	1	-1	-1
1	-1	-1	1

For example, for a sequence such as AACGT, we have

{

=1,

=0,

=0}

{

=2,

=0,

=0}

{

=2,

=1,

=0,

=0}

{

=2,

=1,

=0}

and

{

=2,

=1,

=1}

. The corresponding Z-curve would be

{{0,0,0},{1,1,1},{2,2,2},{1,3,1},{2,2,0},{1,1,1}}

Note

)-(

)

is the difference between the purine bases (R) and the pyrimidine bases (Y). Therefore, it is a measure of RY disparity. Likewise,

quantifies the amino (M) and keto (K) difference, called MK disparity, with

the WS disparity (W for weak and S for strong hydrogen bonds, respectively). More about RY, MK and WS disparities can be found here.

Generate a Z-curve by cumulative sums of four matrices assigned to A, C, G and T:

In[]:=

ZcurveGenerator[st_]:={{1,-1,1,-1},{1,1,-1,-1},{1,-1,-1,1}}.#&/@FoldList[#1+Switch[#2,"A",{1,0,0,0},"C",{0,1,0,0},"G",{0,0,1,0},"T",{0,0,0,1}]&,{0,0,0,0},Characters[st]]

Generate the Z-curves of the COVID-19 viruses (the result is an association in the form of accession→Z-curve):

In[]:=

zCurveCOVID19=ZcurveGenerator/@dataCOVID19;

Generate the Z-curves of the SARS-like viruses (the result is an association in the form of accession→Z-curve):

In[]:=

zCurveSARS=ZcurveGenerator/@dataSARSCoV;

Represent a 3D image of the Z-curve in which red belongs to SARS-like viruses and blue to COVID-19 viruses:

In[]:=

Legended[Graphics3D[{Blue,Values[Line/@zCurveCOVID19],Red,Values[Line/@zCurveSARS]},AxesTrue,ImageSizeLarge,BoxRatios{2,4,4},AxesLabel{"RY disparity","MK disparity","WS disparity"}],LineLegend[{Blue,Red},{"Z-curves of "<>ToString@Length@zCurveCOVID19<>" COVID19 viruses","Z-curves of "<>ToString@Length@zCurveSARS<>" SARS-like viruses"}]]

Out[]=

Find the shortest sequence and denote its length with min:

In[]:=

min=Min[Min[Length/@zCurveCOVID19],Min[Length/@zCurveSARS]]

Out[]=

21480

Trim the Z-curves up to the length of the shortest sequence:

In[]:=

trimmedZCurveCOVID19=Take[#,min]&/@zCurveCOVID19;trimmedzCurveSARS=Take[#,min]&/@zCurveSARS;

In[]:=

totalZcurves=Join[trimmedZCurveCOVID19,trimmedzCurveSARS];

Clustering and Classification Based on Z-Curves

Create a numeric array of the type double-precision real of each Z-curve:

In[]:=

nCOVID=NumericArray[#,"Real64"]&/@trimmedZCurveCOVID19;nSARS=NumericArray[#,"Real64"]&/@trimmedzCurveSARS;

Now let us find the three nearest Z-curves of SARS-like viruses to the Z-curve of each COVID-19 virus. Without loss of fidelity and to compute faster, we will first project the Z-curves into 30-D.

Reduce the dimension to 30-D:

Find the three nearest Z-curves of SARS-like viruses to the Z-curve of each COVID-19 virus:

Flatten out the closest and obtain tallies:

As suggested by the Z-curve visualization, the three nearest SARS-like viruses are the same for all COVID-19 viruses.

Delete duplicates from the previous result and return their GenBank titles:

Our analysis shows that among the 57 SARS-like viruses that we study here, bat coronavirus isolate RaTG13, bat SARS-like coronavirus isolate bat-SL-CoVZXC21 and bat SARS-like coronavirus isolate bat-SL-CoVZC45 have the closest genome similarity to the COVID-19 viruses, providing evidence for a bat origin of COVID-19.

Let us do some hierarchical clustering.

Cluster the viruses based on their Z-curves using a single-linkage clustering algorithm:

Assign a different color to each cluster, and represent 3D images of corresponding Z-curves:

Project clusters into 3D using the t-distributed stochastic neighbor embedding algorithm:

Construct a dendrogram from the hierarchical clustering of the Z-curves (note the coloring is the same as the one for clusters; those that are bold and italic are SARS-like viruses):

Concluding Remarks

I generated the Z-curves of 39 available genetic sequences of COVID-19 viruses together with 40 SARS-like viruses. An eyeballing of the 3D visualization of the Z-curves suggests that there are three SARS-like viruses very close to the virus family of COVID-19. By qualifying the similarity/dissimilarity between Z-curves, I identify those three viruses as bat coronavirus isolate RaTG13 (MN996532), bat SARS-like coronavirus isolate bat-SL-CoVZXC21 (MG772934 .1) and bat SARS-like coronavirus isolate bat-SL-CoVZC45 (MG772933 .1). The hierarchical clustering of viruses based on the Z-curves also confirms the visualization, and identifies bat coronavirus isolate RaTG13 as the most likely culprit of COVID-19. My results strongly support the hypothesis of a bat origin for COVID-19.

Acknowledgments

I would like to thank Dr. Robert Nachbar and Dr. Daniel Lichtblau for their valuable feedback.

References

Zhang, Ren and Chun-Ting Zhang. 2014. “A Brief Review: The Z-Curve Theory and Its Application in Genome Analysis.” Current Genomics 15(2) (April): 78–94. https://doi.org/10.2174/1389202915999140328162433.

Zheng, W. X., L. L. Chen, H. Y. Ou, F. Gau and C. T. Zhang. 2005. “Coronavirus Phylogeny Based on a Geometric Approach.” Molecular Phylogenetics and Evolution 36(2) (August): 224–32. https://doi.org/10.1016/j.ympev.2005.03.030.