# Geometrical Analysis of Genome for COVID-19 vs. SARS-Like Viruses

Geometrical Analysis of Genome for COVID-19 vs. SARS-Like Viruses

Geometrical Analysis of Genome for COVID-19 vs. SARS-Like Viruses

Mads Bahrami

Wolfram Research, Inc.

Z-curve theory provides a very unique geometrical approach for analyzing a genetic sequence, while preserving all the genome information. In this computational essay, I have developed a very efficient and fast function (in the Wolfram Language) to generate the Z-curve of any genetic sequence, regardless of its length. Then I generate Z-curves of 39 COVID-19 viruses together with 40 SARS-like viruses, using their complete genomes. Numerical analysis of corresponding Z-curves and their clustering shows a very close phylogenetic relationship between the family of COVID-19 viruses and the bat coronavirus isolate RaTG13 (MN996532), therefore supporting the hypothesis of a bat origin for COVID-19.

Collecting Data

Here is the accession list of SARS-like viruses:

viruses="AY274119.3","AY278488.2",

,"NC_004718","NC_009019";

One can use the function ImportFASTA, from the Wolfram Function Repository, to import FASTA data from the NCBI using an NCBI Reference Sequence. To save time, I have done this and given it an iconized form.

In[]:=

virusNameSeq=

(*Map[ResourceFunction["ImportFASTA"],viruses]*);

VirusNameSequence |

Take the first element from all sublists of virusNameSeq, flatten the result and return the substrings from the first space to ",":

In[]:=

names=Flatten[StringCases[virusNameSeq[[All,1]]//Flatten," "~~x__~~","x]];Short[names,3]

Out[]//Short=

{Severe acute respiratory syndrome-related coronavirus isolate Tor2,SARS coronavirus BJ01,SARS coronavirus HKU-39849,SARS coronavirus CUHK-W1,SARS coronavirus Urbani,47,Rousettus bat coronavirus isolate GCCDC1 356,Bat coronavirus HKU9-1,Wuhan seafood market pneumonia virus isolate Wuhan-Hu-1,SARS coronavirus,Bat coronavirus HKU4-1}

Here is a replacement rule to change U to T and keep only A, C, G and T:

In[]:=

srules={"U""T",Except[Characters["ACGT"]]""};

Give an association with the viruses’ accessions as keys and sequences as values, then replace the sequences using srules:

In[]:=

dataSARSCoV=StringReplace[srules]/@AssociationThread[viruses,Flatten[virusNameSeq[[All,2]]]];

Take COVID-19 data from the Wolfram Data Repository, then select those whose whole GenBank title contains "complete genome":

In[]:=

dataWDR=ResourceData["Genetic Sequences for Novel Coronavirus 2019-nCoV from Wuhan, China"][Select[StringContainsQ[#GenBankTitle,"complete genome"]&]];

From the previous imported data, take only the Accession and Sequence columns, make them an association with accessions as keys and sequences as values, then replace the sequences using srules:

In[]:=

dataCOVID19=StringReplace[srules]/@<|Rule@@@Normal[Values[dataWDR[All,{"Accession","Sequence"}]]]|>;

Generate an association from SARS-like and COVID-19 viruses with accessions as keys and corresponding names as values:

In[]:=

accessionNameAssoc=Join[AssociationThread[viruses(Style[#,Bold,Italic]&/@names)],AssociationThread[Keys[dataCOVID19]StringSplit[StringSplit[Normal[dataWDR[All,"GenBankTitle"]],","][[All,1]],"isolate"|" 2 "|"strain"][[All,-1]]]];

Z-Curve

For a given genetic sequence of length N, let’s denote the number of occurrences of A, C, G and T bases at the subsequence with , , and . Then the corresponding 3D coordinate for the subsequence is obtained by the following transformation:

th

n

A

n

C

n

G

n

T

n

th

n

Out[]//TraditionalForm=

x n |

y n |

z n |

1 | -1 | 1 | -1 |

1 | 1 | -1 | -1 |

1 | -1 | -1 | 1 |

A n |

C n |

G n |

T n |

For example, for a sequence such as AACGT, we have , , , and . The corresponding Z-curve would be .

{=1,=0,=0,=0}

A

1

C

1

G

1

T

1

{=2,=0,=0,=0}

A

2

C

2

G

2

T

2

{=2,=1,=0,=0}

A

3

C

3

G

3

T

3

{=2,=1,=1,=0}

A

4

C

4

G

4

T

1

{=2,=1,=1,=1}

A

5

C

5

G

5

T

5

{{0,0,0},{1,1,1},{2,2,2},{1,3,1},{2,2,0},{1,1,1}}

Note =(+)-(+) is the difference between the purine bases (R) and the pyrimidine bases (Y). Therefore, it is a measure of RY disparity. Likewise, quantifies the amino (M) and keto (K) difference, called MK disparity, with the WS disparity (W for weak and S for strong hydrogen bonds, respectively). More about RY, MK and WS disparities can be found here.

x

n

A

n

G

n

C

n

T

n

y

n

z

n

Generate a Z-curve by cumulative sums of four matrices assigned to A, C, G and T:

In[]:=

ZcurveGenerator[st_]:={{1,-1,1,-1},{1,1,-1,-1},{1,-1,-1,1}}.#&/@FoldList[#1+Switch[#2,"A",{1,0,0,0},"C",{0,1,0,0},"G",{0,0,1,0},"T",{0,0,0,1}]&,{0,0,0,0},Characters[st]]

Generate the Z-curves of the COVID-19 viruses (the result is an association in the form of accession→Z-curve):

In[]:=

zCurveCOVID19=ZcurveGenerator/@dataCOVID19;

Generate the Z-curves of the SARS-like viruses (the result is an association in the form of accession→Z-curve):

In[]:=

zCurveSARS=ZcurveGenerator/@dataSARSCoV;

Represent a 3D image of the Z-curve in which red belongs to SARS-like viruses and blue to COVID-19 viruses:

In[]:=

Legended[Graphics3D[{Blue,Values[Line/@zCurveCOVID19],Red,Values[Line/@zCurveSARS]},AxesTrue,ImageSizeLarge,BoxRatios{2,4,4},AxesLabel{"RY disparity","MK disparity","WS disparity"}],LineLegend[{Blue,Red},{"Z-curves of "<>ToString@Length@zCurveCOVID19<>" COVID19 viruses","Z-curves of "<>ToString@Length@zCurveSARS<>" SARS-like viruses"}]]

Out[]=

Find the shortest sequence and denote its length with min:

In[]:=

min=Min[Min[Length/@zCurveCOVID19],Min[Length/@zCurveSARS]]

Out[]=

21480

Trim the Z-curves up to the length of the shortest sequence:

In[]:=

trimmedZCurveCOVID19=Take[#,min]&/@zCurveCOVID19;trimmedzCurveSARS=Take[#,min]&/@zCurveSARS;

In[]:=

totalZcurves=Join[trimmedZCurveCOVID19,trimmedzCurveSARS];

Clustering and Classification Based on Z-Curves

Create a numeric array of the type double-precision real of each Z-curve:

In[]:=

nCOVID=NumericArray[#,"Real64"]&/@trimmedZCurveCOVID19;nSARS=NumericArray[#,"Real64"]&/@trimmedzCurveSARS;

Now let us find the three nearest Z-curves of SARS-like viruses to the Z-curve of each COVID-19 virus. Without loss of fidelity and to compute faster, we will first project the Z-curves into 30-D.

Reduce the dimension to 30-D:

In[]:=

allReduced=DimensionReduce[Join[Values[nCOVID],Values[nSARS]],30,Method"LatentSemanticAnalysis"];nCOVIDreduced=Take[allReduced,Length[Values[nCOVID]]];nSARSreduced=AssociationThread[Keys[nSARS]Drop[allReduced,Length[Values[nCOVID]]]];

Find the three nearest Z-curves of SARS-like viruses to the Z-curve of each COVID-19 virus:

In[]:=

closest=Nearest[nSARSreduced,nCOVIDreduced,3]//Flatten//Tally

Out[]=

{{NC_045512,324},{MN996532,324},{MG772934.1,324}}

Flatten out the closest and obtain tallies:

In[]:=

closest//Flatten//Tally

Out[]=

{{NC_045512,1},{324,3},{MN996532,1},{MG772934.1,1}}

As suggested by the Z-curve visualization, the three nearest SARS-like viruses are the same for all COVID-19 viruses.

Delete duplicates from the previous result and return their GenBank titles:

In[]:=

Map[accessionNameAssoc[#]&,DeleteDuplicates[closest],{2}]//Flatten

Out[]=

{ Wuhan-Hu-1,Missing[KeyAbsent,324],Bat coronavirus RaTG13,Missing[KeyAbsent,324],Bat SARS-like coronavirus isolate bat-SL-CoVZXC21,Missing[KeyAbsent,324]}

Our analysis shows that among the 57 SARS-like viruses that we study here, bat coronavirus isolate RaTG13, bat SARS-like coronavirus isolate bat-SL-CoVZXC21 and bat SARS-like coronavirus isolate bat-SL-CoVZC45 have the closest genome similarity to the COVID-19 viruses, providing evidence for a bat origin of COVID-19.

Let us do some hierarchical clustering.

In[]:=

numericArray=Join[nSARS,nCOVID];

Cluster the viruses based on their Z-curves using a single-linkage clustering algorithm:

In[]:=

clusters=FindClusters[numericArray];

Assign a different color to each cluster, and represent 3D images of corresponding Z-curves:

In[]:=

Legended[Graphics3D[Riffle[colors=Take[ColorData[ColorData["Indexed"][[-5]],"ColorList"],UpTo[Length@clusters]],(Line/@Lookup[#][totalZcurves])&/@clusters],AxesTrue,ImageSizeLarge,BoxRatios{2,4,4},AxesLabel{"RY disparity","MK disparity","WS disparity"}],LineLegend[colors,Table["Cluster#"<>ToString[i]<>" ("<>ToString[Length[clusters[[i]]]]<>" viruses)",{i,Length[clusters]}]]]

Project clusters into 3D using the t-distributed stochastic neighbor embedding algorithm:

Construct a dendrogram from the hierarchical clustering of the Z-curves (note the coloring is the same as the one for clusters; those that are bold and italic are SARS-like viruses):

Concluding Remarks

I generated the Z-curves of 39 available genetic sequences of COVID-19 viruses together with 40 SARS-like viruses. An eyeballing of the 3D visualization of the Z-curves suggests that there are three SARS-like viruses very close to the virus family of COVID-19. By qualifying the similarity/dissimilarity between Z-curves, I identify those three viruses as bat coronavirus isolate RaTG13 (MN996532), bat SARS-like coronavirus isolate bat-SL-CoVZXC21 (MG772934 .1) and bat SARS-like coronavirus isolate bat-SL-CoVZC45 (MG772933 .1). The hierarchical clustering of viruses based on the Z-curves also confirms the visualization, and identifies bat coronavirus isolate RaTG13 as the most likely culprit of COVID-19. My results strongly support the hypothesis of a bat origin for COVID-19.

Acknowledgments

I would like to thank Dr. Robert Nachbar and Dr. Daniel Lichtblau for their valuable feedback.

References

Zhang, Ren and Chun-Ting Zhang. 2014. “A Brief Review: The Z-Curve Theory and Its Application in Genome Analysis.” Current Genomics 15(2) (April): 78–94. https://doi.org/10.2174/1389202915999140328162433.

Zheng, W. X., L. L. Chen, H. Y. Ou, F. Gau and C. T. Zhang. 2005. “Coronavirus Phylogeny Based on a Geometric Approach.” Molecular Phylogenetics and Evolution 36(2) (August): 224–32. https://doi.org/10.1016/j.ympev.2005.03.030.