# Creating Dendrograms from Sequenced SARS-CoV-2 Genomes

Creating Dendrograms from Sequenced SARS-CoV-2 Genomes

Creating Dendrograms from Sequenced SARS-CoV-2 Genomes

DanielLichtblau

Wolfram Research, Inc.

The Wolfram Data Repository has a curated and continually updated collection of SARS-CoV-2 genome sequences. These sequences are comprised of strings containing the standard nucleotide letters ACTG. I will show how to create dendrograms from these strings in order to visualize relative genetic proximities both by place and date of collection.

Obtaining and Processing the Sequences

Our first order of business is to obtain the sequences from the Wolfram Data Repository. They are in the form of a Dataset and can be obtained via this URL. A convenient examples notebook can be found that has exactly the code I need in order to get lists of sequences along with their collection geographic region and date. Moreover, there is a simple way to filter so that we only use fully sequenced examples. This code and the data resource are the work of my colleague John Cassel.

In[]:=

fullGenes=ResourceData["Genetic Sequences for the SARS-CoV-2 Coronavirus"][Select[StringContainsQ[#GenBankTitle,"complete genome"]&],{"Sequence","GeographicLocation","CollectionDate"}];geneLists=Normal[fullGenes];Length[geneLists]

Out[]=

141

As it happens, I already know that many have a bunch of trailing adenine nucleotides, and the presence or absence thereof seems to be an artifact of the sequencing. The resource notebook has convenient code for removing these:

In[]:=

dropTrailingA[seq_]:=StringReplace[seq,StartOfString~~Shortest[a__]~~("A"..)~~EndOfStringa];

It is useful to check the size range of the sequence strings:

In[]:=

ReverseSortBy[Tally[Map[StringLength[dropTrailingA[#]]&,"Sequence"/.geneLists]],First]

Out[]=

{{29923,1},{29912,1},{29906,1},{29902,1},{29901,1},{29890,1},{29876,1},{29870,65},{29869,4},{29868,1},{29867,6},{29865,1},{29864,2},{29863,1},{29862,1},{29860,2},{29858,1},{29857,3},{29855,1},{29854,3},{29852,2},{29850,2},{29848,1},{29846,2},{29843,1},{29841,1},{29839,3},{29838,4},{29834,1},{29833,1},{29831,1},{29830,1},{29829,2},{29825,2},{29818,1},{29817,1},{29816,1},{29814,1},{29813,1},{29811,2},{29810,1},{29801,1},{29797,1},{29796,1},{29793,1},{29787,1},{29782,4},{29745,1},{29736,1}}

Removing outliers

Of 141 genome sequences, 65 have the same length of 29870 (adjusted for dropped trailing A sequences). It is easy to see that most are in a fairly tight range. We will restrict to these similar-length sequences on the grounds that the sequencing is not known to be perfect and we expect the lesser-quality sequenced examples to fall outside the size range.

Obtain lists with entries of the form (sequence, location, date):

In[]:=

normalGL=Map[Values,geneLists];

Prune off those with sizes outside the range where most are located and see how many remain:

In[]:=

midsize=Select[normalGL,29850<=StringLength[dropTrailingA[#[[1]]]]≤29880&];Length[midsize]

Out[]=

96

Obtain the sequences, locations and dates as three separate lists:

In[]:=

{fullseqs0,places,dates}=Transpose[midsize];fullseqs=Map[dropTrailingA,fullseqs0];

Tally the sequence lengths for the retained genomes:

In[]:=

ReverseSortBy[Tally[Map[StringLength,fullseqs]],First]

Out[]=

{{29876,1},{29870,65},{29869,4},{29868,1},{29867,6},{29865,1},{29864,2},{29863,1},{29862,1},{29860,2},{29858,1},{29857,3},{29855,1},{29854,3},{29852,2},{29850,2}}

Creating a Color List to Correspond to Collection Dates

We will create ordinary numbers from the collection dates. Since this is only a crude indicator of when a virus was acquired, we do nothing fancy here. The collection times are given to the hour but we only use the day. Since these first appear in 2019, we subtract that from the year and multiply by 365, then multiply the month number by the average number of days (I use 30.4 for this), then add the day of the month.

First see how a date is formed:

In[]:=

dates[[1]]//InputForm

Out[]//InputForm=

DateObject[{2020, 1, 19}, Day, Gregorian, -5.]

So we take the first element from each of these, perform the above-mentioned arithmetic, and then rescale so that the numbers range from 0 to 1:

In[]:=

dates1=Map[First,dates];ndates1=Map[(#[[1]]-2019)*365+#[[2]]*30.4+If[Length[#]3,#[[3]],15]&,dates1];ndates=Rescale[ndates1];

The rescaling was so that we could conveniently assign colors to dates:

In[]:=

colorDateLabels=Thread[Style[dates1,Map[Darker[Hue[#]]&,ndates+.05]]]

Out[]=

{{2020,1,19},{2020,3,15},{2020,3,13},{2020,3,14},{2020,3,13},{2020,3,14},{2020,3,13},{2020,3,14},{2020,3,13},{2020,3,13},{2020,3,13},{2020,3,12},{2020,3,13},{2020,3,13},{2020,1,22},{2020,1,22},{2020,2,24},{2020,2,17},{2020,2,17},{2020,2,18},{2020,2,21},{2020,2,21},{2020,2,18},{2020,1,29},{2020,1},{2020,3,1},{2020,2,29},{2020,2,28},{2020,2,27},{2020,2,21},{2020,2,21},{2020,2,21},{2020,2,18},{2020,2,18},{2020,2,17},{2020,2,24},{2020,2,24},{2020,2,18},{2020,2,18},{2020,2,25},{2020,2,20},{2020,2,17},{2020,2,20},{2020,2,17},{2020,2,17},{2020,2,17},{2020,2,17},{2020,2,2},{2020,1,30},{2020,1,31},{2020,1,27},{2020,2,24},{2020,1,28},{2020,1,28},{2020,1,28},{2020,1,26},{2020,2,28},{2020,1,29},{2020,2,23},{2020,2,11},{2020,2,10},{2020,2,6},{2020,1,8},{2020,2,7},{2020,2,5},{2020,1,31},{2020,1,17},{2020,1,28},{2020,1},{2020,1,29},{2020,1,31},{2020,1,29},{2020,1,29},{2020,1,29},{2020,1,25},{2020,1,25},{2020,1,1},{2019,12,30},{2019,12,30},{2019,12,30},{2019,12,23},{2020,1,25},{2019,12,30},{2019,12,30},{2019,12,30},{2019,12,30},{2020,1,22},{2020,1,22},{2020,1,23},{2020,1,2},{2020,1,2},{2020,1,21},{2020,1,19},{2020,1,11},{2019,12},{2019,12}}

Creating a Dendrogram with Labels Given by Collection Date

A phylogenetic tree is a type of dendrogram one can use to visualize proximities and branching of gene sequences. In order to create such a graphic, there must be an underlying way to obtain "distances" between pairs of sequences. One approach would be to use string distance measures. This can be quite slow when the strings have close to 30,000 letters. I developed a surrogate based on a well-known method for turning gene sequences into numeric arrays along with some standard methods for dimension reduction. I will defer on the details until a later section, since we will actually need to make alterations. For now I simply use the phylogenetic tree–plotting Wolfram Resource Function "out of the box." The tree does seem to show clumps grouped by collection date.

Plot the tree as labeled by collection date:

In[]:=

ResourceFunction["PhylogeneticTreePlot"][fullseqs,colorDateLabels,AspectRatio1.5,ImageSize700]

Out[]=

Very likely they also have some geographic affinity. We tackle that next.

Grouping by collection location

From inspection I know that every collection location is either a geographic Entity or else of the form Missing["Unavailable"]. So it is easy to extract the location names to use as dendrogram node labels. Of course many infection samples collected in one location may have originated elsewhere. And some geographic locations are not well "localized," e.g. China or the United States. So we hardly expect perfect correlation between genome proximity and collection location. But we do expect that there will be some visible evidence of a relation between genetic and geographic similarity:

Again we form a color scheme:

Now associate colors to geographic locations:

Create a dendrogram labeled by collection locations:

Labeling by date and location

It is of course straightforward to show both location and date in the plots:

A Digression to Explain the Method

When I started this notebook I did not expect to be describing the method behind PhylogeneticTreePlot. But I was not thrilled with the results above and played with variants. A happenstance failure pointed me in a direction that necessitates a modification to the code in that Wolfram Function Repository entry (which at some point I will address therein). To get to the modification, I need to give a brief tour of the inner workings.

Given a genetic sequence, a method to convert to an image was presented by Joel Jeffrey in the late 1980s. A modest refinement, which I use, is called the Frequency Chaos Game Representation (FCGR). We have a convenient Wolfram Function Repository item for this, ResourceFunction["FCGRImage"]. I will illustrate it on the first and last of the sequences in our set of SARS-CoV-2 genomes:

A next step is to reduce the dimensions using the Fourier cosine transform (FCT). The images above have dimensions 128×128 and I reduce them to 30×30. The idea is to retain only the lower-frequency components, in effect blurring the images by removing finer details.

Compute the FCTs:

In order to visualize the effect of this reduction, we must invert these (dimension-reduced) transforms:

One will notice the coarse similarities between the original FGCR images and those that are dimension-reduced by FCT. The point though is not to visualize these but rather to utilize them in a further dimension reduction. We now process all the genome sequences as above and flatten each into a vector. This gives an array of dimensions 96×900 (since 30×30 is 900). We now use the singular values decomposition (SVD) to reduce, this time to dimension 41.

Process all genome sequences to obtain those flattened vectors:

Here is a helper function I grabbed and altered (to be explained) from ResourceFunction["PhylogeneticTreePlot"]:

Compute the next step in dimension reduction:

Recall that we began with genome strings containing nearly 30K nucleotides (I know, the correct notation is 30000 bpi). We then made them into 128×128 pixel images. Which, if you work it out properly, is not a dimension reduction. We next reduced to 30×30 and then to 40 machine doubles (each taking 8 bytes). This is a substantial reduction. The point of the SVD is to concentrate most of the "energy" (properly defined) of the input into a smaller dimension. With a bit of work, we can visualize these reduced vectors.

Here are the pictures; the code details can be ignored for now:

The important takeaway is that these are visually quite similar to those FCT images. Actually, there is an important difference I did not yet mention. What I stumbled across earlier is that the SVD packs so much oomph (apologies for the technical term) into the first component that that one swamps the rest. To see this, I retained the singular values.

Here are the initial (largest) of the singular values from the SVD:

That first one is more than 100 times larger than the next. What this means for our purposes is this: If we retain it (as is done in ResourceFunction[“PhylogeneticTreePlot”]), then those dimension-40 vectors would all have large first components and piddly (another technical term, again sorry) remaining values. Since the dendrogram code works by clustering based on distances, we really do not get a reasonable sense of separation because the first components are all about the same and all swamp the rest. So this, finally, explains why I felt a need to modify the inner workings of that resource function. Going forward I will add an option to PhylogeneticTreePlot, with an Automatic default setting, to allow removing first components when the first singular value is much larger than the next. I should mention that this issue of a very large first singular value does not tend to arise when working with a set of genomes from different species. Which was the case for most of the experiments I did in a "spare time" project from a year ago and hence why I did not consider it before.

These Plots Could Be Better…

We now have dimension-reduced vectors that are not dwarfed by their initial components. We can use them to create dendrograms. It so happens that the Wolfram Language, rather conveniently, has a Dendrogram function. It also happens that PhylogeneticTreePlot does not use it. Indeed, it uses a related function from context (that really is a technical term, in the Wolfram Language) that is documented but not very visible; in a word, or actually two, it is a second-class function. Why does it not use the first-class Dendrogram directly? Because the author was only recently made aware that we have such a function. Okay, it is a big language, and it does not trouble me not to know everything there is in it. In any case, below we will use the built-in function Dendrogram.

A dimension-reduced point plot

First I want to show a different view of these dimension-reduced vectors. We do this by, well, even more dimension reduction. Specifically we convert to 3D so we can obtain a visualization. There are many ways to do this. I could, for example, do another SVD (and this is pretty much the engine behind one method in the built-in function DimensionReduce). I happen to like a method known as MultidimensionalScaling, and so I put it into the Wolfram Function Repository. That is the method I will use below.

First I show the picture for the case where we do not clip first components from the SVD step (some code is omitted here but will be shown in the next image):

Now we show the same point plot but done on the vectors formed without that large singular value:

One will observe that this second plot does not show such distinct clusters (though rotating the graphic reveals that there is indeed some degree of clustering). But it does tend to show most genome vectors having similar collection dates (as based on color scheme) to many other neighbors. This is hardly the case for the three clusters in that first plot.

The new dendrogram

Last I will show a dendrogram plot based on these modified dimension-reduced vectors. To my admittedly untrained eyes, it seems to show more contiguity with respect to collection dates and locations:

Of note is that the South Korea sample, and a few others, are no longer widely separated from the rest, as was previously evidenced by unbalanced branch lengths. Some analysis by John Cassel (see the Wolfram Data Repository page) shows that the actual string differences for that South Korea sample were modest. This gives some confidence that removing the vector components corresponding to the largest singular value was in fact a sensible thing to do.

Summary

Two months ago I began playing with genome sequences that were being made available for SARS-CoV-2 (which at that time was then still informally known as 2019-nCoV). I took the few sequences then available in the GenBank repository at NCBI, also signed up on GISAID and found several there, and moreover used many other coronavirus genomes from GenBank, for purposes of performing comparisons and various other computations. Much of what resulted can be found in a Wolfram Community post from early February. At that time, one interest of mine was to show how a phylogenetic tree could be used to indicate which coronaviruses were most closely related to the novel SARS-CoV-2. While such analysis had already been done elsewhere, it was interesting to see that methods available in the Wolfram Language were suitable for independently (and speedily) obtaining similar results. And such results can give useful clues to possible origins of this virus. For example, I recall reading a post in late January that mentioned "bats, not snakes" (a phrase that has since been repeated elsewhere). A phylogenetic tree containing coronavirus genomes from bats, snakes and several other species showed this quite clearly. In this notebook, I tackle a problem related to, but more difficult than, the phylogenetic tree analysis in the prior work. The goal herein is to more accurately compute dendrograms for genome samples from the same virus, specifically the SARS-CoV-2 full sequences now available in the Wolfram Data Repository.

This has been an interesting investigation. I am not sure I learned anything new about SARS-CoV-2, but as a layperson I did not really expect to. I did learn something about PhylogeneticTreePlot (an area more familiar to me) that was not really obvious in experiments on genome sets with greater variability than the one we looked at here. This led us into a brief tour through one approach to converting genome sequences into vectors. It has applicability to genome lookup, proximity measures and other aspects of alignment-free genomic computation. And it is an approach to which I am quite partial. I should be, since I spent significant time to develop it over the past year or two. While I would rather we did not have a test set of quite this significance, it seems to have use for gauging distances in those SARS-CoV-2 sequences available to date.

Related Links