Phylogeographic Mapping of Newly Discovered Coronaviruses Pinpoints the Direct Progenitor of SARS-CoV-2 as Originating from Mojiang, China

by Jonathan Latham, PhD and Allison Wilson, PhD Back in March, the World Health Organisation’s report on the origin of the COVID-19 pandemic coronavirus confirmed something that had long been widely presumed. Since the pandemic ...

Phylogeographic Mapping of Newly Discovered Coronaviruses Pinpoints the Direct Progenitor of SARS-CoV-2 as Originating from Mojiang, China

by Jonathan Latham, Ph.D. and Allison Wilson, PhD

Back in March, the World Health Organisation’s report on the origin of the COVID-19 pandemic coronavirus confirmed something that had long been widely presumed.

Since the pandemic began, there has been an enormous virus hunt in China.

The purpose of this hunt has been to find the viruses intermediate between SARS-CoV-2 and its coronavirus relatives found in bats (Luk et al., 2019).

The closest known wild relative of SARS-CoV-2 was found by Zheng-li Shi of the Wuhan Institute of Virology (WIV) in a bat in central Yunnan province, China. This virus, called RaTG13, is 96.1% similar to SARS-CoV-2. This genetic difference (3.9%) corresponds to about 1150 nucleotide differences between the two viruses; i.e. it is quite a large gap. Finding intermediate viruses would solve two puzzles. One is geographical: By what means or in what host animal(s) did the virus get to Wuhan? The second is genetic: what viruses were the evolutionary intermediates between RaTG13 and SARS-CoV-2?

The targets of this hunt have therefore been bats but also potential intermediate host animals, such as civets or mink, either one of which might have been the vector that brought COVID-19 to Wuhan. Even partial evidence for such a trail of viral intermediates would support a likely zoonotic origin of SARS-CoV-2.

To this end, according to that WHO report, scientists across China have sampled and tested over 80,000 animals, including 1,100 bats just in Hubei province, of which Wuhan is the capital. Yet beyond a few tantalising discoveries, which are discussed below, the search has been unsuccessful.

The broad failure of this enormous research effort has been scantly reported by the media and sometimes its significance has been dismissed entirely. Thus, the editor of Nature journal recently told the Times Higher Education Supplement that there was an “absence of new evidence” on the COVID-19 origin question. Only a handful of mass media articles and none in the scientific literature have thus done proper justice to the negative results of the sampling in China. Exceptions are “No one can find the animal that gave people covid-19“in the MIT Technology Review and an excellent article by Rowan Jacobsen in Newsweek that expertly articulated the essential points.

Parallel to the hunt inside China, a broader international one has taken place across neighboring Asian countries. This hunt has mainly focussed on testing bats, which are the reservoir hosts of most coronaviruses. Unlike most Chinese searches, its results have been reported in the scientific literature (e.g. Lee et al., 2020). As a consequence, in 2021 alone, a series of very near relatives of SARS-CoV-2 have been published. These derive from Japan (Murakami et al., 2021), Cambodia (Hul et al., 2021), Thailand (Wacharapluesadee et al., 2021), and Yunnan province, China (Zhou et al., 2021; Li L. et al., 2021).

The findings of this international search have likewise been poorly covered by the media; either ignored or, much more rarely, misrepresented (Lytras et al., 2021).

The purpose of this article is therefore to straighten the record. It shows that the positive and negative results of these unprecedented searches are of profound importance for understanding the origin of SARS-CoV-2.

Since the consequences of the Chinese search are fairly simple and better known, this article focuses mainly on analyzing and interpreting the published results of the international virus search.

In this article, we reveal that the new coronavirus genomes from Asia contain sufficient information to narrow down the geographical source of the direct bat progenitor of SARS-CoV-2 to a quite small region, the south-central part of the Chinese province of Yunnan. In other words, this analysis identifies with good confidence and quite precisely the location where a bat virus that ultimately became SARS-CoV-2 left its bat reservoir host, initiating the chain of events that led to the COVID-19 pandemic.

The analysis does not specify the precise nature of this initiation event. The jump out of bats may have been into an intermediate host (that later went on to infect a human), or it may have been a jump directly into a human, or even the virus may have been procured as part of a research project.

Nevertheless, such a very substantial narrowing of the location of the jump from bats represents a major step forward. Its implications for understanding the origin of SARS-CoV-2 are profound because the requirement for a Yunnan connection markedly constrains origin theories. For example, advocates of the imported frozen food theory favored in China now have to explain how imported food came to Wuhan carrying a virus from Yunnan (Zhou and Shi, 2021). Likewise, ideas that have circulated about possible European origins of the virus must now explain how a European patient zero could have acquired that virus from Yunnan. Also importantly, the bioweapon theory of Dr. Li-Meng Yan is ruled out by the newly discovered viruses discussed here.

But perhaps the greatest significance of this finding will turn out to be that the region of Yunnan indicated as the likely geographic origin is centered on a place called the Mojiang mine. This mine is already well-known to COVID-19 origins investigators.

The Mojiang mine was the site, in April 2012, of an apparent coronavirus outbreak. This outbreak affected six miners and killed three of them (Rahalkar and Bahulikar, 2020). The miners who became ill were shoveling bat guano, implicating the likelihood of infection by a bat virus. The Mojiang mine is also where RaTG13, the closest known natural relative of SARS-CoV-2 was found by Zheng-li Shi of the WIV. RaTG13 was collected during sampling efforts to determine the cause of the mine outbreak. For these and other reasons, the mine is already the focus of lab origin theories. It is highly suggestive, to say the least, for this new evidence to point so precisely to this location as the source of the SARS-CoV-2 bat progenitor.

The finding is thus rich with irony as well as importance. The Chinese and international searches for SARS-CoV-2-related coronaviruses were supposed to reveal a zoonotic origin and refute a lab leak (Anderson et al., 2020). Instead, they have achieved the almost directly opposite.

Our assessment of the widespread mischaracterization of all this new evidence–in the media and the scientific literature–is therefore that most scientists and most media still resist evidence when it challenges a zoonotic origin or supports a lab leak. These new results do both.

Conclusion one: Intensive search in China yields no evidence for intermediate hosts

Based on the examples of the previous coronavirus outbreaks, the first SARS (hereafter, SARS One) and MERS, an outbreak trail leading to SARS-CoV-2 ought, to begin with, a reservoir host, in this case presumably bats (Wang et al., 2006; Corman et al., 2014; Hu et al., 2017; Luk et al., 2019). The virus reached humans because an intermediate animal capable of amplifying the virus (presumably without sickening or dying itself) acquired the virus from bats. This intermediate animal host with its Intermediate viruses should be a species found close to humans at or near the outbreak site.

Thus, a pool of viruses very highly related (≈99.9% similar) to SARS-CoV-2 should be findable in whatever animal species it was that transmitted the virus to humans. Most likely, these intermediates will be domesticated or farmed, or smuggled animals (Opriessnig and Huang, 2020). Thus, in the case of SARS One, Himalayan palm civets used in the restaurant trade were the likely amplifying species; in the case of MERS, domesticated dromedaries were certainly the source (Guan et al., 2003; Azhar et al., 2014).

However, for SARS-CoV-2, no comparable pool of viruses in intermediate hosts has yet been found.

While the pandemic was still young, this absence was unremarkable. But, given the extent of sampling in China, the lack of evidence for any part of a transmission chain from bats in Yunnan to humans in Wuhan now represents a major data point against a zoonotic origin.

This lack is frequently dismissed by comparing how long it took to find the origins of SARS One (2002-4) and MERS (2011-2012). But since those outbreaks a lot of resources have been devoted, in China and elsewhere, to sampling and identifying viruses, particularly coronaviruses (e.g. Latinne et al., 2020). There have consequently been vast improvements in our understanding of virus ecology (for example, we now know about bat reservoirs). At the same time, there have been huge cost reductions and major leaps in genome sequencing (especially Next Generation Sequencing), database technology, virus taxonomy, and virus isolation methods.

Consequently, the current failure to find a zoonotic proximal origin profoundly challenges the notion that SARS-CoV-2 has a natural animal source. It is no credit to the media or the scientific community that this finding has received so little attention.

Conclusion two: The international search discovers a SARS-CoV-2 lineage with a pronounced geographical distribution

The second major finding is even more compelling but so far all but completely ignored. It derives primarily from the fruits of the international search for bats infected with coronaviruses.

This international search has yielded viral genome sequences that are close relatives of SARS-CoV-2. All are from various parts of Asia (Hu et al., 2018; Zhou P. et al., 2020; Zhou H. et al., 2020; Hul et al., 2021; Wacharapluesadee et al., 2021; Murakami et al., 2021; Zhou et al., 2021; Li L. et al., 2021). These genomes, found mostly in bats (with a few from pangolins), represent the closest relatives of SARS-CoV-2 known from nature. All are between 79% and 96.1% similar to SARS-CoV-2.

Virtually all of these viruses were unknown before the pandemic began and some are even now published only as scientific preprints. Some are from newly sampled bat populations (e.g. Wacharapluesadee et al., 2021; Zhou et al., 2021). Others come from freezer searches for old untested samples (e.g. Murakami et al., 2021). One is even derived from a reanalysis of previously ignored sequence information from historical samples (Li L. et al., 2021).

These twelve known closest relatives of SARS-CoV-2 are listed in Table 1 below. In date order of publication, Table 1 specifies their viral names, their country or province of origin, the genetic similarity of their whole genomes to SARS-CoV-2 (in %), the distance of their sampling location from the Mojiang mine, and the species they were sampled from.

The Mojiang mine, which is in central Yunnan, was selected as the center for this analysis because it is the location where the nearest naturally occurring relative of SARS-CoV-2, RaTG13, was found, in 2013 by Zheng-li Shi (Zhou P. et al., 2020). The coordinates for the Mojiang mine used here (N 23°10’36 E 101°21’28”) are from Camping Huang’s 2016 Ph.D. thesis since those supplied by Zheng-li Shi (N 23°3’27073″, E 101°37’16074″) in Table S1 of Guo et al., 2021 are clearly incorrect.

It should also be noted that, for this analysis, the viruses called YN04/05/08 are treated here as one single virus. This consolidation is merited because they are virtually identical in genome sequence and were found at the same location (Zhou et al., 2021). The same applies to the viruses ShSTT200 and ShSTT182 which are referred to here just as ShSTT200 (Hul et al., 2021).

Thanks mainly to these newfound genome sequences, it is now evident that SARS-CoV-2, the pandemic-associated human virus, is just one member of a larger evolutionary lineage. This is seen in the phylogenetic tree shown in Figure 1 below. This lineage has been called the SARS-CoV-2-related lineage (and independently the ‘nCoV’ lineage by Lytras et al., 2021) (Guo et al., 2021).

Figure 1 Phylogeny of the SARS-related coronaviruses (taken from Guo et al., 2021). The three lineages are highlighted in different colours. Zhejiang2013, at the bottom, is a reference outlier.

Thus, as shown in figure 1, within the Sarbecoviruses are three lineages. SARS One and its near relatives are at the top (highlighted in pink). At the bottom is a novel lineage (containing RaTG15) very recently reported in a preprint by Guo et al., 2021. In the middle, highlighted in blue, is the SARS-CoV-2 lineage that is the focus of this analysis.

The implication of the existence of all such phylogenetic lineages is that the viruses within them have (for unknown reasons) recombined more-or-less readily with each other, but mostly not with viruses from other lineages (Boni et al., 2020). Otherwise, the lineages would have merged. (We write ‘mostly’ because PrC31, ZXC21 and ZC45 are partial exceptions to this rule, having segments derived from other lineages.) Thus, members of  the SARS-CoV-2 lineage are reproductively (i.e., genetically) isolated from the other two lineages. This understanding is key to the analysis below because it means the SARS-CoV-2 lineage can be treated as a distinct group whose members are evolving independently of the other lineages.

By treating this lineage separately, the sampling location and sequence of each virus can be analysed to answer a question that is crucial to the origin mystery. Where in the world did SARS-CoV-2 come from?

In an interview given just after returning from their famous trip to Wuhan, Peter Ben Embarek, leader of the WHO origins investigation team, expressed the following thought to an interviewer:

“[H]aving found other relatively close virus strains to SARS-CoV-2 in the region also in South East Asia where these bats live is a strong indication that’s where the source is”

South East Asia is big place. But Ben Embarek’s statement suggests how one can logically narrow down the possible origins of SARS-CoV-2.

In fact, a more precise analysis than this had already been published. A collaboration between the Wuhan Institute of Virology (WIV) and the EcoHealth Alliance used hundreds of partial viral sequences from China, most of them new to science, to map the geographical origin of SARS-CoV-2 more precisely (Latinne et al., 2020). The authors concluded:

“[W]e found that SARS-CoV-2 is likely derived from a clade of viruses originating in horseshoe bats (Rhinolophus spp.). The geographic location of this origin appears to be Yunnan province” (Latinne et al., 2020) [note: a clade equates here to a lineage].

Relatively little attention was paid at the time to this conclusion. This is largely because the authors provided two substantial caveats. The first was that viruses from outside China were not included in their study. The second caveat was that their analysis used only a small fragment (440 nucleotides) of the virus genome (for most of their samples this was the only sequence information available). A complete coronavirus genome is approximately 30,000 nucleotides. Because recombination between coronaviruses is generally frequent, analysis of complete genomes might reasonably be expected to give different results.

However, due to the new virus discoveries (listed in Table 1), these caveats no longer apply. For the SARS-CoV-2 lineage one can therefore re-do the analysis using complete genomes for all currently identified viruses in the SARS-CoV-2 lineage for which precise geographic location data is available.

None of the researchers who published the novel SARS-CoV-2 lineage viruses in Table 1 performed such an analysis (nor did Lytras et al., 2021, who recently reviewed the evolutionary relationships of the lineage).

However, such an analysis is simple to do. First, though, it requires excluding viruses whose sampling location is uncertain. Hence, those virus sequences extracted from smuggled pangolins (P4L and MP789) are not included in this geographic analysis. This is because a virus found in a pangolin smuggled into China might have originated from almost anywhere in SE Asia.

However, according to the NGDC genome database, the accession called PrC31 is from Pu’er City. This matches the initials (which are not explained in the article). Pu’er City is a town 56 km (in a straight line) from the Mojiang mine. Pu’er city, however, is also the name of an administrative district that encompasses the mine. The furthest boundary of this district from the Mojiang mine is 250 km. Thus 250 km marks the maximum and 0 km the minimum presumed distance to the sampling site of PrC31. Given this uncertainty we decided to omit PrC31 from the distance plot (Figure 2 below). However, PrC31 is important since, over certain parts of its genome, it is the closest known virus to SARS-CoV-2. It will therefore be discussed below, where appropriate, as will the pangolin genomes.

Zeroing in

After excluding these viruses, the results are simple to interpret. Table 1 allows a comparison of the degree of relatedness of each virus to SARS-CoV-2 and the sampling location for each virus. The closest relative of SARS-CoV-2 (RaTG13, 96.1% similar at the nucleotide level) was found at the Mojiang mine in Yunnan Province. The next closest genetic relatives of SARS-CoV-2 are RmYN02 (93.2% similar) and RpYN06 (94.48% similar). These two viruses were both also found in Yunnan, just 150 km away (in a straight line) from RaTG13. The next two closest relatives of SARS-CoV-2 are, almost equally, RshSTT200 (92.70%) and RacCS203 (91.15%). These two viruses were discovered 1,180 km away and 1,070 km away, respectively. The next most distantly related (after PrC31 which cannot be pinpointed) are ZXC21 (87.39%) and ZC45 (87.63%). These were found 2,195 km away, followed by C_o319 (79.06%) from Iwate, Japan, 4,140 km away.

There is an obvious pattern here, which is even more evident when Table 1 (minus PrC31 and the pangolin viruses) is plotted out, as in Figure 2.