Chinese researchers appear to have deleted important data from a global database operated by the National Institutes of Health that could provide key insights into the origins of the COVID-19 pandemic, a preprint study claims.
An American scientist recovered the deleted data from cloud storage and published his analysis Tuesday. The paper, "Recovery of deleted deep sequencing data sheds more light on the early Wuhan SARS-CoV-2 epidemic," suggests that early virus samples from the Wuhan seafood market that until now have been the focus of most studies on the origins of the pandemic "are not fully representative of the viruses actually present in Wuhan at that time."
The paper is not yet peer-reviewed, and its findings should not yet be considered conclusive. The recovered virus samples do not support either the "lab leak" hypothesis or the "natural origins" hypothesis of the origins of SARS-CoV-2, according to scientists who have examined the paper. But these scientists say it does suggest the virus was spreading in Wuhan earlier than the Chinese government claimed, and the paper's author, Dr. Jesse Bloom, says his findings should reinforce skepticism that China has fully shared all relevant data on COVID-19.
Bloom, an influenza virus expert at the Fred Hutchinson Cancer Research Center, also says his study should be a cause for hope that scientists can recover additional information about the early spread of SARS-CoV-2 without an international investigation.
In the course of his research into SARS-CoV-2, Bloom read a paper that analyzed data from a project by Wuhan University that sequenced 45 positive coronavirus cases from January and early February 2020. The Chinese study, which developed an improved technique to test for and diagnose COVID-19 cases, was peer-reviewed and published in June 2020.
The SARS-CoV-2 sequences obtained by the Chinese researchers were uploaded to the NIH's Sequence Read Archive (SRA), a database for storing what are essentially maps of how viruses are built. These sequences can help scientists study how a virus originated and evolved over time, and such study may lead to knowledge that can prevent the next pandemic.
But when Bloom went to the SRA to examine the Chinese sequences, he found the data had been deleted. He explained in his paper that the SRA "is designed as a permanent archive of deep sequencing data." The only circumstances under which data can be removed is if the original researchers make an email request to have it deleted, provide reasons for doing so, and have that request approved by SRA staff.
A spokesperson for the NIH told the Telegraph that the NIH had "reviewed the submitting investigator's request to withdraw the data" in June 2020 and subsequently removed it.
"The requestor indicated the sequence information had been updated, was being submitted to another database, and wanted the data removed from SRA to avoid version control issues," the spokesperson said. "Submitting investigators hold the rights to their data and can request withdrawal of the data."
Bloom attempted to contact the Wuhan University researchers asking why they requested the data be deleted but did not receive a response. He noted in his paper that "there is no plausible scientific reason for the deletion" and suggested "it therefore seems likely the sequences were deleted to obscure their existence."
Fortunately, he was able to recover some of the data from the Google Cloud, obtaining 34 early positive COVID-19 samples, and he was able to reconstruct partial viral sequences from 13 of them.
In a Twitter thread about his paper, Bloom explained why these sequences are crucial for understanding the origins of the virus.
"Although events that led to emergence of #SARSCoV2 in Wuhan are unclear (zoonosis vs lab accident), everyone agrees deep ancestors are coronaviruses from bats," Bloom said.
"Therefore, we'd expect the first #SARSCoV2 sequences would be more similar to bat coronaviruses, and as #SARSCoV2 continued to evolve it would become more divergent from these ancestors. But that is *not* the case!" he continued.
"Instead, early Huanan Seafood Market #SARSCoV2 viruses are more different from bat coronaviruses than #SARSCoV2 viruses collected later in China and even other countries."
Therefore, we’d expect the first #SARSCoV2 sequences would be more similar to bat coronaviruses, and as #SARSCoV2 c… https://t.co/RlEOfs6Rcx
— Bloom Lab (@jbloom_lab) 1624396159.0
The conundrum is easily seen by plotting the relative differences from the bat coronavirus RaTG13 outgroup versus c… https://t.co/RWXJMotiIx
— Bloom Lab (@jbloom_lab) 1624396160.0
These findings suggest that the first virus samples from Huanan Seafood Market, originally suspected by scientists to be the source of viral outbreak, were not the earliest evolutions of the virus. That would mean SARS-CoV-2 was circulating before China reported its first confirmed COVID-19 case on Dec. 8, 2019, and did not necessarily originate in the wet market.
Reacting to this new information, University of California, Berkeley, Professor Rasmus Neilsen, a genomics expert, said the findings "are the most important data that we have received regarding the origins of Covid-19 for more than a year."
Bloom said his work has several important implications.
"First, [the] fact this dataset was deleted should make us skeptical that all other relevant early Wuhan sequences have been shared," he tweeted, noting that China ordered many labs to destroy early samples of the virus.
"Sequence sharing could be further limited by fact that scientists in China are under an order from the State Council requiring central approval of all publications," he added.
Sequence sharing could be further limited by fact that scientists in China are under an order from the State Counci… https://t.co/hEYXGwamoy
— Bloom Lab (@jbloom_lab) 1624396163.0
The second major implication of this work is that "it may be possible to obtain additional information about early spread of #SARSCoV2 in Wuhan even if efforts for more on-the-ground investigations are stymied."
Bloom explained in his paper that "it should be immediately possible for the NIH to determine the date and purported reason for deletion of the data set analyzed here, since the only way sequences can be deleted from the SRA is by an e-mail request to SRA staff." He also suggested that SRA email records should be reviewed to determine if there were any more requests to delete early SARS-CoV-2 sequences from the database.
"Importantly, SRA deletions do not imply any malfeasance: there are legitimate reasons for removing sequencing runs, and the SRA houses >13-million runs making it infeasible for its staff to validate the rationale for all requests," Bloom said. "However, the current study suggests that at least in one case, the trusting structures of science have been abused to obscure sequences relevant to the early spread ofSARS-CoV-2 in Wuhan.
"A careful re-evaluation of other archived forms of scientific communication, reporting, and data could shed additional light on the early emergence of the virus."