Scientist finds early virus sequences that had been mysteriously deleted
By Carl Zimmer
About a year ago, genetic sequences from more than 200 virus samples from early cases of COVID-19 in Wuhan, China, disappeared from an online scientific database.
Now, by rooting through files stored on Google Cloud, a researcher in Seattle reports that he has recovered 13 of those original sequences — intriguing new information for discerning when and how the virus may have spilled over from a bat or another animal into humans.
The new analysis, released last week, bolsters earlier suggestions that a variety of coronaviruses may have been circulating in Wuhan before the initial outbreaks linked to animal and seafood markets in December 2019.
As the Biden administration investigates the contested origins of the virus, known as SARS-CoV-2, the study neither strengthens nor discounts the hypothesis that the pathogen leaked out of a famous Wuhan lab. But it does raise questions about why original sequences were deleted, and suggests that there may be more revelations to recover from the far corners of the internet.
“This is a great piece of sleuth work for sure, and it significantly advances efforts to understand the origin of SARS-CoV-2,” said Michael Worobey, an evolutionary biologist at the University of Arizona who was not involved in the study.
Jesse Bloom, a virus expert at the Fred Hutchinson Cancer Research Center who wrote the new report, called the deletion of these sequences suspicious. It “seems likely that the sequences were deleted to obscure their existence,” he wrote in the paper, which has not yet been peer-reviewed or published in a scientific journal.
Bloom and Worobey belong to an outspoken group of scientists who have called for more research into how the pandemic began. In a letter published in May, they complained that there was not enough information to determine whether it was more likely that a lab leak spread the coronavirus, or that it leapt to humans from contact with an infected animal outside of a lab.
The genetic sequences of viral samples hold crucial clues about how SARS-CoV-2 shifted to our species from another animal, most likely a bat. Most precious of all are sequences from early in the pandemic, because they take scientists closer to the original spillover event.
As Bloom was reviewing what genetic data had been published by various research groups, he came across a March 2020 study with a spreadsheet that included information on 241 genetic sequences collected by scientists at Wuhan University. The spreadsheet indicated that the scientists had uploaded the sequences to an online database called the Sequence Read Archive, managed by the U.S. government’s National Library of Medicine.
But when Bloom looked for the Wuhan sequences in the database earlier this month, his only result was “no item found.”
Puzzled, he went back to the spreadsheet for any further clues. It indicated that the 241 sequences had been collected by a scientist named Aisi Fu at Renmin Hospital in Wuhan.
Searching medical literature, Bloom eventually found another study posted online in March 2020 by Fu and colleagues, describing a new experimental test for SARS-CoV-2. The Chinese scientists published it in a scientific journal three months later.
In that study, the scientists wrote that they had looked at 45 samples from nasal swabs taken “from outpatients with suspected COVID-19 early in the epidemic.” They then searched for a portion of SARS-CoV-2’s genetic material in the swabs. The researchers did not publish the actual sequences of the genes they fished out of the samples. Instead, they only published some mutations in the viruses.
But a number of clues indicated to Bloom that the samples were the source of the 241 missing sequences. The papers included no explanation as to why the sequences had been uploaded to the Sequence Read Archive, only to disappear later.
Perusing the archive, Bloom figured out that many of the sequences were stored as files on Google Cloud. Each sequence was contained in a file in the cloud, and the names of the files all shared the same basic format, he reported.
Bloom swapped in the code for a missing sequence from Wuhan. Suddenly, he had the sequence. All told, he managed to recover 13 sequences from the cloud this way.
With this new data, Bloom looked back once more at the early stages of the pandemic. He combined the 13 sequences with other published sequences of early coronaviruses, hoping to make progress on building the family tree of SARS-CoV-2.
Working out all the steps by which SARS-CoV-2 evolved from a bat virus has been a challenge because scientists still have a limited number of samples to study. Some of the earliest samples come from the Huanan Seafood Wholesale Market in Wuhan, where an outbreak occurred in December 2019.
But those market viruses actually have three extra mutations that are missing from SARS-CoV-2 samples collected weeks later. In other words, those later viruses look more like coronaviruses found in bats, supporting the idea that there was some early lineage of the virus that did not pass through the seafood market.
Bloom found that the deleted sequences he recovered from the cloud also lack those extra mutations. “They’re three steps more similar to the bat coronaviruses than the viruses from the Huanan fish market,” Bloom said.
This suggests, he said, that by the time SARS-CoV-2 reached the market, it had been circulating for awhile in Wuhan or beyond. The market viruses, he argued, aren’t representative of full diversity of coronaviruses already loose in late 2019.
“Maybe our picture of what was present early in Wuhan from what has been sequenced might be somewhat biased,” he said.
It’s not clear why this valuable information went missing in the first place. Scientists can request that files be deleted by sending an email to the managers of the Sequence Read Archive. The National Library of Medicine, which manages the archive, said that the 13 sequences were removed last summer.
“These SARS-CoV-2 sequences were submitted for posting in SRA in March 2020 and subsequently requested to be withdrawn by the submitting investigator in June 2020,” said Renata Myles, a spokeswoman for the National Institutes of Health.
She said that the investigator, whom she did not name, told the archive managers that the sequences were being updated and would be added to a different database. But Bloom has searched every database he knows of, and has yet to find them.
“Obviously I can’t rule out that the sequences are on some other database or webpage somewhere, but I have not been able to find them any of the obvious places I’ve looked,” he said.
Three of the co-authors of the 2020 testing study that produced the 13 sequences did not immediately respond to emails inquiring about Bloom’s finding. That study did not give contact information for another co-author, Fu, who was also named on the spreadsheet from the other study.
Some scientists are skeptical that there is anything sinister behind the removal of the sequences. “I don’t really understand how this points to a cover-up,” said Stephen Goldstein, a virus expert at the University of Utah.
Goldstein noted that the testing paper listed the individual mutations the Wuhan researchers found in their tests. Although the full sequences are no longer in the archive, the key information has been public for over a year, he said. It was just tucked away in a format that is hard for researchers to find.
“We all missed this relatively obscure paper,” Goldstein said.
Regardless of what happened to these 13 sequences, Bloom now wonders what other clues might be discovered online. In order to reconstruct the origin of COVID-19, all those clues potentially matter.
“Ideally, we need to try to find as many other early sequences as possible,” he said. “And I think this study suggests that we should look everywhere.”