In March 2020, when the World Health Organization declared a pandemic, the GISAID public sequence database contained 524 sequences of the virus. Over the next month, the scientists uploaded 6,000 again. By the end of May, the total number was more than 35,000. (In contrast, global scientists added 40,000 influenza sequences to GISAID in the whole of 2019).
“Without a name, forget it — we can’t understand what other people are saying,” says Anderson Britto, a postdoctoral researcher in genetic epidemiology at Yale School of Public Health who contributes to Bango’s efforts.
As the number of coronavirus sequences mounts, researchers trying to study them have had to create entirely new infrastructure and standards on the fly. One of the most important components of this effort was the global naming system: Without it, scientists would struggle to talk to each other about how the virus’ descendants transmitted and changed — either to point to a question or, more importantly, sound the alarm.
Where did the banjo come from?
In April 2020, a handful of prominent virologists appeared in the UK and Australia Suggest letter and number system To name new lineages or branches of the Covid family. It had a logic, a hierarchy, although the names it generated – such as B.1.1.7 – were full of words.
One of the paper’s authors was Shane O’Toole, a PhD student at the University of Edinburgh. I quickly became the primary person actually doing this sorting and sorting, eventually combing through hundreds of thousands of sequences by hand.
She says, “Very early on, it was only who was available to take care of the sequences. It ended up being my business for a good while. I guess I never understood how far we were going to go.”
I soon set out to build a program to assign new genomes to the correct lineages. Shortly thereafter, another researcher, postdoc Emily Sher, built a machine-learning algorithm to speed things up even further.
They named the program Pangolin, a tongue-in-cheek reference to the debate over the animal’s origin of the virus. (The entire system is now simply known as Pango.)
The naming system, along with the software to implement it, quickly became universally necessary. Although the World Health Organization has recently begun to use Greek letters for variants that seem particularly worrisome, such as delta, these nicknames are intended for the public and the media. Delta actually refers to a growing group of variants, which scientists call the more accurate Pango names: B.1.617.2, AY.1, AY.2, and AY.3.
“When alpha appeared in the UK, the Pango made it easier for us to look for those mutations in our genomes to see if we had that lineage in our country as well,” Jolly says. “Since then, Pango has been used as a baseline for variable reporting and monitoring in India.”
As Pango offers a logical and organized approach to what could be chaos, it could forever change the way scientists name viral strains — allowing experts from around the world to work together using a common vocabulary. “Most likely, this will be a format that we will use to track any other new virus,” Brito says.
Many basic tools for tracking and maintaining the genomes of the coronavirus have been developed by early-career scientists such as O’Toole and Scheer over the past year and a half. As the need for cooperation in the fight against the coronavirus increases worldwide, scientists have been quick to back it up with dedicated infrastructure like Pango. Much of this work fell on young, tech-savvy researchers in their twenties and thirties. They used open source networks and unofficial tools – which means they are free to use, and anyone can volunteer to add tweaks and improvements.
“People at the forefront of new technologies tend to be graduate students and postdocs,” says Angie Heinrichs, a bioinformatics scientist at the University of California, Santa Cruz who joined the Pangolin project earlier this year. For example, O’Toole and Scheer work in the lab of Andrew Rambaut, a genomic epidemiologist who posted the first public virus sequences online after receiving them from Chinese scientists. “They were in an ideal position to provide these tools that have become so critical,” Henrix says.
It wasn’t easy. For most of 2020, O’Toole took the bulk of the responsibility for identifying and naming new strains herself. The university was closed, but she and another doctoral student in Rambaut, Verity Hill, got permission to enter the office. Her 40-minute commute to school from the apartment she lives in alone gave her some sense of normalcy.
Every few weeks, O’Toole downloads the entire covid repository from the GISAID database, which has grown exponentially each time. Then she was looking for sets of genomes with mutations that looked similar, or things that sounded strange and might have been misnamed.
When commented privately, Hill, Rambaut, and other members of the lab would step in to discuss the assignments. But hard work fell on her.
Determining when the descendants of the virus deserve a new family name can be as much an art as a science. It was an arduous process, sifting through an unprecedented number of genomes and asking again and again: Is this a new type of Covid virus or not?
“It was very hard,” she says. “But it was always a really humble thing. Imagine going through 20,000 sequences from 100 different places in the world. I saw sequences from places I had never heard of before.”
Over time, O’Toole struggled to keep up with the size of the new genomes to sort and name them.
In June 2020, there were more than 57,000 sequences stored in the GISAID database, and O’Toole sorted them into 39 variants. By November 2020, a month after she was supposed to present her thesis, O’Toole had taken her last solo run through the data. It took her 10 days to go through all the sequences, which at the time numbered 200,000. (Although covid has overshadowed her research on other viruses, she puts a chapter on Pango in her thesis.)
Fortunately, Pango is designed to be collaborative, and others have moved ahead. The online community – the one that Julie turned to when she noticed the alternative sweeping India – has grown and grown. This year, O’Toole’s work has been much more than non-intervention. New strains are now mostly identified when epidemiologists around the world contact O’Toole and the rest of the team via Twitter, email, or GitHub — her preferred method.
“Now it’s more retrograde,” O’Toole says. “If a group of researchers somewhere in the world is working on some data and they think they’ve identified a new strain, they can submit an application.”
The data flood continued. Last spring, the team held a “pangothon,” a type of hackathon in which they sorted 800,000 sequences into about 1,200 strains.
“We gave ourselves three solid days,” O’Toole says. “It took two weeks.”
Since then, the Pango team has recruited a handful of volunteers, such as UCSD researcher Hindriks and Yale researcher Prieto, who were initially engaged by adding their year on Twitter and GitHub page. Chris Royce, a postdoctoral researcher at the University of Cambridge, has turned his attention to helping O’Toole finish the backlog of requests on GitHub.
O’Toole recently asked them to officially join the organization as part of the newly created Pango Network proportions committee, which discusses and makes decisions about variable names. Another committee, which includes laboratory leader Rambaut, makes decisions at a higher level.
“We have a website, and email is not just about my email,” O’Toole says. “It’s getting more formal, and I think that’s really going to help it expand.”
Some cracks around the edges are starting to appear as the data grows. As of today, there are approximately 2.5 million covid sequences in GISAID, which the Pango team has divided into 1,300 branches. Each branch corresponds to a variable. Of those, eight should be monitored, according to the World Health Organization.
With a lot of processing, the program started to shut down. Things became misleading. Many strains look alike, because the virus develops the most beneficial mutations over and over again.
As a temporary measure, the team has built new software that uses a different sorting method and can catch things that Pango might miss.