Most complete human genome sequence

Who: T2T Consortium
What: 3,055,000,000 total number
Where: Not Applicable
When: 27 May 2021

The most complete human genome sequence was published as a preprint on 27 May 2021 by an international research group called the Telomere-to-Telomere (T2T) Consortium. The T2T consortium's sequence, called T2T-CHM13, encompasses all 3.055 billion base pairs of a sample human genome.

Despite what the contemporary reports implied, the landmark Human Genome Project (which ran from 1990 to 2003) did not result in a complete genome sequence. Due to limitations in the technology used, it was missing around 15 percent of the human genome. Later research improved this figure, but until the T2T consortium published their results, the best available data was still missing around 8 percent of the data encoded in our chromosomes.

The missing sections were a consequence of how complex genomes have to be processed for sequencing. No technology exists that can read the whole human genome from start to finish, so the roughly 3.2 billion base-pair sequence has to be cut up into thousands of smaller sequences, each a few hundred base-pairs in length. These short sections are then cloned and analysed individually, before being stitched back together to create the full sequence.

The missing areas in previous sequences were mostly from sections where the base-pairs are arranged in long repeating patterns; short phrases of genetic information that repeat over and over again for thousands of pairs. Given the uniformity of these sections, and the difficulty in telling one section from another, they were considered impossible to sequence at the time of the original human genome project.

Furthermore, at the time it was assumed that these repeating sections were "junk" DNA -- meaningless information left there by a quirk of evolution. Since the publication of the Human Genome Project's research, however, it has become apparent that these sequences do play a role in conditions such as autism and cancer.

The T2T Consortium made use of long-read sequencing, a state-of-the-art technique that allows the analysis of sequences that are thousands, rather than hundreds, or base pairs in length. With these largest sections, it became possible to piece together the full genome, in correct sequence.

The next step for the T2T Consortium is the creation of what is being called the Reference Pangenome. The CHM13 sequence, though complete, represents just one individual. Further sequences will need to be created to capture the information encoded in the Y chromosome (missing from the CHM13 sequence) as well as the variation between individuals across the human population.