Thursday, September 13, 2012

Junk No More: ENCODE Project Nature Paper Finds "Biochemical Functions for 80% of the Genome"

A groundbreaking paper in Nature reports the results of the Encyclopedia of DNA Elements (ENCODE) project, which has detected evidence of function for the "vast majority" of the human genome. Titled "An integrated encyclopedia of DNA elements in the human genome," the paper finds an "unprecedented number of functional elements," where "a surprisingly large amount of the human genome" appears functional. Based upon current knowledge, the paper concludes that at least 80% of the human genome is now known to be functional:
The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation.
(The ENCODE Project Consortium, "An integrated encyclopedia of DNA elements in the human genome," Nature, Vol. 489:57-74 (September 6, 2012) (emphasis added))
In the past we've frequently read about studies reporting function for many thousands of base pairs (see here or here for a few of many examples), but it's often hard to get a sense of just how much of the genome has had function detected for it. Through the collaboration of hundreds of researchers, the ENCODE project determined that "The vast majority (80.4%) of the human genome participates in at least one biochemical RNA- and/or chromatin-associated event in at least one cell type." As discussed further below, Tom Gingeras, a senior scientist with the ENCODE project, contends in an interview that "[a]lmost every nucleotide is associated with a function."
"Surprisingly Large" Amount of the Human Genome is Functional
The ENCODE paper divides up functional genomic elements into major categories: RNA transcribed regions, protein-coding regions, transcription-factor-binding sites, chromatin structure, and DNA methylation sites. After analyzing all of these different kinds of genomic elements, the project found:
Accounting for all these elements, a surprisingly large amount of the human genome, 80.4%, is covered by at least one ENCODE-identified element. The broadest element class represents the different RNA types, covering 62% of the genome (although the majority is inside of introns or near genes). Regions highly enriched for histone modifications form the next largest class (56.1%). Excluding RNA elements and broad histone elements, 44.2% of the genome is covered. Smaller proportions of the genome are occupied by regions of open chromatin (15.2%) or sites of transcription factor binding (8.1%), with 19.4% covered by at least one DHS or transcription factor ChIP-seq peak across all cell lines. (internal citations removed)
In addition to finding 863 pseudogenes that are "transcribed and associated with active chromatin," the paper reports that nearly all of the genome is found near a functional DNA element: "A total of 99% of the known bases in the genome are within 1.7 kb of any ENCODE element."
"Non-Conserved" No Longer Implies "Non-Functional"
As we've discussed here on ENV before, molecular biologists often infer function for non-coding DNA by finding the sequence is "conserved" or "constrained" (i.e. similar) across diverse species, implying there is some kind of selectable function preventing it from accumulating mutations. But if a sequence is not conserved or constrained (i.e. it's different) across different species, does that imply it's not functional? The ENCODE paper asked this question, and found the answer is "no":
Primate-specific elements as well as elements without detectable mammalian constraint show, in aggregate, evidence of negative selection; thus, some of them are expected to be functional
Later the paper found that within primates, unconserved sequences may be very important for determining body form:
There are also a large number of elements without mammalian constraint, between 17% and 90% for transcription-factor binding regions as well as DHSs and FAIRE regions. Previous studies could not determine whether these sequences are either biochemically active, but with little overall impact on the organism, or under lineage specific selection. By isolating sequences preferentially inserted into the primate lineage, which is only feasible given the genome-wide scale of this data, we are able to examine this issue specifically. ... [A]n appreciable proportion of the unconstrained elements are lineage-specific elements required for organismal function, consistent with long-standing views of recent evolution, and the remainder are probably "neutral" elements that are not currently under selection but may still affect cellular or larger scale phenotypes without an effect on fitness. (internal citations omitted)
And of course, if a genetic element affects "cellular or larger scale phenotypes," then clearly those elements have function as well.
Findings are "Unprecedented"
The paper concludes that researchers have uncovered an "unprecedented number of functional elements":
The unprecedented number of functional elements identified in this study provides a valuable resource to the scientific community as well as significantly enhances our understanding of the human genome.
They also make the obvious conclusion that much more of the genome appears to be involved in regulation processes than producing biochemically active proteins:
Interestingly, even using the most conservative estimates, the fraction of bases likely to be involved in direct gene regulation, even though incomplete, is significantly higher than that ascribed to protein coding exons (1.2%), raising the possibility that more information in the human genome may be important for gene regulation than for biochemical function.
And of course, the implications of this study for fighting disease are profound:
The broad coverage of ENCODE annotations enhances our understanding of common diseases with a genetic component, rare genetic diseases, and cancer, as shown by our ability to link otherwise anonymous associations to a functional element.
Junk DNA Will Be "Consigned to the History Books"
The news media have picked up on this story, with headlines like "Breakthrough study overturns theory of 'junk DNA' in genome" (UK Guardian) or "Bits of Mystery DNA, Far From 'Junk,' Play Crucial Role" (NY Times). These articles frankly acknowledge the implications for the old "junk DNA" notion:
  • "Long stretches of DNA previously dismissed as "junk" are in fact crucial to the way our genome works, an international team of scientists said on Wednesday. ... For years, the vast stretches of DNA between our 20,000 or so protein-coding genes -- more than 98% of the genetic sequence inside each of our cells -- was written off as "junk" DNA. Already falling out of favor in recent years, this concept will now, with Encode's work, be consigned to the history books." (Alok Jha, "Breakthrough study overturns theory of 'junk DNA' in genome," UK Guardian (September 5, 2012))
  • "The human genome is packed with at least four million gene switches that reside in bits of DNA that once were dismissed as 'junk' but that turn out to play critical roles in controlling how cells, organs and other tissues behave. The discovery, considered a major medical and scientific breakthrough, has enormous implications for human health because many complex diseases appear to be caused by tiny changes in hundreds of gene switches. ... Human DNA is 'a lot more active than we expected, and there are a lot more things happening than we expected,' said Ewan Birney of the European Molecular Biology Laboratory-European Bioinformatics Institute, a lead researcher on the project." (Gina Kolata, "Bits of Mystery DNA, Far From 'Junk,' Play Crucial Role,"New York Times (September 5, 2012))
The NY Times further commented on the complexity of what we're finding:
There also is a sort of DNA wiring system that is almost inconceivably intricate.
"It is like opening a wiring closet and seeing a hairball of wires," said Mark Gerstein, an Encode researcher from Yale. "We tried to unravel this hairball and make it interpretable."
There is another sort of hairball as well: the complex three-dimensional structure of DNA. Human DNA is such a long strand -- about 10 feet of DNA stuffed into a microscopic nucleus of a cell -- that it fits only because it is tightly wound and coiled around itself. When they looked at the three-dimensional structure -- the hairball -- Encode researchers discovered that small segments of dark-matter DNA are often quite close to genes they control. In the past, when they analyzed only the uncoiled length of DNA, those controlling regions appeared to be far from the genes they affect.
Over at Discover Magazine, Tom Gingeras, a senior scientist affiliated with ENCODE, states that "Almost every nucleotide is associated with a function":
According to ENCODE's analysis, 80 percent of the genome has a "biochemical function". More on exactly what this means later, but the key point is: It's not "junk". Scientists have long recognised that some non-coding DNA probably has a function, and many solid examples have recently come to light. But, many maintained that much of these sequences were, indeed, junk. ENCODE says otherwise. "Almost every nucleotide is associated with a function of some sort or another, and we now know where they are, what binds to them, what their associations are, and more," says Tom Gingeras, one of the study's many senior scientists.
The Discover Magazine article further explains that the rest of the 20% of the genome is likely to have function as well:
And what's in the remaining 20 percent? Possibly not junk either, according to Ewan Birney, the project's Lead Analysis Coordinator and self-described "cat-herder-in-chief". He explains that ENCODE only (!) looked at 147 types of cells, and the human body has a few thousand. A given part of the genome might control a gene in one cell type, but not others. If every cell is included, functions may emerge for the phantom proportion. "It's likely that 80 percent will go to 100 percent," says Birney. "We don't really have any large chunks of redundant DNA. This metaphor of junk isn't that useful."
We will have more to say about this blockbuster paper from ENCODE researchers in coming days, but for now, let's simply observe that it provides a stunning vindication of the prediction of intelligent design that the genome will turn out to have mass functionality for so-called "junk" DNA. ENCODE researchers use words like "surprising" or "unprecedented." They talk about of how "human DNA is a lot more active than we expected." But under an intelligent design paradigm, none of this is surprising. In fact, it is exactly what ID predicted.
This important paper also represents a stunning vindication of Jonathan Wells's book The Myth of Junk DNA. He wrote there:
Far from consisting mainly of junk that provides evidence against intelligent design, our genome is increasingly revealing itself to be a multidimensional, integrated system in which non-protein-coding DNA performs a wide variety of functions. If anything, it provides evidence for intelligent design. Even apart from possible implications for intelligent design, however, the demise of the myth of junk DNA promises to stimulate more research into the mysteries of the genome. These are exciting times for scientists willing to follow the evidence wherever it leads.
(Jonathan Wells, The Myth of Junk DNA, pp. 9-10 (Discovery Institute Press, 2011).)
While undoubtedly a few holdouts will continue to defend "junk DNA" thinking for philosophical or theological reasons, this paper should put most arguments in favor of junk DNA to rest.