Feature Story: Revealing the Secrets of the Genome
by Ross Hardison

We humans have been fascinated, from our earliest history, with questions about what makes us distinctively human and different from other species. We also wonder about how humans differ from one another; in particular, what could account for disorders that are heritable. These issues initially were analyzed in terms of easily observable distinctions among major groups of animals, such as the number of legs, presence of hair or feathers, or having offspring live-borne or hatching from eggs. The framework for discussing these issues has changed over the past century, as biochemists and molecular biologists have identified molecules that are responsible for living processes, including the molecules that comprise an organism’s genome—the blueprint for its growth and development.
In the mid-1980’s, a concerted effort began to determine the sequence (the order and identity) of all the base pairs in the human (Homo sapiens) genome, along with the sequence of genomes of several other model organisms. These additional organisms included the bacterium Escherichia coli , the yeast Saccharomyces cerevisiae, the worm Caeanorhabditis elegans, the fly Drosophila melanogaster, and the mouse Mus musculus. This effort was controversial from the start. It also led to intense competition between a privately-funded, for-profit effort at Celera Genomics and a publicly funded international effort dedicated to free data release and coordinated by the U.S. Department of Energy and the U.S. National Institutes of Health. Much has been written about these exciting and ground-breaking efforts (see list of references). This article picks up the story mostly after the dust had settled from those efforts, when the attention of the scientific community turned from the major task of sequencing the human genome to trying to find important, functional regions within the genome.
One of the early proponents of sequencing the human genome was Paul Berg, a distinguished alumnus of Penn State’s Eberly College of Science, a professor at Stanford University, and a Nobel Laureate. Critics of his proposals were concerned that resources would be wasted by sequencing genomic DNA that did not code for proteins, which some people felt was “junk.” His prediction that “this so-called ‘junk’ could contain the crown jewels of the genome” is being validated in many ways in current research. Some of the comparative genomics that supports Dr. Berg’s contention has been fueled by research at Penn State. These and other contributions of Penn State researchers will be featured in this article. In particular, this article will review the goals of genomics as well as some recent advances, starting with genome sequencing and moving on to comparative studies aimed at tracing the ancestry of mammalian genomes, predicting functional elements, and testing those predictions.
As our understanding of the molecular basis of living processes improves, we can examine the fundamental features of organisms and compare them in quantitative and objective terms. For example, we would like to know what cellular structures and metabolic processes are common to all mammals, which ones differ in various species, and which ones are altered in some humans to make them susceptible to certain diseases. Work over the past half century has suggested some general trends, but a full accounting of the similarities and differences, either between species or among individuals within a species, has been lacking. Recent progress in genomics, the study of genomes, has revolutionized our approach to answering these enduring questions.
Genomes carry the information and instructions needed to make all the macromolecules of an organism, whether it is a bacterium, a fungus, a plant, or an animal. They are also the vehicle to pass along this information to the offspring in such a way that the progeny are very similar, but not identical, to the parents. Thus, they determine continuity of features within a species, but they also are the playground on which new variations on those features arise. As new species arise, the record of common ancestry and the paths to new features are recorded in genomes. Genomics is now central to all aspects of the life sciences, and analysis of genomes is fueling innovative studies in almost all the disciplines represented in the Eberly College of Science, including statistics, mathematics, physics, and chemistry, in addition to the obvious life sciences. In addition, notable advances in anthropology, medicine, agriculture, and other fields are based on contemporary genomics research.
Genes are particular strings of nucleotides—the core subunits of DNA or RNA—that convey the information to make a protein or accomplish some other function. Genes connected in a series constitute chromosomes. Genomes consist of all the chromosomes in a cell. Most bacteria have a single major chromosome, which is circular, whereas humans have 22 pairs of non-sex chromosomes (autosomes) plus two sex chromosomes (XX for females and XY for males), all of which are linear. Chromosome size, and thus genome size, varies enormously in different species, being as small as a million base pairs (a megabase) in bacteria to hundreds of megabases for many human chromosomes. The human genome is almost 3000 megabases in size, but it is not the largest genome known. Some plant genomes are ten times larger.

The Goals of Genomics
Because the genome is all the genetic material in an organism, the quest for completeness is a hallmark of most genomic studies. Thus, instead of studying a single gene or protein, genome scientists study all the genes in a species, or all the proteins in a certain cell type. Having a complete, or almost complete, sequence of the genome of a species revolutionizes research on that species. If investigators do not have an essentially complete genome sequence and protein set, they have to rely on a comparatively small subset of all the known sequences and proteins to formulate testable hypotheses. It is like trying to understand the mechanics of a complex engine knowing only a small number of the parts but realizing that many other parts must be present, but which you can only imagine. With an essentially complete genome sequence and protein set, investigators have something close to a complete parts list, and thus all hypotheses must invoke those parts and only those parts. In effect, prior to genomics, there was no practical limit on the possibilities that could be investigated. With essentially complete genome sequences, biology has become a finite science, albeit with a very large number of “parts” (proteins and genes) to try to understand (see reference list for Lander, Science, 1996).
The goal of genomics is not just to list the sequences of the genomes of a large number of organisms. The goal is to achieve functional understanding, such as how the genes and other components of genomes are read and used to enable living processes. Thus, once a sequence is determined, the first orders of business are to find as accurately as possible all the protein-coding genes, to establish all the DNA sequences that play any role in the organism, and to ascertain how these proteins and DNA sequences generate cellular and organismal function. The production of a protein is the “expression” of a gene. The DNA sequences that control gene expression are called gene regulatory elements. In addition to protein-coding genes, a critical class of genomic elements that are being sought are the DNA sequences that determine when in development and where in an organism the protein encoded by a gene should be produced.
These goals overlap with the classical disciplines of biochemistry, chemistry, and physiology, and they use the techniques of genetics and molecular biology for functional tests. Moreover, the massive amounts of data to be processed have brought in expertise from computer science, statistics, mathematics, and physics to store and analyze the information. Even astronomy is involved in genomics via Penn State’s astrobiology programs. Thus, genomics has become a melting pot of input from all the major life sciences and physical sciences. This characteristic certainly is evident in Penn State’s Eberly College of Science, in which investigators from many disciplines work together in the Institute of Genomics, Proteomics, and Bioinformatics in the Huck Institutes of Life Sciences.

Genome Sequences
The most basic effort in genomics is to determine the sequences of genomes. For large, complex genomes, this effort occurs in combination with the building of genetic and physical maps, which show the positions of notable genes (such as a gene that, when mutated, causes cystic fibrosis in humans) or anonymous segments of DNA (which simply serve a role equivalent to mile markers along a highway).
For smaller genomes, such as those of bacteria, a random or “shotgun” approach can be successful. Researchers at Penn State have played major roles in determining the sequences of many species of bacteria. Donald Bryant, the Ernest C. Pollard Professor of Biotechnology and professor of biochemistry and molecular biology, has specialized in sequencing the genomes of twenty phototropic bacteria, which harvest the energy of sunlight to make energy and build macromolecules. Greg Ferry, the Stanley Person Professor of Molecular Biology, has collaborated on the sequencing of several archaea, which produce methane. Stephan Schuster, associate professor of biochemistry and molecular biology, has sequenced many bacterial genomes and has pioneered a comparative approach to understanding physiological differences that lead one species of bacteria to cause ulcers in humans but allow a closely related species to be benign.
In addition to determining genome sequences, other investigators at Penn State are pursuing high-throughput sequencing approaches to understand particular biological processes. One good example is the Floral Genome Project, a collaborative project involving several universities. At Penn State, the investigators are two professors in the Department of Biology, Drs. Hong Ma and Claude dePamphilis, with John Carlson, Director of the Schatz Center for Tree Molecular Genetics in the School of Forestry. These investigators are finding genes that are critical for flower formation and are comparing them across a wide range of species. This research will deepen our understanding of the evolutionary history and variety of mechanisms in this critical aspect of plant reproduction.
Recently, investigators in the Penn State Center for Comparative Genomics and Bioinformatics (CCGB) determined DNA sequences from a woolly mammoth, in collaboration with investigators at McMaster University , Oxford University, the Russian Academy of Sciences, and elsewhere. Wooly mammoths have been extinct since the Ice Ages, but the research team determined DNA sequences from a 28,000-year-old specimen recovered from the permafrost of Siberia using a state-of-the-art nanotechnology approach. Stephan Schuster has pioneered application of this technology at Penn State, currently one of the few institutions in the world with this capacity. Samples such as the wooly mammoth are preserved in permafrost, but they also contain other organisms, including bacteria, fungi, and viruses. Thus, the DNA sequences determined are actually of many species, comprising a metagenome. To find wooly mammoth sequences in this mixture, Webb Miller, professor of computer science and biology, compared the new sequences to the genome sequences of elephants and other mammals. This combination of nanotechnology sequencing and comparative analysis opens many possibilities for investigating ancient and complex systems.

Finding Functional Elements within Genomic Sequences
Once a genome sequence has been determined, the portions of it that play a role in the biology of the organism need to be identified. Most efforts to date have gone into gene identification. Although much progress has been made, finding all the genes remains an area of intense investigation. Part of the problem comes from the structure of the protein-coding genes in complex organisms. The mRNAs, which carry the message for the proteins, are encoded in pieces in the genome. Each portion that codes for mRNA is called an exon, and the segments between them are called introns. Finding the exons and sewing them together accurately remains a major challenge because genes frequently are more than ten times larger than their mRNA-coding portions and introns are almost always much larger than exons. In addition, the fact that a gene can consist of tens to hundreds of exons has been exploited by organisms to increase the diversity of proteins produced from genes. For a substantial majority of genes, distinctive sets of exons are spliced together to produce RNAs for different tissues. Sometimes the alternative splicing patterns are distinctive to different tissues, and many times the different proteins can have varying or opposite properties.
Comparative genomics is a major approach to finding functional elements in genomes. This approach is grounded in research insights from molecular evolutionary genetics over the past 25 years, many emanating from the Penn State Institute of Molecular Evolutionary Genetics in the Eberly College of Science, headed by Masatoshi Nei, Evan Pugh Professor of Biology and director of the institute. Many of the changes in DNA and protein sequences between species have little to no functional consequence; these are called neutral changes. Some genomic segments or portions of proteins are critical for a function that is common to the two species being compared. In those cases, because changes in sequence are detrimental, those changes are cleared rapidly from a population. This process of purifying, or negative selection, is a strong constraint on the sequences. One consequence is that comparisons of functional DNA between species show significantly more similarity than do comparisons of neutral DNA. A third class of sequence changes are beneficial to a species because they help it adapt to changes in the environment. Such changes will be favored when they occur in a population and will be rapidly fixed in a genome sequence of a species; this is referred to as positive, or Darwinian, selection. Thus, comparisons of sequences in which changes are adaptive in one of the other species will show significantly less similarity than expected for neutral DNA. A major goal of comparative genomics is to identify all the segments of DNA under constraint (negative selection) and all the segments that confer an adaptive advantage (positive selection).
The first step in comparative genomics is to align the DNA sequences of two or more species or of multiple individuals from the same species in a population. From the alignments of the portions of genomes thought to have no function, we can estimate the rate at which neutral DNA has changed. We then can ascertain whether alignments of a sequence of interest have changed slower than neutral DNA, which is taken as evidence of evolutionary constraint, or has changed faster than neutral DNA, which suggests that the DNA differences are adaptive. Much progress has been made in this area in the past few years, and investigators now have access to alignments and indicators of functionality genome-wide. However, none of the steps in the process are perfected, and much of the research in the Huck Institute’s Center for Comparative Genomics and Bioinformatics at Penn State is devoted to improving all the steps.
Among the major contributions of the Center for Comparative Genomics and Bioinformatics to comparative genomics are the algorithms for aligning genomic DNA sequences that are very long. Webb Miller has pioneered the formulation of algorithms that can handle the very long sequences that now need to be aligned, with sizes over a hundred megabases per chromosome. The DNA sequences being compared also have changed by large-scale rearrangements, insertions of multiple copies of DNA of similar sequence (repeated DNA elements), and duplications of genes. These complications increase the challenge of computing reliable alignments, and Dr. Miller’s software has been engineered to overcome these difficulties. Scientists in the Center for Comparative Genomics and Bioinformatics collaborate with the groups of David Haussler and Jim Kent at the University of California at Santa Cruz (UCSC) to compute alignments of whole genomes as they are assembled, beginning with human-mouse alignments in 2002 (see list of references for the mouse-genome paper, Nature 2002). These are massive computations, in which effectively every string of nucleotides in the 3000-megabase genome of humans is given the opportunity to align with any string in a second species. The project now extends to alignments among many mammals (human, chimpanzee, mouse, rat, dog), a bird (chicken), and several fish. Whole-genome alignments are now available on the UCSC Genome Browser, not only for these vertebrate comparisons but also for multiple species of insects, worms, and yeast. These alignments and other annotations are invaluable resources that researchers access freely, thousands of times daily, from sites around the world.
Alignments are the foundation of comparative genomics, and many investigators are analyzing them to find the signatures of constraint and adaptive evolution discussed above. One innovative approach pioneered in Penn State’s Center for Comparative Genomics and Bioinformatics uses patterns in alignments to predict the function of constrained elements. The basic idea is to identify alignment patterns that characterize certain classes of functional elements, such as those that regulate the expression of genes. Alignments in sets of known functional elements are used to train statistical models, which can then be applied to any alignment to estimate its likelihood of having a particular function. Francesca Chiaromonte, associate professor of statistics, in collaboration with other Center for Comparative Genomics and Bioinformatics faculty and young researchers such as James Taylor, a current Ph.D. candidate in computer science and engineering, and Diana Kolbe, a former honors student in the Eberly College of Science, are tackling the technical challenges involved in the development of these predictions.
An area of intense study—which is one example of the utility of the aligned sequences—currently probes the sequence comparisons between human and chimp to find candidates for the differences that make us distinctively human. This research leverages the abundant information on human polymorphisms with human-chimp sequence alignments to find more reliable signatures of adaptive evolution; for example, sequence changes that are beneficial to a human rather than to a chimp. For example, some studies have suggested that genes expressed in neural tissues are subject to stronger adaptive evolution than those expressed in other tissues. This result may be intuitively satisfying because it is consistent with a greater sophistication for humans than for chimps in neural processes, perhaps including cognition. More discoveries are needed, however, to provide greater insights into what makes us uniquely human. Ongoing research that is studying these and related issues is taking place in the Center for Comparative Genomics and Bioinformatics, including studies in the laboratory of Kateryna Makova, assistant professor of biology, and collaborations with Ken Weiss, Evan Pugh Professor of Biological Anthropology and Genetics.
Other investigators in the Center for Comparative Genomics and Bioinformatics are applying comparative approaches to improve our knowledge of how genes encode proteins. For example, one mechanism for increasing the diversity of proteins from a given gene is for the translational machinery of the cell to encode more than one protein sequence within an exon. This process was once thought to be rare, but Anton Nekrutenko, assistant professor of biochemistry and molecular biology, has used comparative genomics to find evidence of a much larger number of these multiple encodings than were previously appreciated. In addition, Wojciech Makalowski, associate professor of biology, has shown that a surprisingly large number of exons have evolved from transposable elements, which are segments of DNA that move around in the genome.

Mechanisms of Genome Function
The bioinformatic comparative analyses of DNA and protein sequences leads to insights into molecular evolution and predicts candidates for important DNA sequences. However, such predictions must be tested experimentally before they can be truly valuable. Indeed, genomics research fits well in the traditional investigative mold familiar in physics, chemistry, and astronomy, in which computational or theoretical predictions first are tested in the laboratory, then the results are employed to refine and improve the predictions.
The DNA segments predicted to be needed for regulating the timing and abundance of gene expression are being tested in the laboratory of Ross Hardison, T. Ming Chu Professor of Biochemistry and Molecular Biology, in the Center for Comparative Genomics and Bioinformatics. Using the techniques of molecular and cell biology, members of his laboratory have shown that a large fraction of the predicted regions are functional in cultured mammalian cells, and that improvements in the statistical models also result in a higher validation rate. The goal of the planned improvements is to allow reliable predictions of gene regulatory elements to be made throughout the genome for a variety of developmental systems (such as determination of cell lineage or maturation of red blood cells).
The information in the genome is expressed as proteins, which carry out most of the actions in cells. Advances in knowledge of protein sequences have been dramatic, but accurately inferring the molecular or physiological activity of most proteins remains problematic. Arthur Lesk, professor of biochemistry and molecular biology in the Center for Comparative Genomics and Bioinformatics, is a pioneer in the correlation of three-dimensional protein structures with functions, and his laboratory is developing novel methods of comparative protein-structure analysis to predict the function of proteins.
Several other faculty members in the Eberly College of Science are making dramatic advances using genome-wide analyses to understand important biological problems. Among these are Nina Fedoroff, Evan Pugh Professor of Life Sciences and the Verne M. Willaman Chair in Life Sciences, who is studying stress-induced changes in gene expression in plants, and Frank Pugh, professor of biochemistry and molecular biology, who is investigating molecular mechanisms of gene regulation in yeast. Deciphering trends and connections in the large data sets resulting from genomic analysis is difficult, and novel methods are being developed both at Penn State and elsewhere. One promising approach applies network theory to these data, and Reka Albert, assistant professor of physics, who is a world leader in this field, is working with several investigators at Penn State to use this approach to find important insights.
Investigators in the Center for Comparative Genomics and Bioinformatics are part of an international consortium whose goal is to determine all the functional DNA sequences in the human genome. The project, sponsored by the National Human Genome Research Institute in the National Institutes of Health, will produce an Encyclopedia of Functional Elements (known as ENCODE) through a combination of multiple genome sequences, bioinformatic analyses, and high-through-put biochemical assays. This exciting project will make freely available comprehensive knowledge of transcribed sequences and the biochemical changes that occur on chromosomes to allow gene expression. These results should make important strides toward interpreting genome sequences in terms of the function and mechanism of action of the proteins they encode.
Databases and Servers
One key to the rapid growth and success of genomics is the widespread commitment to the public release of complete data, whether it is genome sequences, gene-expression results, or data from the ENCODE project. These are wonderful resources, but keeping track of all this information is a formidable challenge. Researchers at Penn State have been collaborating with other investigators at organizations that are responsible for the primary repositories of genome sequence data, such as the University of California at Santa Cruz, the National Center for Biotechnology Information, and the Wellcome Trust Sanger Institute in the United Kingdom, to develop methods and resources for the integration and analysis of these data. One project, headed at Penn State by Anton Nekrutenko in the Center for Comparative Genomics and Bioinformatics and Istvan Albert in the Bioinformatics Consulting Center, has produced a public, Internet-based server, called Galaxy, that allows users to gather data from a variety of sources and to perform sophisticated analyses, all on one site. This resource makes the results of research readily available to students and researchers, plus it enables students in many different disciplines to carry out data analyses that only a short time ago would have required highly specialized knowledge.

Translational Aspects: Implications for Human Health and Well-being
Since the early part of the 20th century, we have known that some disorders are inherited. More recently, we have learned that genetic differences among humans play important roles in susceptibility to certain diseases, such as cancer and diabetes. Thus, information in our genome has a large impact on our health. Clearly, lifestyle and environment can be major determinants of health, but these factors affect each of us differently, depending on our genetic makeup. Knowledge about individual variations in genome sequences and how they correlate with likelihoods of particular health problems or efficacy of certain therapies could improve human health. Advances in sequencing technology may make it feasible to determine genome sequences of individuals in the near future.
These advances bring the promise of individualized medicine, with preventative and therapeutic regimens prescribed according to the best match with a person’s genetic makeup. This exciting prospect is clouded by fears that the genetic information can be used to the disadvantage of an individual. One common fear is that an insurance company will raise the premiums or refuse to insure persons carrying alleles associated with problematic traits. Legislation to prevent such actions has been submitted to the United States Congress, but as of this writing it has not been acted upon. Protecting the rights of individuals is a critical first step before the wonderful insights from genomic studies can be translated into benefits for individuals.
Ross Hardison
