Pál Venetianer The information explosion in biology
The title expresses the commonplace truism, that in the last few decades we have been witnessing a tremendous explosion of information in biology with far-reaching consequences for science itself, for its applications in human medicine and agriculture, for society in general, and - without exaggeration - for the future of the human race. In this lecture I am going to discuss three main subjects: the quality and quantity of this information, the problems of handling, storing and using this information, and the social implications of the information explosion.
Let us begin with a very superficial historic outline. Traditional biology - till the end of the 19. century - was more or less a a descriptive science. It dealt with an enormous variety of phenomena not amenable to quantification, or to the formulation of laws that could be expressed in mathematical formulas. Even the theory of Darwin, despite its enormous effect on the further development of biology, and despite its explanatory power, did not fulfill the conditions that Popper's philosophy require from a "law" of science. This situation changed dramatically with the discovery of Mendel's laws of genetics, or rather with the rediscovery of Mendel's work in 1900. The newly born science of genetics was unique, not only in biology but in the entire empire of science, because – to use an inaccurate broad generalization – it is a "digital", not an "analogic" science. The geneticist does not deal with effects determined by statistical laws governing the interactions of large numbers of atoms and molecules, he or she deals with events occuring in individual molecules that determine the form and fate of whole organisms. The geneticist does not measure, he or she counts. This was true even in the early stages, when the word "gene", was indeed only an abstract concept, without any knowledge about its real nature, even less about its structure. Around the middle of the century however, with the discovery of the Watson-Crick modell of DNA, with the deciphering of the genetic code, and with the birth of molecular biology, it became clear that the whole set of genes - the genome - that determines all the heritable properties of any living organism, can be conceived as a linear sequence of information, consisting of a large number of digits (that is: nucleotides), each digit correponding to two bits. This was indeed a revolutionary development, although it offered only a framework, the actual information content was not yet known in any case. Twenty years later, however, the Nobel-prize winning dicoveries of Sanger and Gilbert enabled scientists to decipher this genetic information stored in the genetic material as the linear sequence of nucleotides. The following figures illustrate our progress in DNA sequence determination and the advances in the effeciency and speed of this process.
The progress of DNA sequence determination
| 1977 | phix174 bacteriophage | 5386 bp |
| 1981 | human mitochondrion | 16569 bp |
| 1982 | bacteriophage | 48509 bp |
| 1986 | tobacco chloroplast | 155844 bp |
| 1990 | cytomegalovirus | 230 kbp |
| 1992 | yeast, chromosme 3. | 315 kbp |
| 1995 | Haemophilus influenzae (bacterium) | 1830 kbp |
| 1996 | Saccharomyces cerevisiae (yeast) | 13478 kbp |
| 1997 | Escherichia coli (bacterium) | 4639 kbp |
| 1998 | Caenorhabditis elegans (worm) | 97 Mbp |
| 1999 | Drosophila melanogaster (fly) | 180 Mbp |
| 2000 | Arabidopsis thaliana (plant) | 115 Mbp |
| 2001 | Homo sapiens | 3150 Mbp |
| 1977. | Sanger: about 300 bases/man-year |
| 1990. | 20 kb (kilobase) /man-year |
| 2000. | Celera: 1 kb/ second |
| 599 | viruses and viroids |
| 205 | plasmids |
| 185 | organelles (plastid, mitochondrion) |
| 33 | eubacteria |
| 7 | archaebacteria |
| 1 | fungus (Saccharomyces) |
| 1 | plant (Arabidopsis) |
| 2 | animal (Caenorhabditis, Drosophila) |
This spectacular progress culminated in the announcement on 26.-th June, 2000, and the publication in February 2001, of the essentially complete determination of the 3.15 billion nucleotides long sequence of human DNA. This work is not finished yet, the expression "essentially complete" is an euphemism, the sequence published by two teams covers only 84% of the genome, it contains more than 100000 gaps and only about one third of it can be considered error-free and perfect. There is still a long way to go until the real completion. From a practical – medical – point of view, an even more important mass of information is generated by those teams that collect information on individual differences occuring in this consensus sequence with a frequency higher than 1 %. These are called in the jargon of molecular biologists "snips", that is "single nucleotide polymorphisms (SNPs)". As of the beginning of this year, the number of such "snips" in the data banks totalled 1.4 million, and this number has been rapidly increasing ever since. These SNPs might serve as the basis of the "personalized medicine" of the future, when drugs will be developed and prescribed to fit the individual genotype of the patient.
The announcement of the completion of the mouse DNA sequence followed in less than three months. The amount of DNA sequence information available in data banks exceeded 1010 nucleotides (2.5 gigabytes) by 2000, and its doubling time is now less than one year.
Although the DNA sequences represent by far the largest and most important mass of biological information obtained and stored in the last several years, they are not the only ones. The linear nucleotide sequence of the genes strictly determines the linear amino acid sequence of the coded proteins and this in turn determines the folding and thereby the final, biologically functional three-dimensional structure of the proteins of living beings. However this latter determination is not so strict, and even if it were, we are still unable to calculate and accurately predict this spatial structure from the linear sequence. Therefore the determination of this three- dimensional structure that took 32 years of painstaking work of Max Perutz in the case of hemoglobin, the first protein, for which this problem was solved, has also been rapidly progressing lately. The quantification and digitalization of such irregular complicated structures is much more difficult than in the case of the linear DNA sequence, and their storage requires much more computer memory, but it can be done, and at present the data banks contain the parameters of approximately 13000 different protein structures.
The origin of another new source of information is the DNA-array, or DNA-chip technology. These DNA chips contain thousands, or rather tens of thousands of short pieces of DNA (bits of genes) and they can be used to assess the intensity of expression (activity) of these genes in a given type of cell, under any particular external condition, or in any stage of development. By this means one can compare quantitatively the expression of the same genes in, for example, two different types of breast cancer cells – several hundred genes in a single experiment. The rapidly expanding experimental use of these arrays generates an incredible amount of data points.
Another type (or rather: several other types) of new information is generated as a result of the further analysis of organisms, the DNA of which has been completely sequenced. For example: a huge experiment attempted to determine and catalogize all the possible pairwise interactions between all the several thousand proteins coded by the yeast genome. Another collective multilateral project tries to determine the function all 6000 identified yeast genes under a set of collectively agreed-upon, well-defined experimental conditions.
The approximately 5 million different living species on earth offer enough raw-material for the information-hungry scientists for decades if not centuries, but this is not all. A startling new discovery revealed that the presently known diversity of microorganisms represent only a small fraction (less than 1 %) of the really existing number of species. Most of this universe remained hidden sofar, because nobody could cultivate these microbes in the laboratory. Our knew methods however enable us to determine the DNA sequence of all these unknown microbes without ever seeing them in the microscope or in the culture-flask, and this information in turn can serve as the basis of the production of new proteins, with amazing properties.
Of course my brief overview of the various kinds of knew information overflowing our publications, computers and minds shows only the tip of the iceberg, but perhaps it can help to explain the words of Edward Ueberbacher: "There's a big data wave coming, anyone with any sanity should get out of its way".
If we cannot get out of its way, how can we manage to handle it?
I am not an expert in informatics and computers, therefore I shall only briefly refer to some of the problems generated by this vast treasure trove of information. First of all, the ways of publishing it, changed thoroughly. The normal scientific publication is unable to include any more all the information it intends to convey. Although both papers on the human DNA sequence were unusually long (about 40 pages each), they did not contain any part of the actual sequence, the printing of which would have required a library of more than 1000 large volumes. Even smaller genome sequences, "snips", coordinates of protein structures, or detailed results of DNA- array experiments are not printed in the publications any more, they are only stored in the data- banks. The general free availability of these data (versus patented, privately owned information, or limited access for a fee) is a very contentious issue and I am not going to discuss it here.
Another - essentially technical, but very important - problem is the format of this vast amount of information. It would be obviously advantageous if anybody could analyze the information, provided by anybody else and compare it with data obtained from other sources. This would necessitate the existence of a uniformly agreed and accepted formalization of all sorts of data, annotated and stored identically everywhere. This way, data mining programs, similarity searches, large-scale comparisons could be applied uniformly to all data-banks, utilizing the same software tools. At present we are very far from this ideal. These software tools are progressing very rapidly, but there are lots of possibilities to develop better and more powerful programs. That is one of the reasons why the experts of an entire new field: the bioinformaticists are presently the hottest commodities on the international job market.
Finally, one of the most pressing problems of handling and using this wealth of information is its quality. In other words, how reliable and correct is the information stored in the data banks? Unfortunately the situation is not very reassuring. Especially in the earlier stages of the Human Genome Project, yeast and bacterial DNA sequences were frequently found among the allegedly human ones. Published and seemingly established sequences are frequently found to be incorrect. This could be very dangerous, because the omission or insertion of a single nucleotide (which could be the result of a typing error) might change the protein structure (the "meaning") of an entire gene. Therefore stored data require constant efforts in curation, revision and double-checking.
In the last part of my lecture, a few words must be said about the social implications of an important subset of the biological data. Obviously the DNA sequence (or part of the DNA sequence) of any human individual represents a very personal form of information, the privacy of which is at least as important (but probably even more) as that of any other personal data. Most experts agree, that this information should be treated accordingly: that is, it should be everybody's sacred right to allow or deny obtaining, reproducing and storing this information, to allow or deny anybody else to have access to, or use this information. This, however, is easier said than done. Firstly, DNA can be obtained from anybody, without his or her consent or knowledge (a drop of saliva, a morsel of stool, a lipstick smear on a cigarette butt, or a fingerprint suffices). In several countries the possibility of archiving the DNA of the entire population is seriously entertained. In fact the overwhelming majority of the people of Iceland already accepted such a plan, and a similar project is being organized in Estonia. The aim in both cases is scientific and medical application, but it has obvious significance for criminal justice as well. In fact such DNA collecting for a subset of the population (i.e.: those with any criminal record) has been going on in Great Britain and in several states of the USA.
Even if the DNA information is obtained only for medical purposes, with the consent of the subject, its strict personal confidentiality can sometimes be argued. A hotly debated issue is whether insurance companies, or potential employers should have access to DNA data of the policyholders or employees. Most scientists (and the majority of public opinion) seem to agree that they should not, but this is not a simple problem, and several fairly convincing arguments can be marshalled in favor of the opposite view. In Great Britain, where individual human rights and privacy are sacrosanct, a recently accepted law confirmed the right of government authorities to use individual DNA data without the previous consent of the subjects, if matters of public health and welfare are at stake. Clearly, the legal and moral problems of obtaining and using individual human DNA data are not simple at all, several important issues should be clarified, and a social consensus must be inevitably reached. In this area we still have a long way to go.
Limits of time do not allow me to elaborate on all the rosy perspectives that the applications of this information explosion in biology offer in the various fields of human medicine, industry, protection of the environment and agriculture. The popular press and media frequently and successfully communicate these promises and possibilties. Unfortunately the same media also excel in scare-mongering, in exposing the alleged (in most cases nonexistent) dangers of some of these applications. Instead of dealing with these practical (real or imaginary) consequences, I should like to say a few words about the dangers of the theoretical, philosophical misconceptions about this information explosion.
Some proponents (eminent scientists) of the Human Genome Project, during its preparatory stage, in lobbying for it, claimed that the determination of the human DNA sequence is equivalent with the perfect realization of the ancient Greek's "Gnoti seauton", that is, with acquiring the ultimate and all-inclusive knowledge about Man. This is obviously untrue. If Laplace's daemon would exist, it would not be able to construct, or even to describe a man using the DNA sequence-information alone. Man - or for that matter, any living organism - is not fully determined solely by the DNA sequence. It represents a potential. Environmental effects play a strong role in influencing how much of this potential will be realized. In recent years we became aware of the increasingly important contribution of so-called epigenetic factors (i.e.: factors not "written" in the DNA) in determining the fate and shape of living creatures. It would also be misleading to believe that the measurements, the effects, the descriptions of the older biology will ever disappear, and be replaced by bits of linear information. Life is, and will always be, more complex than a computer program.
Professor Pál Venetianer Ungarische Akademie der Wissenschaften Biologisches Forschungszentrum Szeged Institut für Biochemie H-6724 Szeged, Temesvári krt. 62