This
idea was stolen blatantly from the
Laboratory Exercises in Evolution at the Biology Department,
University of Virginia (Janis Antonovics, Joanna Vondrasek, Doug Taylor), where
it is set as a class exercise for learning phylogenetic analysis. In turn, these
people credit a similar idea to Barbrook et al. (1998. The phylogeny of the
Canterbury Tales.
Nature 394: 839), although the originators appear to be
Robinson and O'Hara (1996. Cladistic analysis of an Old Norse manuscript
tradition.
Research in
Humanities Computing 4: 115-137). It is an exercise in stemmatology, which
can be a lot more tricky than you might think.
Stemmatology is the discipline that attempts to
reconstruct the transmission history of a written text on the basis of
relationships between the various extant versions (eg. manuscripts or
printings). These relationships can be revealed using phylogenetic networks,
which is the approach that I present here. A network is more appropriate than a
phylogenetic tree, for reasons that will become obvious — the evolution of books
is not a simple thing.
The original text of the christian Bible was written mostly in
Hebrew and Aramaic for the Old Testament, and in Greek for the New Testament. It
was later translated into Latin, which was then standardized as the "Vulgate",
and this was then almost the only version used in churches for the best part of
a millennium. The only texts in Old English consisted usually of either the
Gospels or the Psalms only.
This situation was challenged in the late
14th century, when the first Middle English translations of the whole Bible
appeared. There was active resistance to this by the formal Church, and so the
idea of an English translation was dropped until the mid 16th century, when the
Reformation inspired attempts to translate the books into Modern English as part
of a new Protestant religion. These moves were sanctioned by the government,
with first the Great Bible (1539) and then the King James
Version (1611). Various revisions of the latter have appeared, especially
since the late 19th century. These days, there is a veritable cottage industry
producing new versions of the Bible for various purposes, usually based on the
original texts rather than on earlier translations, with various translation
principles being employed (eg. Formal Equivalence, Dynamic Equivalence, Closest
Natural Equivalence, etc).
You can consult the various versions of the
English-language Bible at one or more of several online sites:
The data used below were all
obtained from these sites. These sites suggest that the most famous
English-language versions of the Bible are: the Geneva Bible (1560), as
used throughout the Reformation, and by William Shakespeare as well as by the
"Pilgrim Fathers" in America, and the King James Version (1611), which
was the standard English text for a quarter of a millennium. The most widespread
current Bible is apparently the New International Version, which has been
updated several times since its first appearance in
1973.
Stemmatology
The text that I use is the third
sentence of the Bible — Genesis 1:3. (The biblical text was first numbered in
the Geneva Bible of 1560.) Here is a dated listing of that sentence in
all of the early English translations, plus most of the revisions up to the
mid-20th century, and a sample of the many recent versions:
1382
Wycliffe Bible And God seide, Be maad li3t; and maad is li3t.
1395
Later Wycliffe And God seide, li3t be maad; and li3t was maad.
1530
Tyndale Bible Then God sayd: let there be lyghte and there was
lyghte.
1535 Coverdale Bible Than God sayd: let there be light: &
there was lyght.
1537 Matthew Bible And God sayde: let there be
light, and there was light.
1539 Great Bible And God sayde: let there
be made lyght, and there was light made.
1560 Geneva Bible Then God
saide, Let there be light: And there was light.
1568 Bishop's Bible
And God sayde, let there be light: and there was light.
1609 Douay-Rheims
Bible And God said: Be light made And light was made.
1611 King James
Version And God said, Let there be light: and there was light.
1750
Challoner Revision And God said: Be light made. And light was
made.
1769 Blayney Revision And God said, Let there be light: and
there was light.
1833 Webster's Bible And God said, Let there be
light: and there was light.
1862 Young's Literal Translation and God
saith, 'Let light be;' and light is.
1885 English Revised Version And
God said, Let there be light: and there was light.
1890 Darby Bible
And God said, Let there be light. And there was light.
1901 American
Standard Version And God said, Let there be light: and there was
light.
1950 Knox Bible Then God said, Let there be light; and the
light began.
1952 Revised Standard Version And God said, "Let there
be light"; and there was light.
1971 New American Standard Bible Then
God said, "Let there be light"; and there was light.
1973 New
International Version And God said, "Let there be light," and there was
light.
1976 Good News Bible Then God commanded, "Let there be light"
— and light appeared.
1982 New King James Version Then God said, "Let
there be light"; and there was light.
1995 God's Word Translation
Then God said, "Let there be light!" So there was light.
1996 New Living
Version Then God said, "Let there be light," and there was light.
2011
Common English Bible God said, "Let there be light." And so light
appeared.
The first thing we need to do is align the text of these 26
versions, including both words and punctuation. This allows us to directly
compare each of the elements of the sentence, comparing like with like as far as
their features are concerned.
This is not as easy as it sounds. In this
alignment I have separated words when they seem to have a different intent — for
example, "was made" is not equivalent to "appeared". I can see endless arguments
about the alignment of any text; and, indeed, disagreements about the intent of
the original text is what has lead to so many different versions of the Bible
being created in English.
This alignment then needs to be coded as a
set of characters, which define the hypothesized homology between the various
elements of the text. In this case I ended up with 50 additive binary characters
for analysis. In general, I used
Young's Literal Translation to determine
the ancestral state for each character, as this translation was an explicit
attempt to emulate the Hebrew original. A nexus-formatted version of the dataset
is available
here.
Various network methods could be used to
summarize the character data. First, I have used a NeighborNet based on hamming
distances, as I usually do (see my earlier
analyses). As you can see from the graph, there is no simple
tree-like relationships among these texts, which calls into question any
simplistic attempt at stemmatology. (Note that in two cases there are multiple
texts that have identical sentences, and thus they appear at the same location
in the graph.)
It is worth pointing out here that Barbrook
et al. (1998) produced a bush-like graph from their data for the Canterbury
Tales, but only after deleting 14 of their 58 manuscripts, "as they were
likely to have been copied from more than one exemplar, either by deliberate
conflation of readings or by changing the exemplar during the course of
copying." A similar explanation is likely to apply for some of the texts for
Genesis 1:3, although many of them were translated directly from the original
Hebrew rather than from later translations (eg. the Latin
"Vulgate").
Nevertheless, there is a general separation of the older
Genesis texts on the right of the graph and the more recent texts on the left.
This might be easier to assess if we simplify the graph.
As a simpler
summary of the same relationships, I have used a Reduced Median Network, based
on r = 2 (the program default). Note that the time order is reversed in this
graph, with the older texts on the left and the more recent texts on the
right. The only major discrepancy between the two graphs is the relative
placement of the Bishop's Bible. (Also, I have not labelled the two cases
where there are several texts that have identical sentences.)
Historically, we would expect the Tyndale
Bible, Coverdale Bible, Matthew Bible and Great
Bible texts to be closely related, but the Great Bible seems not to
fit this expectation. Similarly, we would expect a similarity between
the Geneva Bible and the Bishop's Bible, which is also not
reflected in the study sentence; nor is the acknowledged debt of the King
James Version to the Tyndale Bible.
However, the fact that
the Wycliffe Bible and Later Wycliffe are written in Middle
English rather than Modern English is clear from their distant relationship to
the other texts; and the close historical relationship of the Challoner
Revision and the Douay-Rheims Bible is also clear.
Several
texts show isolated relationships. The Knox Bible, for example, is unique
among the modern texts in being taken from the Latin rather than the original
Hebrew, while the Common English Bible is unusual in trying to balance
two translation principles (Dynamic Equivalence and Formal Equivalence) rather
than using only one.
On the other hand, the New International
Version is clearly a very traditional version of the text, given its
relationships as shown in the two graphs, which perhaps explains its
popularity.
The close association of the
Good News
Bible with
Young's Literal Translation is interesting, given that the
former is an (often criticized) free paraphrase of the original Hebrew text
while the latter is a literal translation of that same text — you can't get more
different translation principles.
Conclusion
The lack of
any simple tree-like relationship among these biblical texts makes any attempt
to study their phylogeny difficult. My own look at the business
of stemmatology suggests that
the results here are quite typical of any study of
written texts. Part of the problem seems to be that ideas developed in one
historical lineage can be transferred to other lineages, and even transferred to
earlier parts of those lineages (see my previous post:
Time inconsistency in evolutionary networks). So, even though
there is a general historical trend through time, that trend is not consistent
enough for a tree-based historical analysis to be effective.
* * * * * * * * * *
Time inconsistency in evolutionary networks
http://phylonetworks.blogspot.com/2012/07/time-inconsistency-in-evolutionary.html
The
temporal ordering of the nodes (and branches) is usually treated as an important
feature in an evolutionary network of biological organisms, because the order
must be time consistent (Baroni et al. 2004, 2006; Moret et al. 2004). That is,
for reticulation events the "horizontal" gene flow can only occur between
species that are contemporaries. So, speciation events occur successively but
reticulation events occur instantaneously (Sang and Zhong 2000).
For
example, it would be unrealistic to hypothesize either a hybridization or a
horizontal gene transfer event between a species and one of its own ancestors.
Furthermore, each reticulation event must not only be consistent on its own but
must be consistent in relation to all of the other
events.
Mathematically, inconsistency creates directed pseudo-cycles in
the network graph, so that it is not acyclic, as required for an evolutionary
history (see previous
blog post). Time consistency is thus seen as a useful means of
validating a network as a potential biological history, and can even be used as
a criterion for choosing among otherwise equally optimal
networks.
However, evolutionary analysis is not applied only to
biological organisms. It has also been applied to the study of languages (Atkinson & Gray 2005) and to cultural objects (Collard
et al. 2006). Indeed, Darwin himself recognized early on that it would be
important to show that language (a characteristic solely of humans) had a
natural origin and that it develops in a genealogical fashion (ie. it has a
pedigree).
Thus, both language and cultural objects have an historical
component that can be studied, and both can fit into an evolutionary framework
of variation + transmission + selection (Dagg 2011). Moreover, the evolutionary
history also consists of both vertical and horizontal transmission. This means
that the same data-analysis techniques can potentially be applied to biology,
language and culture (Heggarty et al. 2010; Gray et al. 2010).
The issue
that I wish to raise here is that time consistency is not a requirement of the
evolution of either language or cultural objects, the way that it is for
biological organisms. Organisms store the information (that is vertically and
horizontally transmitted) in genes that they carry with them, which is what
restricts reticulation to occurring only between contemporaries. However,
language and culture store their "information" externally, either in the minds
of people or in permanent or semi-permanent records (either written or
pictorial). Thus, the information available for horizontal transmission can come
from the distant past, as well as from the present.*
It is important to
note that for language and culture the biological ideas of vertical and
horizontal transmission of genetic information need modification (Cavalli-Sforza
and Feldman 1981). Vertical (or descending) transmission still involves faithful
copying of the information (with perhaps some losses or minor modifications).
Lateral transfer, however, can be either horizontal transmission (between
contemporary generations) or oblique transmission (between different
generations), and it is the latter that allows time-travel of
information.
Lateral transfer in this context may be a form of
hybridization, in which new concepts are added from elsewhere (eg. synonymous
words), but is likely to be a form of HGT in which concepts are simply replaced
with something from elsewhere (eg. a new word effectively replaces an old word).
Recombination, in which concepts are mutually exchanged, may be rather
rare.
As an illustration, Dagg (2011) provides some interesting examples
of lateral transfer in the parts of mouse traps. For example, he notes that:
"Torsion power may have been transmitted laterally from Egyptian torsion traps
to prefabricated dead-fall traps." These traps need not be contemporaneous,
because the ideas being transferred may be from pictures or descriptions of old
traps rather than from concurrently existing traps. (Joachim Dagg also has a
couple of blog posts where he further discusses the evolution of mouse traps:
post 1 —
post 2.)
As an alternative example, Johnson et al.
(1989) provide an evolutionary network showing the history of the various
software (mostly) and hardware components of the revolutionary Xerox 8010 "Star"
Information System (ie. computer), introduced in April 1981. Note that almost
all of the lateral transfer events (single arrows; mostly hybridization) are
time inconsistent. To quote the authors: "Although Star was conceived as a
product in 1975 and was released in 1981, many of the ideas that went into it
were born in projects dating back over three decades."
|
Fig. 8 – How systems influenced
later systems.
This graph summarizes how various systems related to Star have
influenced one another over the years. Time progresses downwards. Double arrows
indicate direct successors (i.e., follow-on versions). Many "influence arrows"
are due to key designers changing jobs or applying concepts from their graduate
research to products. |
The implications of
time-travelling laterally transferred information for network construction
methods may be unfortunate, in the sense that evolutionary networks in biology
may be quite different from those for language and culture, with the latter pair
requiring somewhat different methods. At a minimum, the requirements for
choosing among alternative networks will be different.
A quick look at
the current literature involving network analysis of languages and cultural
artifacts shows an almost universal use of unrooted graphs, most often a
Neighbor-Net, Reduced-Median or Median-Joining network. Such networks cannot
directly represent evolutionary history because there is no time direction in
the graph. This type of analysis thus neatly side-steps the issue of
representing time-travelling information in an evolutionary diagram; and it
suggests that social scientists have not yet considered the consequences of the
potential lack of time consistency in their data.
*Footnote: I
suppose that I should be precise, and note that a modern gene bank does allow
genetic information to time travel, as
well.
References
Atkinson QD, Gray RD (2005) Curious
parallels and curious connections — phylogenetic thinking in biology and
historical linguistics. Systematic Biology 54: 513-526.
Baroni M, Semple C,
Steel M (2004) A framework for representing reticulate evolution. Annals of Combinatorics 8: 391–408.
Baroni M, Semple C,
Steel M (2006) Hybrids in real time. Systematic Biology 55: 46–56.
Cavalli-Sforza LL,
Feldman MW (1981) Cultural Transmission and Evolution. Princeton University
Press, Princeton.
Collard M, Shennan SJ, Tehrani JJ (2006) Branching,
blending, and the evolution of cultural similarities and differences among human
populations. Evolution and Human Behavior 27: 169–184.
Dagg JL
(2011) Exploring mouse trap history. Evoluton: Education and Outreach 4: 397–414.
Gray RD,
Bryant D, Greenhill SJ (2010) On the shape and fabric of human history. Philosophical Transactions of the Royal Society of London series
B 365: 3923-3933.
Heggarty P, Maguire W, McMahon A (2010) Splits or
waves? Trees or webs? How divergence measures and network analysis can unravel
language histories. Philosophical Transactions of the Royal Society of London series
B 365: 3829-3843.
Johnson J, Roberts TL, Verplank W, Smith DC, Irby
C, Beard M, Mackey K (1989) The Xerox "Star": a retrospective. IEEE Computer 22: 11-29.
Moret BME, Nakhleh L, Warnow
T, Linder CR, Tholse A, Padolina A, Sun J, Timme R (2004) Phylogenetic networks:
modeling, reconstructibility, and accuracy. IEEE/ACM Transactions on Computational Biology and
Bioinformatics 1: 13–23.
Sang T, Zhong Y (2000) Testing hybridization
hypotheses based on incongruent gene trees. Systematic Biology 49: 422–434.
Comments 1 - Nice post. I'd add one comment: It is
important to distinguish between the true evolutionary history, which has to be
time consistent, and the reconstructed one, which may have a reticulation edge
from an ancestor to descendant, simply due to incomplete taxon sampling (or
extinction). In other words, except for ensuring acyclicity, I don't think one
needs impose time consistency constraints during inference.
Reply by Author -Thanks. You are right that it is
important to make the distinction; and it is always possible to add "ghost"
lineages to account for apparent time inconsistency in a reconstructed network.
However, consistency can be a valuable criterion for choosing among
reconstructions, as Leo van Iersel and I discussed in an earlier post (May 8,
2012).