Tuesday, July 12, 2011

more on authoshop algorithm

Attribution and Misattribution: On Computational Linguistics, Heresy and Journalism
Moshe Koppel
A few days ago, newspaper readers from New Jersey to New Zealand read about new computer software that "sheds light on the authorship of the Bible"D1D. By the time the news circled back to Israel, farteitcht and farbessert, readers of Haaretz were (rather gleefully) informed that the head of the project had announced that it had been proved that the Torah was written by multiple human authorsD2D, just as the Bible critics had been saying all along.
I'm always skeptical about that kind of grandiose claim and this is no exception, even though the person who allegedly made the claim in this particular case happens to be me. The news reports in question refer to a recently published paperD3D in computational linguistics involving decomposition of a document into authorial components. A brief reference to application of the method to the Torah (Pentateuch) is responsible for most of the noise.
In what follows, I’ll briefly provide some background about authorship attribution research, sketch the method used in the research, outline the main results and say a few words about what they mean. My main purpose is to explain what has actually been proved and, more crucially in this case, what has not been proved.
Authorship Attribution
One of my areas of research for over a decade has been authorship attribution, the use of automated statistical methods to identify or profile the author of a given text. For example, we can determine, with varying degrees of accuracy, the age, gender and native language of the author of a textD4D. Under certain conditions, we can determine, with varying degrees of certainty, if two texts were written by the same personD5D. Some of this work has been applied to topics of particular interest to students of Jewish texts, such as strong evidence that the collection of responsa Torah Lishmah was written by Ben Ish ChaiD6D (although he often quoted the work as if it were written by someone else) and that all of the letters in Genizat Harson are forgeriesD7D.
Whenever I have lectured on this topic, the first question has been: have you ever analyzed the Bible? The honest truth is that I never really understood the question and I suspect that in most cases the questioner didn't have any very well-formed question in mind, beyond the vague thought that the Bible is of mysterious provenance and ought to be amenable to some sort of statistical analysis. I would always mumble something about the question being poorly defined, Bible books being too short to permit reliable statistical analysis, etc. But, while all those excuses were quite true, I also had a vague thought of my own, which was that whatever well-formed research question I could come up with regarding Tanach, it would probably land me in hot water.
One research question that I have been working on with my graduate student, Navot Akiva, involves decomposition of a document into distinct stylistic components. For example, if a document was written by multiple authors, each of whom presumably
writes in some distinct style, we'd like to be able to identify the parts written by each author. (Bear in mind this is what is known in the jargon as an unsupervised problem: we don't get known examples of each author's writing to analyze. All we have is the composite text itself, from which we need to tease apart distinctive looking chunks of text.) The object is straightforward: given a text, split it up into families of chunks in the best possible way, where by "best" we mean that the chunks that are assigned to the same family are as similar to each other as possible.
Even I could see that this could have some bearing on Tanach. So when Prof. Nachum Dershowitz, a colleague with whom I share a number of research interests, introduced me to his son, Idan, a graduate student in the Tanach program at Hebrew University, we agreed to consider how to apply this work to Tanach (sort of fudging the question of whether this meant Torah or Nach). It happens that, apart from being the most studied and revered set of books ever written, Tanach offers another advantage as an object of linguistic analysis: precisely because it has been the subject of so much study, there are many available automated tools that we could exploit in our research.
The Method
Here's how our computerized method works. Divide a text into chunks in some reasonable way. These chunks might be chapters or some fixed number of sentences or whatever; the details aren't critical and need not concern us at this stage. I'm going to call these chunks "chapters" (only because it is a less technical sounding word), but bear in mind that we are not assuming that a chapter is stylistically homogeneous; that is, the split between authors might take place in the middle of a chapter.
Our object is to split our collection of chapters into families of stylistically similar chapters. (The chapters in a family need not be contiguous.) All the chapters that look a certain way, please step to the left; all others, please step to the right.
As a first step, for any pair of chapters, we're going to have to measure the similarity between them. The trick is to measure this similarity in a way that captures style rather than content.
The way we do it is as follows: we begin by generating a list of synonym sets. For example, for the case of Tanach, we would consider synonym sets such as , , , and so on. There are about 200 such sets of Biblical synonyms. We generate this list automatically by identifying Hebrew roots that are translated by the same English root in the KJV. Note that not every occurrence of, for example, shevet (which can mean either “staff” or “tribe”) is a synonym for makel (which is always “staff”). We use online concordances to disambiguate, that is, to determine the intended sense of a word in a particular context. (In this respect, Tanach is especially convenient to work with.)
For every chapter and every such set of synonyms, we record which synonym (if any) that chapter uses. The similarity of a pair of chapters reflects the extent to which they make similar choices from among synonym sets. The idea is that if one chapter uses – for example – betoch, sar and mateh and the other uses bekerev, nasi and makel, the two chapters have low similarity. If a chapter doesn’t use any of the synonyms in a
particular synonym set, that set plays no role in measuring the similarity between that chapter and any other chapter.
Once we know the similarity between every pair of chapters, we use formal methods to create optimal families. Ideally, we want all the chapters in the same family to be very similar to each other and to be very different from the chapters in other families. In fact, such clean divisions are unusual, but the formal methods will generally find a near-optimal clustering into families. (What we call families are called “clusters” in the jargon, and the process of finding them is called “clustering”. The particular clustering method we used is a spectral approximation method called n-cut.)
A key question you should ask at this point is: how many families will we get? You might imagine that the clustering method will somehow figure out the right number of families. Indeed, there are clustering methods that can do that. But – note this carefully – the number of families we obtain is not determined by the clustering method we use. Rather it is given by us as an input. That is, we decide in advance how many families we want to get and the method is forced to give us exactly what we asked for. This is a crucial point and we'll come back to it when we get to the meaning of all these results below.
In any case, at this stage, we have a tentative division of chapters into however many families we asked for. (For simplicity, let's assume that we have split the chapters into exactly two families.) This is not the final result, for the simple reason that we have no guarantee that the chapters themselves are homogeneous. The next step is to identify those chapters that are at the core of each family; these are the chapters we are most confident we have assigned correctly and are consequently the ones most likely to be homogeneous. (Note that when I say "we are confident" I don't mean anything subjective and wishy-washy; all this is done automatically according to formal criteria a bit too technical to get into here.)
Now that we have a selection of chapters that are assigned to respective families with high confidence, we use them as seeds for building a "model" that distinguishes between the two families. Very roughly speaking, we look for common words (ones not tied to any specific topic) that appear more in one family than in the other and we use formal methods (for those interested, we use SVM) to find just the right weight to give to each such word as an indicator of one family or the other. We now use this model to classify individual sentences as being in one family or the other.
Results
Wonderful, so we did all sorts of geeky hocus-pocus. Why should you believe that this works? Maybe the whole synonym idea is wrong because we ignore subtle differences in meaning between "synonyms". Maybe the same author deliberately switches from one synonym to the other for literary reasons. Maybe we are biased because we believe something wicked and we subtly manipulated the method to obtain particular results.
These are legitimate concerns. That's why we test the method on data for which we know the right answer to see if the method gives that right answer. In this case, our
test works as follows. We take two books, each of which we can assume is written by a single distinct author, mix them up in some random fashion, and check if our method correctly unmixes them. In particular, we took as our main test set random mishmashes of Yirmiyahu and Yechezkel.
We found that the method works extremely well. About 17% of the psukim could not be classified (no differentiating words appeared in these psukim or their near neighbors). Of the approximately 2200 psukim that were classified into two families, all the Yirmiyahu psukim went into one family and all the Yechezkel psukim went into the other, with a total of 26 (1.2%) exceptions. We obtained similar results on a variety of other book pairs.
So maybe we should have left well enough alone. But with a power tool like this in hand, how could you not want to see how it would split the chumash? Shoot me, but for me, like Rav Kahana hiding under his rebbe's bed, Torah hee velilmod any tzarich. We did the experiment. I should hasten to mention, though, that the chumash experiment is only briefly mentioned in the published paper, which focuses on proving the efficacy of the method (it’s a computational linguistics paper, not a Bible paper).
Now, I should point out that until I got involved in this, I was a complete am haaretz in Bible Criticism, a perfectly agreeable state of affairs, as far as I was concerned. However, Idan Dershowitz immediately observed that our split was very similar to the split between what critics refer to as the Priestly (P) and non-Priestly portions of the Torah. Bear in mind that there are ongoing disagreements among the critics about precisely which psukim should be regarded as P and which not. We took two standard such splits, that of Driver and that of Friedman, and refer to the set of psukim for which they agree as “consensus” psukim. (They agree just over 90% of the time.)
Here’s the result. Our split of the Torah into two families corresponds with their split for about 90% of all consensus psukim.
Let me say a few words about the main areas of disagreement. To a significant extent, our split runs along lines of genre. One family is mostly – not completely – legal material and the other is mostly narrative. Since what the critics call the Priestly sections include pretty much all of Vayikra (which is mostly laws), as well as selected portions of Bereishis, Shemos and Bemidbar, their split also corresponds somewhat to the legal/narrative split. Most of the cases where our split is different than theirs involve narrative sections that they assign to P and our method assign to the family that corresponds to non-P, for example, the first chapter of Bereishis. (The rest of the disagreements involve P sections that scholars now refer to as H and consider some sort of quasi-P, but I don’t want to get into all that, mostly because I’m still pretty clueless about it.)
Before you dismiss all this by saying that all we did was discover that stories don’t look like laws, let me point out there are plenty of narrative sections that the computerized analysis assigned to the P family (or, more precisely, to the nameless family that turns out to be very similar to what the critics call the P family). Two prominent examples are the story of Shimon and Levi in Shechem and the story of Pinchas and Zimri.
One more point: when we split the Torah into three or more families, our results do not coincide with those of the critics. In the case of three families, Devarim does seem to split off as its own family, as the critics claim, but there are a fair number of exceptions. And even with four or more families, no hint of the critics' E/J split shows up at all.
Interpreting the Results
So does all this mean that we have proved that the Torah was written by at least two human authors, as the breathless reports claim? No.
First of all, as I noted above, our method does not determine the optimal number of families. That is, it does not make a claim regarding the number of authors. Rather, you decide in advance how many families you want and the method finds the optimal (or a near-optimal) split of the text into that number. If you ask it to split Moby Dick into two (or four or thirteen) parts, it will do so. Thus the fact that we split the Torah into two tells us exactly nothing about the actual number of authors.
Having said that, I want to temper any religious enthusiasm such a disclaimer might engender. First of all, with a few improvements to the method we could probably identify some optimal number of families for a given text. We simply haven’t done so. Second, the fact that – for the case of two families – the results of our method coincide (to some extent) with those of the critics would seem to suggest that the split the method suggests is not merely coincidental.
But, the deeper reason that our work is irrelevant to the question of divine authorship is simply that it does not – indeed, it could not – have a thing to say on that question. If you were to have some theory about what properties divine writing ought to have and close analysis revealed that a certain text probably did not have those properties, then you might have to change your prior belief about the divine provenance of that text. But does anyone really have some theory about what divine texts are supposed to look like? Several press reports about this work referenced the idea that “God could write in multiple voices”. I find that formulation a bit simplistic, but it captures the fact that any attempt to map from multiple writing styles to multiple authorship must be rooted in assumptions about human cognition and human performance that are simply not relevant to the question of divine actionD8D.
In short, our results seem to support some findings of higher Bible criticism regarding possible boundaries between distinct stylistic threads in the Torah. These results might have some relevance regarding literary analysis of the Torah. Taken on their own, however, they are not proof of multiple authorship. Furthermore, there is nothing in these results that should cause those of us committed to the traditional belief in divine authorship of the Torah to doubt that belief.
1 HUhttp://news.yahoo.com/israeli-algorithm-sheds-light-bible-163128454.htmlU
2 HUhttp://www.haaretz.co.il/captain/spages/1233355.htmlU
3 M. Koppel, N. Akiva, I. Dershowitz and N. Dershowitz, (2011). HUUnsupervised Decomposition of a Document Into Authorial ComponentsUH, Proceedings of ACL, pp. 1356-1364.
4 S. Argamon, M. Koppel, J. Pennebaker and J. Schler (2009), HUAutomatically Profiling the Author of an Anonymous Text,UH Communications of the ACM, 52 (2): pp. 119-123 (virtual extension).
5 M. Koppel, J. Schler and E. Bonchek-Dokow (2007), HUMeasuring Differentiability: Unmasking Pseudonymous Authors,UH JMLR 8, July 2007, pp. 1261-1276.
6 M. Koppel, D. Mughaz and N. Akiva (2006), HUNew Methods for Attribution of Rabbinic LiteratureUH , Hebrew Linguistics: A Journal for Hebrew Descriptive, Computational and Applied Linguistics, 57, pp. 5-18.
7 מ. קופל, זיהוי מחברים בשיטות ממוחשבות: "גניזת חרסון", ישורון כג (אלול ה'תש"ע), תקנט-תקסו.
8 I realize that this argument comes close to asserting that the claim of divine authorship is unfalsifiable, which for some might cast doubt on the meaningfulness of that claim. A proper response to that concern would involve a discussion of the nature and content of religious belief, a discussion that is well beyond the scope of this brief peroration.