Pierre MARANDA, Computers in the Bush: Tools for the Automatic Analysis of Myths. Un article publié dans l'ouvrage sous la direction de J. Helm, ES-SAYS ON THE VERBAL AND VISUAL ARTS. Proceedings of the 1966 Annual Spring Meeting of the American Ethnological Society

[77]

Pierre Maranda

Anthropologist, Harvard University

“Computers in the Bush :
Tools for the Automatic Analysis of Myths.” [1]

Un article publié dans l'ouvrage sous la direction de J. Helm, ESSAYS ON THE VERBAL AND VISUAL ARTS. Proceedings of the 1966 Annual Spring Meeting of the American Ethnological Society, pp. 77-83. University of Washington Press, 1967.

There is a term in computer jargon which should be prefaced as a slogan to all automatic analyses. Social scientists working with computers should be especially aware of its message. It reads GIGO, shorthand for "garbage in., garbage out," and means that the output of the fanciest and most sophisticated machine is at the entire mercy of the input fed into it by the analyst.

Unfortunately, much too much sociological garbage is processed in the hope that the magic of electronics will change it into scientific truth. And a sad side of the fad is that the mention of a computer number or that of the esoteric name of a program, modestly relegated to a footnote, lends indeed unwarranted authority to conclusions that even a nineteenth-century positivist would have swept aside as fallacious.

It must be stressed at the outset that no formula exists which can produce computerized results without painstaking labor. If automatic analyses may yield better outputs than paper and pencil techniques, they also require more work. The purpose of this paper is to indicate prerequisites to the computerized analysis of folkloric data by describing summarily a few basic procedures which cannot be ignored without invalidating the whole product. Computers are beginning to be brought in to clear mythological bushes, but they are mostly handled like bulldozers — which are not the best equipment to harvest "La pensée sauvage"— and one forgets that they can be much more sensitive tools.

Two sound methodological warnings to folklorists can be found in the still young literature, and they should be taken into account not only by the analysts themselves but also by those who read the latter's works (Dundes 1965 ; Greimas 1966). These remain general, however, and represent the viewpoint of outsiders since they come from scholars who are not themselves engaged in computerized analyses. In contrast, the following remarks are inspired by inside experience and propose positive guidelines for the preliminary edition of the documents on the one hand, and for the use of available programs on the other.

The folkloric data with which I am concerned here are myths. My sample consists of 135 narratives (about 33,000 words) collected among four Gê tribes of Central Brazil (Matta 1962 ; Maybury-Lewis 1958 ; Melatti 1962 ; Métraux 1960 ; Nimuendaju 1939, 1942, 1944, 1946 ; Schultz 1950). The steps I will discuss are the outcome of a 26‑month work with the corpus and the computers. Approximately six more months will still be necessary before the last conclusions are drawn.

"Canned" analytic grids were tried out at first in the hope that they would provide short-cuts : they yielded immediate results, which [78] turned out to have a reliability of less than 30 percent. It was then decided to start from the beginning and to take all the necessary measures to insure a reliable analysis.

1. Editing the Texts for the Computer

My work was done on English translations. Computers can handle non-Indo-European languages, but at great costs. A careful control of the versions must be exerted, needless to say ; this is a problem common to all folklore studies conducted in languages other than the original and needs not detain us here. In this case, the texts used were checked by competent ethnographers ; whenever necessary., original Gê-words were kept and inserted in the dictionary. The interest of the enterprise as a whole resides in the fact that the results of the formal analysis are being assessed against the ethnographic knowledge which the anthropologists of the Harvard Central Brazil Project bring back from the field.

A second translation problem is that of the conversion of a natural language into an analytic code. Folklorists have long devised categories to rewrite their data for comparisons and other research purposes (Cf. Aarne-Thompson 1961 : Lévi-Strauss 1955 ; Propp 1958 ; Thompson 1958). Such categories as motifs (Aarne-Thompson, etc. : cf. Dundes 1962) are too ill‑defined‑to meet the strict requirements imposed by the use of computers. Function indices (Propp) are not satisfactory either for if they behave superficially like syntagms, they are not isomorphic to each other (cf. Lévi-Strauss 1960 ; Bremond 1964 ; Greimas 1964). Only analytic propositions provide a valid operational framework (cf. Lévi-Strauss 1955 ; Mathiot 1966). The texts must therefore be recast in those terms, and this is a long, demanding, and tedious operation which the machines cannot perform. It implies that all passive sentences be rewritten in the active voice, and the ambiguity inherent in natural languages completely resolved. Thus, pronouns are replaced by their antecedents, homonyms are eliminated, connectives and other conjunctions are standardized, etc. Each natural sentence is then recoded into analytic or elementary propositions by assigning a special syntax-marker to each word (Dunphy et al. 1965 ; Stone 1966). Three syntactic categories were deemed sufficient for this purpose : subject+modifier., verb+modifier., and object+modifier. At the end of this phase, the text is broken down into minimal units which cannot be divided further without altogether loosing their dynamic meaning in the narrative. Periods, and + signs within sentences, mark the borders of each analytic proposition. For example, "the jaguar captures the (two) (young) araras and ate them" is rewritten as "the jaguar/S captured/V the two/O young/O araras/O+ and the jaguar/S ate/V the two/O young/O araras/O."

To some extent, guidelines for recoding are found in the corpus itself. In effect, because of their nature of traditional messages, myths maintain a remarkable stability through time. The rodage of narratives through years and years of transmission produces an optimal formulation, as it were, where redundancy, saturation, and economy have reached an equilibrium which insures a "negentropy" somewhat like that of well formulated propositions. The changes an item undergoes are mostly of a paradigmatic order, its syntagmatic or [79] sequential aspect remaining constant as long as the message is to be aligned on a persistent sociological function. (In this respect, folkloristics might perhaps contribute to information theory, for it deals with highly stable messages whose coding can barely be improved.)

Editing and recoding are followed by keypunching. Each punch card carries an identification number. Four digits were used : the first one designating the tribe and the source, the second standing for a general categorization corresponding to the title of the myth, and the two last numbers giving further specifications by referring to prominent contentual aspects. For example, 3616 stands for Sherente, Maybury-Lewis 1958, culture‑hero, and hostile animals. Then comes the edited text, which reads continuously from one card to another.

2. Automatic Analysis

2.1. The KWIC

The first operation of the computer is to transfer the contents of the cards onto a tape. The myths are then tagged in the form of a string of numerical symbols coding analytic categories, i.e., operating a normalization of the analytic language. This second operation can be done only with the help of a high-speed and large computer like the IBM 7090. But normalization implies than an analytic grid is used, which I must define before going any further. A normalization is essentially a recoding. Paradigmatic sets are established, i.e., slots are tagged for groups of terms either in free variation or in complementary distribution according to whether they are related to other or the same terms through different or identical actions. The fuller meaning of normalization will become clearer as I proceed.

A most powerful, yet simple, tool to deal with verbal data is the program called KWIC ("Key Words In Context"). It is actually a way to make a concordance mechanically, and it is described in connection with its first use in the IBM Reference Manual Index Organization for Information Retrieval (IBM 1961 : 34-36 ; also IBM 1962: 5-34). A KWIC output as used in the first step of my analysis presents the format of an alphabetical list of all the words contained in a large sample (3/5) of the corpus. These appear in the middle of the printout, and six words of context are found on each side, with the ID number at the right‑hand end of the line.

I would like to mention here that independently from all that it makes possible afterwards, a KWIC output is already most valuable in itself. In effect, it can be used as a concordance and enables to define, for instance, any dramatis persona in terms of the actions it performs in the narrative. By looking up jaguar or sun or any other entry of the same order, one can delimit very rapidly the semantic fields immediately broached by the entry.

The main purpose of KWIC, however, is twofold. First, it serves as an exact tool to check whether all ambiguities of the natural language have been resolved. Overlooked homonyms can be quickly spotted, pronouns which have not been replaced by their antecedents will appear under "it," "they," etc. Then, KWIC leads to the [80] construction of an "emic" analytic code, i.e., to the description of the data in terms of tags or semantic categories.

The KWIC step is often ignored or neglected by hasty analysts and by those who prefer to project their own "etic" categories onto the data. Of course, it is long and tedious to reread one's materials word by word in alphabetical order ; it remains the only way, nonetheless., to build adequate grids.

2.2. The General Inquirer

The program used in my analysis, the General Inquirer devised by Philip Stone at Harvard (Dunphy et al. 1965 ; Stone 1966), [2] offers a battery of 99 positions for categories or descriptive tags to normalize the data (more can be used, but at a higher cost). The words of the corpus were thus distributed into 99 categories according to the following paradigmatic sets suggested by the study of KWIC : cosmological features (4 tags), motion (6 tags), space, time (3 tags each), quantity (2 tags), plants (2 tags), animals (6 tags), body parts (6 tags), physiological processes (10 tags), social categories (6 tags . social behavior (8 tags), societal units (8 tags), ritual (3 tags), supernatural beings (4 tags), general psychological features (2 tags), cognitive psychology (7 tags), relation of information or of accomplishment of task (l tag each), logical operations, i.e., normalized connectives and relational modifiers (10 tags), and, finally, types of transformations undergone by dramatis personae (2 tags).

The paradigmatic sets are therefore only nominal categories at this early stage. A set is in effect defined so far exclusively and exhaustively by the entries which describe it. The left-over list and a sample of scores, as will be shown below, will lead to a revision of the dictionary.

Some of these categories contain only "nouns" or dramatis personae (e.g., Tag 44, Affines), and some contain only "verbs", i.e., are essentially relational in nature, like the tags for transformation and those for motion. The former can be considered as metonymic sets, the latter as metaphorical elements. Accordingly, syntax-marking and category membership overlap to some extent. This, along with the tags normalizing connectives and logical operations, makes automatic structural analysis possible in that it provides a normalization of the sequences constituting plots or, in other words, it forms a framework for syntagmatic analysis (cf. Maranda 1966b, c).

It would be simple‑minded to believe that the 99 categories forming the dictionary of paradigmatic sets and syntagmatic frames exhaust the documents. In fact, my first dictionary was revised after a first run. It was also completely revamped from a different angle, all categories being then subsumed under only three major syntagmatic headings. The flexibility of the program is such indeed that different theoretical viewpoints can be tested and entirely new approaches adopted without great expense. And KWIC is what warrants the validity of these experiments.

Once a first draft of the dictionary completed, the next step is to assess whether the dictionary compiles — but this is a technical matter which can be left aside here. Of more direct interest is the left-over list which contains all the words not assigned to a tag by [81] the computer, and which is produced as a complementary output. The left-over list must then be examined and its contents either sent to the N list, i.e., discarded as irrelevant (articles, pronouns, etc.) or integrated into the dictionary.

The dictionary is tested further with the help of the bilingual output, or tag‑list. This consists of a printout divided into two parts : on the left‑hand side appears the text of the myths, line by line, with ID number and number of words ; on the right-hand side figures the normalization of the text and its syntactic analysis. "Bilingual" in "bilingual output" refers to this translation of the data into normalized coding. A sentence like "The jaguar captured the (two) (young) araras" is recoded as "Animal, wild S(ubject) Dominance V(erb) Quantity O(bject) Age O(bject) Animal, bird O(bject)." A comparison of the normalization with the original text reveals to what degree the dictionary is reliable and which corrections have to be made in order to insure adequate recoding on the higher level of normalized description. It may be mentioned at this point in connection with the example that if the analyst wishes to test the significance of numbers or of age in the corpus, retrievals either of the relevant tags or of individual entries will immediately produce each sentence unit where they appear along with the rank order of the sentence in each myth. The dictionary is then ready to be used.

The tag-tallies and the graphs are the next output. Tag-tallies supply word and sentence counts, and, at the same time, these scores are punched on a special deck of cards for ulterior statistical analysis. Document length is taken into account so that the scores indicate the respective ratio of components which describe the data., both on the basis of words and/or of analytic propositions as operational units. The graphs provide the same information in visual terms (for illustrations, see Dunphy et al. 1965 : 478, Fig. 5 ; Maranda 1966b). Low, scores may suggest revisions of a paradigmatic set or of another, so that a preliminary run of a sample of the corpus is most useful at this stage.

Thus far, the results obtained are all quantitative. In order to control their validity, I fed into the computer only slightly different variants of the same narratives : the quantitative descriptions of the data turned out to be accurate enough to discriminate between them (for a discussion,, see Maranda 1966c). Scores were then compared across tribes and, in the case of the Sherente where we have a time dimension (late 1930's late 1950's), from time 1 to time 2. Clear‑cut lines of demarcation emerged. For instance, the Sherente, Eastern Timbira, and Apinayé emphasize supernatural beings, which the Cayapό underrate — specific retrievals reveal that the function of supernatural beings is taken over by culture-heroes in Cayapό mythology. Likewise, the Cayapό are highly concerned with wild animals, (6.2 percent of the dramatis personae of their myths), i.e., more than twice as much as the Sherente and Eastern Timbira, while the Apinayé score is the lowest of all (1.6 percent).

These very general references to results of my computerized analyses will have to mark the end of this paper. I decided to emphasize the preliminary conditions of such investigations because they are basic, and I would have had to mention them at any rate in order to justify any conclusion which I might have presented. Space [82] limitations also keep me from touching on the most important functions of individual retrievals, viz., to revise paradigmatic sets constituted earlier on the basis of nominal definitions, to establish semantic constellations, to do contingency analysis, to decide on levels of significance and to define syntagms. Likewise, I cannot report on my test of Propp's Morphology (1958) where results of different order were obtained — especially the substantiation of Lévi-Strauss (1960) and Bremond's (1964) criticisms and, on the other hand, the heuristic power of Propp's model to point out structural shifts in some narratives. Structural analyses could not be discussed either. My objective, as stated at the beginning., was to stress methodological aspects. Concrete cases, demonstrations, and discussions will be presented elsewhere and should eventually be fully elaborated in a book.

REFERENCES

Aarne, A. and S. Thompson

1961 The Types of the Folktale. Helsinki.

Bremond, C.

1964 Le Message Narratif. Communications, 4 : 4-32.

Colby, B. N.

1965 Cultural Patterns in Narratives. Science, 151 : 793-798.

Dundes, A.

1962 From Etic to Emic Units in the Structural Study of the Folktale. Journal of American Folklore, 75 : 95-105.

1965 On Computers and Folktales. Western Folklore, 24 :185-190.

Dunphy, D., P. Stone and M. Smith

1965 The General Inquirer : Further Developments in a Computer System for Content Analysis of Verbal Data in the Social Sciences. Behavioral Science, 10 : 468-480.

Greimas, J.

1964 La Structure Élémentaire de la Signification en Linguistique. L'Homme, 4 : 5-17.

1966 Interpretation of Myths : Theory and Practice. In The Structural Analysis of Myths, P. and E. Maranda, eds. (forthcoming).

International Business Machines

1961 Index Organization for Information Retrieval. New York, IBM.

[83]

1962 Catalogue of Programs for IBM Data Processing Systems KWIC Index. New York, IBM.

Lévi-Strauss, C.

1955 The Structural Study of Myth. In Myth : A Symposium, T. A. Sebeok, ed. Bloomington, Ind. (Revised as Ch. XI of Anthropologie Structurale. Paris, 1958.)

1960 La Structure et la Forme. Cahiers de l'Institut de Science Economique Appliquée, 99 : 3-36.

Maranda, P.

1966a Quantitative and Qualitative Analysis of Myths : A Computerized Investigation of G& Data. Proceedings of the International Symposium on the Use of Mathematical and Computational Methods in the Social Sciences (forthcoming).

1966b Sur l'Analyse Automatique de la Mythologie. L'Homme, 6 (forthcoming).

1966c Formal Analysis and Intra-Cultural Studies. In International Symposium on Cross-Cultural Research. Paris. (forthcoming).

Mathiot, M.

1966 Cognitive Analysis of a Myth — An Exercise in Method. In The Structural Analysis of Myths, P. and E. Maranda, eds. (forthcoming).

Matta, R. da

1962 Field Notes.

Maybury-Lewis, D.

1958 Sherente Myths. Ms.

Melatti C.

1962 Field Notes.

Métraux, A.

1960 Mythes et Contes des Indiens Cayapό. Revista do Museu Paulista, 12.

Nimuendaju, C.

1939 The Apinayé. The Catholic University of America, Anthropological Series, No. 8. Washington, D. C.

1942 The Serente. Publications of the Frederick Webb Hodge Anniversary Publication Fund 4. Los Angeles.

1944 Serente Tales. Journal of American Folklore, 57.

1946 The Eastern Timbira. University of California Publications in Anthropology and Ethnology 41. Berkeley-Los Angeles.

Fropp, V.

1958 Morphology of the Folktale. Publication 10, Indiana University Research Center in Anthropology, Folklore, and Linguistics. Bloomington.

Schultz, H.

1950 Lendas dos Indios Krahό. Revista do Museu Paulista 4.

Stone, P , ed.

1966 The General Inquirer. Cambridge, Mass., Massachusetts Institute of Technology (forthcoming).

Thompson S.

1958 Motif-Index of Folk Literature. 6 vols. Bloomington, Indiana University Press.

[1] My work was made possible thanks to the support of David Maybury-Lewis, director of the Harvard‑Central Brazil Project, that of the Laboratory of Social Relations of Harvard University (Fourth Pilot Grant ; free computer time as subsidized by the National Science Foundation Grant GP-2723), and with the most valuable help of the staff of the Office of the General Inquirer at Harvard, especially Philip Stone and Dexter Dunphy. Professor John Whiting also contributed helpful suggestions on the cutting of analytic units and discussed procedures with my wife and myself. I wish to express my gratitude to all of them.

[2] For another use of the same program, see Colby 1965.