-
Modern Wendat Lexicography:Using XML to Reflect the Grammar and Lexicon of an Iroquoian Language1
Building dictionaries with tools and methods emerging from Eurocentric traditions has proved problematic for Indigenous languages. We are building a dictionary for Wendat, an Iroquoian language formerly known as Huron that is being reawakened in Wendake, Québec. There are twelve manuscript dictionaries and lexicons for Wendat, created by missionaries during the seventeenth and eighteenth centuries. We are encoding the manuscripts using a standard Text Encoding Initiative (TEI) schema. However, when we came to create and encode a modern reconstructed Wendat dictionary, we were overly constrained by Eurocentric structures and assumptions inherent to TEI. Building our own custom XML schema allows us to better reflect Wendat grammar, responding to community needs and our evolving understandings of the language. This article describes the development of this schema, based on analysis of the archival documentation and related languages. Through this discussion, we will exemplify the schema we built and address the points of friction between TEI and Wendat grammatical structures. Our custom schema enables us to elegantly and economically represent exactly what our analysis of the language reveals, capturing elements of the language such as event-verb consequentiality, conjugation class, and stems, while avoiding incompatible elements and assumptions.
Indigenous lexicography, Iroquoian linguistics, Huron, text encoding, XML, TEI
[End Page 75]
INTRODUCTION
Building dictionaries with tools and methods emerging from Eurocentric traditions has proved problematic for Indigenous languages (Sear and Turin 2021), including the technologies used to encode and structure this information. We are building a dictionary for Wendat, an Iroquoian language that is being reawakened from dormancy in Wendake, Québec. There are twelve manuscript dictionaries and lexicons for Wendat, created by missionaries during the seventeenth and eighteenth centuries. We are encoding the manuscripts using a standard TEI schema, although our encoding is basic.2 However, when we came to create a reconstructed Wendat dictionary, using Extensible Markup Language (XML) for reasons of sustainability and interoperability with the primary sources, we were overly constrained by Eurocentric structures and assumptions inherent to TEI. Building our own custom XML schema allows us to better reflect Wendat grammar, responding to community reclamation needs and our evolving understandings of the language. In other words, while the TEI schema is more than adequate for encoding the historical manuscripts, it has proved quite unsuitable for the reconstructed dictionary, so we are building our own schema for that work. [End Page 76]
We come to this work with shared goals, but from different perspectives and with different skill sets. Megan Lukaniec is a member of the Huron-Wendat Nation of Wendake, Québec and an Assistant Professor of Indigenous Language Revitalization in the Indigenous Studies Program at the University of Victoria on unceded WSÁNEĆ territories. She has been working with and for her community to reawaken and reclaim the Wendat language since 2006 and is responsible for compiling and editing this dictionary and managing the project. Martin Holmes is a programmer in the University of Victoria Humanities Computing and Media Centre. He is the lead programmer on several large digital edition projects and is part of the Project Endings team. He served on the TEI Technical Council from 2010 to 2015 and was managing editor of the Journal of the Text Encoding Initiative from 2013 to 2015. He is responsible for providing essential technical support for the project and maintaining the documentation.
This paper describes the iterative development of this schema for the modern reconstructed Wendat dictionary, which is based on analysis of the Wendat archival materials and the documentation of related languages. Our custom schema enables us to economically represent exactly what our analysis of the language reveals, adding components to capture elements of the language such as event-verb consequentiality, conjugation class, and stems, while avoiding incompatible elements and assumptions. This paper is organized as follows. The next section provides background information about the Wendat language, the legacy dictionaries, and this project. In the following sections, we discuss recent developments in Indigenous language lexicography, briefly describe the dictionary chapter and module of TEI P5 standards and highlight some of the most important mismatches between TEI and Wendat grammar and how we chose to remedy those mismatches. Finally, we offer some conclusions.
WENDAT LANGUAGE AND LEXICOGRAPHY
The Wendat language is the ancestral language of the Wendat people, also known as the Huron-Wendat Nation of Wendake, Québec. Although it was formerly a lingua franca for the Great Lakes region (Trigger 1976/1987), the Wendat language, called la langue huronne by the French, fell dormant between the second half of the nineteenth [End Page 77] century and the early twentieth century. Since 2007, it is gradually being reawakened in Wendake through the detailed analysis and interpretation of archival documentation. The Wendat language is a member of the Iroquoian language family, and its closest relative is Wandat (or Wyandot), which is being revitalized in Wyandot(te) communities in Oklahoma, Kansas, and Michigan.
There is a long history of Wendat lexicography, starting with the publication of the Récollet Frère Gabriel Sagard's French–Wendat dictionary in Paris in 1632. Aside from this early dictionary, there are eleven other legacy dictionaries of the Wendat language which span from the early seventeenth century to the year 1800. Ten of these eleven dictionaries were created by Jesuit missionaries who were attempting to convert Wendat people to Christianity and who were using the dictionaries and accompanying grammars and translations of Christian texts as field guides for their own study of the language. These legacy dictionaries, which total close to 3,500 pages of documentation, include both French–Wendat and Wendat–French dictionaries. They contain a vast amount of information about the lexicon and grammar of the language, since the (Jesuit) missionaries had what was then considered to be advanced scholarly training with knowledge of Latin and Greek (Hanzeli 1969). In this excerpt from a seventeenth-century French–Wendat dictionary attributed to the Jesuit Père Chaumonot in Figure 1, we can see detailed information about meaning, grammar, and usage.
Although these legacy dictionaries contain a wealth of information, the primarily Francophone missionaries were not always able to discern the distinctive sounds of the Wendat language. Therefore, the Wendat forms in these dictionaries are sometimes missing phonemic distinctions in the language (e.g., the presence of aspiration h and glottal stops ʔ); in other contexts, some sequences of symbols are ambiguous. These errors are identified and repaired through a historical-comparative process called reclamation-driven reconstruction (Lukaniec, in press) which relies on cognates from other Northern Iroquoian languages and knowledge of sound changes in these sister languages.
Aside from the errors introduced by the compilers of these dictionaries, these manuscripts are not well suited for direct use in language reclamation work. In their current forms, they cannot be searched, edited, or easily repurposed. Furthermore, they contain numerous references [End Page 78]
to evangelization efforts and represent a particularly different period of history for the Wendat people. For all of these reasons, we decided to focus our efforts on creating a born-digital dictionary of modern, reconstructed Wendat that was built for, with, and by Wendat people.
Driven by community needs for a flexible, accessible, and sustainable dictionary, we were awarded an Insight Development Grant from the Social Sciences and Humanities Research Council of Canada in 2020 for Project Kwakwendahchondiahk 'We are preparing our words'. This lexicography project, which also includes the transcription and encoding of the legacy dictionaries, will lead to more reconstructed content for language learners. The project is supported by the Wendat Language Committee, a body of the Council of the Huron-Wendat Nation that deliberates on all matters related to language, including standardizing orthography and adopting lexical innovations. The committee members act as project consultants to shape the dictionary structure.
INDIGENOUS LANGUAGE LEXICOGRAPHY
The creation of Indigenous language dictionaries, which were "essential to the fabric of the imperial project" (Anderson 2020, 33), began as soon as the colonizers landed on the shores of North America. [End Page 79] Lexicography for Indigenous languages continues to this day, evolving from these earlier, colonial counterparts to resources that are designed explicitly for the use of Indigenous communities and with the direct involvement of Indigenous language speakers and learners. Discussing the tensions between variation and standardization, Rice and Saxon (2002, 153) call for the "true involvement of a community" in shaping a dictionary. Similarly, Anderson (2020, 7) describes the development of a dictionary of Tunica in a framework they describe as "revitalization lexicography" in which dictionary compilation "is only ethical and effective if it is undertaken as part of a larger community-driven revitalization movement." Hinton and Weigel (2002, 155–56) highlight the differences between the needs of communities and the desires of academic linguists, describing the dictionary as a "critical reference" and a "repository of tribal identity."
While any lexicographic project inevitably faces challenges, the particular circumstances of Indigenous language lexicography—pertaining to the language, its structure, and the needs of the community to whom it belongs—cannot be easily remedied by Western lexicographic traditions. For example, one of the often-cited divergences between Western and Indigenous language lexicography relates to the choice of headword form (Munro 2002). In a dictionary of the Iroquoian language Cherokee, in which all verb bases are bound forms, Pulte and Feeling (2002) chose to use a conjugated form of the verb as the entry headword, specifically a form with the third-person singular pronominal prefix and a present-tense suffix. For another Iroquoian language, Oneida, Abbott (1998, 129) explains that since there is "no simple basic form such as an infinitive" and "no consensus among native speaker intuitions and opinions," they chose to use the bound verb base as the headword of entries in the Abbott, Christjohn, and Hinton (1996) dictionary.
The choice of technology for such Indigenous lexicographic projects is not always straightforward either. Many of these lexicographic tools are "designed to be all things to all people" (Rehg 2018, 312) and, like other technologies, they may have "assumptions baked into [them that] limit their functionality for different languages" (Carpenter et al. 2017, 3). Furthermore, while it is important to have structured data (Thieberger 2011), not all database software makes use of Unicode or allows [End Page 80] for adequate customization (Anderson 2020). Some Indigenous language dictionary projects choose Fieldworks Language Explorer (FLEx) created by the missionary organization SIL (e.g., the Mutsun dictionary described in Warner, Butler, and Luna-Costillas 2006 and Butler and van Volkinburg 2007 and the Tunica dictionary described in Anderson 2020). However, there can be drawbacks to FLEx when building a dictionary from archival documentation (see Lukaniec 2022). Others use Miromaa (McKenny, Genetti, and Chacon 2013), which was created by Miromaa Aboriginal Language & Technology Centre (MALTC), an Indigenous organization of Australia, for creating resources for community-based linguistic and cultural knowledge. XML is also increasingly being used to encode Indigenous language dictionaries (Garrett 2011, Spence 2021). Some decide to use specifically TEI XML for this purpose, since it is a "mature, reliable, flexible standard […] for the production of lexical resources" (Czaykowska-Higgins, Holmes, and Kell 2014, 1), including the Lushootsheed dictionary (Bates and Lonsdale 2010), the Nxaʔamxcín dictionary (Czaykowska-Higgins, Holmes, and Kell 2014), and the Mixtepec-Mixtec dictionary (Bowers and Romary 2018).
THE TEI DICTIONARY CHAPTER AND MODULE
The Text Encoding Initiative (TEI) is "a nonprofit membership organization composed of academic institutions, research projects, and individual scholars from around the world" which "collectively develops and maintains a standard for the representation of texts in digital form" (https://tei-c.org/). Its aim is to cover "every type of data created and used by researchers in the Humanities, such as source texts, manuscripts, archival documents, ancient inscriptions, and many others" (Burnard 2014, 1). As such, dictionaries are included in the TEI standard and are described in a separate chapter of the guidelines.
Indeed, from its earliest days, the Text Encoding Initiative has attempted to provide encoding support for lexical resources. The first schema proposal, P1 (1991), says:
Within the community of computing lexicologists, two broad paradigms have emerged: the scriptural and the prophetic […]. For those guided by the scriptures, the electronic dictionary is only a rather more flexible version of a printed dictionary. Their [End Page 81] concern is with the management of strings of characters, whether derived directly from publishers' typesetting tapes or scanned or retyped from printed originals. For those guided by the prophets, what count are the underlying linguistic phenomena, of which the printed dictionary is one possible reflection among many. Their concern is with the management of linguistic concepts as actualised in electronic lexica, perhaps never intended for direct human consumption. At present, we propose Guidelines for the former school only, since it is here that most effort has already been expended […]
(P175.SCR, Text Encoding Initiative Consortium 1990)
This bias has never been fully addressed since. In P2 (1993), the Dictionaries chapter is absent, being shown in the Table of Contents as "in preparation." In both P3 (1994) and P4 (2002), the chapter is unambiguously headed "Print Dictionaries." In P5 (first released in 2007, and steadily updated since), the chapter is again titled "Dictionaries." There is now an explicit acknowledgment that born-digital resources are supported:
Dictionaries are most familiar in their printed form; however, increasing numbers of dictionaries exist also in electronic forms which are independent of any particular printed form, but from which various displays can be produced.
Nevertheless, the Guidelines accept that there is an inherent problem in balancing desirable structural constraints with the realities of both print and born-digital lexicography:
First, because the structure of dictionary entries varies widely both among and within dictionaries, the simplest way for an encoding scheme to accommodate the entire range of structures actually encountered is to allow virtually any element to appear virtually anywhere in a dictionary entry. It is clear, however, that strong and consistent structural principles do govern the vast majority of conventional dictionaries, as well as many or most entries even in more 'exotic' dictionaries; encoding guidelines should include these structural principles. We therefore define two distinct elements for dictionary entries, one (entry) which [End Page 82] captures the regularities of many conventional dictionary entries, and a second (entryFree) which uses the same elements, but allows them to combine much more freely. It is however recommended that entry be used in preference to entryFree wherever possible.
Acknowledging the Eurocentric origins of these standards in their framing of "conventional dictionaries" (i.e., presumably those of European languages) versus "exotic" ones, the Guidelines recognize that the structural expectations arising from Western print dictionaries may be problematic. For this reason, we use <entryFree> when encoding the Wendat legacy dictionaries. But the sequence or structure of elements is not the main issue. Among the many dictionary-related elements now available for encoding are <case>, <colloc>, <def>, <etym>, <form>, <gen>, <gram>, <gramGrp>, <hom>, <hyph>, <iType>, <lang>, <lbl>, <mood>, <number>, <oRef>, <orth>, <pRef>, <per>, <pos>, <pron>, <re>, <sense>, <subc>, <superEntry>, <syll>, <tns>, <usg>, and <xr>. Many of these elements display their Western, Indo-European-focused linguistic origins, which are ill suited for an Iroquoian language such as Wendat. Grammatical categories such as <case>, <gen> ('gender', specifically for morphological gender on nouns), and <tns> ('tense') do not map easily onto Wendat. The elements that are absent from TEI, such as aspect, noun incorporation, or voice, are also notable, in that these absences point to the Eurocentric assumptions inherent to this schema.
ENCODING WENDAT GRAMMAR IN XML
Like other Iroquoian languages, Wendat is a polysynthetic language, meaning that core arguments are represented directly on the verb through pronominal prefixes, and verbs are the equivalents of clauses in languages like English and French. Since Wendat verbs are complex and pose the greatest challenges for the XML encoding, we limit our discussion here to verbal entries and the most salient mismatches between TEI and Wendat grammatical structure.
Modern Northern Iroquoian dictionaries use bases as headwords. Bases are defined as "any combination of components that has properties that are unpredictable, whether these unpredictable properties are structural (i.e., have to do with form) or semantic (i.e., have to do with meaning)" (Michelson and Doxtator 2002, 15). Because of the [End Page 83] polysynthetic nature of verbs, verb bases are bound units and require the addition of at least a pronominal prefix and an aspect-mood suffix. With over fifty pronominal prefixes to choose from in any given Northern Iroquoian language and no infinitive-like grammatical structure, it would be difficult to choose one inflected form as the "basic" or "unmarked" form (Abbott 1998), leading to the current practice of using a bound verb base as the headword for verbal dictionary entries, as is adopted for the Wendat dictionary.
The verb base for Wendat, as shown in Figure 2, includes all derivational elements of the verb, including the verb root and any voice prefixes, incorporated noun base, and derivational suffixes. While this figure suggests that the definition of a base relies on the position of an element in the verbal template, it actually relies on the distinction between derivation and inflection, whereas derivational processes (such as the addition of any of the morphemes in the slots named above) typically result in lexicalized forms that are non-compositional.
Following the Oneida dictionary guidelines of Michelson and Doxtator (2002), it is also possible for a Wendat verb base to be discontinuous if the verb requires the use of one of the seven (non-modal) prepronominal prefixes or if the addition of a prepronominal prefix results in a non-compositional meaning. An example of a discontinuous Wendat verb base, -rihwahkwahnd-/-rihwahkwahn- + duplicative 'to go and sing', appears in example (1). [End Page 84]
(1) Tahchrihwahkwahndah!
tachrihȣa'kȣa'nda
t-a-hs-rihw-a-hkw-a-hn-ah
dupl-cis-2sg.agt.imp-matter-join-pick.up-join-disloc-imp
« viens ici chanter »
'Come here and sing!'
(Ms 60 n.d., 55, adapted from Lukaniec 2018, 166)4
Parallel to Pulte and Feeling's (2002) description of the compilation of the Cherokee dictionary, we know from discussions with the Wendat Language Committee that it is important to provide enough information "to enable a dictionary user to predict a complete paradigm for any […] verb" (64). Following that criterion, we designed the schema for encoding verbs to represent all the necessary and relevant grammatical and semantic characteristics of a verb for language learners. For Northern Iroquoian languages like Wendat, the necessary grammatical information includes which initial segments the verb base or stem begins with (or, the conjugation class), which paradigms of pronominal prefixes the verb takes with which aspect-moods (or, the verb class), the forms of the aspect-mood suffixes, and which derivational prepronominal prefixes are required, if any, with the verb (Abbott 1998). These grammatical specifications, which show a dictionary user how to conjugate a verb, as demonstrated below, are not easy to encode in TEI, and out of these specifications, the forms of aspect-mood suffixes diverge the most from TEI affordances. To further contextualize this discussion of the aspect-mood suffixes, we provide a rough sketch of the overall structure of entries for verb bases.
Brief overview of Wendat verb entries
As shown in Figure 3, each verb is wrapped in the element <entry>.5 Each entry has a <form> element that provides information about the headword followed by a <gram> element that contains all the grammatical specifications that are true for the entry as a whole (as opposed to grammatical [End Page 85] information that can be specific to a particular form, stem, or meaning). The <sense> element wraps the information related to the meaning of the headword, including glosses and definitions in English and French, along with any grammatical information particular to that sense. This is followed by the element <stems> which groups the verb stems that are built off the headword in combination with aspect-mood suffixes and possibly, expanded aspect-mood suffixes. Two <xrs> elements contain collections of <xr> ('cross references'), including one for manuscripts (type="mss") that links to attestations of that verb base in the Wendat manuscripts and one for cognates (type="cogs") that links to references of cognate forms in other Northern Iroquoian languages as part of the reconstruction process. Finally, a <note> element captures relevant information for the dictionary user (@type is "composition" or "usage") or lexicographer (type="internal") in English or French.
A fully encoded entry for the Wendat verb -arahta-/-arahtat- + duplicative 'to run; to flee from someone' appears in Appendix A. This entry follows this basic structure but is much more complex since it includes all the descendant elements of the <entry> and the typical attributes used in verb entries.
[End Page 86]
Encoding aspect-mood information in Wendat: The <stems> element
As mentioned earlier, a Wendat verb base is bound and, at the minimum, must combine with a pronominal prefix and an aspect-mood suffix to become a fully inflected word, as shown in the conjugated example (1) above. There are five aspect-moods in Wendat: the habitual, stative, punctual, imperative, and purposive.6 The habitual aspect-mood refers to a regularly occurring event, generic activity, or, in the case of a subclass of event verbs (consequential verbs), ongoing actions. The stative encodes states of being, continuous action, or again, with consequential verbs, the present consequences of a past event (a present perfect). The punctual, equivalent to the perfective cross-linguistically, is "used to describe an event or action in its totality" (Lukaniec 2018, 108) and must co-occur with one of three modal prefixes (factual, future, or optative) specifying "how the speaker perceives the event's certainty or probability of occurring" (Lukaniec 2018, 109). The last two, more mood-like, are the imperative, which is used in commands, and the purposive, which refers to "the fact that the participant in question intends to accomplish a certain action, or that the participant's purpose is to do said action" (Lukaniec 2018, 119–20).
The expanded aspect-mood suffixes in Wendat include the progressive, the continuative, the past, and the stative distributive, all of which combine with the verb stem to contribute to and build upon the aspect, mood, tense, or distributive meanings conveyed by any given verb.
Not all aspect-moods co-occur with any given verb, but these constraints are described through verb classes. There are three major verb classes in Wendat: 1) stative verbs, which take only the stative aspect-mood; 2) (non-motion) event verbs, which co-occur with the habitual, stative, punctual, and imperative aspect-moods; and 3) motion verbs, which are technically a subtype of event verbs that occur with those same four aspect-moods in addition to the purposive. Aside from needing to identify which aspect-moods occur with a verb base, it is not possible to predict which phonological forms of the aspect-mood suffixes co-occur with any given verb base, as they are (mostly) lexically conditioned.7 [End Page 87]
Northern Iroquoian dictionaries have handled these aspect-mood forms in various ways, either through using ordered sequences, supplying references to suffix classes, or providing conjugated examples. For example, Chafe's (1967) dictionary of Seneca lists the forms of the aspect-mood suffixes in parentheses in a predetermined order specified in the front matter. The Oneida dictionary of Michelson and Doxtator (2002) provides the name of the aspect conjugation classes near the end of each verb entry (yet each class also contains a lot of combinatorial possibilities, since these aspect-mood suffixes do not often occur in neat sets). The Onondaga dictionary by Woodbury (2003) makes use of both strategies. For example, the Onondaga cognate of this Wendat verb base is .aæhdat-/.aæhdad- + dualic 'run', and its aspect-mood suffixes are specified through a class name ("Aspect Class E3"), referring to a table in the front matter of the dictionary, as well as through a sequence in brackets ("[-ffis, -ih, -ffiØ]"8) listing the habitual, stative, and punctual suffixes in order (the form of the punctual can be used to predict the form of the imperative) (Woodbury 2003, 75). Yet another possibility is to simply provide at least one example of the verb base conjugated in each of the aspect-moods, which became the preferred lexicographic strategy employed by Abbott, Christjohn, and Hinton (1996) in their Oneida dictionary, as described in Abbott 1998. Examples of each verb base in each attested aspect-mood are also found in Michelson and Doxtator 2002 and Woodbury 2003.
While these three strategies to treat the aspect-mood suffixes are well established in the modern Northern Iroquoian dictionaries, there are two important differences between the development of these dictionaries and the current development of the Wendat one. First, unlike [End Page 88] other Northern Iroquoian languages which still have speakers, Wendat is being reawakened exclusively using archival documentation. The linguistic ecology of the Wendat community is such that we are creating the dictionary solely for language learners, and not for fluent speakers. Given that there is a small (but growing) level of language proficiency and metalinguistic knowledge that varies extensively among individuals, we need to provide an abundance of information (and multiple ways to present the same information) to best support individuals at varying levels of language knowledge. Second, the dictionary we are designing is a digital dictionary. Digital dictionaries do not have the same space limitations as print dictionaries, can be queried through a search engine (Dyck and Kumar 2012), can be easily reversed so that the user can look up entries using a different language, and can make use of hyperlinks to encode relationships between different parts of the content (Garrett 2018). The Oneida dictionary of Abbott, Christjohn, and Hinton (1996) was updated and expanded in 2006 to become the first online Northern Iroquoian dictionary, and it has features that the other dictionaries cannot have as print references. For example, it is possible to query the database of entries, and each verb entry has sound files accompanying the conjugated examples. Overall, digital interfaces offer innumerable novel and helpful ways to retrieve, filter, group, list, and sort dictionary entries.
Given these circumstances, we developed a different way to encode aspect-mood information for verbs that presents dictionary users with the verb stems for each aspect-mood and any expanded aspect-moods. As illustrated in Figure 4 which displays the encoding of aspect-mood for the verb entry -arahta-/-arahtat- + duplicative 'to run; to flee from someone', this information is all contained by our custom element <stems>, which contains individual <stem> elements. Each <stem> contains the attributes @attested, indicating whether the verb stem is attested in the manuscripts, and @xml:id, through which we give the stem a unique identifier so that we can link to it. [End Page 89]
The first <stem> of this entry is the habitual stem, or in other words, the continuous verb base combined with the -s habitual suffix. There are two variants of the verb stem that are attested in the manuscripts, -arahtas and -arahtats, showing that a final ts cluster can optionally reduce to s. These variants are encoded in <form> elements with a @orth attribute ('orthography') set to the standardized Wendat orthography ("stan"). The <hyph> ('hyphenation') element, adapted from TEI, contains the morphological analysis of the verb stem. Here, this means that we have two <m> ('morpheme') elements that are linked directly to entries for those morphemes: the first to the two <form> elements of the current verb entry (-arahta- and -arahtat-) and the second to the dictionary entry for the -s habitual suffix. This linking mechanism allows us to establish relationships between entries that can then become hyperlinks in the online dictionary. The remaining two elements of the [End Page 90] <stem> are <gram> ('grammatical information') and <xr> ('cross reference'). In our <gram> element, we developed another custom attribute @stemType whose values are constrained by our taxonomy of all possible stem types in Wendat. Here, we have selected the habitual stem ("sttHab" for 'stem type habitual'). Finally, the <xr> element points to a reconstructed example that illustrates the use and meaning of this verb stem in a fully conjugated word (i.e., with the addition of at least a pronominal prefix).
Like other modern Iroquoian dictionaries, the <stem> elements that follow the habitual stem are in a predictable order for any (non-motion) event verb: the habitual, stative, punctual, and imperative. Furthermore, each stem points to one or more curated examples to illustrate the stems in fully conjugated words. The major difference between our encoding and most other Northern Iroquoian language dictionaries (aside from Abbott, Christjohn, and Hinton 1996) is that we provide the full verb stem for users rather than sequences of suffixes or references to tables with such suffixes. In that sense, we remove a potential barrier for language learners, who must understand how the disparate pieces connect to use the forms correctly. Instead, we hope that our encoding provides less ambiguity for learners.
Our rendering of the <stems> element is still in progress, but a mock-up of such a rendering for the Wendat–English portion appears in Figure 5. Each underlined item in this figure is linked to other content in the dictionary. In the case of information specific to this entry (such as the individual morphemes in examples and the manuscript sources), the hyperlink will bring the user to the entry for those individual morphemes or to the manuscript transcriptions. In the case of category names or grammatical terms (such as Stem Type, Habitual, Punctual, etc.), the hyperlinks will direct users toward our glossary, which contains definitions and examples of these terms. Since the target users of the dictionary, Wendat language learners, will differ in their knowledge of the language as well as their familiarity with linguistic terminology, we will provide alternate views9 (Basic, Intermediate, and Full) of stems, as we do with the overall entries, which can either display or filter out information depending on the individual needs of the user. Figure 5 is [End Page 91] a representation of the Full view, but in the Basic view of the <stems> rendering, examples will be streamlined, hiding the interlinear analysis and any manuscript source information and leaving only the most essential information: the form in Wendat along with its translation.
Careful readers will notice that the rendering of this portion of the dictionary entry contains information about pronominal prefix paradigm selection that is not encoded in the <stems> element of our XML schema. This information is pulled from the @vClass ('verb class') attribute in the <gram> element which is a child of the <sense> element. The verb class specifies the pattern of pronominal prefix marking for all aspect-moods which can then be broken down by individual aspect-moods in the stems portion of the dictionary entry, instead of expecting dictionary users to deduce the pronominal prefix marking for each aspect-mood from the name of the verb class alone (although we expect that some dictionary users will be able to do so). Since there are two senses of this particular verb base and each sense has a different verb class (it is a shift event verb with the meaning 'to run' and a transitive event verb with the meaning 'to flee from someone'), the pronominal marking patterns for both senses are listed. The structuring and rendering of this information, which is essential to knowing how to conjugate a Wendat verb, takes advantage of the affordances of online dictionaries, namely that there are no limitations in terms of page length (yet there are some ways in which the spatial organization of a web page needs to be carefully considered for different size screens and devices) and that hyperlinking can help users understand how to use the dictionary (with links to the glossary) and notice the relationships between different elements of the language.
The <stem> elements also allow us to capture the fact that expanded aspect-moods are built off existing verb stems. For example, the past suffix, one of the four expanded aspect-moods, attaches to either a habitual, stative, or purposive stem in Wendat, whereas the continuative suffix attaches to only a habitual or stative stem, but appears with a following punctual or imperative suffix and either an optative or future [End Page 94] modal prefix (Lukaniec 2018). The stative verb base -ienter- 'to know something or someone' is attested with both the -nen' past suffix and the -k continuative suffix, and the encoding of the stative, stative past, and stative continuative stems appear in Figure 6. The stative verb stem is encoded the same way as the other <stem> elements in Figure 4, but the stative past and stative continuative stems are encoded as nested stems, or in other words, as children of the stative <stem> element to capture their relationships to this stem.
In some ways, it is possible to wrangle TEI to produce the equivalent of the <stem> elements (but not the <stems> wrapper element). Figure 7 contains a possible TEI-compliant encoding of the <stem> element for the habitual stem of -arahta-/-arahtat- + duplicative 'to run; to flee from someone' using the <re> ('related entry') element. The <re> element is intended to encode a "lexical item related to the headword, such as a compound phrase or derived form" with a @type attribute that can delineate between different varieties of related entries (Text Encoding Initiative Consortium 2023c). Our @attested attribute in <stem> could be re-encoded through the presence or absence of a @source attribute, suggesting attested versus unattested, respectively, although this is less explicit than our encoding. The @xml:id of our <stem> element, however, would work as well here with TEI's <re>. To distinguish this use of [End Page 95] <re> from any others, the value of the @type attribute would be set to "stem," although this stem is part of the entry itself, and not technically a related entry.
The <orth> elements contain the two alternants of the verb stems followed by the <hyph>, which is identical to its use in our encoding. Instead of using <gram> with a custom @stemType attribute that points to a taxonomy of stems, in this TEI encoding, we use <subc> ('subcategorization') in a <gramGrp> ('grammatical group') element to specify the stem type. While <subc> is a fine choice for including any categorization needed, it is vague and perhaps too general for our purposes. In fact, <subc> would be the only option in TEI to encode much of the grammatical information in Wendat. We estimate that in a strict TEI encoding, <subc> would be used to collapse five different attributes that we use to track grammatical information in entries,10 including @eventVbConseq ('event verb consequentiality'), @nounIncorp ('noun incorporation status'), @vbPrepron ('non-modal prepronominal prefix'), @pronRestriction ('pronominal prefix restriction'), and @stemType ('stem type'). Of course, collapsing distinctions and using the same element for different functions are practices that are less than ideal and leave more room for human error. [End Page 96]
An alternative to <subc> would be to use TEI's <gram>, which allows only a single value for @type. Multiple values could only be provided by wrapping several <gram>s in a <gramGrp>, but even in that case, all the values for @type would have to be shared across the multiple taxonomies that we need. Creating custom attributes in our schema allows us to separate values from different taxonomies and point directly to a single value in a single taxonomy from a single attribute.
Although this TEI wrangling of our <stem> element is possible in this instance, this encoding ceases to function when handling the expanded aspect-moods. In our schema, we nest these stems to more accurately reflect the Wendat grammar, but this is impossible to do with the <re> element, as it is not permitted to have nested <re> elements.11 If we were to encode these verb stems in TEI, we would have simply a sequence of sibling <re> elements. Another possible option would be to encode Wendat stems in <form> elements, but these stems are not simply a different way of representing "all the information on the written and spoken forms of one headword" (Text Encoding Initiative Consortium 2023b), since they are larger than the headword itself and reflect a subset of the inflectional information needed to form a full Wendat verb.
Given these constraints, it would be possible, but not necessarily practical or accurate, to encode this information about aspect-mood and expanded aspect-mood in TEI. Doing so would require making numerous concessions for representing Wendat grammatical structure, including the lack of nesting stems, the collapsing of multiple grammatical categorizations among only two elements (<subc> and <iType>), and misrepresenting some structures (stems as related entries or forms). Unfortunately, it seems that TEI is unable to represent (expanded) aspect-mood in Wendat, and in fact, there is no mention of aspect whatsoever in TEI's Dictionary chapter. Both options in TEI, using <re> or <form>, would seemingly distort and underserve this part of Wendat grammar. [End Page 97]
CONCLUSIONS
If there is any joy to be had from encoding text and data—and there surely is—it arises from the sense that the encoder has elegantly and precisely encapsulated all of their scholarly expertise and intuition in the encoded text, in a manner that reflects the "truth" insofar as it is understood, and that is easily parsed, queried, processed, and rendered. Where the encoding schema is a barrier to this elegance and precision, as we have shown standard TEI dictionary encoding to be, it is surely preferable to use a custom schema which fits more precisely with one's understanding of the language. This is not to say that TEI has no use at all in this scenario; for one thing, we have appropriated the local names of numerous TEI elements, where their definition is close enough to what we need (e.g., <entry>, <form>, <hyph>, and <gram>), and we have continued to use standard TEI to encode ancillary content such as bibliographical data, taxonomies, and so on. Furthermore, we will be producing a conversion layer which renders our custom encoding into an equivalent TEI form, for the purposes of TEI's other main function: interchange (Holmes 2016).
While TEI is routinely customized to meet the specific needs of a project (Czaykowska-Higgins, Holmes, and Kell 2014), our experiences of attempting to do just that resulted in imprecise, opaque, and vague structures that felt akin to how the Jesuits described Wendat grammar using a Latin model. Echoing Rice and Saxon's (2002, 153) statement that "dictionaries should represent the fullness of what a language is rather than be a straightjacket, turning it into something less than it is," we broke free from the TEI mold and chose to represent our current understanding of the language as fully and as faithfully as we can. Furthermore, we acknowledge that this modern Wendat dictionary is a living work that is being created with a specific understanding of the language, in the hopes that it serves the current and diverse needs of the Wendat community and can continuously evolve to do so. After all, like other works, "all dictionaries of all languages are embodiments of the creators' language ideology" (Anderson 2020, 11), and our values and beliefs about the Wendat language are the driving force behind each decision we make in the iterative development of the XML schema. [End Page 98]
mlukaniec@uvic.ca
mholmes@uvic.ca
Megan Lukaniec is a member of the Huron-Wendat Nation of Wendake, Québec and an Assistant Professor of Indigenous Language Revitalization in the Indigenous Studies Program at the University of Victoria on unceded WSÁNEĆ territories. Since 2006, she has been working with and for her community to reawaken and reclaim the Wendat language, which was dormant for close to 150 years. She is responsible for compiling and editing the Wendat dictionary and managing the project. She continues to work for the Wendat Language Sector (Conseil de la Nation huronne-wendat), where she reconstructs the language from legacy documentation, creates reference materials, and contributes to the development of curricula and pedagogical materials.
Martin Holmes is a programmer in the University of Victoria Humanities Computing and Media Centre. He is the lead programmer on several large digital edition projects, including the Map of Early Modern London (MoEML) and Le mariage sous l'Ancien Régime, and is part of the Project Endings team. He served on the TEI Technical Council from 2010 to 2015 and was managing editor of the Journal of the Text Encoding Initiative from 2013 to 2015. He is responsible for providing essential technical support for the Wendat dictionary project and maintaining the documentation.
ACKNOWLEDGMENTS
We would like to thank the members of Onywawenda' tehatirihoretha' (the Wendat Language Committee) for their time, patience, and valuable feedback on this dictionary project, Project Kwakwendahchondiahk. Tiawenhk inenh! We would also like to thank both Ewa Czaykowska-Higgins and Karin Michelson, a collaborator on this project, for the many insightful and thought-provoking discussions about Indigenous lexicography over the past years. In addition, we are grateful for the comments provided by two anonymous reviewers on an earlier draft of this article. Finally, we would like to thank the University of Victoria's Humanities Computing and Media Centre along with the Social Sciences and Humanities Research Council of Canada for their generous support of this work.
REFERENCES
Footnotes
1. This article appears in Indigenous Lexicography, a special issue of Dictionaries: The Journal of the Dictionary Society of North America 44(2) (2023), edited by Christine Schreyer and Mark Turin. It is open access under a Creative Commons CC-BYNC-ND license (https://creativecommons.org/).
In the print version, all illustrations are rendered in grayscale. Any color illustrations can be found in the open-access online version at Project Muse: http://muse.jhu.edu/resolve/213
2. We currently identify individual entries and sub-entries, languages used (Wendat, French, and Latin), abbreviations, and some layout features. We use only 63 elements out of a possible 585 TEI elements.
3. Verb elements that are underlined are obligatory in every conjugated verb. The plural is used in the name of positions in the Wendat verb template if more than one of those units can be used in the same verb, whereas the singular implies that there can only be one of those elements present in any given conjugated verb.
4. Line 1 of this example represents the reconstructed Wendat form in the standardized orthography. Line 2 represents the manuscript transcription of this form. Lines 3 and 4 are the morphological analysis and glossing. Line 5 is the original French translation of this example, and line 6 is the authors' English translation.
5. Our elements often share the local names of TEI elements but are in a custom XML namespace.
6. Examples of most of the Wendat aspect-moods appear as part of Figure 4. Examples of the purposive in use can be found in Lukaniec 2018.
7. There are, however, certain derivational suffixes that only occur with certain sets of aspect-mood suffixes, so in these cases, the suffix forms can be predicted. For example, the distributive suffix, which designates that "the action or event is distributed in some manner, whether the action is distributed through space, performed by a variety of agents, or pertains to various kinds or types of agents" (Lukaniec 2018, 293), always occurs with the same forms of aspect-mood suffixes: -hk habitual, -' stative, -' punctual, and -h imperative.
8. The acute accents appearing before the habitual and punctual suffixes indicate that the primary stress shifts to the antepenultimate syllable in these aspect-moods (Woodbury 2003, 75).
9. The alternate views of the Wendat dictionary for different types of users was inspired by the multiple interfaces designed for the Nxaʔamxcín dictionary (Czaykowska-Higgins, Holmes, and Kell 2014).
10. This count does not include the three inflectional class types that we could encode in TEI's <iType> ('inflectional class') element. Using <iType> would involve collapsing another three distinct attributes—verb class (@vClass), conjugation class (@cnjClass), and paradigm (@paradigm)—into a single attribute used in different ways.
11. The TEI Dictionary chapter currently states that <re> "may not contain any nested re elements" (Text Encoding Initiative 2023a), but the element specification allows nesting. When the TEI Guidelines and the element specifications differ from one another, the Guidelines are understood to prevail.