-
The discourse basis of ergativity revisited: Online Appendices
- Language
- Linguistic Society of America
- Volume 92, Number 3, September 2016
- pp. s1-s14
- 10.1353/lan.2016.0044
- Article
- Additional Information
Language 92.3, September 2016 s1 THE DISCOURSE BASIS OF ERGATIVITY REVISITED: ONLINE SUPPLEMENTARY MATERIAL GEOFFREY HAIG STEFAN SCHNELL University of Bamberg University of Melbourne This document contains illustration and justification of the methodology used in our article ‘The discourse basis of ergativity revisited’ (henceforth H&S). We begin by outlining the Multi-CAST database (‘Multi-Language Corpora of Annotated Spoken Texts’), from which large portions of the data stem. Multi-CAST is a project designed for crosslinguistic, corpus-based studies into argument realization in spoken discourse, covering topics such as REFERENTIAL DENSITY (Bickel 2003, Noonan 2003), referentiality (Kibrik 2011), or PREFERRED ARGUMENT STRUCTURE (Du Bois 1987, 2003). The focus of H&S is preferred argument structure. It is worth pointing out that text-based, or corpus-based, typology is a field in its infancy (Schnell 2012, Cysouw & Wälchli 2007, Wälchli 2006, 2009), and researchers are still engaging in exploratory studies to gauge the validity of different data types. Our study and this document are also intended as a contribution to the ongoing methodological discussion. Contents of this document:§1: Corpus size and composition: The Multi-CAST online database ............................................. p. s1§2: Corpus mark-up: How the Multi-CAST data are annotated .................................................... p. s3§3: Interpreting quantitative data on preferred argument structure: contrasting two approaches ... p. s6 Appendix A: Languages and sources for Table 2 in H&S ........................................................... p. s10 Appendix B: Raw data from the Multi-CAST data set ................................................................ p. s11 References ..................................................................................................................................... p. s12 1. CORPUS SIZE AND COMPOSITION (see H&S §3). The Multi-CAST corpus currently contains recordings of spontaneous spoken narrative texts from five languages, together with transcription, translation, and further linguistic annotation, available online at https://lac.uni-koeln.de/en/multicast/. Obviously, for quantitative approaches to cross-corpus comparison, corpora should aim at maximal size. However, manual annotation of natural spoken language data, in particular of lesser-described languages, is so timeand resource-costly that in published research, many of the available corpora (half of those listed in Table 2 below) do not exceed 1,000 clause units. The Multi-CAST corpora all contain a minimum of 1,000 clause units, making them broadly comparable to available data sets. Table 1 gives an overview of the Multi-CAST corpora, while Table 2 provides the relevant data from previously published sources that have been included in the analysis. Our findings are based on the total number of clauses in Tables 1 and 2, thus a total of 25,618 clauses. s2 LANGUAGE CLAUSES GENRE SOURCE Vera’a 3,789 traditional narratives (11 texts) Schnell 2016 Teop 1,328 traditional narratives (4 texts) Mosel & Schnell 2016 N. Kurdish 1,205 traditional narratives (2 texts) Haig & Thiele 2016 English 2,360 monologic oral history (Kent dialect of English, 1 text) Schiborr 2016 Cypriot Greek 1,078 traditional narratives (3 texts) Hadjidas & Vollmer 2016 TOTAL CLAUSES 9,670 TABLE 1. The Multi-CAST corpus. LANGUAGE CLAUSES GENRE SOURCE Sakapultek 456 Pear story retellings (18 texts) Du Bois 1987, table 2 English 704 informal conversation Kärkkäinen 1996 English 484 classroom interactions, teachers’ contributions only Kumpf 2003 English 1,313 televized interviews, interviewees’ contributions only Everett 2009 English 1,654 Pear story retellings (20 texts) Kumagai 2006 Portugese 412 televized interviews, interviewees’ contributions only Everett 2009 Roviana 339 monologic texts, variety of topics and genres Corston-Oliver 2003 Korean 4,363 children’s speech (1 yr. 8 months to 2 yrs. 10 months) Clancy 2003 To’aba’ita 1,278 six traditional narratives, third person only Lichtenberk 1996 Mapudungun 700 transcriptions of spoken narratives, third person only Arnold 2003 Yagua 1,156 traditional folkloric narrative Payne 1993 Gorani 483 traditional folkloric narrative Mahmoudveysi et al. 2012 French 1,056 structured interviews, mostly monologic Ashby & Bentivoglio 1993 Spanish 1,550 structured interviews, mostly monologic Ashby & Bentivoglio 1993 TOTAL CLAUSES 15,948 TABLE 2. Previously published data included in H&S. An issue that needs to be considered for cross-corpus comparability is that of variation in genre and discourse type. Some researchers advocate the use of standardized stimuli to elicit texts of broadly comparable content, for example the Frog story (e.g. Noonan 2003) or Pear film retellings (see Chafe 1980 for details on the stimulus and the elicitation...