POLLEX-Online: The Polynesian Lexicon Project Online
The Polynesian lexicon project, POLLEX, was initiated in 1965 by Bruce Biggs in order to provide a large-scale comparative dictionary of Polynesian languages. Since then, POLLEX has grown to include over 55,000 reflexes of more than 4,700 reconstructed forms in 68 languages. These data have enabled many fundamental advances in Polynesian linguistics and prehistory. At almost half a century old, POLLEX is one of the longest-standing databases of linguistic information, and has moved through various incarnations, from typewriter and edge-punched cards, through microfiche to mainframe computer. In the last few years, online databases of linguistic information have become increasingly more prevalent, representing a major shift in the way linguistics is conducted. Online databases provide many advantages over the older forms of data distribution, including high availability, more robust data storage, and easy data manipulation and searching, and they also facilitate the replication of previous studies. This paper announces the latest reincarnation of the POLLEX database as an online resource, POLLEX-Online (http://pollex.org.nz), and describes the technical implementation details.
But whence came the inhabitants of Polynesia? How did they come, or get possession of so many islands scattered over such a vast extent of ocean? When did they come? And why did they come? are questions that cannot now be answered without much conjecture. Yet, no doubt a careful and thorough examination of the several dialects, and a comparison of one with the other with a view to ascertain the groundwork of the general language, and a comparison with the languages of the neighboring continents, would not only be a subject of inquiry full of interest, but would go far to indicate the probable origin of this people.(Andrews 1836:12–13).
The Polynesians have long occupied a central position in European awareness of Oceania and its peoples: even today, scientific news that actually bears on the entire expansion of Austronesian languages into the Pacific is often presented as an answer to the perennial question of the “origin of the Polynesians.”2 The manifest lexical affinity of [End Page 551] these languages confirmed to early European visitors the unity of a people spread over much of the world’s largest ocean. No wonder, then, that Lorrin Andrews expected comparative investigation to be particularly fruitful.
Lexical comparison of Polynesian languages, in fact, could be said to have begun with Hadrianus Relandus’s recognition that the two brief vocabularies collected a century earlier by Jacob Le Maire at Futuna and Niuatoputapu both represented forms of the “Malay” language (Relandus 1708; see also Rensch 2000: 313–28). More substantial comparative word-lists followed in the late eighteenth century. J. R. Forster’s table compared 46 words in a dozen languages, including Tahitian, Tongan, Māori, Easter Island, and Marquesan (Forster 1778). A quantitative and qualitative leap forward was made by Horatio Hale (1846), who presented about 1,000 cognate sets from ten Polynesian languages, each set headed by a “primitive or radical form” that amounts to a Proto-Polynesian reconstruction, following regular sound correspondences described in Hale’s comparative grammar. Later in the century, Edward Tregear’s comparative dictionary (1891) deserves mention for the large amount of material it brought together, but (in addition to being frankly Māori-centered) it made no effort to follow regular correspondences, and thus included a fair amount of spurious material.
Such was the background to the Polynesian Lexicon project (henceforth POLLEX), initiated by Bruce Biggs at the University of Auckland in 1965, using funding obtained through a New Zealand Government Lottery Grant, and a National Science Foundation Grant to the Bernice P. Bishop Museum (Pawley 2001). First fruits of the project were presented in a 1966 monograph, with D. S. Walsh as coauthor, and contributions from a large number of colleagues and students (Walsh and Biggs 1966). They presented over 900 reconstructed forms with supporting evidence from 11 Polynesian languages. While from a quantitative standpoint this might seem a small advance from what Hale had achieved, POLLEX had the advantage of both the superior data that had become available during the intervening century, and an improved reconstruction of Proto-Polynesian (PPn) that followed Elbert (1953) in recognizing PPn *h, *ʔ, and *r for the first time. The inevitable continued growth from this beginning was marked by the “interim” issue of a listing of headwords alone (Biggs, Walsh, and Waqa 1970), and a third version incorporating microfiches to handle the ever expanding corpus (Biggs 1979). By this time, POLLEX had assumed electronic form, and as paper printouts became prohibitively bulky, copies were distributed on a person-to-person basis as text files—first on floppy disks, and later as email attachments. Biggs continued to work on the project until his death in 2000, when Ross Clark assumed overall direction.
As readers of this journal well know, POLLEX has already proved to be very useful for enlightening the “intellectual darkness” surrounding Polynesian prehistory. The POLLEX database has been used to help elucidate the terms for Proto-Oceanic meteorological phenomena (Ross 1995), to understand the development of the verb ‘swallow’ in Oceanic languages (Lynch 2001), to reevaluate the evidence for Proto-Polynesian *h (Rutter 2001), to support new observations on the Proto-Oceanic labiovelars (Lynch 2002a), and to uncover the origins of kava (Lynch 2002b). Jeff Marck has made use of POLLEX to investigate questions ranging from whether there was an early Polynesian “Sky Father” (Marck 1996) to a full reassessment of Polynesian subgrouping (Marck [End Page 552] 2000). However, perhaps the most intensive use was by Kirch and Green (2001), who integrated the information in POLLEX (a “monumental achievement” [p.46]) with archaeological findings to produce a rich and detailed picture of Proto-Polynesian society.
2. Language Databases
POLLEX has followed ever-advancing technology—from typewriter and edge-punched cards, through microfiche and mainframe computer, to wide dispersal on personal computers. Progression to an online database is the next natural step. Recent years have seen a major growth in online databases in a range of fields (Ellis and Attwood 2001; Greenhill, Blust, and Gray 2008b). By providing vast collections in easily accessible formats, these databases are now “as important to scientific progress today as is access to a laboratory or library” (Ellis and Attwood 2001:509). Linguistics has recently begun to follow suit, with a number of large databases going online, including the Austronesian basic vocabulary database (http://language.psy.auckland.ac.nz; Greenhill, Blust, and Gray 2008a), Blust and Trussel’s Austronesian comparative dictionary (http://www.trussel2.com/acd/), the world atlas of language structures (http://wals.info; Dryer and Haspelmath 2011), and the world loanword database (http://wold.livingsources.org/; Haspelmath and Tadmor 2009).
There are at least four major benefits of these online databases (Greenhill, Blust, and Gray 2008b). First, while there is a vast amount of published linguistic data, this information is often scattered between books, journal articles, manuscripts, filing cabinets, and shoeboxes. Having it in one easily accessible location reduces the reliance on happenstance library holdings or access to unpublished material. Second, much of this unpublished information is highly fragile, often stored in field notebooks, or obsolete data-storage media. This fragility is all the more concerning, as the primary sources—the languages themselves—are often heavily endangered: it is estimated that one of the world’s languages goes extinct every week (Nettle and Romaine 2000). Digitizing this material and placing it online will protect it (Greenhill, Blust, and Gray 2008b). Third, online databases make easy manipulation and filtering of data possible. This capability greatly enhances the ability of researchers to test hypotheses using the data, enables the discovery of new hypotheses, and allows the robust testing of new methods (Ellis and Attwood 2001; Greenhill, Blust, and Gray 2008b; Rausher et al. 2010). Finally, online databases facilitate the citation of data. A crucial component of science is the replication and confirmation of results. Replication is all but impossible if the raw data are not accessible. This issue has recently risen to major importance in the life sciences, and many journals now require that authors use citable online data-sources, or deposit their data in such a database (for example, Rausher et al. 2010, Fairbairn 2011). Furthermore, when the original data are made available, the original publications tend to be much more useful—and get cited more often themselves (Piwowar, Day, and Fridsma 2007).
Making POLLEX available online provides these benefits: this large collection of data is in one location, is easily accessible through the internet, is safely stored in a digital format, is searchable, and is citable. This paper is thus an announcement of the availability of POLLEX-Online at http://www.pollex.org.nz. In what follows, we describe the technical implementation details of POLLEX-Online, beginning with the data layer, before moving on to describe the website and its contents. [End Page 553]
3. Technical Implementation
All lexical data are stored as Unicode text (UTF-8) in the relational database system MySQL. The core of the database schema is comprised of four interlinked tables (figure 1).
1. The Language table stores information about the languages in Pollex. This information includes the language name, the original POLLEX three-letter identification code, and the language’s ISO-639 identification code, where available.
2. The Protoform table stores information about POLLEX’s reconstructed protoforms, the level/subgroup to which they can be reconstructed, and any notes/clarifications about the reconstructions.
3. The Source table contains information about the data sources incorporated into POLLEX. This includes the legacy three-letter token used to identify the source in the original POLLEX (for example, “Bge” referred to Beaglehole 1991), along with a short description of the source and a full reference.
4. The Entry table contains the words themselves. Each entry in this table is linked to a language in Language table, a protoform in the Protoform table, and a source in the Source table. Entries also have two flags: the first denoting loanwords, the second denoting problematic entries (due, for example, to phonological irregularity, dubious semantic connection, or being problematic in other ways). Each entity in these tables has all meta-information tracked, such that any changes to the data can be logged, reviewed, and reverted (undone) if necessary.
The POLLEX-Online website is implemented in the programming language Python, using the open-source web development framework Django (http://www.djangoproject.com). The user interface allows users to view the data either by language (for example, show all words from Māori), or by protoform. For example, figure 2 shows one term for the paper mulberry (Broussonetia papyrifera) in both the original POLLEX form (2a) and the POLLEX-Online variant (2b). The reconstruction, [End Page 554] Proto–Central-Eastern Polynesian *aute, has eight supporting entries in the database. These entries can be downloaded in the original POLLEX format and in a dialect of XML (suitable for computational retrieval).
4. Current Statistics
Currently there are 55,238 entries, from 68 languages and dialects, listed under 4,753 protoforms. The 68 languages include 44 varieties of Polynesian,3 as well as collateral evidence from Polynesian’s closest relatives in the Central Pacific subgroup (Eastern Fijian, Western Fijian, Rotuman), and some 21 other [End Page 555] Oceanic languages representing North and Central Vanuatu (6), Southeast Solomonic (10), Northwest Solomonic (2), and one each from Micronesian, Papuan Tip, and North New Guinea.
[End Page 556]
Table 1 shows the number of reflexes in POLLEX-Online for each language, along with the number of identified loan words and problematic entries.4 New Zealand Māori is the best attested language with 3,209 entries (plus 5 identified loanwords and 245 problematic entries), followed by Tongan (2,591 entries, 4 loanwords, and 133 problematic entries) and Samoan (2,517 entries, 5 loanwords, and 143 problematic entries). The most widely attested protoform is *refu ‘ashes’ with 63 entries, followed by *futi ‘pluck hair or feathers, pull up weeds, pull on a line or rope’ with 51 entries, and *muri ‘behind, after, to follow, be last’ with 50 entries. Of the 55,238 entries in POLLEX-Online, 167 are marked as known loanwords, and 4,434 (8.01 percent) are flagged as problematic.
The reconstructions in POLLEX-Online are linked to a specific language subgroup. Table 2 shows the ten best-attested subgroups by number of reconstructed protoforms and reflexes. Unsurprisingly, the Polynesian subgroup has the most reconstructions with 1,538 identified protoforms containing 19,789 entries. The next most well-attested subgroups are Central-Eastern Polynesian, with 553 protoforms from 3,339 entries, and Nuclear Polynesian, with 469 protoforms identified from 5,022 entries.5
The data in POLLEX-Online are sourced from 199 different resources, ranging from dictionaries, through manuscripts, to personal communications. The most prevalent sources in POLLEX-Online are Williams (1971) with 2,691 entries, Churchward (1959) with 2,469 entries, Pratt (1911) with 2,438 entries, Lemaître (1973) with 2,163 entries, and Stimson (1964) with 2,069 entries.
At almost half a century old, POLLEX is one of the longest-standing databases of linguistic information. POLLEX has moved through various incarnations, from typewriter and edge-punched cards, through microfiche, to mainframe computer. [End Page 557] This latest incarnation places POLLEX online as a publicly available, highly accessible online database of lexical information. It is our hope that POLLEX-Online continues the strong tradition of its predecessors in helping to generate new insights into all aspects of Polynesian linguistics and prehistory. Additions and improvements to the database will continue as before. We invite readers of Oceanic Linguistics to sample POLLEX-Online, and provide suggestions as to how it can be made a more useful research tool.
1. We would like to thank Liz Pascal and Annik van Toledo for comments on the manuscript.
2. See, for example, Soares et al. (2011). This paper is reported under the headline “Genetic study uncovers new path to Polynesia,” at http://www.sciencedaily.com/releases/2011/02/110203124726.htm.
3. POLLEX-Online includes some data from all known Polynesian languages, including Moriori of the Chatham Islands and Niuatoputapu of northern Tonga, known only from historical documents. A few of the varieties presently distinguished in the database are better considered dialects than distinct languages. Aitutaki, Atiu, Mangaia, and Rarotongan are variants of what is commonly called Cook Islands Māori; and the Austral, Ra’ivavae, Rurutu, and Tupuaki entries probably likewise represent a single language, though the language situation in this area is much less clear. These exclusions and mergers would give a total in the mid-30s for the number of living Polynesian languages.
4. “Problematic” is a provisional cover term for words considered to be possible reflexes of a reconstructed form, but exhibiting unexplained semantic or phonological deviations. Some of the latter may be the result of borrowing. Further improvements to the database will include more precise annotation of these problems. In table 1, the column headed “??” consists of problematic entries.
5. A small percentage of reconstructions (perhaps eight percent of the total) have reflexes whose distribution does not clearly correspond to any established subgroup. Some of these are labeled with ad hoc distribution codes (such as XO where reflexes are found only in Outlier languages), others frankly recognized as mysteries (??).