-
Evaluating the Availability of Resources, Research Hubs, and Financial Supports for Nigerian Languages Natural Language Processing Research / Évaluation de la disponibilité des ressources, des centres de recherche et du soutien financier pour la recherche sur le traitement du langage naturel des langues nigérianes
Despite the unavailability of national government support and perpetual scarcity of resources, there are indications that the number of natural language processing (NLP) studies on Nigerian languages is growing. This study collates information about the resources, institutional structures, and financial sources that support the research of Nigerian languages from published scientific articles. Relevant publications were systematically retrieved from Google, Web of Science, and Scopus. Information on research data availability, authors’ institutional affiliation, and funding acknowledgement was collected from the full texts of the NLP publications for analysis. Results reveal that only 14.28% of the published papers shared 18 different resources (speech and textual corpora, electronic dictionary, and source codes) online, and 27.95% of the articles were funded. Support for the NLP of Nigerian languages was significantly higher from outside Nigeria—for instance, most of the funding sources were from the United States, France, and the United Kingdom. Secondly, more papers were written by authors that were affiliated with institutions outside Nigeria than authors from within Nigeria. Thirdly, most of the available resources on online repositories were shared by authors that were affiliated with institutions outside Nigeria. The top five research hubs on the NLP of Nigerian languages are Obafemi Awolowo University, Nigeria; the University of Sheffield, United Kingdom; the University of Uyo, Nigeria; African Languages Technology Initiative, Nigeria; and the University of Ibadan, Nigeria. Instances of research collaboration between Nigerian and foreign universities for NLP capacity building were identified. This study provides an insight into the existing structure and resources for the NLP of Nigerian languages, which could be harnessed by stakeholders for the development of Nigerian languages.
Résumé: Malgré le manque de soutien du gouvernement national et la rareté des ressources disponibles, le nombre d’études de traitement du langage naturel (TLN) sur les langues nigérianes semble être en augmentation. Cette étude rassemble les ressources, les structures institutionnelles et les sources financières qui soutiennent la recherche sur les langues nigérianes à partir d’articles scientifiques publiés. Les publications pertinentes ont été systématiquement extraites des bases de données Google, Web of Science et Scopus. Les données sur la disponibilité des données de recherche, l’affiliation institutionnelle des auteurs et les mentions des sources de financement ont été recueillies à partir du texte intégral des publications. Seulement 14,28% des articles publiés partageaient 18 ressources différentes en ligne et 27,95% des articles ont été financés. Le soutien au TLN des langues nigérianes était significativement plus élevé en dehors du Nigéria; par exemple, la plupart des sources de financement provenaient de l’extérieur du pays. Dans un second temps, plus d’articles ont été rédigés par des auteurs affiliés à des institutions en dehors du Nigéria que des auteurs affiliés au Nigéria. Enfin, la plupart des ressources disponibles sur les dépôts d’autoarchivage numérique étaient partagées par des auteurs affiliés à des institutions en dehors du Nigéria. Les cinq principaux centres de recherche sur le TLN des langues nigérianes sont l’Université Obafemi Awolowo (Nigéria); L’Université de Sheffield (Royaume-Uni); Université d’Uyo (Nigéria); l’Africa Languages Technology Initiative (Nigéria), et Université d’Ibadan (Nigéria). Deux exemples de collaboration de recherche entre les universités nigérianes et étrangères pour le renforcement des capacités en TLN ont été identifiés. Cette étude donne un aperçu de la structure et des ressources existantes pour le TLN des langues nigérianes, qui pourraient être exploitées par les parties prenantes pour le développement des langues nigérianes.
open scholarship, Nigerian languages, digital archiving, data sharing, natural language processing [End Page 269]
Science ouverte, Langues nigérianes, Archivage numérique, Partage de données, Traitement du langage naturel
Introduction
With 527 different languages, Nigeria has the third highest number of spoken languages in any country (Eberhard, Simons, and Fennig 2019). However, most of the languages are spoken by a very small proportion of the country’s population, and four of these languages—Hausa/Fulani (29%), Yoruba (21%), Igbo (18%), and Ijaw (10%)—are spoken by about 78% of Nigeria’s population (Central Intelligence Agency 2019). Most Nigerians are “bilingual by English”—that is, in addition to the Indigenous languages, native Nigerians try to learn English, which was the colonial language, lingua franca, and perceived as the language of the formally educated (Adedun and Shodipe 2011; Rehbein 2015). The development of Nigeria’s native languages is not supported by any government policy, and there is no structure to fund or support their language technology development. In addition to many natural language processing (NLP) studies that rank Nigerian languages as under-resourced (Adegbola and Odilinye 2012; Asubiaro et al. 2018; Ekpenyong and Inyang 2016; Ekpenyong, Urua, and Gibbon 2008; Ezeani, Hepple, and Onyenwe 2017; Ezeani et al. 2018, Iheanetu et al. 2017; [End Page 270] Lu et al. 2016; Tucker and Shalonova 2005), basic resources such as textual data is non-existent for most of the Nigerian languages such as Ibibio, Ijaw, and Kanuri that are spoken by less than 10 million of its population, while the major languages—Hausa, Igbo, and Yoruba—are a hair’s breadth better as they lack quality corpora.
This is unlike South Africa, another African country, where the development of local languages is supported by the government’s national language policy, and 11 of its 25 languages are recognized as official languages (Alberts 2011; Department of Arts and Culture 2003). As part of its effort to develop its native languages, the South African government funds human language technology (HLT) projects through the Department of Arts and Culture, the Department of Science and Technology, and the National Research Foundation under the umbrella of the National HLT Network (Grover, van Huyssteen, and Pretorius 2011). In South Africa, the government’s support for native languages, including the textual documentation of information in the local languages, has resulted in a large library collection of documents. This, in turn, is a source of textual corpora for NLP studies, which is also being stored and processed for the HLT of South African languages in a database by the National Centre for Human Language Technology, which is called the Centre for Text Technology (Barnard et al. 2014; Duvenhage, Ntini, and Ramonyai 2017; Eiselen and Puttkammer 2014).
On the other hand, documentation of information in native Nigerian languages was and is driven by religious organizations, especially the Christian churches (Igboanusi 2006; Law 1976; Ogunbiyi 2003; Philips 2004). The Christian Missionary Society strongly influenced the Romanized writing system of the major Nigerian languages such as Igbo, Yoruba, and Hausa in the precolonial era. The Nigeria Bible Translation Trust (NBTT) and the Nigeria Bible Society have both continued the trend by creating writing systems for minority Nigerian languages (Abimaje 2019; Alao 2014; Lalo 2011; Obinna 2017). The textual corpus from the Holy Bible and other Christian websites in Nigerian languages have been useful in many NLP studies (Alabi et al. 2019; Asubiaro 2011; Onyenwe 2017; Onyenwe et al. 2018; Onyenwe et al. 2015; Orife 2018; Semiyou, Aoga, and Igue 2013). The NBTT has shown an intensified effort at translating the Holy Bible to other Nigerian languages (Nigeria Bible Translation Trust 2015), an effort that is not in place for translating other forms of information and resources into Nigerian languages by the government. For instance, the current Nigerian Constitution has not been translated into any local Nigerian languages by the government, save for its translation into the Yoruba language, which was done by an individual (Ogunshola 2015). Websites of the federal, state, and local government ministries, agencies, and parastatals, which are all written in English, are not translated into the native Nigerian languages.
Existing research on the NLP of Nigerian languages has somehow sourced the needed resources, such that there is a possibility that identifying the resources in these different NLP studies may lead to a pool of resources and, if these resources are identified and pulled together, could alleviate the problem of resource scarcity. This is important to the research community that is interested in the [End Page 271] NLP of Nigerian languages, as this can also give them an insight into the resources that have been created, which are different from those in their possession. Besides, with the open scholarship movement, which has spurred changes in research communication has impacted publishing and sharing research data on the Internet, it is imperative to find out how researchers of the NLP of resource-scarce languages (such as the Nigerian languages) have harnessed these developments to improve resource scarcity. With the Nigerian government’s inaction towards supporting the HLT of Indigenous Nigerian languages, this study also sets out to find out if there are notable research hubs and researchers that have contributed significantly to the HLT of the Nigerian languages. Similarly, this study investigates the funding sources for NLP studies of Nigerian languages. Such an inventory of research hubs and funding sources are potentially useful for stakeholders in the HLT development of resource-scarce languages as well as for the planning and establishment of centres of excellence. There are perhaps existing informal, but significant, research structures on which a solid research development agenda for the HLT of Nigerian languages could be built.
Methodology
The methodology discusses two major tasks: the literature search and the analyses. The literature search involves choosing sources to search, drawing search queries, and identifying the relevant publications. The analyses, on the other hand, focus on details of the bibliometric and content analysis performed on the text and bibliographic data obtained.
Literature search
The search was limited to the top 10 languages spoken in Nigeria: Hausa, Yoruba, Igbo, Kanuri, Ijaw, Ibibio, Ebira, Tiv, Fulfulde, and Kanuri. Relevant articles on the NLP of Nigerian languages were retrieved from two sources: three online search engines and the references of the relevant articles that were retrieved from these search engines (Asubiaro and Igwe 2020). The three search engines used for this search were Google, Web of Science (WoS), and Elsevier’s Scopus. Search strategy employed on Google was different from WoS and Scopus. General and more specific keywords were used for creating the queries for searching. The first set of keywords, which were general, are “natural language processing,” “digital,” “information retrieval,” “information processing,” “information extraction,” “information storage,” “computer,” “corpus,” “electronic,” “automatic,” “internet,” and “web” and were used along with the language name and the Boolean operator, one at a time on Google search engine. For instance, while retrieving documents on Ibibio language, “natural language processing” and “Ibibio” were entered into the Google search engine, and this was repeated for all of the keywords. Then this search query was modified to the specifications of WoS and Scopus (see Table 1).
A total of 617 articles were retrieved after the removal of duplicates from the search results from Google, WoS, and Scopus. The first round of screening was conducted by reading the title, the abstract, and/or the full text of the [End Page 272] retrieved articles to identify the relevant articles. In total, 209 articles were identified to be relevant to NLP of the Nigerian languages after the first screening. The full text of each of the 209 articles was carefully read for the second screening, where inclusion criteria were carefully applied. The inclusion criteria included:
• the article must either be a conference proceeding or a journal article;
• the full text of the articles must be available and accessible online;
• the article must have NLP relevance to at least one of the Nigerian languages, which must be reflected in the methodology; and
• the quality of the methodology was assessed for clarity and details. Details in the methodology must include the methods/theories used and the relevance to NLP must be established.
After the second screening, 127 articles passed the inclusion criteria. To complement the literature search from the online sources, the references of the 127 articles were also searched for relevant articles, and this search yielded an additional 15 relevant articles.
The second stage of searching for relevant articles was initiated because the first set of queries was considered too general and they may not have captured all of the relevant articles. However, the first queries retrieved articles that gave a direction on the areas covered in the NLP of Nigerian languages. Therefore, the following keywords were extracted from the retrieved articles to create a second set of queries: morphological analysis, cross-lingual entity knowledge transfer, phonological analysis, speech synthesis, computational grammar, text-to-speech synthesis, speech analysis, speech corpus, text corpus, part-of-speech tagging, machine translation, electronic dictionary, text normalization, diacritics restoration, electronic keyboard, computing sign language, entity recognition, entity typing, folktale narrative computation, sentiment analysis, text summarization, document similarity, language identification, lexical analysis, stop words, optical character recognition, information and communications technology localization, spell corrector, computational modeling, human language technology, capacity building, automatic speech recognition, and word identification. The effectiveness of the new queries was tested on WoS and Scopus, and the results showed a lot of overlaps with the original queries. Therefore, keywords that overlapped with the keywords from the old queries were removed, and when the new keywords were posed as queries, they retrieved the same number of new relevant articles as the second number of queries and fewer irrelevant articles and overlaps. The new queries are listed in Appendix 1. Queries were also created for the Google search engine using the method that was described earlier in the first stage of the literature search. The second search returned 31 articles from Google, 30 from WoS, and 60 from Scopus; 19 articles were relevant after the removal of duplicates. Overall, 161 articles were included in the study, in addition to a research data publication. Similarly, 14 master’s and doctoral theses were found during the two search processes, and the universities that hosted the master’s or [End Page 273] doctoral degree students that produced these theses were identified as part of research hubs on the NLP of Nigerian languages. The bibliographic details of the relevant articles are available on GitHub.1
Content and bibliometric analyses
All of the sections of the full text, including the abstract, the acknowledgements, the footnotes, and the endnotes, were searched for research data availability/sharing information, while the acknowledgement section was searched for funding information. The following information was collected:
• the uniform resource locator (URL) of research data—that is, the textual or speech corpora, used for the study;
• the URL of computer codes used for the study; and
• the name and country of the funding agencies acknowledged in the research articles.
Bibliometric data such as the author’s name, the institution, and the countries of affiliation were also collected for analysis.
Operationalizing openness
Openness refers to compliance with the open scholarship models of research communication; included in this study are open data and open access. Openness was operationalized using the following strategies: research data or publication (all forms of data including text or speech corpus or source codes) must be available on websites that are open to the public. Green open access was investigated by searching for the full text of the publications in open repositories. Gold open access publications were available on the publishers’ website without a subscription pay wall, either in gold, bronze, or platinum open access or hybrid journals. The validity and openness of research data and publication was further investigated by visiting the repositories in which research data or publications were deposited.
Results and discussion
Of the 161 reviewed studies, the Yoruba language was included as a language of interest in the highest number of studies (91), followed by the Hausa (42), Igbo (36), Ibibio (21), Tiv (3), Kanuri (3), Fulfulde (3), and Efik (1) languages, while the Ebira and Ijaw languages were not part of any NLP studies. In examining the distribution of those Nigerian languages that have been included in NLP studies, only four (Hausa, Ibibio, Igbo, and Yoruba) have been considered for research in more than four NLP studies, while the other Nigerian languages are largely unstudied as part of NLP research. This is an indication of the level of inclusion of Nigerian languages in the information age. Further analysis reveals that more papers were written by authors that were affiliated with institutions from outside Nigeria (43.75%) than by authors that were affiliated with institutions from Nigeria (42.23%). With the prior knowledge that there is no [End Page 274] Nigerian government structure to support the development of local languages, this fact connotes that the influence of computational development of Nigerian languages from outside Nigeria could be significantly higher. This finding is surprising, and possibly indicates that researchers outside the country are more motivated to work on foreign languages because of the availability of funds and tools. Another conjecture could be that researchers from Nigeria pursue their research careers in other countries and thus contribute to the development of Nigerian languages. It is also noteworthy that researchers that are affiliated with Nigerian institutions worked on the Nigerian languages despite the dearth of resources and absence of government support.
The retrieved publications covered NLP topics such as speech technology (for example, text-to-speech and speech synthesis) (n = 48); computational morphology (n = 24); alternatives to a language-dependent keyboard (automatic diacritics restorations, optical character recognition) (n = 29); machine translation (n = 12); document modelling (n = 10); lexical analysis (n = 4); computational grammar (n = 5); entity typing (n = 6); narrative computation (n =3); language identification (n = 2); resource development (n = 7); sentiment analysis (n = 3); and information retrieval (n = 2). This suggests that the NLP of Nigerian languages is strong in three major areas of research (speech technology, alternatives to a language-dependent keyboard, and computational morphology).
Resource availability and openness
Considering the paucity of funding for journal access subscription in Nigeria, the openness of resources such as research articles, data, and computer codes is important to the NLP of Nigerian languages.. Not only was the availability of resources explored, but their openness was also evaluated. The proportion of NLP articles that were shared on different online repositories is displayed in Table 1. Most of the articles (63.75%) were published in gold open access journals, more than in open repositories. Researchgate.net (47.4%) was the most popular green open access repository, followed by institutional repositories (34.34%).
Details of resources that were published in the articles are presented in Table 2, which shows that only 18 (16.25%) of the 161 articles were published in 23 different research publications, with an additional resource that was shared in a research data publication. At face value, this suggests a low compliance with the open data initiative in research communication, if it is compared with the results from Zenk-Möltgen et al. (2018), which reveals that about half of the [End Page 275] empirical studies that were published in selected social science journals shared research data. Furthermore, 65.22% of the articles that shared their research data online were written by authors that were affiliated with institutions outside Nigeria. The resources that were recorded in Ogwu, Talib, and Odejobi (2006),2 Akinwale et al. (2015),3 and Finkel and Odejobi (2009)4 were not included because the URLs that were provided in the data-sharing statements were inactive at the time of writing this report. The resource that was reported in Ojumah, Misra, and Adewumi (2018)5 was also excluded because the URL that was provided in the data sharing statement was empty at the time this report was compiled.
More resources were published on the Yoruba language (12), followed by the Hausa (9), Igbo (4), and Ibibio (2) languages, while one dataset was published on the Tiv language. GitHub (10) is the most popular repository, followed by sourceforge.net (2). Five of the articles made their datasets available on institutional repositories. A probe into the quality and quantity of the data published by the studies shows that corpora from Orimaye, Alhashmi, and Eu-gene (2012) is not very useful for NLP studies on the Yoruba language because it is heavily code mixed and code switched with English text, though the corpora were collected from online comments about Yoruba movies. Corpora from Asubiaro et al. (2018) and Orife (2018), with more than 100,000 words, were fully diacritically marked, and they represent the biggest datasets that were published. The corpora were sourced from religious texts, published books, and news websites, and research data from these two studies are the simplest packaged corpora. Other studies published smaller corpora. Two studies,—Ekpenyong, Urua, and Gibbon (2008) and Tucker and Shalonova (2005)—shared the same datasets, which were packaged in res (Windows 32 binary resource file that is executable in Microsoft Visual C++) and lpc (linear predictive coding) file formats. Gauthier, Besacier, and Voisin (2016) published the biggest speech corpus for the Hausa language (see Tables 2 and 3).
Research hubs
Five major research hubs on the NLP of Nigerian languages were identified. The research hubs are Obafemi Awolowo University, Nigeria; the University of Sheffield, United Kingdom; the University of Uyo, Nigeria; the African Languages Technology Initiative (ALT-I), Nigeria; and the University of Ibadan, Nigeria. Surprisingly, one non-Nigerian university made the cut, though there are indications that the research hub outside Nigeria (the University of Sheffield) had strong collaborations with the University of Nigeria. Persons of reference were identified as researchers that have contributed significantly to the NLP of Nigerian languages; in this case, a researcher must have contributed more than four research articles on the NLP of Nigerian languages before he or she was considered to be a person of reference.
The biggest research hub that was identified is Obafemi Awolowo University in Nigeria. Twenty-one researchers had affiliation with Obafemi Awolowo University, and they published at least 25 publications on the NLP of Nigerian languages. The researchers at Obafemi Awolowo University contributed primarily in [End Page 276]
[End Page 277]
the NLP of the Yoruba language and in NLP topics such as diacritics restoration, text to speech synthesis, machine translation, and computation of narratives. Odetunji A. Odejobi was the most prolific Nigerian languages NLP researcher and a person of reference at Obafemi Awolowo University. Other persons of reference at Obafemi Awolowo University are Safiriyu I. Eludiora, Franklin O. Asahiah, and Olufemi D. Ninan. Obafemi Awolowo University is located in the south-western part of Nigeria, where Yoruba is the dominant local language.
The University of Sheffield in the United Kingdom was the second most important research hub for the NLP of Nigerian languages. The University of Sheffield houses the Igbo NLP research group where three doctoral candidates—Ikechukwu E. Onyenwe, Ignatius Majesty Ezeani, and Chioma Enemuo—were working on part-of-speech tagging, diacritics restoration, and resource development for the Igbo language. Many of the publications from these doctoral students were co-funded by the Tertiary Trust Fund (TETFund) of the federal government of Nigeria (FGN) and the University of Nigeria. Mark Hepple of the University of Sheffield and Chinedu Uchechukwu of the University of Nigeria also co-authored major articles with the Igbo NLP group. From the analysis, it appears that the research activities of the Igbo NLP group is an instance of capacity-building collaboration between a Nigerian and a foreign university. The Sheffield University research hub for Igbo NLP has produced at least 13 publications on the NLP of the Igbo language, authored by eight researchers.
The University of Uyo in Nigeria, another major research hub, houses researchers that work mainly on the text-to-speech synthesis of the Ibibio language. [End Page 278] The persons of reference at the research hub are Moses Ekpenyong, EmemObong Udoh, Udoinyang Inyang, and Eno-Abasi Urua. The Ibibio language is featured as a language of interest in the Local Language Speech Technology Initiative project, which is a product of a collaboration between the University of Uyo and the University of Bielefeld in Germany.7 The University of Uyo is situated in the southern part of Nigeria, where the Ibibio language is spoken as a local language.
Another major research hub is ALT-I, a research institute that collaborates with international organizations like Tiwa Systems, Bait-al-Hikma, the Open Society Initiative for West Africa, and the International Development Research Center (Adegbola 2009). ALT-I also collaborates with universities in Nigeria for capacity building in language technology for African languages, and such collaborations exist between the University of Ibadan, Obafemi Awolowo University, the University of Benin, the University of Lagos, the University of Ilorin, and the University of Abuja. It was observed that the ALT-I had one major contributor and collaborated with more local institutions and researchers than other research hubs. ALT-I is in the south-western part of Nigeria where Yoruba is the native language.
The University of Ibadan in Nigeria is one of the five major Nigerian languages NLP research hubs. While other research hubs focused on the major languages that are spoken in their regions (the University of Ibadan is located in the south-western part of Nigeria where Yoruba is the major local language), researchers at the University of Ibadan produced research publications mainly on the NLP of the Yoruba and Igbo languages. While persons of reference were identified in other research hubs, none were identified at the University of Ibadan.
Although no research hub with a concentration of researchers on the Hausa language was identified, it was part of the following NLP projects: Broad Operational Language Translation,8 African Languages in the Field Speech Fundamentals and Automation,9 Dictionnaires mis en ligne par l’Université de Nantes,10 and Langage, Langues et Cultures d’Afrique.11
Apart from the major research hubs, universities that have hosted master’s and doctoral degree students who have produced theses on the NLP of Nigerian languages were deemed fit to be recognized as places of interest on the NLP of Nigerian languages. In total, 7 of the 14 theses that were identified were submitted to Nigerian universities, which were also the major research hubs that were mentioned earlier; three each were submitted to the University of Ibadan and Obafemi Awolowo University, and one was submitted to the University of Uyo. Other universities to which at least one thesis on the NLP of Nigerian languages had been submitted include the University of Sheffield (n = 2), Charles University in Prague in the Czech Republic (n = 2), the Vaal Triangle Campus of the North-West University in South Africa (n = 1), Simon Fraser University in Canada (n = 1), and Aston University in the United Kingdom (n = 1). This is another indication that the support for the NLP of Nigerian languages is strong from outside Nigeria.
Statistics showing the countries and major institutions of affiliation of the authors that contributed to the NLP of Nigerian languages are presented in [End Page 279] Appendix 2. The statistics show that 47.01% of the authors were affiliated with Nigerian institutions, 15.13% were affiliated with the US institutions, while 9.96% were affiliated with Malaysian institutions. Authors that were affiliated with institutions in Botswana, Benin, Belgium, Brazil, South Africa, and India worked primarily on the Yoruba language. On the other hand, authors that were affiliated with institutions in Niger, Ethiopia, Senegal, and Saudi Arabia worked primarily on the Hausa language. Apart from Malaysia, the top five foreign countries represent the usual world’s power and countries that collaborate with African countries. Contributors to the NLP of Nigerian languages in African countries were affiliated with institutions in Benin, South Africa, Botswana, Kenya, Senegal, and Niger.
The importance of Yoruba in Benin, and Fulfulde and Hausa in Niger and Senegal, as a major language could explain the level of contribution that these three African countries have in the NLP research of the languages under review in this study. However, the strong participation of Malaysia in the NLP of Nigerian languages research is tangential from the norm, as Malaysia is not one of the top five countries that have collaborated with Nigeria in previous studies (Boshoff 2009; Adams et al. 2014; Confraria and Godinho 2015; Asubiaro 2018, 2019). The Yoruba language attracted interest from most of the countries, followed by Hausa, Igbo, and Ibibio. A notable trend from the results shows that authors from France and Niger were interested in Hausa, Kanuri, and Fulfulde; these three languages belong to the same family and are spoken in Niger, which was a colony of France, which could explain why authors from France contributed to the NLP research of the local languages of its former colony. While French is the official language of Niger, previous studies, such as Asubiaro and by Badmus (2020), have shown that the language of former colonial powers is a significant factor in determining how countries in Africa collaborate with each other and the former African colonial powers. All of the authors from Benin, Belgium, and Botswana contributed to the NLP of the Yoruba language.
Financial support
The result of the analysis of funding statements in the published articles is presented in Table 3. It was shown that 27.95% of all of the articles were funded and that foreign funding agencies in the United States, France, the United Kingdom, Canada, Belgium, and Malaysia (n = 28) funded twice as many studies as Nigerian funding agencies (n = 13). The most funded language overall is Yoruba; the Igbo language attracted more local funding, while Hausa and Yoruba attracted more foreign funding. This is another indication of the strong influence of developing Nigerian languages in other countries. The local funding agencies are TET-Fund, the ALTI, the FGN’s Step-B Program, and the University of Nigeria. TETFund is the only government agency in Nigeria that is saddled with the responsibility of funding research in higher institutions. This contrasts with South Africa where the HLT of Indigenous languages is specially funded under the umbrella of the National HLT Network (Grover, van Huyssteen, and Pretorius 2011). [End Page 280] In practice, financial support for the NLP of Nigerian languages is twice more likely to come from organizations outside Nigeria.
Conclusion and recommendations
This study has analysed publications on the NLP of Nigerian languages that were systematically retrieved using WoS, Scopus, and Google search engines. Data sharing, authors’ affiliation, and funding information was extracted from the full texts of the publications for analysis. Information about Nigerian languages that were studied in the publications and areas of NLP research classifications were also included in the analysis. Major contributors in the identified major research hubs were also identified. Only four Nigerian languages have been included significantly in the NLP studies; Yoruba was featured in the highest number of studies, followed by Hausa, Igbo, and Ibibio. The major research areas that were covered in the publications are speech analysis, computational morphology, the development of alternative technologies to a language-dependent keyboard, and machine translation. In total, 18 research datasets and computer codes were published in 14.28% of the analysed NLP publications. The resources that were published in the publications include corpora, electronic dictionaries, and computer codes.
There are indications that contributions to the development of Nigerian languages from outside the country are significant. First, it was noted that the number of papers that were written by authors that had affiliations with institutions from outside Nigeria is greater than papers that were written by authors that had affiliations with institutions in Nigeria. Second, most of the funding sources for the NLP studies were from outside Nigeria. Third, most of the shared research data were made available in publications that were written by authors that were affiliated with institutions outside Nigeria. Lastly, a significant number of universities outside Nigeria were identified as having hosted master’s and doctoral degree students that submitted theses on different aspects of the NLP of Nigerian languages.
One of the major practical implications of this study is that open scholarship is not popular in the data-intensive, yet resource-scarce, research context that was studied. Though an inventory of resources was created in this study, the sparseness of data that were published speaks volumes about the subsisting problem of resource scarcity. For instance, six of the ten languages that were included in the study were not featured on the list of languages with resources. On a positive note, the identified research hubs are potentially strong institutional structures for the development of Nigerian languages that need to be officially strengthened by stakeholders. The external research hubs and institutional supports for doctoral and master’s degree theses from outside Nigeria are also a positive development, and, if correctly harnessed, they could spur the capacity building for the development of NLP of Nigerian languages. The underlining effect of support from the governments at the state and federal levels is sacrosanct, especially for legal and financial purposes.
By implication, the absence of an adhoc legal framework and limited financial supports for the NLP of Nigerian languages is an indication that drastic [End Page 281] improvement in the development of the Nigerian languages depends on efforts outside the research community—namely, from the governments in Nigeria. Therefore, beyond this research, there is a need for advocacy that will target the creation of a legal framework that will provide legal and financial support and consolidate the existing institutional structures. As a recommendation, there is a need for the provision of an ad hoc research-funding program within Nigeria, especially from the national and state governments. It is also recommended that NLP researchers from Nigeria should share their research outputs, including publications, research data, and source codes, on open repositories. It is also recommended that Nigerian universities should develop digital and open scholarship frameworks that will facilitate the adoption of open scholarship in the universities.
Limitations of the study
It is noteworthy that the search for publications was not exhaustive because offline information sources such as relevant departmental and institutional library theses catalogues were not searched; this was a limitation of this study. The search was therefore patently biased against publications and theses that were not available online. Furthermore, this study did not repeat the search with the language names with diacritics. There is a possibility that some studies that used the language names with their standard orthography that included diacritics (for example, Yoruba written as Yorùbá) were not retrieved. Research has shown that popular search engines do not collocate words in languages that are written with diacritics, and when search queries in these languages are posed in Romanized alphabets, the popular search engines do not have the capabilities to search effectively (Hammo 2009; Alpkocak and Ceylan 2012; Asubiaro 2014).
E. Latunde Odeku Medical Library, College of Medicine, University of Ibadan
tasubiar@uwo.ca
Notes
1. GitHub, https://github.com/Toluwase/NLP-publications-on-Nigerian-languages (accessed December 14, 2020).
2. Aston University, http://www.cs.aston.ac.uk/intranet/~odejoboa/sytts (this website could not be accessed, which is why I made reference to it in this article).
3. Naijatranslate, http://www.naijatranslate.com (this website could not be accessed, which is why I made reference to it in this article).
4. University of Kentucky, http://www.cs.uky.edu/%CB%9Craphael/KATR.html (this website could not be accessed, which is why I made reference to it in this article).
5. GitHub, https://github.com/samuelcesc/yohcrdb (accessed December 14, 2020).
6. A research data publication.
7. Local Language Speech Technology Initiative project, http://llsti.org/ (accessed December 14, 2020).
8. Broad Operational Language Translation, https://www.darpa.mil/program/broad-operational-language-translation (accessed December 14, 2020).
9. African Languages in the Field Speech Fundamentals and Automation, http://alffa.imag.fr (accessed December 14, 2020).
10. Dictionnaires mis en ligne par l’Université de Nantes, http://dilaf.org; http://page-sperso.ls2n.fr/~enguehard-c/DiLAF/Dilaf_projet.php (accessed December 14, 2020).
11. Langage, Langues et Cultures d’Afrique, http://llacan.vjf.cnrs.fr/ (accessed December 14, 2020).
References
Appendix 1. Queries posed to Scopus and Web of Science
Database | Query | Number of results |
---|---|---|
Scopus | TITLE-ABS-KEY (((“natural language processing”) OR (“information” AND (“retrieval” OR “storage” OR “processing” OR “extraction”)) OR “comput*” OR “corpus” OR “machine” OR “automatic*” OR “electronic*” OR “internet” OR “web” OR “digit*”) AND (“yoruba” OR “hausa” OR “igbo” OR “fulfulde” OR “ijaw” OR “Ibibio” OR “ebira” OR “tiv” OR “kanuri” OR “efik”)) | 538 |
Web of Science | TS= (((“natural language processing”) OR (“information” AND (“retrieval” OR “storage” OR “processing” OR “extraction”)) OR “comput*” OR “corpus” OR “machine” OR “automatic*” OR “electronic*” OR “internet” OR “web” OR “digit*”) AND (yoruba OR hausa OR igbo OR fulfulde OR ijaw OR Ibibio OR ebira OR tiv OR kanuri OR efik)) | 245 |
Hausa-22 Ibibio-16 Igbo-43 Yoruba-85 Fula-2 |
||
TS= ((“morpholog* analys*” OR “phonolog* analys*” OR “text-to-speech” OR “speech analysis” OR “speech synthesis” OR “part-of-speech tag*” OR “text normali*” OR “diacritics restoration” OR “entity typing” OR “sentiment analysis” OR “text summar*” OR “document similarity” OR “text similarity” OR “language identification” OR “lexical analys*” OR “stop words” OR “optical character recognition” OR “word identification”) AND (yoruba OR hausa OR igbo OR fulfulde OR ijaw OR Ibibio OR ebira OR tiv OR kanuri OR efik)) | 22 | |
TITLE-ABS-KEY ((“morpholog* analys*” OR “phonolog* analys*” OR “speech synthesis” OR “text-to-speech” OR “speech analysis” OR “speech synthesis” OR “part-of-speech tag*” OR “text normali*” OR “diacritics restoration” OR “entity typing” OR “sentiment analysis” OR “text summar*” OR “document similarity” OR “text similarity” OR “language identification” OR “lexical analys*” OR “stop words” OR “optical character recognition” OR “word identification” OR “automatic speech recognition”) AND (yoruba OR hausa OR igbo OR fulfulde OR ijaw OR Ibibio OR ebira OR tiv OR kanuri OR efik)) | 60 |
Appendix 2. Countries and major institutions of affiliation and languages of interest
Countries | Number of authors | Languages | Major institutions |
---|---|---|---|
Nigeria | 118 | Ibibio (6.78%), Igbo (16.95%), Hausa (17.80%), Yoruba (63.56%) | Obafemi Awolowo University, Ile-Ife University of Nigeria, Nsukka African Languages Technology Initiative, Ibadan University of Ibadan, Ibadan University Uyo, Uyo |
Malaysia | 25 | Hausa (52%), Igbo, (8%) Yoruba (52%), Tiv (8%) | University of Malaya, Kuala LumpurUniversiti Sultan Zainal Abidin, Besut Terengganu |
USA | 38 | Ibibio (2.63%), Igbo (13.16%), Hausa (34.21%), Yoruba (71.05%), Tiv(2.63%), Efik (2.63%) | Carnegie Mellon University, Pittsburgh Johns Hopkins University, Baltimore University of Southern California, Los Angeles George Washington University, Washington, DC University of Pennsylvania, Philadelphia |
UK | 15 | Ibibio (20.00%), Igbo (33.33%), Yoruba (46.67%) | University of Sheffeild, Sheffield University of Edinburgh, Edinburgh Aston University, Birmingham Lancaster University, Lancashire University College Cork Cork, |
France | 6 | Hausa (100%), Fulfulde, (16.67%), Kanuri (16.67%) | Centre national de la recherché scientifique (CNRS), Université Aix Marseille, France Laboratoire d’Informatique de Nantes–Atlantique, Université de Nantes, Nantes Cedex Laboratoire de linguistique formelle, CNRS, Paris cedex Université Grenoble Alpes, Grenoble |
Germany | 11 | Hausa (45.45%), Ibibio (18.18%), Yoruba (36.36%) | Universität Bielefeld, Bielefeld Karlsruhe Institute of Technology, Karlsruhe Deutsches Forschungszentrum für Künstliche Intelligenz GmbH, Saarbrucken, Saarland University |
Benin | 5 | Yoruba(100%) | University of Abomey-Calavi, Cotonou |
South Africa | 5 | Yoruba (80%), Igbo (20%), Fulfulde (40%) | University of the Western Cape, Cape Town Council for Scientific and Industrial Research, Pretoria North-West University, Vanderbijlpark |
Canada | 3 | Yoruba (100%), Igbo (66.6%), Hausa (66.6%) | Simon Fraser University, Burnaby, British Columbia University of Western Ontario, London, Ontario |
Botswana | 2 | Yoruba(100%) | University of Botswana, Gaborone |
Kenya | 2 | Igbo (50%, Yoruba (50%) | University of Nairobi, Nairobi |
Niger | 2 | Hausa (100%), Kanuri (50%), Fulfulde(50%) | Université Abdou Moumouni, Niamey |
Belgium | 2 | Yoruba(100%) | Ghent University, Ghent University of Antwerp, Antwerpen |
Australia | 1 | Hausa (100%), Igbo(100%), Yoruba(100%) | University of Sydney, Sydney |
Brazil | 3 | Yoruba (100%) | Federal University of Rio Grande do Sul, Porto Alegre |
Czech Republic | 2 | Yoruba (100%) | Charles University, Prague |
Ethiopia | 1 | Hausa(100%) | Addis Ababa University, Addis Ababa |
India | 4 | Hausa(75%), Yoruba (100%) | Maharaja Ranjit Singh Punjab Technical University, Bathinda |
Saudi Arabia | 1 | Hausa (100%) | King Khalid University, Abha |
Senegal | 1 | Hausa (100%) | Pleumeur-Bodou, France and Dakar, Senegal |
Switzerland | 2 | Fulfulde (100%) | École Polytechnique Fedérale de Lausanne |