Abstract

Abstract:

The digital world is marked by large asymmetries in the volume of content available between different languages. As a direct corollary, this inequality also exists, amplified, in the number of resources (labeled and unlabeled datasets, pretrained models, academic research) available for the computational analysis of these languages or what is generally called natural language processing (NLP). NLP literature divides languages between high- and low-resource languages. Thanks to early private and public investment in the field, the Korean language is generally considered to be a high-resource language. Yet, the good fortunes of Korean in the age of machine learning obscure the divided state of the language, as recensions of available resources and research solely focus on the standard language of South Korea, thus making it the sole representant of an otherwise diverse linguistic family that includes the Northern standard language as well as regional and diasporic dialects. This paper shows that the resources developed for the South Korean language do not necessarily transfer to the North Korean language. However, it also argues that this does not make North Korean a low-resource language. On one hand, South Korean resources can be augmented with North Korean data to achieve better performance. On the other, North Korean has more resources than commonly assumed. Retracing the long history of NLP research in North Korea, the paper shows that a large number of datasets and research exists for the North Korean language even if they are not easily available. The paper concludes by exploring the possibility of "unified" language models and underscoring the need for active NLP research collaboration across the Korean peninsula.

pdf