In lieu of an abstract, here is a brief excerpt of the content:

3 statistical and symbolic paradigms in arabic Computational linguistics ALI FARGHALY DataFlux Corporation With the inception of the digital age and, in particular, the widespread adoption of the Internet as a communication tool and as a medium for information exchange, the amount of information available to the public has grown exponentially, although the tools for processing and extracting meaning from this enormous body of information have only grown linearly. To address these pressing needs, computational linguists have developed three main approaches to natural language processing (NLP): the statistical approach; the symbolic approach; and the hybrid approach, which combines features of both the statistical and symbolic approaches. In this chapter I present the history and progress of NLP beginning with the introduction of the digital computer in the late 1940s, to the rise of the Internet, which resulted in a massive explosion of information and the dominance of the digital format of communication. The abundance of information stored in electronic format required computational tools for information processing, retrieval, and extraction. Because Arabic is a major world language, both Arabic speakers and international entities with interest in the Arab world have the desire to develop tools for the analysis and processing of Arabic data. Here I first present a brief description of the properties of the Arabic language that are crucial to Arabic Natural Language Processing (ANLP). Then I focus on the development of symbolic and statistical paradigms for the processing of natural language . I discuss these paradigms in the context of theoretical and practical considerations for developing Arabic machine translation systems. My conclusion is that whereas statistical approaches to ANLP seem to be more successful from a native Arabic perspective, NLP approaches that promote rigorous analysis of the Arabic language could better meet the need for Arabic information processing and also satisfy other important sociocultural needs. nlp in the Digital age From its earliest development in the 1940s, the computer was hailed as an innovation that would facilitate and promote the development and dissemination of knowledge. However, widespread adoption as anticipated by the nascent information technology 35 Ali Farghaly 36 industry was constrained by the technology because the expensive mainframe computers available at the time were affordable only to governments, academic centers, and the largest corporations. But the advent of the personal computer in the 1980s created a paradigm shift in the information technology industry because the tools for both creating and processing information in digital formats were now available to small and midsize entities and even to individuals. Computers were becoming smaller, cheaper, and more powerful with processing power and data storage capabilities (Gazdar and Pullum 1985). However, the utility of the newly affordable computer was not obvious to the nonspecialist. Only in the 1990s did the development and rapid adoption of the Internet worldwide create a second paradigm shift that enabled individuals to create and distribute information to and from the most remote corners of the world, ushering in what is now called the Digital Age. The enormous quantity of documents created and stored in digital format every minute has grown from kilobytes to megabytes to gigabytes and is now estimated to be several terabytes. However, the glut of information enabled by the Internet has created a simultaneously pressing need for NLP tools to process, classify, and extract meaning from the huge unstructured data on the Internet. In a study undertaken by the University of California, Berkeley (Layman and Varian 2003), it was estimated that as recently as 2002, 92 percent of the world’s information was available and/or stored on magnetic tapes. But it is now estimated that millions of documents are created daily in sizes ranging from kilobytes to terabytes. E-mails alone account for 400,000 terabytes per year, and new social networking applications such as instant messaging create 5 terabytes daily. It is also estimated that 40 percent of the world’s newly stored information is created in the United States. Thus, much of this new information is created in English, but multilingual applications are badly needed to make new information accessible to speakers of other languages. Although information is considered a path to prosperity and a means to obtain power, the glut of information could lead instead to a poverty of knowledge when governments, academia, and industry simply lack the means to process this information efficiently and in a timely manner. The challenge is that information is encoded in natural language, yet the necessary human expertise is neither sufficient nor available to process...

Share