[ᠨᠠᠰᠤᠨ᠋ᠤ᠋ᠷᠲᠤ᠂ ᠬᠠᠢ ᠶᠢᠨ ᠬᠤᠸᠠ] - ᠦᠭᠦᠯᠡᠯ - (ᠠᠩᠭ᠍ᠯᠢ) - The Development and Preliminary Applications of Semantic Information Knowledge-Base of Mongolian
SAI Intelligent Systems Conference 2016 September21-22, 2016 | London, UK 1 | P a g e 978-1-5090-1121-6/16/$31.00 ©2016 IEEE The Development and Preliminary Applications of Semantic Information Knowledge-Base of Mongolian* YinhuaHai1 School of Mongolian Studies Inner Mongolia University Hohhot, China haiyh2008@163.com Nasun-urt2 School of Mongolian Studies Inner Mongolia University Hohhot, China qingjirvm@126.com Abstract—“The Semantic Information Knowledge-base of Mongolian“ (SIKM) is a natural language processing-oriented lexical semantic knowledge base. The first period of the project was conducted between 2009–12 with the three sub-bases as noun, adjective, and verb being developed and completed. Many more efforts have been put during phase II since 2013. The scale and quality of the SIKM has been significantly improved, which will provide stronger support for semantic analysis. This paper presents the latest progress of SIKM from the perspective of general base development, expansion of scale, improvement of semantic classification, perfection of attribute description, and some preliminary applications in the system. Keywords— Mongolian; Semantic Information Knowledge- Base; New Progress I. INTRODUCTION Since the late 1980s, Mongolian linguistics referenced Semantic Field Theory and Sememe Analysis Method of structural linguistics, which has provided a new methodology and a new direction for the development of lexical semantics of Mongolian. This has blazed the trail in a new era of “Contemporary Semantics of Mongolian“. Many works have introduced and illustrated the above theories and methods more or less, such as “The Contemporary Mongolian“(1996) [1], “the Semantic Foundation of Contemporary Mongolian “(Ulaanbaatar, 1997) [2],“The Mongolian Semantics “(2001) [3]and so on. Compared to the study of grammar, lexical semantics was a weak link in Mongolian. The processing and research on lexical semantics with computer started only recently. Xingguang Lin, a linguist, argues that the fundamental task of the computational lexicology is to study the representation of lexical meanings and computer applications of vocabulary, such as how to construct machine dictionary, how to build corpus, and so on (1999: 146) [4]. Machine dictionary is the backbone and basis of computer processing for modern languages. The construction of machine dictionary is an infrastructure for the modernization of national language (Tianshun Yao, et al., 1995: 215-216) [5]. Since 1980s, Mongolian information processing has made considerable achievements in the development of engineering of machine dictionaries. For example, the development of the special electronic dictionary such as “Darihan Electronic Dictionary of Chinese-Mongolian“, “The Electronic Dictionary of Traditional Mongolian, Cyrillic-Mongolian and Chinese“, “The Grammatical Information Dictionary of Mongolian“- (GIDM), “The Big Dictionary of Chinese, Mongolian, English, Japanese“ and the building of other lexical semantic knowledge-bases for some application systems. A previous project (funded by National Social Science Foundation of China) called “Information Processing-Oriented Research of Mongolian Semantics“[6] that has started to undertake the lexical semantics of Mongolian from the perspective of natural language processing since 2002, becomes the cornerstone of the computational semantics of Mongolian. From 2009 to 2012, the research group has been undertaking a project “Design and Implementing the Semantic Information Knowledge-base of Mongolian“ (which was funded by the National Natural Science Foundation of China), in those years SIKM had started to develop. Since then, scholars have begun to focus on the construction of lexical semantic resources in Mongolian, especially the construction and design of the semantic dictionary. Currently, “Dictionary of Mongolian Polysemy“, “Dictionary of Mongolian Homograph“, “Dictionary of Mongolian Collocation“ and “Knowledge-base of Mongolian Idioms“ have been put into practice. The development of SIKM is based on GIDM (that was developed by School of Mongolian Studies, Inner Mongolia University), in order to provide more detailed and in-depth syntactic and semantic knowledge for the computer when automatically analysing Mongolian sentences and generating sentences in foreign languages. As a result, the semantic classification and the description of collocation of three categories like noun, adjective and verb had been completed. However, the research basis of traditional Mongolian semantics has been comparatively weak that makes the development of semantic knowledge base a long-term language project. The Phase II of SIKM supported by “The Specific Project of Informationization of Mongolian Language and Script“ from Autonomous Region of Inner Mongolia during This research is sponsored by the project NSF grant of China-“The Design and Implementation of Semantic Information Knowledge-Base in Mongolian”(60873084), the Project of NSSF-“ Development and Research of Knowledge-Base of Mongolian Idioms“ (12CYY062), the project of Mongolian Language and Character of Inner Mongolia-“the Related Standard of Mongolian Idioms for Education”(MW-YB-2015014). SAI Intelligent Systems Conference 2016 September21-22, 2016 | London, UK 2 | P a g e 978-1-5090-1121-6/16/$31.00 ©2016 IEEE 2013. There is a more substantial expansion of the dictionary and a comprehensive revision of all the words with semantic classifications and attribute descriptions. And the general base of SIKM has been set out to develop. The research group has achieved the research project of SIKM by using the method of “Semantic Classification + Attribute Description“[7] according to the ontology feature of Mongolian, which refers to the basic design concept of the “Semantic Knowledge-Base of Contemporary Chinese“(SKCC) [8], “the Chinese Concept Dictionary“ (CCD) [9] [10], and the methods of semantic analysis and description of Wordnet. At the same time, the project was based on the theory for lexicography and lexicology, Case Grammar and Valency Grammar [11]. At present, the SIKM can provide a comprehensive semantic knowledge for the research of differentiating phrase structures, reducing vocabulary ambiguity, corpus tagging and other relevant issues, and have new progress. The following contents will dissertate the latest progress of SIKM from the prospective of the expansion of scale, the development of general base, the improvement of semantic classification, the perfection of attribute description and some preliminary applications in the system. II. NEW PROGRESS OF KNOWLEDGE-BASE A. Expansion of Scale The original SIKM has 38,883 entries, all of which are from the GIDM. The research group has extended the scale of SIKM from 38,883 to 60,000 since 2009. The technology of phase II of SIKM has absorbed the latest results of the GIDM. The research group have checked all the original attribute fields in SIKM and revised some of them, such as “Mongolian words“(MONGOL), “phonetics“(GALIG), “part of speech“ (UGSAYIMAG) (those were taken directly from the GIDM). The research group has expanded the SIKM from the following two aspects. 1) Some frequently used words have been added in the database of SIKM. When SIKM was being constructed, The research group found certain common words such as“ “ (consonant) and so on were not included. That was because the original source of entries-GIDM failed to involve some commonly used words. To solve the problem, the research added some frequently used words in the database of SIKM. 2) Polysemy is a common linguistic phenomenon of the lexical semantics. More meanings have been added for the polysemous words. Due to the deficiency of the original GIDM, some polysemous words did not show them with more than one meaning in the database, but rather with only one single sense as an entry. Different senses of a word and homonyms are important content for a dictionary. Therefore, to differentiate the word senses of polysemy, each sense has been added as an entry in the database of SIKM. For example, “ “ is a word with polysemy, and it has three different senses, such as “tribe“, “type“, “prefecture“, three senses have been extended to this word. At present, there are 60,000 entries in SIKM, covering four databases, which contains all the words in a “general base“ and other sub-bases of noun, adjective and verb. Each database file has a detailed description of words and their semantic attributes with a two-dimensional relationship. In each base, the entry and the amount of information and its total amount of information (the amount of information = No.* Field No.) are collected as shown in TableⅠ. The total amount of information about 1.1 million. The general base and all of the sub-bases are interconnected by three key attributes such as “words“, “phonetics“, and “parts of speech“. TABLE I. COLLECTION OF RELATED INFORMATION OF SIKM name of base entry attribute amount of information total amount of information general base 60000 16 960,000 1099,540 Noun 16000 26 416,000 Adjective 11025 21 231,525 Verb 32365 11 356,015 B. Development of General Base SIKM is a knowledge base to describe the lexical semantic features of the commonly used words in contemporary Mongolian from many perspectives, multi-levels. In the early stage of its development, due to the theoretical and technical limitations of the traditional Mongolian semantics and the deficient techniques of Mongolian information processing, the research group had difficulties fulfilling the overall description on semantic features of Mongolian words. SAI Intelligent Systems Conference 2016 September21-22, 2016 | London, UK 3 | P a g e 978-1-5090-1121-6/16/$31.00 ©2016 IEEE Fig. 1. Sample of General Base of SIKM So, the research group was only capable of developing SIKM from three sub-bases like noun, adjective and verb, with hardly any attempts to design and develop general base. Until 2013, phase II started to develop its “general base” (as shown in Figure.1), based on the technique and experience of developing the three sub-bases. As mentioned above, “general base“ is composed of 60,000 words, and 16 attribute fields shown as in Table 4. C. Improvement of Semantic Classification The research group roughly classified the meanings of Mongolian words when undertook a previous project “Information Processing-oriented Research of Mongolian Semantics“. However, because of the theoretical and technical limitation, the classification couldn’t cover all of Mongolian common words in SIKM and failed to meet the deep needs of Mongolian information processing. The semantic classification of SIKM is based on the whole entries including the knowledge-base, and further research depends mainly on the needs of the depth and breadth of semantic analysis. The classification had a great practical value for Mongolian information processing after the preliminary examination of application. During phase II of SIKM, when dealing with the semantic classification, the research regards the following as unavoidable factors to be considered: First the classification will meet the actual requirements for corpus tagging and language analysis in Chinese-Mongolian machine translation. Secondly, it will be convenient to be compatible with Mongolian WordNet, or it will be able to share the semantic resources with “the Dictionary of Mongolian Polysemy“, “the Dictionary of Mongolian Homograph“ and other some semantic knowledge-bases in the near future. With reference to the current semantic classification materials obtained, the research group made some adjustments and additions to the former classification of SIKM. 1) The semantic classification of verbal nouns was modified according to the verb semantic classification. 2) The adjective classification became more detailed, refined the categories from the former six into seven, correlated with the noun classification, and hence the collocation between adjective and noun can be described in much more details. Now, the semantic classification of SIKM covers noun, adjective and verb, and it is classified according to the basic lexical meaning of each word, and developed related marks. For the noun semantic classification, there are 7 categories and 191 sub-categories (as shown in Figure 2); for adjective semantic classification, there are 6 categories and 217 sub- categories; and for verb semantic classification, there are 5 categories and 121 sub-categories. SAI Intelligent Systems Conference 2016 September21-22, 2016 | London, UK 4 | P a g e 978-1-5090-1121-6/16/$31.00 ©2016 IEEE Fig. 2. Sample of Semantic Classification of Noun D. Perfection of the Attribute Description The original focus of the semantic knowledge-base is on classifying noun, adjective, and verb, and describing its semantic collocation restrictions on the basis of valency theory. However, because of the imperfection of semantic classification and limitation of technology, the previous research was unable to provide some information description of the SIKM in the first phase, such as “semantic classification“ and other fields of noun base, the “semantic classification“, “collocation“, “quantity of valency“, “quality of valency“ and other fields of adjective base and so on. In phase II, the deficiency mentioned above has been revised or filled up, and the quality of the attribute has been greatly enhanced. The attributes of SIKM described can be broadly divided into five categories as following. 1) Connection Information: The three key attributes--- “Mongolian words”, “phonetics”, and “parts of speech” are carried directly on from the GIDM. Linking to these fields not only ensures the regulatory of words that semantic knowledge base included and the accuracy of tagging parts of speech, but also enlarges the system, in conjunction with the syntax, and makes semantic knowledge more comprehensive. 2) Basic semantic information: Words themselves possess certain basic semantic features, such as the sense, homograph, interpretation, synonym, antonym and idioms and so on. For example, a word-“AMA“(mouth) has five senses, the description of it is shown in TableⅡ. Basic semantic information can provide rich knowledge for eliminating ambiguity of word and semantic study. SAI Intelligent Systems Conference 2016 September21-22, 2016 | London, UK 5 | P a g e 978-1-5090-1121-6/16/$31.00 ©2016 IEEE 3) Semantic classification information: Fill in the semantic classification that the word belongs to according to the three major parts of speech, including semantic classification and sub-classification. TABLE II. SAMPLE OF DESCRIPTION OF SENSES IN SIKM PART1 word phonetics part of speech pense Chinese AMA Ne1 1 嘴mouth AMA Ne2 2 边side AMA Ne2 3 闲话chat AMA Ne1 4 人口population AMA Ne2 5 接近nearly PART2 word interpretation homograph synonym antonym idiom eating, speech organ AMA ALDAHV open part SEGUL AMA SEGER chat HELE AMA NIGETEI statist