[ᠨᠠᠰᠤᠨ᠋ᠤ᠋ᠷᠲᠤ᠂ ᠬᠠᠰ] - ᠦᠭᠦᠯᠡᠯ - (ᠠᠩᠭ᠍ᠯᠢ) The Automatic Construction Method of Mongolian WordNet Noun Sets of Synonyms
The Automatic Construction Method of Mongolian WordNet Noun Sets of Synonyms Hasi Academy of Mongolian studies Inner Mongolia University Huhhot, China e-mail: hasi@imnu.edu.cn Nasun-urt Academy of Mongolian studies Inner Mongolia University Huhhot, China e-mail: mgnasun@imu.edu.cn Abstract —Automatic construction of Mongolian noun sets of synonyms is the fundamental work to be accomplished first when developing the noun subnet of Mongolian WordNet. This article proposed an approach of transforming Chinese or English WordNet to Mongolian WordNet noun sets of synonyms, and also designed and implemented the maintenance system of the Mongolian WordNet noun sets of synonyms. Some application prospects such as extension principles for noun sets of synonyms and disambiguation algorithm are discussed. Keywords- WordNet ; Mongolian WordNet ; Noun; Sets of Synonyms I. INTRODUCTION In recent years, researchers have done many explorations for semantic in Mongolian information processing, these explorations involve noun semantic classification, verbs semantic classification and adjectives semantic classification, case frame, the coordination valence theory etc. And these studies have made certain achievements. But machine translation systems and other natural language processing systems usually need the electronic dictionary including semantic knowledge that automatic computer offers more comprehensive and thorough analysis of the semantic information. Although “Mongolian information grammatical dictionary” covers semantic information, it is not the semantic dictionary so there is an urgent need to construct a Mongolian semantic information dictionary. In addition, the current semantics in information processing oriented study is beyond of inner semantic relationship between Mongolian verbs, nouns, adjectives, a verb and noun, nouns and nouns, nouns and adjectives . Therefore, research core of Mongolian semantic information processing is using the semantic net and concept interdependence theory to analyze the semantic relationship between words and phrases and establish the semantic relationship net. The most representative WordNet about Semantic relationship is English WordNet. This WordNet was made by cognitive science laboratory, Princeton University, which is an online dictionary database system. It was based on the lexical semantic network system in English and organize English nouns, verbs, adjectives and adverbs for synonyms synsets, and each set presents a basic vocabulary concept, and the establishments are including synonymous relationship between antonyms, upper relationship, bottom relationship, part relationship and complete relationship which are all lexical semantic relationship in vocabulary concepts. At present, WordNet have been successfully used in eliminate different meaning of word, automatic processing in linguistics, bilingual and multilingual machine translation, search system and so on. And it was widely considered as the most important source for computational linguistics, text analysis and many other related fields. There are also Chinese WordNet researches based on English WordNet in China. Present achievements about grammar knowledge base should be absorbed to construct semantic knowledge base which are considered as a effective way to study Mongolian semantic research. The task of lexical semantic study in Mongolian information processing is based on computer processing requirement and presents various semantic relationships comprehensively and thoroughly, so as to convenient for calculating. The relationship is soul of the lexical semantics. The semantic relationships here are between the concepts and between properties. By using theories and methods in computational linguistics and computational semantics, we will construct the information processing oriented semantic net automatically and provide powerful, applicable semantic resource to solve the urgent problems in Mongolian information processing. Nouns is one of three “openness“ parts of Mongolian, so construct noun semantic relationship net is the basic exploration of Mongolian WordNet. In order to provide comprehensive and in-depth semantic knowledge for the automatic computer analysis and generating, we should establish noun semantic net which is combined with language engineering application and the semantic knowledge base oriented that is a basic research of “Mongolian WordNet. So one of important is construction of noun synonyms set which is our primary completed branch work. 2011 Fourth International Conference on Intelligent Networks and Intelligent Systems 978-0-7695-4543-1/11 $25.00 ' 2011 IEEE DOI 10.1109/ICINIS.2011.56 195II. THE CONSTRUCTION OF MONGOLIAN NOUN SYNONYMS SET A. The realization method of Mongolian noun synonyms set Based on the noun concept in Mongolian information grammatical dictionary, we will establish Mongolian information processing oriented Mongolian noun synonyms set using relationship between WordNet concepts by semi- automatic. First, we extract noun concept from Mongolian information grammatical dictionary and then explore relationship between the corresponding semantic frame and concepts from WordNet and transplant it as Mongolian WordNet frame in which Mongolian words are implanted that in result of Mongolian WortNet noun subnet.Some noun concepts in the Mongolian information grammatical dictionary is not exist in WordNet, so we add such vocabulary concepts by manual into semantic net which usually positioned leaf node at lower levels. So it ensures that the concept at higher levels are compatible with WordNet, meanwhile the concept at lower levels has the maximum flexibility to do it. B. The building proccess of Mongolian noun synset The WordNet’s frame is organized with synset , so , establishing Mongolian synset and constructing semantic web taking synset as the unit is the Prerequisite for the Mongolian’compatible with other language’s WordNet. In order to make full use of the research results of the predecessor , the thesis labeled the noun with synset ID, based on the Mongolian grammatical information dictionary. The main approaches: (1) based on the Mongolian synonym dictionary, Constructed the Mongolian synonymy. each of the records are a group of synonyms, containing more than 2,000 record. (2) Judged the noun of Mongolian grammatical information dictionary ,whether having the synonyms. If the noun will be found in the Mongolian synonyms dictionary ,then marked the noun as the synonym. (3)From the Mongolian grammatical information dictionary , Found the noun not labeling the synset ID and read the english antithesis . (4)Found the synonymy from the Chinese-english WordNet ,if found out the synset ID, then labeled the Mongolian noun with the synset ID. The value of English WordNet is from 100000000 to 400000000,Chinese WordNet is from 600000000 to 900000000. Chinese synset ID is that the corresponding English synset ID’s the first place plus the 5. For example, the synset ID of English word “thing” is 100002056, the synset of Chinese word “ 物” is 600002056. So, the synset ID of Mongolian word “BVDAS”(thing) is 100002056. (5) If the Mongolian grammatical information dictionary still had the noun that is not labeled with synset ID, then the labeling process jumped to step (3) continuing labeling. Considering the corresponding Chinese and English words tagging of the Mongolian noun ,which in Mongolian grammatical information dictionary , is incomplete or inaccurate, the thesis found the noun from the other tools like Darhan dictionary and found the corresponding words from the Chinese WordNet. Through the two methods above, we completed the more than 8.000 word’s synset ID labeling work. The corresponding Chinese words of remaining words like “SIYANBeI” (鲜卑) is not exist in the Chinese WordNet, so, we setted the synset ID by manual starting from the value of m00000001automatically. M means Mongolian unique vocabulary. For example,BISILIG(a kind of milk food) is not exist in Chinese or English WordNet. III. FURTHER PERFECTION OF MONGOLIAN NOUN SETS OF SYNONYMS Sets of synonyms really is the cornerstone of WordNet lexicon and the foundation of building a thesaurus dictionary. So ,the first assignment of building a nouns subnet is setting up the synset ID to all of the noun ,which in Mongolian grammatical information dictionary. And adding explanatory notes to synset. This is similar to traditional dictionary, but a synset does not equal to a entry of dictionary. Especially, in a dictionary, a entry may be a polysemous word, but a synset contains only a comment. The most famous and important psycholinguistics facts of lexical vocabulary learning way is that people are more familiar with some words than other words. the familiarity with one word performs in many ways: reading speed, understanding speed, easy to recall and using probability, etc. these effects exist so widely that even those who hope to study other properties of words may pay tremendous effects, it is difficult to regard the familiarity of different words as the same. In other words, the original intention of lexicon is reflecting psycholinguistics principles. In lexicon, ignoring the word’s familiarity on performance above is unimaginable. The labeling index of familiarity for each word form was added in order to reflecting word’s familiarity on the WordNet. Frequency is generally considered to reflect the word’s familiarity. The closed words that play an important syntax role is the words with high using frequency, however, even in opened words, the using frequency is also quite different. The frequency is usually assumed to relevant to the familiarity ,or simply using the former to explain the latter. Frequency data can be found in the technical documents, but for the WordNet , the original frequency data is not enough.Thorndike and Lorge (1994) published words frequency tables based five million word’s text data, but they only reported the 30,000 common words. In addition ,they definited the string with two space, so the statistics of homograph is unreliable, such as their result can’t explain what the difference exit in which the appearing frequency of word “lead ”as noun and verb. Francis and Kucvera (1982) marked the word’s parts of speech with their own syntactic class marker, but their published results(contains 5,0400 word form, including many proper noun) is just from the text with 1,014,000 words. So the results is not enough to reflect uncommon word’s frequency.(the speed is 120 words/a mimute, so 100 million words is about 140 hours, or equals two weeks one person spoken ) 196Experts on WordNet use another method to express familiarity. Zipf (1945 ) indicate that the frequency of appearance of a word is relevant to its polysemy. On average, the more frequently a word is used, the more meaning it will have in dictionaries. One interesting finding of psycholinguistics ( Jastrezembski , 1981 ) is that the polysemy seems to predict lexical access time as well as frequency does. WordNet uses polysemy as an index for familiarity instead of frequency. This measure can be determined from a machine-readable dictionary. If an index value of 0 is assigned to the word that do not appear in the dictionary, if values of 1 or more are assigned according to the number of senses a word has, then an index value can be made available for every word in every syntactic category. Associated with the every word form in the WordNet, therefore, is an integer that represents the count of number of senses that word form has when it is used as a noun, verb, adjective or adverb. During he research, Mongolian polysemous nouns were marked by familiarity using the “Mongolian Polysemous Word Information Dictionary”. Besides, polysemous words can be assigned to synonym group correspondingly according to their basic word sense and also by other sense to enrich the Mongolian noun synonyms gathering. IV. APPLICATION OF NOUN SYNONYMS GATHERING A noun usually has singly hypernym, Lexicographers include it in the definitions: since a noun can have many hyponyms, lexicographers seldom list them. The generalization of nouns in WordNet used the hierarchy of nouns. It is a important basis of noun subnet of WordNet and of important value. Restrictions of choice on verb collaboration also shows the importance of noun hierarchy. For example, the direct object of the verb “drink” could be any hyponyms of “beverage”. It implies that the knowledge of noun hierarchy should be stored in a way where it can be visited and searched quickly. After Mongolian synset gathering is formed, the hyponymy, antonymy, meronymy relations of Chinese WordNet will be added to Mongolian noun subnet in a automatic transformation way and form the framework of Mongolian noun subnet. Although difference exist between languages, people’s understanding of world is similar from the point of concept. Therefore using Chinese WordNet to construct the framework of Mongolian noun subnet is an effective path. WordNet noun subnet is a linguistic knowledge base. It’s aim is to serve for the understanding and processing of natural language. In this research, besides constructing the WordNet noun subnet search application, the other main characteristics is to solve the problem of ambiguity. Noun subnet will become the main dictionary resource of Word Sense Disambiguation and possibly used in semantic analysis through the formal description of noun concept and simple structure of sense relationship between concepts. For example in Mongolian sentences; H0GVLAN GER-TU GAJAR UGEI B0LHVR BI GVWANJAN-DU BVDAG_A IDEBE As there was no place in the cafeteria, I ate at a restaurant. “GAJAR” in the sentence has six different meanings; (1) HOMON TOROLHITEN-U AMIDVRA/N 0R0SI/JV BAYI/G_A DELEHEI-YIN BOMBORCEG ( the planet earth ) (2) HOROSO,SIR0I,TARIYA/N GAJAR (farmland ) (3) 0R0N NVTVG,0R0N BAYIRI (place ) (4) VRTV-YIN NIGECI.NIGE GAJAR NI JAGV TABI/N H00S ALDA(ARBA/N TABV/N YIN)BVYV HAGAS KIL0MetR-TEI TENGCE/N_E.BAYIGVLG_A BVYV TEGUN-U D0T0RAHI NIGE NIGECI ( unit of length ) (5) $ALA HUSER (floor ) (6) SILTAGAN-V DAYIBVRI UGE (adverb referring to reasons ) The above sense are on different noun semantic trees : Just by calculating the distance of “H0GVLAN GER” (cafeteria) and “GAJAR” (place), it will be known that (3) is closest in the meaning. V. CONCLUSION At present, the framework of Mongolian noun synset gathering has been completed in general and a few unique words need to assign synset ID manually. The semantic relation marking has started for the words assigned with synset ID using synset as organization unit. The research of Mongolian noun subnet is a beginning of Mongolian WordNet construction. It will be a long-term, dynamic and huge program involving various fields and technology. As it started not long ago, many places needs to be improved. Synonymy gathering information will be perfected by next steps of