333 research outputs found

    Introduction (to Special Issue on Tibetan Natural Language Processing)

    Get PDF
    This introduction surveys research on Tibetan NLP, both in China and in the West, as well as contextualizing the articles contained in the special issue

    Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration

    Full text link
    Cross-language information retrieval (CLIR), where queries and documents are in different languages, has of late become one of the major topics within the information retrieval community. This paper proposes a Japanese/English CLIR system, where we combine a query translation and retrieval modules. We currently target the retrieval of technical documents, and therefore the performance of our system is highly dependent on the quality of the translation of technical terms. However, the technical term translation is still problematic in that technical terms are often compound words, and thus new terms are progressively created by combining existing base words. In addition, Japanese often represents loanwords based on its special phonogram. Consequently, existing dictionaries find it difficult to achieve sufficient coverage. To counter the first problem, we produce a Japanese/English dictionary for base words, and translate compound words on a word-by-word basis. We also use a probabilistic method to resolve translation ambiguity. For the second problem, we use a transliteration method, which corresponds words unlisted in the base word dictionary to their phonetic equivalents in the target language. We evaluate our system using a test collection for CLIR, and show that both the compound word translation and transliteration methods improve the system performance

    A real time Named Entity Recognition system for Arabic text mining

    Get PDF
    Arabic is the most widely spoken language in the Arab World. Most people of the Islamic World understand the Classic Arabic language because it is the language of the Qur'an. Despite the fact that in the last decade the number of Arabic Internet users (Middle East and North and East of Africa) has increased considerably, systems to analyze Arabic digital resources automatically are not as easily available as they are for English. Therefore, in this work, an attempt is made to build a real time Named Entity Recognition system that can be used in web applications to detect the appearance of specific named entities and events in news written in Arabic. Arabic is a highly inflectional language, thus we will try to minimize the impact of Arabic affixes on the quality of the pattern recognition model applied to identify named entities. These patterns are built up by processing and integrating different gazetteers, from DBPedia (http://dbpedia.org/About, 2009) to GATE (A general architecture for text engineering, 2009) and ANERGazet (http://users.dsic.upv.es/grupos/nle/?file=kop4.php).This work has been partially supported by the Spanish Center for Industry Technological Development (CDTI, Ministry of Industry, Tourism and Trade), through the BUSCAMEDIA Project (CEN-20091026), and also by the Spanish research projects: MA2VICMR: Improving the access, analysis and visibility of the multilingual and multimedia information in web for the Region of Madrid (S2009/TIC-1542), and MULTIMEDICA: Multilingual Information Extraction in Health domain and application to scientific and informative documents (TIN2010-20644-C03-01). The authors would like also to thank the IPSC of the European Commission’s Joint Research Centre for allowing us to include the EMM search engine in our system.Publicad

    Biomedical Term Extraction: NLP Techniques in Computational Medicine

    Get PDF
    Artificial Intelligence (AI) and its branch Natural Language Processing (NLP) in particular are main contributors to recent advances in classifying documentation and extracting information from assorted fields, Medicine being one that has gathered a lot of attention due to the amount of information generated in public professional journals and other means of communication within the medical profession. The typical information extraction task from technical texts is performed via an automatic term recognition extractor. Automatic Term Recognition (ATR) from technical texts is applied for the identification of key concepts for information retrieval and, secondarily, for machine translation. Term recognition depends on the subject domain and the lexical patterns of a given language, in our case, Spanish, Arabic and Japanese. In this article, we present the methods and techniques for creating a biomedical corpus of validated terms, with several tools for optimal exploitation of the information therewith contained in said corpus. This paper also shows how these techniques and tools have been used in a prototype

    Genetic Algorithm (GA) in Feature Selection for CRF Based Manipuri Multiword Expression (MWE) Identification

    Full text link
    This paper deals with the identification of Multiword Expressions (MWEs) in Manipuri, a highly agglutinative Indian Language. Manipuri is listed in the Eight Schedule of Indian Constitution. MWE plays an important role in the applications of Natural Language Processing(NLP) like Machine Translation, Part of Speech tagging, Information Retrieval, Question Answering etc. Feature selection is an important factor in the recognition of Manipuri MWEs using Conditional Random Field (CRF). The disadvantage of manual selection and choosing of the appropriate features for running CRF motivates us to think of Genetic Algorithm (GA). Using GA we are able to find the optimal features to run the CRF. We have tried with fifty generations in feature selection along with three fold cross validation as fitness function. This model demonstrated the Recall (R) of 64.08%, Precision (P) of 86.84% and F-measure (F) of 73.74%, showing an improvement over the CRF based Manipuri MWE identification without GA application.Comment: 14 pages, 6 figures, see http://airccse.org/journal/jcsit/1011csit05.pd

    A Rule-based Methodology and Feature-based Methodology for Effect Relation Extraction in Chinese Unstructured Text

    Get PDF
    The Chinese language differs significantly from English, both in lexical representation and grammatical structure. These differences lead to problems in the Chinese NLP, such as word segmentation and flexible syntactic structure. Many conventional methods and approaches in Natural Language Processing (NLP) based on English text are shown to be ineffective when attending to these language specific problems in late-started Chinese NLP. Relation Extraction is an area under NLP, looking to identify semantic relationships between entities in the text. The term “Effect Relation” is introduced in this research to refer to a specific content type of relationship between two entities, where one entity has a certain “effect” on the other entity. In this research project, a case study on Chinese text from Traditional Chinese Medicine (TCM) journal publications is built, to closely examine the forms of Effect Relation in this text domain. This case study targets the effect of a prescription or herb, in treatment of a disease, symptom or body part. A rule-based methodology is introduced in this thesis. It utilises predetermined rules and templates, derived from the characteristics and pattern observed in the dataset. This methodology achieves the F-score of 0.85 in its Named Entity Recognition (NER) module; 0.79 in its Semantic Relationship Extraction (SRE) module; and the overall performance of 0.46. A second methodology taking a feature-based approach is also introduced in this thesis. It views the RE task as a classification problem and utilises mathematical classification model and features consisting of contextual information and rules. It achieves the F-scores of: 0.73 (NER), 0.88 (SRE) and overall performance of 0.41. The role of functional words in the contemporary Chinese language and in relation to the ERs in this research is explored. Functional words have been found to be effective in detecting the complex structure ER entities as rules in the rule-based methodology

    Hybrid tag-set for natural language processing.

    Get PDF
    Leung Wai Kwong.Thesis (M.Phil.)--Chinese University of Hong Kong, 1999.Includes bibliographical references (leaves 90-95).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Motivation --- p.1Chapter 1.2 --- Objective --- p.3Chapter 1.3 --- Organization of thesis --- p.3Chapter 2 --- Background --- p.5Chapter 2.1 --- Chinese Noun Phrases Parsing --- p.5Chapter 2.2 --- Chinese Noun Phrases --- p.6Chapter 2.3 --- Problems with Syntactic Parsing --- p.11Chapter 2.3.1 --- Conjunctive Noun Phrases --- p.11Chapter 2.3.2 --- De-de Noun Phrases --- p.12Chapter 2.3.3 --- Compound Noun Phrases --- p.13Chapter 2.4 --- Observations --- p.15Chapter 2.4.1 --- Inadequacy in Part-of-Speech Categorization for Chi- nese NLP --- p.16Chapter 2.4.2 --- The Need of Semantic in Noun Phrase Parsing --- p.17Chapter 2.5 --- Summary --- p.17Chapter 3 --- Hybrid Tag-set --- p.19Chapter 3.1 --- Objectives --- p.19Chapter 3.1.1 --- Resolving Parsing Ambiguities --- p.19Chapter 3.1.2 --- Investigation of Nominal Compound Noun Phrases --- p.20Chapter 3.2 --- Definition of Hybrid Tag-set --- p.20Chapter 3.3 --- Introduction to Cilin --- p.21Chapter 3.4 --- Problems with Cilin --- p.23Chapter 3.4.1 --- Unknown words --- p.23Chapter 3.4.2 --- Multiple Semantic Classes --- p.25Chapter 3.5 --- Introduction to Chinese Word Formation --- p.26Chapter 3.5.1 --- Disyllabic Word Formation --- p.26Chapter 3.5.2 --- Polysyllabic Word Formation --- p.28Chapter 3.5.3 --- Observation --- p.29Chapter 3.6 --- Automatic Assignment of Hybrid Tag to Chinese Word --- p.31Chapter 3.7 --- Summary --- p.34Chapter 4 --- Automatic Semantic Assignment --- p.35Chapter 4.1 --- Previous Researches on Semantic Tagging --- p.36Chapter 4.2 --- SAUW - Automatic Semantic Assignment of Unknown Words --- p.37Chapter 4.2.1 --- POS-to-SC Association (Process 1) --- p.38Chapter 4.2.2 --- Morphology-based Deduction (Process 2) --- p.39Chapter 4.2.3 --- Di-syllabic Word Analysis (Process 3 and 4) --- p.41Chapter 4.2.4 --- Poly-syllabic Word Analysis (Process 5) --- p.47Chapter 4.3 --- Illustrative Examples --- p.47Chapter 4.4 --- Evaluation and Analysis --- p.49Chapter 4.4.1 --- Experiments --- p.49Chapter 4.4.2 --- Error Analysis --- p.51Chapter 4.5 --- Summary --- p.52Chapter 5 --- Word Sense Disambiguation --- p.53Chapter 5.1 --- Introduction to Word Sense Disambiguation --- p.54Chapter 5.2 --- Previous Works on Word Sense Disambiguation --- p.55Chapter 5.2.1 --- Linguistic-based Approaches --- p.56Chapter 5.2.2 --- Corpus-based Approaches --- p.58Chapter 5.3 --- Our Approach --- p.60Chapter 5.3.1 --- Bi-gram Co-occurrence Probabilities --- p.62Chapter 5.3.2 --- Tri-gram Co-occurrence Probabilities --- p.63Chapter 5.3.3 --- Design consideration --- p.65Chapter 5.3.4 --- Error Analysis --- p.67Chapter 5.4 --- Summary --- p.68Chapter 6 --- Hybrid Tag-set for Chinese Noun Phrase Parsing --- p.69Chapter 6.1 --- Resolving Ambiguous Noun Phrases --- p.70Chapter 6.1.1 --- Experiment --- p.70Chapter 6.1.2 --- Results --- p.72Chapter 6.2 --- Summary --- p.78Chapter 7 --- Conclusion --- p.80Chapter 7.1 --- Summary --- p.80Chapter 7.2 --- Difficulties Encountered --- p.83Chapter 7.2.1 --- Lack of Training Corpus --- p.83Chapter 7.2.2 --- Features of Chinese word formation --- p.84Chapter 7.2.3 --- Problems with linguistic sources --- p.85Chapter 7.3 --- Contributions --- p.86Chapter 7.3.1 --- Enrichment to the Cilin --- p.86Chapter 7.3.2 --- Enhancement in syntactic parsing --- p.87Chapter 7.4 --- Further Researches --- p.88Chapter 7.4.1 --- Investigation into words that undergo semantic changes --- p.88Chapter 7.4.2 --- Incorporation of more information into the hybrid tag-set --- p.89Chapter A --- POS Tag-set by Tsinghua University (清華大學) --- p.96Chapter B --- Morphological Rules --- p.100Chapter C --- Syntactic Rules for Di-syllabic Words Formation --- p.10

    UCSY-SC1: A Myanmar speech corpus for automatic speech recognition

    Get PDF
    This paper introduces a speech corpus which is developed for Myanmar Automatic Speech Recognition (ASR) research. Automatic Speech Recognition (ASR) research has been conducted by the researchers around the world to improve their language technologies. Speech corpora are important in developing the ASR and the creation of the corpora is necessary especially for low-resourced languages. Myanmar language can be regarded as a low-resourced language because of lack of pre-created resources for speech processing research. In this work, a speech corpus named UCSY-SC1 (University of Computer Studies Yangon - Speech Corpus1) is created for Myanmar ASR research. The corpus consists of two types of domain: news and daily conversations. The total size of the speech corpus is over 42 hrs. There are 25 hrs of web news and 17 hrs of conversational recorded data.The corpus was collected from 177 females and 84 males for the news data and 42 females and 4 males for conversational domain. This corpus was used as training data for developing Myanmar ASR. Three different types of acoustic models  such as Gaussian Mixture Model (GMM) - Hidden Markov Model (HMM), Deep Neural Network (DNN), and Convolutional Neural Network (CNN) models were built and compared their results. Experiments were conducted on different data  sizes and evaluation is done by two test sets: TestSet1, web news and TestSet2, recorded conversational data. It showed that the performance of Myanmar ASRs using this corpus gave satisfiable results on both test sets. The Myanmar ASR  using this corpus leading to word error rates of 15.61% on TestSet1 and 24.43% on TestSet2

    ANNOTATED DISJUNCT FOR MACHINE TRANSLATION

    Get PDF
    Most information found in the Internet is available in English version. However, most people in the world are non-English speaker. Hence, it will be of great advantage to have reliable Machine Translation tool for those people. There are many approaches for developing Machine Translation (MT) systems, some of them are direct, rule-based/transfer, interlingua, and statistical approaches. This thesis focuses on developing an MT for less resourced languages i.e. languages that do not have available grammar formalism, parser, and corpus, such as some languages in South East Asia. The nonexistence of bilingual corpora motivates us to use direct or transfer approaches. Moreover, the unavailability of grammar formalism and parser in the target languages motivates us to develop a hybrid between direct and transfer approaches. This hybrid approach is referred as a hybrid transfer approach. This approach uses the Annotated Disjunct (ADJ) method. This method, based on Link Grammar (LG) formalism, can theoretically handle one-to-one, many-to-one, and many-to-many word(s) translations. This method consists of transfer rules module which maps source words in a source sentence (SS) into target words in correct position in a target sentence (TS). The developed transfer rules are demonstrated on English → Indonesian translation tasks. An experimental evaluation is conducted to measure the performance of the developed system over available English-Indonesian MT systems. The developed ADJ-based MT system translated simple, compound, and complex English sentences in present, present continuous, present perfect, past, past perfect, and future tenses with better precision than other systems, with the accuracy of 71.17% in Subjective Sentence Error Rate metric
    • …
    corecore