19 research outputs found

    Multi-feature Based Chinese-English Named Entity Extraction from Comparable Corpora

    Get PDF
    PACLIC 20 / Wuhan, China / 1-3 November, 200

    Integrated Use of Internal and External Evidence in the Alignment of Multi-Word Named Entities

    Get PDF
    This paper proposes a method of extracting English multi-word named entities and their Japanese equivalents from a parallel corpus. The aim of our research is to extract multi-word named entities which are not listed in a dictionary of an English-to-Japanese MT system and appear infrequently in a parallel corpus. Our method makes its alignment on the basis of two kinds of external evidence provided by the context in which a bilingual pair appears, as well as two kinds of internal evidence within the pair. Each evidence is accompanied by a score, and the aggregate score is computed as a weighted sum of the scores. The appropriate weights are estimated with the logistic regression analysis. An experiment using a parallel corpus of Yomiuri Shimbun and The Daily Yomiuri satisfactorily found that 86.36% of the extracted bilingual pairs with the highest scores were judged to be correct

    Named entity translation matching and learning with mining from multilingual news.

    Get PDF
    Cheung Pik Shan.Thesis (M.Phil.)--Chinese University of Hong Kong, 2004.Includes bibliographical references (leaves 79-82).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Named Entity Translation Matching --- p.2Chapter 1.2 --- Mining New Translations from News --- p.3Chapter 1.3 --- Thesis Organization --- p.4Chapter 2 --- Related Work --- p.5Chapter 3 --- Named Entity Matching Model --- p.9Chapter 3.1 --- Problem Nature --- p.9Chapter 3.2 --- Matching Model Investigation --- p.12Chapter 3.3 --- Tokenization --- p.15Chapter 3.4 --- Hybrid Semantic and Phonetic Matching Algorithm --- p.16Chapter 4 --- Phonetic Matching Model --- p.22Chapter 4.1 --- Generating Phonetic Representation for English --- p.22Chapter 4.1.1 --- Phoneme Generation --- p.22Chapter 4.1.2 --- Training the Tagging Lexicon and Transformation Rules --- p.25Chapter 4.2 --- Generating Phonetic Representation for Chinese --- p.29Chapter 4.3 --- Phonetic Matching Algorithm --- p.31Chapter 5 --- Learning Phonetic Similarity --- p.37Chapter 5.1 --- The Widrow-Hoff Algorithm --- p.39Chapter 5.2 --- The Exponentiated-Gradient Algorithm --- p.41Chapter 5.3 --- The Genetic Algorithm --- p.42Chapter 6 --- Experiments on Named Entity Matching Model --- p.43Chapter 6.1 --- Results for Learning Phonetic Similarity --- p.44Chapter 6.2 --- Results for Named Entity Matching --- p.46Chapter 7 --- Mining New Entity Translations from News --- p.48Chapter 7.1 --- Metadata Generation --- p.52Chapter 7.2 --- Discovering Comparable News Cluster --- p.54Chapter 7.2.1 --- News Preprocessing --- p.54Chapter 7.2.2 --- Gloss Translation --- p.55Chapter 7.2.3 --- Comparable News Cluster Discovery --- p.62Chapter 7.3 --- Named Entity Cognate Generation --- p.64Chapter 7.4 --- Entity Matching --- p.66Chapter 7.4.1 --- Matching Algorithm --- p.66Chapter 7.4.2 --- Matching Result Production --- p.68Chapter 8 --- Experiments on Mining New Translations --- p.69Chapter 9 --- Experiments on Context-based Gloss Translation --- p.72Chapter 9.1 --- Results on Chinese News Translation --- p.73Chapter 9.2 --- Results on Arabic News Translation --- p.75Chapter 10 --- Conclusions and Future Work --- p.77Bibliography --- p.79A --- p.83B --- p.85C --- p.87D --- p.89E --- p.91F --- p.94G --- p.9

    Semantic Similarity of Texts

    Get PDF
    Tato práce se zabývá problematikou určování sémantické podobnosti textů se zaměřením na škálovatelnost. Součástí zpracování je teoretický přehled nástrojů pro implementaci systému na testovaných datech. Testovaný korpus obsahuje odborné články v anglickém jazyce. Cílem práce je tyto články analyzovat, modifikovat pro snadnější analýzu jejich sémantické obdoby. Jedním z nejdůležitějších využitých nástrojů je reprezentace dat ve vektorovém prostoru. This paper deals with the determination of semantic similarity texts, focusing on scalability. Part of treatment is a theoretical overview of the tools to implement the system on test data. Tested corpus contains expert articles in the English language. The aim is to analyze these articles, modified to facilitate the analysis of their semantic analogues. One of the most utilized tools is a representation of data in a vector space model.

    Unsupervised learning of relation detection patterns

    Get PDF
    L'extracció d'informació és l'àrea del processament de llenguatge natural l'objectiu de la qual és l'obtenir dades estructurades a partir de la informació rellevant continguda en fragments textuals. L'extracció d'informació requereix una quantitat considerable de coneixement lingüístic. La especificitat d'aquest coneixement suposa un inconvenient de cara a la portabilitat dels sistemes, ja que un canvi d'idioma, domini o estil té un cost en termes d'esforç humà. Durant dècades, s'han aplicat tècniques d'aprenentatge automàtic per tal de superar aquest coll d'ampolla de portabilitat, reduint progressivament la supervisió humana involucrada. Tanmateix, a mida que augmenta la disponibilitat de grans col·leccions de documents, esdevenen necessàries aproximacions completament nosupervisades per tal d'explotar el coneixement que hi ha en elles. La proposta d'aquesta tesi és la d'incorporar tècniques de clustering a l'adquisició de patrons per a extracció d'informació, per tal de reduir encara més els elements de supervisió involucrats en el procés En particular, el treball se centra en el problema de la detecció de relacions. L'assoliment d'aquest objectiu final ha requerit, en primer lloc, el considerar les diferents estratègies en què aquesta combinació es podia dur a terme; en segon lloc, el desenvolupar o adaptar algorismes de clustering adequats a les nostres necessitats; i en tercer lloc, el disseny de procediments d'adquisició de patrons que incorporessin la informació de clustering. Al final d'aquesta tesi, havíem estat capaços de desenvolupar i implementar una aproximació per a l'aprenentatge de patrons per a detecció de relacions que, utilitzant tècniques de clustering i un mínim de supervisió humana, és competitiu i fins i tot supera altres aproximacions comparables en l'estat de l'art.Information extraction is the natural language processing area whose goal is to obtain structured data from the relevant information contained in textual fragments. Information extraction requires a significant amount of linguistic knowledge. The specificity of such knowledge supposes a drawback on the portability of the systems, as a change of language, domain or style demands a costly human effort. Machine learning techniques have been applied for decades so as to overcome this portability bottleneck¿progressively reducing the amount of involved human supervision. However, as the availability of large document collections increases, completely unsupervised approaches become necessary in order to mine the knowledge contained in them. The proposal of this thesis is to incorporate clustering techniques into pattern learning for information extraction, in order to further reduce the elements of supervision involved in the process. In particular, the work focuses on the problem of relation detection. The achievement of this ultimate goal has required, first, considering the different strategies in which this combination could be carried out; second, developing or adapting clustering algorithms suitable to our needs; and third, devising pattern learning procedures which incorporated clustering information. By the end of this thesis, we had been able to develop and implement an approach for learning of relation detection patterns which, using clustering techniques and minimal human supervision, is competitive and even outperforms other comparable approaches in the state of the art.Postprint (published version

    Automatic extraction of named entity translingual equivalence based on multi-feature cost minimization

    No full text

    Meaning refinement to improve cross-lingual information retrieval

    Get PDF
    Magdeburg, Univ., Fak. für Informatik, Diss., 2012von Farag Ahme

    Suspect Until Proven Guilty, a Problematization of State Dossier Systems via Two Case Studies: The United States and China

    Get PDF
    This dissertation problematizes the state dossier system (SDS): the production and accumulation of personal information on citizen subjects exceeding the reasonable bounds of risk management. SDS - comprising interconnecting subsystems of records and identification - damage individual autonomy and self-determination, impacting not only human rights, but also the viability of the social system. The research, a hybrid of case-study and cross-national comparison, was guided in part by a theoretical model of four primary SDS driving forces: technology, political economy, law and public sentiment. Data sources included government documents, academic texts, investigative journalism, NGO reports and industry white papers. The primary analytical instrument was the juxtaposition of two individual cases: the U.S. and China. Research found that constraints on the extent of the U.S. SDS today may not be significantly different from China\u27s, a system undergoing significant change amidst growing public interest in privacy and anonymity. Much activity within the U.S., such as the practice of suspicious activity reporting, is taking place outside the domain of federal privacy laws, while ID systems appear to advance and expand despite clear public opposition. Momentum for increasingly comprehensive SDS appears to be growing, in part because the harms may not be immediately evident to the data subjects. The future of SDS globally will depend on an informed and active public; law and policy will need to adjust to better regulate the production and storage of personal information. To that end, the dissertation offers a general model and linguistic toolkit for the further analysis of SDS

    IV Міжнародний науковий конгрес "Society of Ambient Intelligence - 2021" (ISCSAI 2021). Кривий Ріг, Україна, 12-16 квітня 2021 року

    Get PDF
    IV Міжнародний науковий конгрес "Society of Ambient Intelligence - 2021" (ISCSAI 2021). Кривий Ріг, Україна, 12-16 квітня 2021 року - матеріали.IV International Scientific Congress “Society of Ambient Intelligence – 2021” (ISCSAI 2021). Kryvyi Rih, Ukraine, April 12-16, 2021 - proceedings
    corecore