1,266 research outputs found

    Automatic Construction of Cross-lingual Networks of Concepts from the Hong Kong SAR Police Department

    Get PDF
    Abstract. The tragic event of September 11 has prompted the rapid growth of attention of national security and criminal analysis. In the national security world, very large volumes of data and information are generated and gathered. Much of this data and information written in different languages and stored in different locations may be seemingly unconnected. Therefore, cross-lingual semantic interoperability is a major challenge to generate an overview of this disparate data and information so that it can be analysed, searched. The traditional information retrieval (IR) approaches normally require a document to share some keywords with the query. In reality, the users may use some keywords that are different from what used in the documents. There are then two different term spaces, one for the users, and another for the documents. The problem can be viewed as the creation of a thesaurus. The creation of such relationships would allow the system to match queries with relevant documents, even though they contain different terms. Apart from this, terrorists and criminals may communicate through letters, e-mails and faxes in languages other than English. The translation ambiguity significantly exacerbates the retrieval problem. To facilitate cross-lingual information retrieval, a corpusbased approach uses the term co-occurrence statistics in parallel or comparable corpora to construct a statistical translation model to cross the language boundary. However, collecting parallel corpora between European language and Oriental language is not an easy task due to the unique linguistics and grammar structures of oriental languages. In this paper, the text-based approach to align English/Chinese Hong Kong Police press release documents from the Web is first presented. This article then reports an algorithmic approach to generate a robust knowledge base based on statistical correlation analysis of the semantics (knowledge) embedded in the bilingual press release corpus. The research output consisted of a thesaurus-like, semantic network knowledge base, which can aid in semantics-based cross-lingual information management and retrieval

    Category tree integration by exploiting hierarchical structure.

    Get PDF
    Lin, Jianfeng.Thesis (M.Phil.)--Chinese University of Hong Kong, 2007.Includes bibliographical references (leaves 79-83).Abstracts in English and Chinese.Abstract --- p.i内容摘要 --- p.iiAcknowledgement --- p.iiiTable of Contents --- p.ivList of Figures --- p.viList of Tables --- p.viiChapter Chapter 1. --- Introduction --- p.1Chapter Chapter 2. --- Related Work --- p.6Chapter 2.1. --- Ontology Integration --- p.7Chapter 2.2. --- Schema Matching --- p.10Chapter 2.3. --- Taxonomy Integration as Text Categorization --- p.13Chapter 2.4. --- Cross-lingual Text Categorization & Cross-lingual Information Retrieval --- p.15Chapter Chapter 3. --- Problem Definition --- p.17Chapter 3.1. --- Mono-lingual Category Tree Integration --- p.17Chapter 3.2. --- Integration Operators --- p.19Chapter 3.3. --- Cross-lingual Category Tree Integration --- p.21Chapter Chapter 4. --- Mono-lingual Category Tree Integration Techniques --- p.23Chapter 4.1. --- Category Relationships --- p.23Chapter 4.2. --- Decision Rules --- p.27Chapter 4.3. --- Mapping Algorithm --- p.38Chapter Chapter 5. --- Experiment of Mono-lingual Category Tree Integration --- p.42Chapter 5.1. --- Dataset --- p.42Chapter 5.2. --- Automated Text Classifier --- p.43Chapter 5.3. --- Evaluation Metrics --- p.46Chapter 5.3.1. --- Integration Accuracy --- p.47Chapter 5.3.2. --- Precision and Recall and F1 value of the Three Operators --- p.48Chapter 5.3.3. --- "Precision and Recalls of ""Split""" --- p.48Chapter 5.4. --- Parameter Turning --- p.49Chapter 5.5. --- Experiments Results --- p.55Chapter Chapter 6. --- Cross-lingual Category Tree Integration --- p.60Chapter 6.1. --- Parallel Corpus --- p.61Chapter 6.2. --- Cross-lingual Concept Space Construction --- p.65Chapter 6.2.1. --- Phase Extraction --- p.65Chapter 6.2.2. --- Co-occurrence analysis --- p.65Chapter 6.2.3. --- Associate Constraint Network for Concept Generation --- p.67Chapter 6.3. --- Document Translation --- p.69Chapter 6.4. --- Experiment Setting --- p.72Chapter 6.5. --- Experiment Results --- p.73Chapter Chapter 7. --- Conclusion and Future Work --- p.77Reference --- p.7

    Clustering by compression

    Full text link
    We present a new method for clustering based on compression. The method doesn't use subject-specific features or background knowledge, and works as follows: First, we determine a universal similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is universal in that it is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, co-developed by one of the authors, is provably optimal but uses the non-computable notion of Kolmogorov complexity. We propose precise notions of similarity metric, normal compressor, and show that the NCD based on a normal compressor is a similarity metric that approximates universality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram (binary tree) by a new quartet method and a fast heuristic to implement it. The method is implemented and available as public software, and is robust under choice of different compressors. To substantiate our claims of universality and robustness, we report evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors. In genomics we presented new evidence for major questions in Mammalian evolution, based on whole-mitochondrial genomic analysis: the Eutherian orders and the Marsupionta hypothesis against the Theria hypothesis.Comment: LaTeX, 27 pages, 20 figure

    A history and theory of textual event detection and recognition

    Get PDF

    人の行動分類のための教師なし転移学習

    Get PDF
    筑波大学 (University of Tsukuba)201

    On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism

    Full text link
    Barrón Cedeño, LA. (2012). On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/16012Palanci

    Predicting Linguistic Structure with Incomplete and Cross-Lingual Supervision

    Get PDF
    Contemporary approaches to natural language processing are predominantly based on statistical machine learning from large amounts of text, which has been manually annotated with the linguistic structure of interest. However, such complete supervision is currently only available for the world's major languages, in a limited number of domains and for a limited range of tasks. As an alternative, this dissertation considers methods for linguistic structure prediction that can make use of incomplete and cross-lingual supervision, with the prospect of making linguistic processing tools more widely available at a lower cost. An overarching theme of this work is the use of structured discriminative latent variable models for learning with indirect and ambiguous supervision; as instantiated, these models admit rich model features while retaining efficient learning and inference properties. The first contribution to this end is a latent-variable model for fine-grained sentiment analysis with coarse-grained indirect supervision. The second is a model for cross-lingual word-cluster induction and the application thereof to cross-lingual model transfer. The third is a method for adapting multi-source discriminative cross-lingual transfer models to target languages, by means of typologically informed selective parameter sharing. The fourth is an ambiguity-aware self- and ensemble-training algorithm, which is applied to target language adaptation and relexicalization of delexicalized cross-lingual transfer parsers. The fifth is a set of sequence-labeling models that combine constraints at the level of tokens and types, and an instantiation of these models for part-of-speech tagging with incomplete cross-lingual and crowdsourced supervision. In addition to these contributions, comprehensive overviews are provided of structured prediction with no or incomplete supervision, as well as of learning in the multilingual and cross-lingual settings. Through careful empirical evaluation, it is established that the proposed methods can be used to create substantially more accurate tools for linguistic processing, compared to both unsupervised methods and to recently proposed cross-lingual methods. The empirical support for this claim is particularly strong in the latter case; our models for syntactic dependency parsing and part-of-speech tagging achieve the hitherto best published results for a wide number of target languages, in the setting where no annotated training data is available in the target language

    Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

    Get PDF
    Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen

    Principles and Applications of Data Science

    Get PDF
    Data science is an emerging multidisciplinary field which lies at the intersection of computer science, statistics, and mathematics, with different applications and related to data mining, deep learning, and big data. This Special Issue on “Principles and Applications of Data Science” focuses on the latest developments in the theories, techniques, and applications of data science. The topics include data cleansing, data mining, machine learning, deep learning, and the applications of medical and healthcare, as well as social media
    corecore