1,309 research outputs found

    Pruning False Unknown Words to Improve Chinese Word Segmentation

    Get PDF
    During the process of unknown word detection in Chinese word segmentation, many detected word candidates are invalid. These false unknown word candidates deteriorate the overall segmentation accuracy, as it will affect the segmentation accuracy of known words. Therefore, we propose to eliminate as many invalid word candidates as possible by a pruning process. Our experiments show that by cutting down the invalid unknown word candidates, we improve the segmentation accuracy of known words and hence that of the overall segmentation accuracy

    The Construction of a Dictionary for a Two-layer Chinese Morphological Analyzer

    Get PDF
    PACLIC 20 / Wuhan, China / 1-3 November, 200

    Adaptive text mining: Inferring structure from sequences

    Get PDF
    Text mining is about inferring structure from sequences representing natural language text, and may be defined as the process of analyzing text to extract information that is useful for particular purposes. Although hand-crafted heuristics are a common practical approach for extracting information from text, a general, and generalizable, approach requires adaptive techniques. This paper studies the way in which the adaptive techniques used in text compression can be applied to text mining. It develops several examples: extraction of hierarchical phrase structures from text, identification of keyphrases in documents, locating proper names and quantities of interest in a piece of text, text categorization, word segmentation, acronym extraction, and structure recognition. We conclude that compression forms a sound unifying principle that allows many text mining problems to be tacked adaptively

    Spoken content retrieval: A survey of techniques and technologies

    Get PDF
    Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR

    Embedding Web-based Statistical Translation Models in Cross-Language Information Retrieval

    Get PDF
    Although more and more language pairs are covered by machine translation services, there are still many pairs that lack translation resources. Cross-language information retrieval (CLIR) is an application which needs translation functionality of a relatively low level of sophistication since current models for information retrieval (IR) are still based on a bag-of-words. The Web provides a vast resource for the automatic construction of parallel corpora which can be used to train statistical translation models automatically. The resulting translation models can be embedded in several ways in a retrieval model. In this paper, we will investigate the problem of automatically mining parallel texts from the Web and different ways of integrating the translation models within the retrieval process. Our experiments on standard test collections for CLIR show that the Web-based translation models can surpass commercial MT systems in CLIR tasks. These results open the perspective of constructing a fully automatic query translation device for CLIR at a very low cost.Comment: 37 page

    Text Augmentation: Inserting markup into natural language text with PPM Models

    Get PDF
    This thesis describes a new optimisation and new heuristics for automatically marking up XML documents. These are implemented in CEM, using PPMmodels. CEM is significantly more general than previous systems, marking up large numbers of hierarchical tags, using n-gram models for large n and a variety of escape methods. Four corpora are discussed, including the bibliography corpus of 14682 bibliographies laid out in seven standard styles using the BIBTEX system and markedup in XML with every field from the original BIBTEX. Other corpora include the ROCLING Chinese text segmentation corpus, the Computists’ Communique corpus and the Reuters’ corpus. A detailed examination is presented of the methods of evaluating mark up algorithms, including computation complexity measures and correctness measures from the fields of information retrieval, string processing, machine learning and information theory. A new taxonomy of markup complexities is established and the properties of each taxon are examined in relation to the complexity of marked-up documents. The performance of the new heuristics and optimisation is examined using the four corpora

    Image Annotation and Topic Extraction Using Super-Word Latent Dirichlet

    Get PDF
    This research presents a multi-domain solution that uses text and images to iteratively improve automated information extraction. Stage I uses local text surrounding an embedded image to provide clues that help rank-order possible image annotations. These annotations are forwarded to Stage II, where the image annotations from Stage I are used as highly-relevant super-words to improve extraction of topics. The model probabilities from the super-words in Stage II are forwarded to Stage III where they are used to refine the automated image annotation developed in Stage I. All stages demonstrate improvement over existing equivalent algorithms in the literature
    corecore