9,248 research outputs found

    2kenize: Tying Subword Sequences for Chinese Script Conversion

    Full text link
    Simplified Chinese to Traditional Chinese character conversion is a common preprocessing step in Chinese NLP. Despite this, current approaches have poor performance because they do not take into account that a simplified Chinese character can correspond to multiple traditional characters. Here, we propose a model that can disambiguate between mappings and convert between the two scripts. The model is based on subword segmentation, two language models, as well as a method for mapping between subword sequences. We further construct benchmark datasets for topic classification and script conversion. Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy. These results are further confirmed in a downstream application, where 2kenize is used to convert pretraining dataset for topic classification. An error analysis reveals that our method's particular strengths are in dealing with code-mixing and named entities.Comment: Accepted to ACL 202

    Mostly-Unsupervised Statistical Segmentation of Japanese Kanji Sequences

    Full text link
    Given the lack of word delimiters in written Japanese, word segmentation is generally considered a crucial first step in processing Japanese texts. Typical Japanese segmentation algorithms rely either on a lexicon and syntactic analysis or on pre-segmented data; but these are labor-intensive, and the lexico-syntactic techniques are vulnerable to the unknown word problem. In contrast, we introduce a novel, more robust statistical method utilizing unsegmented training data. Despite its simplicity, the algorithm yields performance on long kanji sequences comparable to and sometimes surpassing that of state-of-the-art morphological analyzers over a variety of error metrics. The algorithm also outperforms another mostly-unsupervised statistical algorithm previously proposed for Chinese. Additionally, we present a two-level annotation scheme for Japanese to incorporate multiple segmentation granularities, and introduce two novel evaluation metrics, both based on the notion of a compatible bracket, that can account for multiple granularities simultaneously.Comment: 22 pages. To appear in Natural Language Engineerin

    Natural Language Processing Using Neighbour Entropy-based Segmentation

    Get PDF
    In natural language processing (NLP) of Chinese hazard text collected in the process of hazard identification, Chinese word segmentation (CWS) is the first step to extracting meaningful information from such semi-structured Chinese texts. This paper proposes a new neighbor entropy-based segmentation (NES) model for CWS. The model considers the segmentation benefits of neighbor entropies, adopting the concept of "neighbor" in optimization research. It is defined by the benefit ratio of text segmentation, including benefits and losses of combining the segmentation unit with more information than other popular statistical models. In the experiments performed, together with the maximum-based segmentation algorithm, the NES model achieves a 99.3% precision, 98.7% recall, and 99.0% f-measure for text segmentation; these performances are higher than those of existing tools based on other seven popular statistical models. Results show that the NES model is a valid CWS, especially for text segmentation requirements necessitating longer-sized characters. The text corpus used comes from the Beijing Municipal Administration of Work Safety, which was recorded in the fourth quarter of 2018

    New Light Shed on Chinese Word Segmentation in MT by a Language Investigation

    Get PDF
    The Chinese language, unlike some western languages, is written without a space between any two words, which presents itself as a unique problem in Machine Translation: how to segment words in Chinese? The current word-segmentation systems in Machine Translation are either linguistically-oriented or statistically-oriented. Both types, however, have some innate defects that cannot be overcome due to the pragmatically-oriented feature of the Chinese language. This research aims at addressing the problem of Chinese word segmentation of Machine Translation in light of a language investigation consisting of two surveys and eight interviews.La langue chinoise, Ă  la diffĂ©rence des langues occidentales, ne laisse pas d’espace entre deux mots Ă  l’écrit, ce qui pose un problĂšme Ă  la traduction par ordinateur du chinois Ă  l’anglais : comment segmenter les mots en chinois ? Le systĂšme de segmentation de mots utilisĂ© actuellement dans la traduction par machine est dotĂ© soit d’une orientation linguistique, soit d’une orientation statistique. Cependant, compte tenu du caractĂšre pragmatique de la langue chinoise, les deux genres de systĂšme ont des dĂ©fauts inhĂ©rents que l’on n’arrivera pas Ă  effacer. La prĂ©sente Ă©tude propose des solutions pour rĂ©soudre le problĂšme de segmentation de mots dans la traduction par machine par une Ă©tude langagiĂšre composĂ©e de deux enquĂȘtes et de huit interviews

    Research on Reasoning and Modeling of Solving Mathematics Situation Word Problems of Primary Schools

    Get PDF
    [[abstract]]This research developed a web-based reasoning of mathematical situation word problems using the natural language processing technology. Our system provided the steps of morphological analysis, syntax analysis, semantic analysis and rule judgment to infer the semantic structure and operational structure of situation word problems. It also adopted the language of MathML and SVG to provide the web-based illustration of solving procedure in mathematical situation word problems. Keywords: situation word problem; natural language processing; MathML; SVG

    Statistical Augmentation of a Chinese Machine-Readable Dictionary

    Get PDF
    We describe a method of using statistically-collected Chinese character groups from a corpus to augment a Chinese dictionary. The method is particularly useful for extracting domain-specific and regional words not readily available in machine-readable dictionaries. Output was evaluated both using human evaluators and against a previously available dictionary. We also evaluated performance improvement in automatic Chinese tokenization. Results show that our method outputs legitimate words, acronymic constructions, idioms, names and titles, as well as technical compounds, many of which were lacking from the original dictionary.Comment: 17 pages, uuencoded compressed PostScrip

    Description of the Chinese-to-Spanish rule-based machine translation system developed with a hybrid combination of human annotation and statistical techniques

    Get PDF
    Two of the most popular Machine Translation (MT) paradigms are rule based (RBMT) and corpus based, which include the statistical systems (SMT). When scarce parallel corpus is available, RBMT becomes particularly attractive. This is the case of the Chinese--Spanish language pair. This article presents the first RBMT system for Chinese to Spanish. We describe a hybrid method for constructing this system taking advantage of available resources such as parallel corpora that are used to extract dictionaries and lexical and structural transfer rules. The final system is freely available online and open source. Although performance lags behind standard SMT systems for an in-domain test set, the results show that the RBMT’s coverage is competitive and it outperforms the SMT system in an out-of-domain test set. This RBMT system is available to the general public, it can be further enhanced, and it opens up the possibility of creating future hybrid MT systems.Peer ReviewedPostprint (author's final draft

    Relating Dependent Terms in Information Retrieval

    Get PDF
    Les moteurs de recherche font partie de notre vie quotidienne. Actuellement, plus d’un tiers de la population mondiale utilise l’Internet. Les moteurs de recherche leur permettent de trouver rapidement les informations ou les produits qu'ils veulent. La recherche d'information (IR) est le fondement de moteurs de recherche modernes. Les approches traditionnelles de recherche d'information supposent que les termes d'indexation sont indĂ©pendants. Pourtant, les termes qui apparaissent dans le mĂȘme contexte sont souvent dĂ©pendants. L’absence de la prise en compte de ces dĂ©pendances est une des causes de l’introduction de bruit dans le rĂ©sultat (rĂ©sultat non pertinents). Certaines Ă©tudes ont proposĂ© d’intĂ©grer certains types de dĂ©pendance, tels que la proximitĂ©, la cooccurrence, la contiguĂŻtĂ© et de la dĂ©pendance grammaticale. Dans la plupart des cas, les modĂšles de dĂ©pendance sont construits sĂ©parĂ©ment et ensuite combinĂ©s avec le modĂšle traditionnel de mots avec une importance constante. Par consĂ©quent, ils ne peuvent pas capturer correctement la dĂ©pendance variable et la force de dĂ©pendance. Par exemple, la dĂ©pendance entre les mots adjacents "Black Friday" est plus importante que celle entre les mots "road constructions". Dans cette thĂšse, nous Ă©tudions diffĂ©rentes approches pour capturer les relations des termes et de leurs forces de dĂ©pendance. Nous avons proposĂ© des mĂ©thodes suivantes: ─ Nous rĂ©examinons l'approche de combinaison en utilisant diffĂ©rentes unitĂ©s d'indexation pour la RI monolingue en chinois et la RI translinguistique entre anglais et chinois. En plus d’utiliser des mots, nous Ă©tudions la possibilitĂ© d'utiliser bi-gramme et uni-gramme comme unitĂ© de traduction pour le chinois. Plusieurs modĂšles de traduction sont construits pour traduire des mots anglais en uni-grammes, bi-grammes et mots chinois avec un corpus parallĂšle. Une requĂȘte en anglais est ensuite traduite de plusieurs façons, et un score classement est produit avec chaque traduction. Le score final de classement combine tous ces types de traduction. Nous considĂ©rons la dĂ©pendance entre les termes en utilisant la thĂ©orie d’évidence de Dempster-Shafer. Une occurrence d'un fragment de texte (de plusieurs mots) dans un document est considĂ©rĂ©e comme reprĂ©sentant l'ensemble de tous les termes constituants. La probabilitĂ© est assignĂ©e Ă  un tel ensemble de termes plutĂŽt qu’a chaque terme individuel. Au moment d’évaluation de requĂȘte, cette probabilitĂ© est redistribuĂ©e aux termes de la requĂȘte si ces derniers sont diffĂ©rents. Cette approche nous permet d'intĂ©grer les relations de dĂ©pendance entre les termes. Nous proposons un modĂšle discriminant pour intĂ©grer les diffĂ©rentes types de dĂ©pendance selon leur force et leur utilitĂ© pour la RI. Notamment, nous considĂ©rons la dĂ©pendance de contiguĂŻtĂ© et de cooccurrence Ă  de diffĂ©rentes distances, c’est-Ă -dire les bi-grammes et les paires de termes dans une fenĂȘtre de 2, 4, 8 et 16 mots. Le poids d’un bi-gramme ou d’une paire de termes dĂ©pendants est dĂ©terminĂ© selon un ensemble des caractĂšres, en utilisant la rĂ©gression SVM. Toutes les mĂ©thodes proposĂ©es sont Ă©valuĂ©es sur plusieurs collections en anglais et/ou chinois, et les rĂ©sultats expĂ©rimentaux montrent que ces mĂ©thodes produisent des amĂ©liorations substantielles sur l'Ă©tat de l'art.Search engine has become an integral part of our life. More than one-third of world populations are Internet users. Most users turn to a search engine as the quick way to finding the information or product they want. Information retrieval (IR) is the foundation for modern search engines. Traditional information retrieval approaches assume that indexing terms are independent. However, terms occurring in the same context are often dependent. Failing to recognize the dependencies between terms leads to noise (irrelevant documents) in the result. Some studies have proposed to integrate term dependency of different types, such as proximity, co-occurrence, adjacency and grammatical dependency. In most cases, dependency models are constructed apart and then combined with the traditional word-based (unigram) model on a fixed importance proportion. Consequently, they cannot properly capture variable term dependency and its strength. For example, dependency between adjacent words “black Friday” is more important to consider than those of between “road constructions”. In this thesis, we try to study different approaches to capture term relationships and their dependency strengths. We propose the following methods for monolingual IR and Cross-Language IR (CLIR): We re-examine the combination approach by using different indexing units for Chinese monolingual IR, then propose the similar method for CLIR. In addition to the traditional method based on words, we investigate the possibility of using Chinese bigrams and unigrams as translation units. Several translation models from English words to Chinese unigrams, bigrams and words are created based on a parallel corpus. An English query is then translated in several ways, each producing a ranking score. The final ranking score combines all these types of translations. We incorporate dependencies between terms in our model using Dempster-Shafer theory of evidence. Every occurrence of a text fragment in a document is represented as a set which includes all its implied terms. Probability is assigned to such a set of terms instead of individual terms. During query evaluation phase, the probability of the set can be transferred to those of the related query, allowing us to integrate language-dependent relations to IR. We propose a discriminative language model that integrates different term dependencies according to their strength and usefulness to IR. We consider the dependency of adjacency and co-occurrence within different distances, i.e. bigrams, pairs of terms within text window of size 2, 4, 8 and 16. The weight of bigram or a pair of dependent terms in the final model is learnt according to a set of features. All the proposed methods are evaluated on several English and/or Chinese collections, and experimental results show these methods achieve substantial improvements over state-of-the-art baselines
    • 

    corecore