92 research outputs found

    Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

    Get PDF
    Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen

    Beyond Stemming and Lemmatization: Ultra-stemming to Improve Automatic Text Summarization

    Full text link
    In Automatic Text Summarization, preprocessing is an important phase to reduce the space of textual representation. Classically, stemming and lemmatization have been widely used for normalizing words. However, even using normalization on large texts, the curse of dimensionality can disturb the performance of summarizers. This paper describes a new method for normalization of words to further reduce the space of representation. We propose to reduce each word to its initial letters, as a form of Ultra-stemming. The results show that Ultra-stemming not only preserve the content of summaries produced by this representation, but often the performances of the systems can be dramatically improved. Summaries on trilingual corpora were evaluated automatically with Fresa. Results confirm an increase in the performance, regardless of summarizer system used.Comment: 22 pages, 12 figures, 9 table

    Content Extraction based on Hierarchical Relations in DOM Structures

    Full text link
    This article introduces a new approach for content extraction that exploits the hierarchical inter-relations of the elements in a webpage. Content extraction is a technique used to extract from a webpage the main textual content. This is useful in order to filter out the advertisements and all the additional information that is not part of the main content. The main idea behind our approach is to use the DOM tree as an explicit representation of the inter-relations of the elements in a webpage. Using the information contained in the DOM tree we can identify blocks of content and we can easily determine what of the blocks contains more text. Thanks to this information, the technique achieves a considerable recall and precision. Using the DOM structure for content extraction gives us the benefits of other approaches based on the syntax of the webpage (such as characters, words and tags), but it also gives us a very precise information regarding the related components in a block, thus, producing very cohesive blocks.López Romero, S.; Silva Galiana, JF.; Insa Cabrera, D. (2012). Content Extraction based on Hierarchical Relations in DOM Structures. Research and Development in Computer Science and Engineering. 45:5-12. http://hdl.handle.net/10251/47738S5124

    Identification of Central Points in Road Networks using Betweenness Centrality Combined with Traffic Demand

    Get PDF
    Abstract-This paper aims to identify central points in road networks considering traffic demand. This is made with a variation of betweenness centrality. In this variation, the graph that corresponds to the road network is weighted according to the number of routes generated by the traffic demand. To test the proposed approach three networks have been created, which are Porto Alegre and Sioux Falls cities and a regular 10 × 10 grid. Then, trips were microscopically simulated and the results were compared with the proposed method

    CHATBOT FOR KNOWLEDGE – BASED MUSEUM RECOMMENDER SYSTEM (CASE STUDY: MUSEUM IN JAKARTA)

    Get PDF
    Sistem pemberi rekomendasi yang umum digunakan untuk merekomendasi museum adalah content-based filtering dan collaborative filtering. Tetapi, sistem pemberi rekomendasi tersebut mengalami permasalahan seperti cold start dan data sparsity, karena beberapa museum masih memiliki rating dan feedback yang rendah. Untuk mengatasi masalah tersebut, knowledge-based recommender system dapat digunakan untuk memberikan rekomendasi museum berdasarkan preferensi pengguna, sehingga sistem tidak perlu menggunakan rating dan feedback. Preferensi pengguna bisa didapatkan menggunakan conversational recommender system dengan memanfaatkan percakapan dua arah antara pengguna dengan sistem. Chatbot merupakan salah satu bentuk conversational recommender system yang umum digunakan. Penelitian ini mengembangkan sebuah chatbot untuk merekomendasikan museum di Jakarta menggunakan knowledge-based recommender system. Sistem yang dikembangkan menggunakan Rasa framework untuk membangun chatbot yang mampu melakukan percakapan dengan pengguna. Knowledge graph dan k-nearest neighbor digunakan untuk merekomendasikan museum berdasarkan preferensi pengguna. Berdasarkan evaluasi yang telah dilakukan, sistem yang dikembangkan dapat memahami pesan pengguna dan memberikan rekomendasi museum berdasarkan preferensi pengguna. Tetapi, performa sistem masih dapat dikembangkan supaya sistem dapat diandalkan pada skenario dunia nyata

    What is SemEval evaluating?: A Systematic Analysis of Evaluation Campaigns in NLP

    Get PDF
    SemEval is the primary venue in the NLP community for the proposal of new challenges and for the systematic empirical evaluation of NLP systems. This paper provides a systematic quantitative analysis of SemEval aiming to evidence the patterns of the contributions behind SemEval. By understanding the distribution of task types, metrics, architectures, participation and citations over time we aim to answer the question on what is being evaluated by SemEval.Comment: 12 pages, 6 figure

    Cross-Lingual Zero Pronoun Resolution

    Get PDF
    In languages like Arabic, Chinese, Italian, Japanese, Korean, Portuguese, Spanish, and many others, predicate arguments in certainsyntactic positions are not realized instead of being realized as overt pronouns, and are thus called zero- or null-pronouns. Identifyingand resolving such omitted arguments is crucial to machine translation, information extraction and other NLP tasks, but depends heavilyonsemanticcoherenceandlexicalrelationships. WeproposeaBERT-basedcross-lingualmodelforzeropronounresolution,andevaluateit on the Arabic and Chinese portions of OntoNotes 5.0. As far as we know, ours is the first neural model of zero-pronoun resolutionfor Arabic; and our model also outperforms the state-of-the-art for Chinese. In the paper we also evaluate BERT feature extraction andfine-tune models on the task, and compare them with our model. We also report on an investigation of BERT layers indicating whichlayer encodes the most suitable representation for the task. Our code is available at https://github.com/amaloraini/cross-lingual-Z
    corecore