92 research outputs found
Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources
Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen
Beyond Stemming and Lemmatization: Ultra-stemming to Improve Automatic Text Summarization
In Automatic Text Summarization, preprocessing is an important phase to
reduce the space of textual representation. Classically, stemming and
lemmatization have been widely used for normalizing words. However, even using
normalization on large texts, the curse of dimensionality can disturb the
performance of summarizers. This paper describes a new method for normalization
of words to further reduce the space of representation. We propose to reduce
each word to its initial letters, as a form of Ultra-stemming. The results show
that Ultra-stemming not only preserve the content of summaries produced by this
representation, but often the performances of the systems can be dramatically
improved. Summaries on trilingual corpora were evaluated automatically with
Fresa. Results confirm an increase in the performance, regardless of summarizer
system used.Comment: 22 pages, 12 figures, 9 table
Content Extraction based on Hierarchical Relations in DOM Structures
This article introduces a new approach for content
extraction that exploits the hierarchical inter-relations of the
elements in a webpage. Content extraction is a technique used
to extract from a webpage the main textual content. This is
useful in order to filter out the advertisements and all the
additional information that is not part of the main content. The
main idea behind our approach is to use the DOM tree as an
explicit representation of the inter-relations of the elements in a
webpage. Using the information contained in the DOM tree we
can identify blocks of content and we can easily determine what
of the blocks contains more text. Thanks to this information, the
technique achieves a considerable recall and precision. Using the
DOM structure for content extraction gives us the benefits of
other approaches based on the syntax of the webpage (such as
characters, words and tags), but it also gives us a very precise
information regarding the related components in a block, thus,
producing very cohesive blocks.López Romero, S.; Silva Galiana, JF.; Insa Cabrera, D. (2012). Content Extraction based on Hierarchical Relations in DOM Structures. Research and Development in Computer Science and Engineering. 45:5-12. http://hdl.handle.net/10251/47738S5124
Identification of Central Points in Road Networks using Betweenness Centrality Combined with Traffic Demand
Abstract-This paper aims to identify central points in road networks considering traffic demand. This is made with a variation of betweenness centrality. In this variation, the graph that corresponds to the road network is weighted according to the number of routes generated by the traffic demand. To test the proposed approach three networks have been created, which are Porto Alegre and Sioux Falls cities and a regular 10 × 10 grid. Then, trips were microscopically simulated and the results were compared with the proposed method
CHATBOT FOR KNOWLEDGE – BASED MUSEUM RECOMMENDER SYSTEM (CASE STUDY: MUSEUM IN JAKARTA)
Sistem pemberi rekomendasi yang umum digunakan untuk merekomendasi museum adalah content-based filtering dan collaborative filtering. Tetapi, sistem pemberi rekomendasi tersebut mengalami permasalahan seperti cold start dan data sparsity, karena beberapa museum masih memiliki rating dan feedback yang rendah. Untuk mengatasi masalah tersebut, knowledge-based recommender system dapat digunakan untuk memberikan rekomendasi museum berdasarkan preferensi pengguna, sehingga sistem tidak perlu menggunakan rating dan feedback. Preferensi pengguna bisa didapatkan menggunakan conversational recommender system dengan memanfaatkan percakapan dua arah antara pengguna dengan sistem. Chatbot merupakan salah satu bentuk conversational recommender system yang umum digunakan. Penelitian ini mengembangkan sebuah chatbot untuk merekomendasikan museum di Jakarta menggunakan knowledge-based recommender system. Sistem yang dikembangkan menggunakan Rasa framework untuk membangun chatbot yang mampu melakukan percakapan dengan pengguna. Knowledge graph dan k-nearest neighbor digunakan untuk merekomendasikan museum berdasarkan preferensi pengguna. Berdasarkan evaluasi yang telah dilakukan, sistem yang dikembangkan dapat memahami pesan pengguna dan memberikan rekomendasi museum berdasarkan preferensi pengguna. Tetapi, performa sistem masih dapat dikembangkan supaya sistem dapat diandalkan pada skenario dunia nyata
What is SemEval evaluating?: A Systematic Analysis of Evaluation Campaigns in NLP
SemEval is the primary venue in the NLP community for the proposal of new
challenges and for the systematic empirical evaluation of NLP systems. This
paper provides a systematic quantitative analysis of SemEval aiming to evidence
the patterns of the contributions behind SemEval. By understanding the
distribution of task types, metrics, architectures, participation and citations
over time we aim to answer the question on what is being evaluated by SemEval.Comment: 12 pages, 6 figure
Cross-Lingual Zero Pronoun Resolution
In languages like Arabic, Chinese, Italian, Japanese, Korean, Portuguese, Spanish, and many others, predicate arguments in certainsyntactic positions are not realized instead of being realized as overt pronouns, and are thus called zero- or null-pronouns. Identifyingand resolving such omitted arguments is crucial to machine translation, information extraction and other NLP tasks, but depends heavilyonsemanticcoherenceandlexicalrelationships. WeproposeaBERT-basedcross-lingualmodelforzeropronounresolution,andevaluateit on the Arabic and Chinese portions of OntoNotes 5.0. As far as we know, ours is the first neural model of zero-pronoun resolutionfor Arabic; and our model also outperforms the state-of-the-art for Chinese. In the paper we also evaluate BERT feature extraction andfine-tune models on the task, and compare them with our model. We also report on an investigation of BERT layers indicating whichlayer encodes the most suitable representation for the task. Our code is available at https://github.com/amaloraini/cross-lingual-Z
- …