10,823 research outputs found

    EveTAR: Building a Large-Scale Multi-Task Test Collection over Arabic Tweets

    Full text link
    This article introduces a new language-independent approach for creating a large-scale high-quality test collection of tweets that supports multiple information retrieval (IR) tasks without running a shared-task campaign. The adopted approach (demonstrated over Arabic tweets) designs the collection around significant (i.e., popular) events, which enables the development of topics that represent frequent information needs of Twitter users for which rich content exists. That inherently facilitates the support of multiple tasks that generally revolve around events, namely event detection, ad-hoc search, timeline generation, and real-time summarization. The key highlights of the approach include diversifying the judgment pool via interactive search and multiple manually-crafted queries per topic, collecting high-quality annotations via crowd-workers for relevancy and in-house annotators for novelty, filtering out low-agreement topics and inaccessible tweets, and providing multiple subsets of the collection for better availability. Applying our methodology on Arabic tweets resulted in EveTAR , the first freely-available tweet test collection for multiple IR tasks. EveTAR includes a crawl of 355M Arabic tweets and covers 50 significant events for which about 62K tweets were judged with substantial average inter-annotator agreement (Kappa value of 0.71). We demonstrate the usability of EveTAR by evaluating existing algorithms in the respective tasks. Results indicate that the new collection can support reliable ranking of IR systems that is comparable to similar TREC collections, while providing strong baseline results for future studies over Arabic tweets

    Construction of an ontology for intelligent Arabic QA systems leveraging the Conceptual Graphs representation

    Full text link
    The last decade had known a great interest in Arabic Natural Language Processing (NLP) applications. This interest is due to the prominent importance of this 6th most wide-spread language in the world with more than 350 million native speakers. Currently, some basic Arabic language challenges related to the high inflection and derivation, Part-of-Speech (PoS) tagging, and diacritical ambiguity of Arabic text are practically tamed to a great extent. However, the development of high level and intelligent applications such as Question Answering (QA) systems is still obstructed by the lacks in terms of ontologies and other semantic resources. In this paper, we present the construction of a new Arabic ontology leveraging the contents of Arabic WordNet (AWN) and Arabic VerbNet (AVN). This new resource presents the advantage to combine the high lexical coverage and semantic relations between words existing in AWN together with the formal representation of syntactic and semantic frames corresponding to verbs in AVN. The Conceptual Graphs representation was adopted in the framework of a multi-layer platform dedicated to the development of intelligent and multi-agents systems. The built ontology is used to represent key concepts in questions and documents for further semantic comparison. Experiments conducted in the context of the QA task show a promising coverage with respect to the processed questions and passages. The obtained results also highlight an improvement in the performance of Arabic QA regarding the c@1 measure.The work of the last author was carried out in the framework of the WIQ-EI IRSES project (Grant No. 269180) within the FP 7 Marie Curie, the DIANA APPLICATIONS - Finding Hidden Knowledge in Texts: Applications (TIN2012-38603-C02-01) project, and the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems.Abouenour, L.; Nasri, M.; Bouzoubaa, K.; Kabbaj, A.; Rosso, P. (2014). Construction of an ontology for intelligent Arabic QA systems leveraging the Conceptual Graphs representation. Journal of Intelligent and Fuzzy Systems. 27(6):2869-2881. https://doi.org/10.3233/IFS-141248S2869288127

    PersoNER: Persian named-entity recognition

    Full text link
    © 1963-2018 ACL. Named-Entity Recognition (NER) is still a challenging task for languages with low digital resources. The main difficulties arise from the scarcity of annotated corpora and the consequent problematic training of an effective NER pipeline. To abridge this gap, in this paper we target the Persian language that is spoken by a population of over a hundred million people world-wide. We first present and provide ArmanPerosNERCorpus, the first manually-annotated Persian NER corpus. Then, we introduce PersoNER, an NER pipeline for Persian that leverages a word embedding and a sequential max-margin classifier. The experimental results show that the proposed approach is capable of achieving interesting MUC7 and CoNNL scores while outperforming two alternatives based on a CRF and a recurrent neural network

    Arabic Cooperative Answer Generation via Wikipedia Article Infoboxes

    Full text link
    [EN] The typical question-answering system is facing many challenges related to the processing of questions and information resources in the extraction and generation of adequate answers. These challenges increase when the requested answer is cooperative and its language is Arabic. In this paper, we propose an original approach to generate cooperative answers for user-definitional questions designed to be integrated in a question-answering system. This approach is mainly based on the exploitation of the semi-structured Web knowledge which consists in using features derived from Wikipedia article infoboxes to generate cooperative answers. It is globally independent of a particular language, which gives it the ability to be integrated in any definitional question-answering system. We have chosen to integrate and experiment it in a definitional question-answering system dealing with the Arabic language entitled DefArabicQA. The results showed that this system has a significant impact on the approach efficiency regarding the improvement of the quality of the answer.The work of the third author was partially funded by the Spanish Ministry of Economy, Industry and Competitiveness (MINECO) under the SomEMBED research project (TIN2015-71147-C2-1-P) and by the Generalitat Valenciana under the grant ALMAMATER (PrometeoII/2014/030).Trigui, O.; Belguith, L.; Rosso, P. (2017). Arabic Cooperative Answer Generation via Wikipedia Article Infoboxes. Research in Computing Science. 132:129-153. http://hdl.handle.net/10251/103731S12915313

    Learning Explicit and Implicit Arabic Discourse Relations.

    Get PDF
    We propose in this paper a supervised learning approach to identify discourse relations in Arabic texts. To our knowledge, this work represents the first attempt to focus on both explicit and implicit relations that link adjacent as well as non adjacent Elementary Discourse Units (EDUs) within the Segmented Discourse Representation Theory (SDRT). We use the Discourse Arabic Treebank corpus (D-ATB) which is composed of newspaper documents extracted from the syntactically annotated Arabic Treebank v3.2 part3 where each document is associated with complete discourse graph according to the cognitive principles of SDRT. Our list of discourse relations is composed of a three-level hierarchy of 24 relations grouped into 4 top-level classes. To automatically learn them, we use state of the art features whose efficiency has been empirically proved. We investigate how each feature contributes to the learning process. We report our experiments on identifying fine-grained discourse relations, mid-level classes and also top-level classes. We compare our approach with three baselines that are based on the most frequent relation, discourse connectives and the features used by Al-Saif and Markert (2011). Our results are very encouraging and outperform all the baselines with an F-score of 78.1% and an accuracy of 80.6%

    Can humain association norm evaluate latent semantic analysis?

    Get PDF
    This paper presents the comparison of word association norm created by a psycholinguistic experiment to association lists generated by algorithms operating on text corpora. We compare lists generated by Church and Hanks algorithm and lists generated by LSA algorithm. An argument is presented on how those automatically generated lists reflect real semantic relations

    Splitting Arabic Texts into Elementary Discourse Units

    Get PDF
    International audienceIn this article, we propose the first work that investigates the feasibility of Arabic discourse segmentation into elementary discourse units within the segmented discourse representation theory framework. We first describe our annotation scheme that defines a set of principles to guide the segmentation process. Two corpora have been annotated according to this scheme: elementary school textbooks and newspaper documents extracted from the syntactically annotated Arabic Treebank. Then, we propose a multiclass supervised learning approach that predicts nested units. Our approach uses a combination of punctuation, morphological, lexical, and shallow syntactic features. We investigate how each feature contributes to the learning process. We show that an extensive morphological analysis is crucial to achieve good results in both corpora. In addition, we show that adding chunks does not boost the performance of our system
    • …
    corecore