160 research outputs found

    T2{}^2K2{}^2: The Twitter Top-K Keywords Benchmark

    Full text link
    Information retrieval from textual data focuses on the construction of vocabularies that contain weighted term tuples. Such vocabularies can then be exploited by various text analysis algorithms to extract new knowledge, e.g., top-k keywords, top-k documents, etc. Top-k keywords are casually used for various purposes, are often computed on-the-fly, and thus must be efficiently computed. To compare competing weighting schemes and database implementations, benchmarking is customary. To the best of our knowledge, no benchmark currently addresses these problems. Hence, in this paper, we present a top-k keywords benchmark, T2{}^2K2{}^2, which features a real tweet dataset and queries with various complexities and selectivities. T2{}^2K2{}^2 helps evaluate weighting schemes and database implementations in terms of computing performance. To illustrate T2{}^2K2{}^2's relevance and genericity, we successfully performed tests on the TF-IDF and Okapi BM25 weighting schemes, on one hand, and on different relational (Oracle, PostgreSQL) and document-oriented (MongoDB) database implementations, on the other hand

    The scholarly impact of TRECVid (2003-2009)

    Get PDF
    This paper reports on an investigation into the scholarly impact of the TRECVid (TREC Video Retrieval Evaluation) benchmarking conferences between 2003 and 2009. The contribution of TRECVid to research in video retrieval is assessed by analyzing publication content to show the development of techniques and approaches over time and by analyzing publication impact through publication numbers and citation analysis. Popular conference and journal venues for TRECVid publications are identified in terms of number of citations received. For a selection of participants at different career stages, the relative importance of TRECVid publications in terms of citations vis a vis their other publications is investigated. TRECVid, as an evaluation conference, provides data on which research teams ‘scored’ highly against the evaluation criteria and the relationship between ‘top scoring’ teams at TRECVid and the ‘top scoring’ papers in terms of citations is analysed. A strong relationship was found between ‘success’ at TRECVid and ‘success’ at citations both for high scoring and low scoring teams. The implications of the study in terms of the value of TRECVid as a research activity, and the value of bibliometric analysis as a research evaluation tool, are discussed

    Automatic identification methods on a corpus of twenty five fine-grained Arabic dialects

    Get PDF
    International audienceThis research deals with Arabic dialect identification, a challenging issue related to Arabic NLP. Indeed, the increasing use of Arabic dialects in a written form especially in social media generates new needs in the area of Arabic dialect processing. For discriminating between dialects in a multi-dialect context, we use different approaches based on machine learning techniques. To this end, we explored several methods. We used a classification method based on symmetric Kullback-Leibler, and we experimented classical classification methods such as Naive Bayes Classifiers and more sophisticated methods like Word2Vec and Long Short-Term Memory neural network. We tested our approaches on a large database of 25 Arabic dialects in addition to MSA

    Evaluation of a Bayesian inference network for ligand-based virtual screening

    Get PDF
    Background Bayesian inference networks enable the computation of the probability that an event will occur. They have been used previously to rank textual documents in order of decreasing relevance to a user-defined query. Here, we modify the approach to enable a Bayesian inference network to be used for chemical similarity searching, where a database is ranked in order of decreasing probability of bioactivity. Results Bayesian inference networks were implemented using two different types of network and four different types of belief function. Experiments with the MDDR and WOMBAT databases show that a Bayesian inference network can be used to provide effective ligand-based screening, especially when the active molecules being sought have a high degree of structural homogeneity; in such cases, the network substantially out-performs a conventional, Tanimoto-based similarity searching system. However, the effectiveness of the network is much less when structurally heterogeneous sets of actives are being sought. Conclusion A Bayesian inference network provides an interesting alternative to existing tools for ligand-based virtual screening

    An Arabic Corpus of Fake News: Collection, Analysis and Classification

    Get PDF
    International audienceOver the last years, with the explosive growth of social media, huge amounts of rumors have been rapidly spread on the internet. Indeed, the proliferation of malicious misinformation and nasty rumors in social media can have harmful effects on individuals and society. In this paper, we investigate the content of the fake news in the Arabic world through the information posted on YouTube. Our contribution is threefold. First, we introduce a novel Arab corpus for the task of fake news analysis, covering the topics most concerned by rumors. We describe the corpus and the data collection process in detail. Second, we present several exploratory analysis on the harvested data in order to retrieve some useful knowledge about the transmission of rumors for the studied topics. Third, we test the possibility of discrimination between rumor and no rumor comments using three machine learning classifiers namely, Support Vector Machine (SVM), Decision Tree (DT) and Multinomial Naïve Bayes (MNB)

    Bias-variance analysis in estimating true query model for information retrieval

    Get PDF
    The estimation of query model is an important task in language modeling (LM) approaches to information retrieval (IR). The ideal estimation is expected to be not only effective in terms of high mean retrieval performance over all queries, but also stable in terms of low variance of retrieval performance across different queries. In practice, however, improving effectiveness can sacrifice stability, and vice versa. In this paper, we propose to study this tradeoff from a new perspective, i.e., the bias-variance tradeoff, which is a fundamental theory in statistics. We formulate the notion of bias-variance regarding retrieval performance and estimation quality of query models. We then investigate several estimated query models, by analyzing when and why the bias-variance tradeoff will occur, and how the bias and variance can be reduced simultaneously. A series of experiments on four TREC collections have been conducted to systematically evaluate our bias-variance analysis. Our approach and results will potentially form an analysis framework and a novel evaluation strategy for query language modeling

    Automatic summarisation: 25 years On

    Get PDF
    This is an accepted manuscript of an article published by Cambridge University Press (CUP) in Natural Language Engineering on 19/09/2019, available online: https://doi.org/10.1017/S1351324919000524 The accepted version of the publication may differ from the final published version.Automatic text summarisation is a topic that has been receiving attention from the research community from the early days of computational linguistics, but it really took off around 25 years ago. This article presents the main developments from the last 25 years. It starts by defining what a summary is and how its definition changed over time as a result of the interest in processing new types of documents. The article continues with a brief history of the field and highlights the main challenges posed by the evaluation of summaries. The article finishes with some thoughts about the future of the field.Published onlin
    corecore