22 research outputs found

    LTRo: Learning to Route Queries in Clustered P2P IR

    Get PDF
    Query Routing is a critical step in P2P Information Retrieval. In this paper, we consider learning to rank approaches for query routing in the clustered P2P IR architecture. Our formulation, LTRo, scores resources based on the number of relevant documents for each training query, and uses that information to build a model that would then rank promising peers for a new query. Our empirical analysis over a variety of P2P IR testbeds illustrate the superiority of our method against the state-of-the-art methods for query routing

    Influential users in Twitter: detection and evolution analysis

    Get PDF
    In this paper, we study how to detect the most influential users in the microblogging social network platform Twitter and their evolution over time. To this aim, we consider the Dynamic Retweet Graph (DRG) proposed in Amati et al. (2016) and partially analyzed in Amati et al. (IADIS Int J Comput Sci Inform Syst, 11(2) 2016), Amati et al. (2016). The model of the evolution of the Twitter social network is based here on the retweet relationship. In a DRGs, the last time a tweet has been retweeted we delete all the edges representing this tweet. In this way we model the decay of tweet life in the social platform. To detect the influential users, we consider the central nodes in the network with respect to the following centrality measures: degree, closeness, betweenness and PageRank-centrality. These measures have been widely studied in the static case and we analyze them on the sequence of DRG temporal graphs with special regard to the distribution of the 75% most central nodes. We derive the following results: (a) in all cases, applying the closeness measure results into many nodes with high centrality, so it is useless to detect influential users; (b) for all other measures, almost all nodes have null or very low centrality and (c) the number of vertices with significant centrality are often the same; (d) the above observations hold also for the cumulative retweet graph and, (e) central nodes in the sequence of DRG temporal graphs have high centrality in cumulative graph

    Fisher's exact test explains a popular metric in information retrieval

    Full text link
    Term frequency-inverse document frequency, or tf-idf for short, is a numerical measure that is widely used in information retrieval to quantify the importance of a term of interest in one out of many documents. While tf-idf was originally proposed as a heuristic, much work has been devoted over the years to placing it on a solid theoretical foundation. Following in this tradition, we here advance the first justification for tf-idf that is grounded in statistical hypothesis testing. More precisely, we first show that the one-tailed version of Fisher's exact test, also known as the hypergeometric test, corresponds well with a common tf-idf variant on selected real-data information retrieval tasks. We then set forth a mathematical argument that suggests the tf-idf variant approximates the negative logarithm of the one-tailed Fisher's exact test P-value (i.e., a hypergeometric distribution tail probability). The Fisher's exact test interpretation of this common tf-idf variant furnishes the working statistician with a ready explanation of tf-idf's long-established effectiveness.Comment: 26 pages, 4 figures, 1 tables, minor revision

    Combining compound and single terms under language model framework

    Get PDF
    International audienceMost existing Information Retrieval model including probabilistic and vector space models are based on the term independence hypothesis. To go beyond this assumption and thereby capture the semantics of document and query more accurately, several works have incorporated phrases or other syntactic information in IR, such attempts have shown slight benefit, at best. Particularly in language modeling approaches this extension is achieved through the use of the bigram or n-gram models. However, in these models all bigrams/n-grams are considered and weighted uniformly. In this paper we introduce a new approach to select and weight relevant n-grams associated with a document. Experimental results on three TREC test collections showed an improvement over three strongest state-of-the-art model baselines, which are the original unigram language model, the Markov Random Field model, and the positional language model

    Question answering systems for health professionals at the point of care -- a systematic review

    Full text link
    Objective: Question answering (QA) systems have the potential to improve the quality of clinical care by providing health professionals with the latest and most relevant evidence. However, QA systems have not been widely adopted. This systematic review aims to characterize current medical QA systems, assess their suitability for healthcare, and identify areas of improvement. Materials and methods: We searched PubMed, IEEE Xplore, ACM Digital Library, ACL Anthology and forward and backward citations on 7th February 2023. We included peer-reviewed journal and conference papers describing the design and evaluation of biomedical QA systems. Two reviewers screened titles, abstracts, and full-text articles. We conducted a narrative synthesis and risk of bias assessment for each study. We assessed the utility of biomedical QA systems. Results: We included 79 studies and identified themes, including question realism, answer reliability, answer utility, clinical specialism, systems, usability, and evaluation methods. Clinicians' questions used to train and evaluate QA systems were restricted to certain sources, types and complexity levels. No system communicated confidence levels in the answers or sources. Many studies suffered from high risks of bias and applicability concerns. Only 8 studies completely satisfied any criterion for clinical utility, and only 7 reported user evaluations. Most systems were built with limited input from clinicians. Discussion: While machine learning methods have led to increased accuracy, most studies imperfectly reflected real-world healthcare information needs. Key research priorities include developing more realistic healthcare QA datasets and considering the reliability of answer sources, rather than merely focusing on accuracy.Comment: Accepted to the Journal of the American Medical Informatics Association (JAMIA

    Sidra5: a search system with geographic signatures

    Get PDF
    Tese de mestrado em Engenharia InformĂĄtica, apresentada Ă  Universidade de Lisboa atravĂ©s da Faculdade de CiĂȘncias, 2007Este trabalho consistiu no desenvolvimento de um sistema de pesquisa de informação com raciocĂ­nio geogrĂĄfico, servindo de base para uma nova abordagem para modelação da informação geogrĂĄfica contida nos documentos, as assinaturas geogrĂĄficas. Pretendeu-se determinar se a semĂąntica geogrĂĄfica presente nos documentos, capturada atravĂ©s das assinaturas geogrĂĄficas, contribui para uma melhoria dos resultados obtidos para pesquisas de cariz geogrĂĄfico. SĂŁo propostas e experimentadas diversas estratĂ©gias para o cĂĄlculo da semelhança entre as assinaturas geogrĂĄficas de interrogaçÔes e documentos. A partir dos resultados observados conclui-se que, em algumas circunstĂąncias, as assinaturas geogrĂĄficas contribuem para melhorar a qualidade das pesquisas geogrĂĄficas.The dissertation report presents the development of a geographic information search system which implements geographic signatures, a novel approach for the modeling of the geographic information present in documents. The goal of the project was to determine if the information with geographic semantics present in documents, captured as geographic signatures, contributes to the improvement of search results. Several strategies for computing the similarity between the geographic signatures in queries and documents are proposed and experimented. The obtained results show that, in some circunstances, geographic signatures can indeed improve the search quality of geographic queries

    Smart Search Engine For Information Retrieval

    Get PDF
    This project addresses the main research problem in information retrieval and semantic search. It proposes the smart search theory as new theory based on hypothesis that semantic meanings of a document can be described by a set of keywords. With two experiments designed and carried out in this project, the experiment result demonstrates positive evidence that meet the smart search theory. In the theory proposed in this project, the smart search aims to determine a set of keywords for any web documents, by which the semantic meanings of the documents can be uniquely identified. Meanwhile, the size of the set of keywords is supposed to be small enough which can be easily managed. This is the fundamental assumption for creating the smart semantic search engine. In this project, the rationale of the assumption and the theory based on it will be discussed, as well as the processes of how the theory can be applied to the keyword allocation and the data model to be generated. Then the design of the smart search engine will be proposed, in order to create a solution to the efficiency problem while searching among huge amount of increasing information published on the web. To achieve high efficiency in web searching, statistical method is proved to be an effective way and it can be interpreted from the semantic level. Based on the frequency of joint keywords, the keyword list can be generated and linked to each other to form a meaning structure. A data model is built when a proper keyword list is achieved and the model is applied to the design of the smart search engine

    Biomedical Question Answering: A Survey of Approaches and Challenges

    Full text link
    Automatic Question Answering (QA) has been successfully applied in various domains such as search engines and chatbots. Biomedical QA (BQA), as an emerging QA task, enables innovative applications to effectively perceive, access and understand complex biomedical knowledge. There have been tremendous developments of BQA in the past two decades, which we classify into 5 distinctive approaches: classic, information retrieval, machine reading comprehension, knowledge base and question entailment approaches. In this survey, we introduce available datasets and representative methods of each BQA approach in detail. Despite the developments, BQA systems are still immature and rarely used in real-life settings. We identify and characterize several key challenges in BQA that might lead to this issue, and discuss some potential future directions to explore.Comment: In submission to ACM Computing Survey
    corecore