22 research outputs found
LTRo: Learning to Route Queries in Clustered P2P IR
Query Routing is a critical step in P2P Information Retrieval. In this paper, we consider learning to rank approaches for query routing in the clustered P2P IR architecture. Our formulation, LTRo, scores resources based on the number of relevant documents for each training query, and uses that information to build a model that would then rank promising peers for a new query. Our empirical analysis over a variety of P2P IR testbeds illustrate the superiority of our method against the state-of-the-art methods for query routing
Influential users in Twitter: detection and evolution analysis
In this paper, we study how to detect the most influential users in the microblogging social network platform Twitter and their evolution over time. To this aim, we consider the Dynamic Retweet Graph (DRG) proposed in Amati et al. (2016) and partially analyzed in Amati et al. (IADIS Int J Comput Sci Inform Syst, 11(2) 2016), Amati et al. (2016). The model of the evolution of the Twitter social network is based here on the retweet relationship. In a DRGs, the last time a tweet has been retweeted we delete all the edges representing this tweet. In this way we model the decay of tweet life in the social platform. To detect the influential users, we consider the central nodes in the network with respect to the following centrality measures: degree, closeness, betweenness and PageRank-centrality. These measures have been widely studied in the static case and we analyze them on the sequence of DRG temporal graphs with special regard to the distribution of the 75% most central nodes. We derive the following results: (a) in all cases, applying the closeness measure results into many nodes with high centrality, so it is useless to detect influential users; (b) for all other measures, almost all nodes have null or very low centrality and (c) the number of vertices with significant centrality are often the same; (d) the above observations hold also for the cumulative retweet graph and, (e) central nodes in the sequence of DRG temporal graphs have high centrality in cumulative graph
Fisher's exact test explains a popular metric in information retrieval
Term frequency-inverse document frequency, or tf-idf for short, is a
numerical measure that is widely used in information retrieval to quantify the
importance of a term of interest in one out of many documents. While tf-idf was
originally proposed as a heuristic, much work has been devoted over the years
to placing it on a solid theoretical foundation. Following in this tradition,
we here advance the first justification for tf-idf that is grounded in
statistical hypothesis testing. More precisely, we first show that the
one-tailed version of Fisher's exact test, also known as the hypergeometric
test, corresponds well with a common tf-idf variant on selected real-data
information retrieval tasks. We then set forth a mathematical argument that
suggests the tf-idf variant approximates the negative logarithm of the
one-tailed Fisher's exact test P-value (i.e., a hypergeometric distribution
tail probability). The Fisher's exact test interpretation of this common tf-idf
variant furnishes the working statistician with a ready explanation of tf-idf's
long-established effectiveness.Comment: 26 pages, 4 figures, 1 tables, minor revision
Combining compound and single terms under language model framework
International audienceMost existing Information Retrieval model including probabilistic and vector space models are based on the term independence hypothesis. To go beyond this assumption and thereby capture the semantics of document and query more accurately, several works have incorporated phrases or other syntactic information in IR, such attempts have shown slight benefit, at best. Particularly in language modeling approaches this extension is achieved through the use of the bigram or n-gram models. However, in these models all bigrams/n-grams are considered and weighted uniformly. In this paper we introduce a new approach to select and weight relevant n-grams associated with a document. Experimental results on three TREC test collections showed an improvement over three strongest state-of-the-art model baselines, which are the original unigram language model, the Markov Random Field model, and the positional language model
Question answering systems for health professionals at the point of care -- a systematic review
Objective: Question answering (QA) systems have the potential to improve the
quality of clinical care by providing health professionals with the latest and
most relevant evidence. However, QA systems have not been widely adopted. This
systematic review aims to characterize current medical QA systems, assess their
suitability for healthcare, and identify areas of improvement.
Materials and methods: We searched PubMed, IEEE Xplore, ACM Digital Library,
ACL Anthology and forward and backward citations on 7th February 2023. We
included peer-reviewed journal and conference papers describing the design and
evaluation of biomedical QA systems. Two reviewers screened titles, abstracts,
and full-text articles. We conducted a narrative synthesis and risk of bias
assessment for each study. We assessed the utility of biomedical QA systems.
Results: We included 79 studies and identified themes, including question
realism, answer reliability, answer utility, clinical specialism, systems,
usability, and evaluation methods. Clinicians' questions used to train and
evaluate QA systems were restricted to certain sources, types and complexity
levels. No system communicated confidence levels in the answers or sources.
Many studies suffered from high risks of bias and applicability concerns. Only
8 studies completely satisfied any criterion for clinical utility, and only 7
reported user evaluations. Most systems were built with limited input from
clinicians.
Discussion: While machine learning methods have led to increased accuracy,
most studies imperfectly reflected real-world healthcare information needs. Key
research priorities include developing more realistic healthcare QA datasets
and considering the reliability of answer sources, rather than merely focusing
on accuracy.Comment: Accepted to the Journal of the American Medical Informatics
Association (JAMIA
Sidra5: a search system with geographic signatures
Tese de mestrado em Engenharia InformĂĄtica, apresentada Ă Universidade de Lisboa atravĂ©s da Faculdade de CiĂȘncias, 2007Este trabalho consistiu no desenvolvimento de um sistema de pesquisa de informação com raciocĂnio geogrĂĄfico, servindo de base para uma nova abordagem para modelação da informação geogrĂĄfica contida nos documentos, as assinaturas geogrĂĄficas. Pretendeu-se determinar se a semĂąntica geogrĂĄfica presente nos documentos, capturada atravĂ©s das assinaturas geogrĂĄficas, contribui para uma melhoria dos resultados obtidos para pesquisas de cariz geogrĂĄfico. SĂŁo propostas e experimentadas diversas estratĂ©gias para o cĂĄlculo da semelhança entre as assinaturas geogrĂĄficas de interrogaçÔes e documentos. A partir dos resultados observados conclui-se que, em algumas circunstĂąncias, as assinaturas geogrĂĄficas contribuem para melhorar a qualidade das pesquisas geogrĂĄficas.The dissertation report presents the development of a geographic information search system which implements geographic signatures, a novel approach for the modeling of the geographic information present in documents. The goal of the project was to determine if the information with geographic semantics present in documents, captured as geographic signatures, contributes to the improvement of search results. Several strategies for computing the similarity between the geographic signatures in queries and documents are proposed and experimented. The obtained results show that, in some circunstances, geographic signatures can indeed improve the search quality of geographic queries
Smart Search Engine For Information Retrieval
This project addresses the main research problem in information retrieval and semantic search. It proposes the smart search theory as new theory based on hypothesis that semantic meanings of a document can be described by a set of
keywords. With two experiments designed and carried out in this project, the experiment result demonstrates positive evidence that meet the smart search theory.
In the theory proposed in this project, the smart search aims to determine a set of keywords for any web documents, by which the semantic meanings of the documents can be uniquely identified. Meanwhile, the size of the set of keywords is supposed to be small enough which can be easily managed. This is the fundamental assumption for creating the smart semantic search engine. In this project, the rationale of the assumption and the theory based on it will be discussed, as well as the processes of how the theory can be applied to the keyword allocation and the data model to be
generated. Then the design of the smart search engine will be proposed, in order to create a solution to the efficiency problem while searching among huge amount of increasing information published on the web.
To achieve high efficiency in web searching, statistical method is proved to be an effective way and it can be interpreted from the semantic level. Based on the frequency of joint keywords, the keyword list can be generated and linked to each other to form a meaning structure. A data model is built when a proper keyword list is achieved and the model is applied to the design of the smart search engine
Biomedical Question Answering: A Survey of Approaches and Challenges
Automatic Question Answering (QA) has been successfully applied in various
domains such as search engines and chatbots. Biomedical QA (BQA), as an
emerging QA task, enables innovative applications to effectively perceive,
access and understand complex biomedical knowledge. There have been tremendous
developments of BQA in the past two decades, which we classify into 5
distinctive approaches: classic, information retrieval, machine reading
comprehension, knowledge base and question entailment approaches. In this
survey, we introduce available datasets and representative methods of each BQA
approach in detail. Despite the developments, BQA systems are still immature
and rarely used in real-life settings. We identify and characterize several key
challenges in BQA that might lead to this issue, and discuss some potential
future directions to explore.Comment: In submission to ACM Computing Survey