5 research outputs found
Machine Reading at Scale: A Search Engine for Scientific and Academic Research
The Internet, much like our universe, is ever-expanding. Information, in the most varied formats, is continuously added to the point of information overload. Consequently, the ability to navigate this ocean of data is crucial in our day-to-day lives, with familiar tools such as search engines carving a path through this unknown. In the research world, articles on a myriad of topics with distinct complexity levels are published daily, requiring specialized tools to facilitate the access and assessment of the information within. Recent endeavors in artificial intelligence, and in natural language processing in particular, can be seen as potential solutions for breaking information overload and provide enhanced search mechanisms by means of advanced algorithms. As the advent of transformer-based language models contributed to a more comprehensive analysis of both text-encoded intents and true document semantic meaning, there is simultaneously a need for additional computational resources. Information retrieval methods can act as low-complexity, yet reliable, filters to feed heavier algorithms, thus reducing computational requirements substantially. In this work, a new search engine is proposed, addressing machine reading at scale in the context of scientific and academic research. It combines state-of-the-art algorithms for information retrieval and reading comprehension tasks to extract meaningful answers from a corpus of scientific documents. The solution is then tested on two current and relevant topics, cybersecurity and energy, proving that the system is able to perform under distinct knowledge domains while achieving competent performance.This work has received funding from the following projects: UIDB/00760/2020 and UIDP/00760/2020.info:eu-repo/semantics/publishedVersio
Adaptive distributional extensions to DFR ranking
Divergence From Randomness (DFR) ranking models assume that informative terms
are distributed in a corpus differently than non-informative terms. Different
statistical models (e.g. Poisson, geometric) are used to model the distribution
of non-informative terms, producing different DFR models. An informative term
is then detected by measuring the divergence of its distribution from the
distribution of non-informative terms. However, there is little empirical
evidence that the distributions of non-informative terms used in DFR actually
fit current datasets. Practically this risks providing a poor separation
between informative and non-informative terms, thus compromising the
discriminative power of the ranking model. We present a novel extension to DFR,
which first detects the best-fitting distribution of non-informative terms in a
collection, and then adapts the ranking computation to this best-fitting
distribution. We call this model Adaptive Distributional Ranking (ADR) because
it adapts the ranking to the statistics of the specific dataset being processed
each time. Experiments on TREC data show ADR to outperform DFR models (and
their extensions) and be comparable in performance to a query likelihood
language model (LM)