8 research outputs found
ExpFinder: An Ensemble Expert Finding Model Integrating -gram Vector Space Model and CO-HITS
Finding an expert plays a crucial role in driving successful collaborations
and speeding up high-quality research development and innovations. However, the
rapid growth of scientific publications and digital expertise data makes
identifying the right experts a challenging problem. Existing approaches for
finding experts given a topic can be categorised into information retrieval
techniques based on vector space models, document language models, and
graph-based models. In this paper, we propose , a new
ensemble model for expert finding, that integrates a novel -gram vector
space model, denoted as VSM, and a graph-based model, denoted as
\textit{\muCO-HITS}, that is a proposed variation of the CO-HITS algorithm.
The key of VSM is to exploit recent inverse document frequency weighting
method for -gram words and incorporates VSM into
\textit{\muCO-HITS} to achieve expert finding. We comprehensively evaluate
on four different datasets from the academic domains in
comparison with six different expert finding models. The evaluation results
show that is a highly effective model for expert finding,
substantially outperforming all the compared models in 19% to 160.2%.Comment: 15 pages, 18 figures, "for source code on Github, see
https://github.com/Yongbinkang/ExpFinder", "Submitted to IEEE Transactions on
Knowledge and Data Engineering
Biomedical term extraction: overview and a new methodology
International audienceTerminology extraction is an essential task in domain knowledge acquisition, as well as for Information Retrieval (IR). It is also a mandatory first step aimed at building/enriching terminologies and ontologies. As often proposed in the literature, existing terminology extraction methods feature linguistic and statistical aspects and solve some problems related (but not completely) to term extraction, e.g. noise, silence, low frequency, large-corpora, complexity of the multi-word term extraction process. In contrast, we propose a cutting edge methodology to extract and to rank biomedical terms, covering the all mentioned problems. This methodology offers several measures based on linguistic, statistical, graphic and web aspects. These measures extract and rank candidate terms with excellent precision: we demonstrate that they outperform previously reported precision results for automatic term extraction, and work with different languages (English, French, and Spanish). We also demonstrate how the use of graphs and the web to assess the significance of a term candidate, enables us to outperform precision results. We evaluated our methodology on the biomedical GENIA and LabTestsOnline corpora and compared it with previously reported measures
Retrieval for Extremely Long Queries and Documents with RPRS: a Highly Efficient and Effective Transformer-based Re-Ranker
Retrieval with extremely long queries and documents is a well-known and
challenging task in information retrieval and is commonly known as
Query-by-Document (QBD) retrieval. Specifically designed Transformer models
that can handle long input sequences have not shown high effectiveness in QBD
tasks in previous work. We propose a Re-Ranker based on the novel Proportional
Relevance Score (RPRS) to compute the relevance score between a query and the
top-k candidate documents. Our extensive evaluation shows RPRS obtains
significantly better results than the state-of-the-art models on five different
datasets. Furthermore, RPRS is highly efficient since all documents can be
pre-processed, embedded, and indexed before query time which gives our
re-ranker the advantage of having a complexity of O(N) where N is the total
number of sentences in the query and candidate documents. Furthermore, our
method solves the problem of the low-resource training in QBD retrieval tasks
as it does not need large amounts of training data, and has only three
parameters with a limited range that can be optimized with a grid search even
if a small amount of labeled data is available. Our detailed analysis shows
that RPRS benefits from covering the full length of candidate documents and
queries.Comment: Accepted at ACM Transactions on Information Systems (ACM TOIS
journal
A systematic approach to normalization in probabilistic models
Open access funding provided by Austrian Science Fund (FWF). This research was partly supported by the Austrian Science Fund (FWF) Project Number P25905-N23 (ADmIRE). This work has been supported by the Self-Optimizer project (FFG 852624) in the EUROSTARS programme, funded by EUREKA, the BMWFW and the European Union
Lifelog: moments retrieval algorithm
O aumento da variedade e quantidade de dispositivos sensoriais portáteis ocasionou
um paralelo crescimento da diversidade e quantidade de dados produzidos.
Hoje em dia, qualquer individuo com recurso ao smartphone pessoal produz uma
panóplia de registos diários de momentos. Esta tipologia de dados resulta de cenários
quotidianos que são registados em imagem e frequentemente detalhados
com dados biométricos bem como registos de actividades, localização e tempo. Ao
armazenarmos esta diversidade de dados impõe-se a questão: como identificar e
recuperar um momento exacto em largos arquivos de dados? A recuperação de um
momento pode atender à simples acção de revisitar um episódio longínquo, mas
também pode auxiliar pessoas com problemas de memória. A aplicação de sistemas
computacionais para este fim é a principal resposta. Para além de identificarem e
recuperarem um momento, são aplicados com o principal objectivo de melhorar a
qualidade de vida humana.
Estes factos exigem a estes sistemas uma redução de distâncias comunicacionais
entre a linguagem natural e a linguagem computacional. Para tal, são constituídos
por algoritmos de processamento e análise de texto que visam estabelecer uma
ligação interactiva entre utilizadores e sistema.
Neste sentido, a solução proposta nesta dissertação é baseada num algoritmo que
recebe e entende o momento que o utilizador descreve e tenta devolver esse instante
sob a forma de imagens retiradas da base de dados do utilizador onde esse
momento possa estar representado. O seu desenvolvimento passa pela aplicação
de metodologias descritas no estado de arte e novas abordagens no sistema de
classificação de resultados. O algoritmo é incorporado por ferramentas NLP que
são fundamentais na comunicação entre ambas as partes. Além disso, engloba a
função matemática TFIDF com acções de vectorização auxiliada pela similaridade
de cosseno que é responsável por seleccionar os momentos que mais se identificam
com a descrição do utilizador. Também a função BM25 foi introduzida no
algoritmo visando reforçar a análise de similaridades entre pergunta e respostas. A
coligação de ambas as técnicas atribuem ao algoritmo uma maior probabilidade na
devolução do momento correcto.
O mecanismo desenvolvido mostra resultados bastante satisfatórios e interessantes
uma vez que em várias interacções devolve o momento correcto ou pelo menos
identifica episódios similares á descrição do utilizador.
O conhecimento adquirido ao longo desta dissertação permite-me concluir que o
algoritmo teria uma maior valorização com um redobrado ênfase na descrição textual
de um momento introduzida pelo utilizador. A identificação automática de
campos chave, permitiria que o sistema de filtragem, aplicado no algoritmo, se
tornasse totalmente automatizado.The increase of the variety and quantity of the wearable devices brought a parallel
growth of the diversity and amount of data produced. Nowadays any individual
using a personal smartphone produces a large amount of daily moments records.
These data typology results from daily scenarios recorded in image and detailed
with biometric data as well activities, location and time records. When storing this
diversity and amount of data, a question arises: how can we identify and retrieve
an exact moment in large data archives? A moment retrieval can serve the simple
action of revisiting a distant episode, but it can also support a person with memory
disorders. The application of computer systems for this purpose is the main
answer. In addition to identifying and retrieving a moment, they are applied with
the main objective of improving the quality of human life.
These facts require these systems to reduce communicational distances between
natural language and computer language. Therefore, they consist of processing
and text analysis algorithms that aim to establish an interactive link between the
users and the system.
In this sense, the proposed solution in this dissertation is based on an algorithm
that receives and understands the moment described by the user and tries to return
that moment in the form of images taken from the user’s database where that
moment can be represented. Its development involves the application of methodologies
described in the state of the art and new approaches in the results ranking
system. The algorithm is incorporated by NLP tools that are fundamental in the
communication between both parties. Moreover it incompasses TFIDF math function
with vectorization tasks supported by cosine similarity responsible for selecting
identical moments to the user description. Also the BM25 function was introduced
in the algorithm aiming to reinforce the analysis of similarities between question
and answers. The combination of both techniques gives the algorithm a greater
probability of returning the correct moment.
The developed mechanism shows very satisfactory and interesting results, considering
the fact that in several interactions they return the correct moment or at
least identify similar episodes comparing to the user’s description.
The knowledge acquired throughout this dissertation allows me to conclude that
the algorithm would have a greater value with an emphasis on the textual moment
description introduced by the user. The automatic identification of key fields would
allow the filtering system, applied in the algorithm, to become fully automated.Mestrado em Engenharia Eletrónica e Telecomunicaçõe
Model-based feature construction and text representation for social media analysis
Text representation is at the foundation of most text-based applications. Surface features are insufficient for many tasks and therefore constructing powerful discriminative features in a general way is an open challenge. Current approaches use deep neural networks to bypass feature construction. While deep learning can learn sophisticated representations from the text, it requires a lot of training data, which might not be readily available, and the derived features are not necessarily interpretable.
In this work, we explore a novel paradigm, model-based feature construction (MBFC), that allows us to construct semantic features that can potentially improve many applications. In brief, MBFC uses human knowledge and expertise as well as big data to guide the design of models that enhance predictive modeling and support the data mining process by extracting useful knowledge, which in turn can be used as features for downstream prediction tasks. In this dissertation, we show how this paradigm can be applied to several tasks of social media analysis. We explore how MBFC can be used to solve the problem of target misalignment for prediction, where the output variable and the data may be at different levels of resolution and the goal is to construct features that can bridge this gap. The MBFC method allows us to use additional related data, e.g. associated context, to facilitate semantic analysis and feature construction.
In this dissertation, we focus on a subset of problems in which social media data, in particular text data, can be leveraged to construct useful representations for prediction. We explore several kinds of user-generated content in social media data such as review data for useful review prediction, micro-blogging data for urgent health-based prediction tasks, and discussion forum data for expert prediction. First, we propose a background mixture model to capture incongruity features in text and use these features for humor detection in restaurant reviews. Second, we propose a source reliability feature representation method for trustworthy comment identification that incorporates user aspect expertise when modeling fine-grained reliabilities in an online discussion forum. And finally, we propose multi-view attribute features that adapt MBFC to handle the target misalignment problem for topic-based features and apply this to tweets in order to forecast new diagnosis rates for sexually transmitted infections
Adaptive term frequency normalization for BM25
A key component of BM25 contributing to its success is its sub-linear term frequency (TF) normalization formula. The scale and shape of this TF normalization component is controlled by a parameter k1, which is generally set to a term-independent constant. We hypothesize and show em-pirically that in order to optimize retrieval performance, this parameter should be set in a term-specific way. Following this intuition, we propose an information gain measure to directly estimate the contributions of repeated term occur-rences, which is then exploited to fit the BM25 function to predict a term-specific k1. Our experiment results show that the proposed approach, without needing any training data, can efficiently and automatically estimate a term-specific k1, and is more effective and robust than the standard BM25