8 research outputs found

    ExpFinder: An Ensemble Expert Finding Model Integrating NN-gram Vector Space Model and μ\muCO-HITS

    Full text link
    Finding an expert plays a crucial role in driving successful collaborations and speeding up high-quality research development and innovations. However, the rapid growth of scientific publications and digital expertise data makes identifying the right experts a challenging problem. Existing approaches for finding experts given a topic can be categorised into information retrieval techniques based on vector space models, document language models, and graph-based models. In this paper, we propose ExpFinder\textit{ExpFinder}, a new ensemble model for expert finding, that integrates a novel NN-gram vector space model, denoted as nnVSM, and a graph-based model, denoted as \textit{\muCO-HITS}, that is a proposed variation of the CO-HITS algorithm. The key of nnVSM is to exploit recent inverse document frequency weighting method for NN-gram words and ExpFinder\textit{ExpFinder} incorporates nnVSM into \textit{\muCO-HITS} to achieve expert finding. We comprehensively evaluate ExpFinder\textit{ExpFinder} on four different datasets from the academic domains in comparison with six different expert finding models. The evaluation results show that ExpFinder\textit{ExpFinder} is a highly effective model for expert finding, substantially outperforming all the compared models in 19% to 160.2%.Comment: 15 pages, 18 figures, "for source code on Github, see https://github.com/Yongbinkang/ExpFinder", "Submitted to IEEE Transactions on Knowledge and Data Engineering

    Biomedical term extraction: overview and a new methodology

    Get PDF
    International audienceTerminology extraction is an essential task in domain knowledge acquisition, as well as for Information Retrieval (IR). It is also a mandatory first step aimed at building/enriching terminologies and ontologies. As often proposed in the literature, existing terminology extraction methods feature linguistic and statistical aspects and solve some problems related (but not completely) to term extraction, e.g. noise, silence, low frequency, large-corpora, complexity of the multi-word term extraction process. In contrast, we propose a cutting edge methodology to extract and to rank biomedical terms, covering the all mentioned problems. This methodology offers several measures based on linguistic, statistical, graphic and web aspects. These measures extract and rank candidate terms with excellent precision: we demonstrate that they outperform previously reported precision results for automatic term extraction, and work with different languages (English, French, and Spanish). We also demonstrate how the use of graphs and the web to assess the significance of a term candidate, enables us to outperform precision results. We evaluated our methodology on the biomedical GENIA and LabTestsOnline corpora and compared it with previously reported measures

    Retrieval for Extremely Long Queries and Documents with RPRS: a Highly Efficient and Effective Transformer-based Re-Ranker

    Full text link
    Retrieval with extremely long queries and documents is a well-known and challenging task in information retrieval and is commonly known as Query-by-Document (QBD) retrieval. Specifically designed Transformer models that can handle long input sequences have not shown high effectiveness in QBD tasks in previous work. We propose a Re-Ranker based on the novel Proportional Relevance Score (RPRS) to compute the relevance score between a query and the top-k candidate documents. Our extensive evaluation shows RPRS obtains significantly better results than the state-of-the-art models on five different datasets. Furthermore, RPRS is highly efficient since all documents can be pre-processed, embedded, and indexed before query time which gives our re-ranker the advantage of having a complexity of O(N) where N is the total number of sentences in the query and candidate documents. Furthermore, our method solves the problem of the low-resource training in QBD retrieval tasks as it does not need large amounts of training data, and has only three parameters with a limited range that can be optimized with a grid search even if a small amount of labeled data is available. Our detailed analysis shows that RPRS benefits from covering the full length of candidate documents and queries.Comment: Accepted at ACM Transactions on Information Systems (ACM TOIS journal

    A systematic approach to normalization in probabilistic models

    Get PDF
    Open access funding provided by Austrian Science Fund (FWF). This research was partly supported by the Austrian Science Fund (FWF) Project Number P25905-N23 (ADmIRE). This work has been supported by the Self-Optimizer project (FFG 852624) in the EUROSTARS programme, funded by EUREKA, the BMWFW and the European Union

    Lifelog: moments retrieval algorithm

    Get PDF
    O aumento da variedade e quantidade de dispositivos sensoriais portáteis ocasionou um paralelo crescimento da diversidade e quantidade de dados produzidos. Hoje em dia, qualquer individuo com recurso ao smartphone pessoal produz uma panóplia de registos diários de momentos. Esta tipologia de dados resulta de cenários quotidianos que são registados em imagem e frequentemente detalhados com dados biométricos bem como registos de actividades, localização e tempo. Ao armazenarmos esta diversidade de dados impõe-se a questão: como identificar e recuperar um momento exacto em largos arquivos de dados? A recuperação de um momento pode atender à simples acção de revisitar um episódio longínquo, mas também pode auxiliar pessoas com problemas de memória. A aplicação de sistemas computacionais para este fim é a principal resposta. Para além de identificarem e recuperarem um momento, são aplicados com o principal objectivo de melhorar a qualidade de vida humana. Estes factos exigem a estes sistemas uma redução de distâncias comunicacionais entre a linguagem natural e a linguagem computacional. Para tal, são constituídos por algoritmos de processamento e análise de texto que visam estabelecer uma ligação interactiva entre utilizadores e sistema. Neste sentido, a solução proposta nesta dissertação é baseada num algoritmo que recebe e entende o momento que o utilizador descreve e tenta devolver esse instante sob a forma de imagens retiradas da base de dados do utilizador onde esse momento possa estar representado. O seu desenvolvimento passa pela aplicação de metodologias descritas no estado de arte e novas abordagens no sistema de classificação de resultados. O algoritmo é incorporado por ferramentas NLP que são fundamentais na comunicação entre ambas as partes. Além disso, engloba a função matemática TFIDF com acções de vectorização auxiliada pela similaridade de cosseno que é responsável por seleccionar os momentos que mais se identificam com a descrição do utilizador. Também a função BM25 foi introduzida no algoritmo visando reforçar a análise de similaridades entre pergunta e respostas. A coligação de ambas as técnicas atribuem ao algoritmo uma maior probabilidade na devolução do momento correcto. O mecanismo desenvolvido mostra resultados bastante satisfatórios e interessantes uma vez que em várias interacções devolve o momento correcto ou pelo menos identifica episódios similares á descrição do utilizador. O conhecimento adquirido ao longo desta dissertação permite-me concluir que o algoritmo teria uma maior valorização com um redobrado ênfase na descrição textual de um momento introduzida pelo utilizador. A identificação automática de campos chave, permitiria que o sistema de filtragem, aplicado no algoritmo, se tornasse totalmente automatizado.The increase of the variety and quantity of the wearable devices brought a parallel growth of the diversity and amount of data produced. Nowadays any individual using a personal smartphone produces a large amount of daily moments records. These data typology results from daily scenarios recorded in image and detailed with biometric data as well activities, location and time records. When storing this diversity and amount of data, a question arises: how can we identify and retrieve an exact moment in large data archives? A moment retrieval can serve the simple action of revisiting a distant episode, but it can also support a person with memory disorders. The application of computer systems for this purpose is the main answer. In addition to identifying and retrieving a moment, they are applied with the main objective of improving the quality of human life. These facts require these systems to reduce communicational distances between natural language and computer language. Therefore, they consist of processing and text analysis algorithms that aim to establish an interactive link between the users and the system. In this sense, the proposed solution in this dissertation is based on an algorithm that receives and understands the moment described by the user and tries to return that moment in the form of images taken from the user’s database where that moment can be represented. Its development involves the application of methodologies described in the state of the art and new approaches in the results ranking system. The algorithm is incorporated by NLP tools that are fundamental in the communication between both parties. Moreover it incompasses TFIDF math function with vectorization tasks supported by cosine similarity responsible for selecting identical moments to the user description. Also the BM25 function was introduced in the algorithm aiming to reinforce the analysis of similarities between question and answers. The combination of both techniques gives the algorithm a greater probability of returning the correct moment. The developed mechanism shows very satisfactory and interesting results, considering the fact that in several interactions they return the correct moment or at least identify similar episodes comparing to the user’s description. The knowledge acquired throughout this dissertation allows me to conclude that the algorithm would have a greater value with an emphasis on the textual moment description introduced by the user. The automatic identification of key fields would allow the filtering system, applied in the algorithm, to become fully automated.Mestrado em Engenharia Eletrónica e Telecomunicaçõe

    Model-based feature construction and text representation for social media analysis

    Get PDF
    Text representation is at the foundation of most text-based applications. Surface features are insufficient for many tasks and therefore constructing powerful discriminative features in a general way is an open challenge. Current approaches use deep neural networks to bypass feature construction. While deep learning can learn sophisticated representations from the text, it requires a lot of training data, which might not be readily available, and the derived features are not necessarily interpretable. In this work, we explore a novel paradigm, model-based feature construction (MBFC), that allows us to construct semantic features that can potentially improve many applications. In brief, MBFC uses human knowledge and expertise as well as big data to guide the design of models that enhance predictive modeling and support the data mining process by extracting useful knowledge, which in turn can be used as features for downstream prediction tasks. In this dissertation, we show how this paradigm can be applied to several tasks of social media analysis. We explore how MBFC can be used to solve the problem of target misalignment for prediction, where the output variable and the data may be at different levels of resolution and the goal is to construct features that can bridge this gap. The MBFC method allows us to use additional related data, e.g. associated context, to facilitate semantic analysis and feature construction. In this dissertation, we focus on a subset of problems in which social media data, in particular text data, can be leveraged to construct useful representations for prediction. We explore several kinds of user-generated content in social media data such as review data for useful review prediction, micro-blogging data for urgent health-based prediction tasks, and discussion forum data for expert prediction. First, we propose a background mixture model to capture incongruity features in text and use these features for humor detection in restaurant reviews. Second, we propose a source reliability feature representation method for trustworthy comment identification that incorporates user aspect expertise when modeling fine-grained reliabilities in an online discussion forum. And finally, we propose multi-view attribute features that adapt MBFC to handle the target misalignment problem for topic-based features and apply this to tweets in order to forecast new diagnosis rates for sexually transmitted infections

    Adaptive term frequency normalization for BM25

    No full text
    A key component of BM25 contributing to its success is its sub-linear term frequency (TF) normalization formula. The scale and shape of this TF normalization component is controlled by a parameter k1, which is generally set to a term-independent constant. We hypothesize and show em-pirically that in order to optimize retrieval performance, this parameter should be set in a term-specific way. Following this intuition, we propose an information gain measure to directly estimate the contributions of repeated term occur-rences, which is then exploited to fit the BM25 function to predict a term-specific k1. Our experiment results show that the proposed approach, without needing any training data, can efficiently and automatically estimate a term-specific k1, and is more effective and robust than the standard BM25
    corecore