65 research outputs found
Temporal models for mining, ranking and recommendation in the Web
Due to their first-hand, diverse and evolution-aware reflection of nearly all areas of life, heterogeneous temporal datasets i.e., the Web, collaborative knowledge bases and social networks have been emerged as gold-mines for content analytics of many sorts. In those collections, time plays an essential role in many crucial information retrieval and data mining tasks, such as from user intent understanding, document ranking to advanced recommendations. There are two semantically closed
and important constituents when modeling along the time dimension, i.e., entity and event. Time is crucially served as the context for changes driven by happenings and phenomena (events) that related to people, organizations or places (so-called entities) in our social lives. Thus, determining what users expect, or in other words, resolving the uncertainty confounded by temporal changes is a compelling task to support consistent user satisfaction.
In this thesis, we address the aforementioned issues and propose temporal models that capture the temporal dynamics of such entities and events to serve for the end tasks. Specifically, we make the following contributions in this thesis:
(1) Query recommendation and document ranking in the Web - we address the issues for suggesting entity-centric queries and ranking effectiveness surrounding the happening time period of an associated event. In particular, we propose a multi-criteria optimization framework that facilitates the combination of multiple temporal models to smooth out the abrupt changes when transitioning between event phases for the former and a probabilistic approach for search result diversification of temporally ambiguous queries for the latter.
(2) Entity relatedness in Wikipedia - we study the long-term dynamics of Wikipedia as a global memory place for high-impact events, specifically the reviving memories of past events. Additionally, we propose a neural network-based approach to measure the temporal relatedness of entities and events. The model engages different latent representations of an entity (i.e., from time, link-based graph and content) and use the collective attention from user navigation as the supervision.
(3) Graph-based ranking and temporal anchor-text mining inWeb Archives - we tackle the problem of discovering important documents along the time-span ofWeb Archives, leveraging the link graph. Specifically, we combine the problems of relevance, temporal authority, diversity and time in a unified framework. The model accounts for the incomplete link structure and natural time lagging in Web Archives in mining the temporal authority.
(4) Methods for enhancing predictive models at early-stage in social media and clinical domain - we investigate several methods to control model instability and enrich contexts of predictive models at the “cold-start” period. We demonstrate their effectiveness for the rumor detection and blood glucose prediction cases respectively.
Overall, the findings presented in this thesis demonstrate the importance of tracking these temporal dynamics surround salient events and entities for IR applications. We show that determining such changes in time-based patterns and trends in prevalent temporal collections can better satisfy user expectations, and boost ranking and recommendation effectiveness over time
A Survey on Automatically Mining Facets for Web Queries
In this paper, a detailed survey on different facet mining techniques, their advantages and disadvantages is carried out. Facets are any word or phrase which summarize an important aspect about the web query. Researchers proposed different efficient techniques which improves the user’s web query search experiences magnificently. Users are happy when they find the relevant information to their query in the top results. The objectives of their research are: (1) To present automated solution to derive the query facets by analyzing the text query; (2) To create taxonomy of query refinement strategies for efficient results; and (3) To personalize search according to user interest
Diversified query expansion
La diversification des résultats de recherche (DRR) vise à sélectionner divers documents à partir des résultats de recherche afin de couvrir autant d’intentions que possible. Dans les approches existantes, on suppose que les résultats initiaux sont suffisamment diversifiés et couvrent bien les aspects de la requête. Or, on observe souvent que les résultats initiaux n’arrivent pas à couvrir certains aspects.
Dans cette thèse, nous proposons une nouvelle approche de DRR qui consiste à diversifier l’expansion de requête (DER) afin d’avoir une meilleure couverture des aspects. Les termes d’expansion sont sélectionnés à partir d’une ou de plusieurs ressource(s) suivant le principe de pertinence marginale maximale. Dans notre première contribution, nous proposons une méthode pour DER au niveau des termes où la similarité entre les termes est mesurée superficiellement à l’aide des ressources. Quand plusieurs ressources sont utilisées pour DER, elles ont été uniformément combinées dans la littérature, ce qui permet d’ignorer la contribution individuelle de chaque ressource par rapport à la requête. Dans la seconde contribution de cette thèse, nous proposons une nouvelle méthode de pondération de ressources selon la requête. Notre méthode utilise un ensemble de caractéristiques
qui sont intégrées à un modèle de régression linéaire, et génère à partir de chaque ressource un nombre de termes d’expansion proportionnellement au poids de cette ressource.
Les méthodes proposées pour DER se concentrent sur l’élimination de la redondance entre les termes d’expansion sans se soucier si les termes sélectionnés couvrent effectivement les différents aspects de la requête. Pour pallier à cet inconvénient, nous introduisons dans la troisième contribution de cette thèse une nouvelle méthode pour DER au niveau des aspects. Notre méthode est entraînée de façon supervisée selon le principe que les termes reliés doivent correspondre au même aspect. Cette méthode permet de sélectionner des termes d’expansion à un niveau sémantique latent afin de couvrir autant que possible différents aspects de la requête. De plus, cette méthode autorise l’intégration de plusieurs ressources afin de suggérer des termes d’expansion, et supporte l’intégration de plusieurs contraintes telles que la contrainte de dispersion.
Nous évaluons nos méthodes à l’aide des données de ClueWeb09B et de trois collections de requêtes de TRECWeb track et montrons l’utilité de nos approches par rapport aux méthodes existantes.Search Result Diversification (SRD) aims to select diverse documents from the search results in order to cover as many search intents as possible. For the existing approaches, a prerequisite is that the initial retrieval results contain diverse documents and ensure a good coverage of the query aspects.
In this thesis, we investigate a new approach to SRD by diversifying the query, namely diversified query expansion (DQE). Expansion terms are selected either from a single resource or from multiple resources following the Maximal Marginal Relevance principle. In the first contribution, we propose a new term-level DQE method in which word similarity is determined at the surface (term) level based on the resources.
When different resources are used for the purpose of DQE, they are combined in a uniform way, thus totally ignoring the contribution differences among resources. In practice the usefulness of a resource greatly changes depending on the query. In the second contribution, we propose a new method of query level resource weighting for DQE. Our method is based on a set of features which are integrated into a linear regression model and generates for a resource a number of expansion candidates that is proportional to the weight of that resource.
Existing DQE methods focus on removing the redundancy among selected expansion terms and no attention has been paid on how well the selected expansion terms can indeed cover the query aspects. Consequently, it is not clear how we can cope with the semantic relations between terms. To overcome this drawback, our third contribution in this thesis aims to introduce a novel method for aspect-level DQE which relies on an explicit modeling of query aspects based on embedding. Our method (called latent semantic aspect embedding) is trained in a supervised manner according to the principle that related terms should correspond to the same aspects. This method allows us to select expansion terms at a latent semantic level in order to cover as much as possible the aspects of a given query. In addition, this method also incorporates several different external resources to suggest potential expansion terms, and supports several constraints, such as the sparsity constraint.
We evaluate our methods using ClueWeb09B dataset and three query sets from TRECWeb tracks, and show the usefulness of our proposed approaches compared to the state-of-the-art approaches
Recommended from our members
A user-centred approach to information retrieval
A user model is a fundamental component in user-centred information retrieval systems. It enables personalization of a user's search experience. The development of such a model involves three phases: collecting information about each user, representing such information, and integrating the model into a retrieval application. Progress in this area is typically met with privacy and scalability challenges that hinder the ability to synthesize collective knowledge from each user's search behaviour. In this thesis, I propose a framework that addresses each of these three phases. The proposed framework is based on social role theory from the social science literature and at the centre of this theory is the concept of a social position. A social position is a label for a group of users with similar behavioural patterns. Examples of such positions are traveller, patient, movie fan, and computer scientist. In this thesis, a social position acts as a label for users who are expected to have similar interests. The proposed framework does not require real users' data; rather it uses the web as a resource to model users.
The proposed framework offers a data-driven and modular design for each of the three phases of building a user model. First, I present an approach to identify social positions from natural language sentences. I formulate this task as a binary classification task and develop a method to enumerate candidate social positions. The proposed classifier achieves an accuracy score of 85.8%, which indicates that social positions can be identified with good accuracy. Through an inter-annotator agreement study, I further show a reasonable level of agreement between users when identifying social positions.
Second, I introduce a novel topic modelling-based approach to represent each social position as a multinomial distribution over words. This approach estimates a topic from a document collection for each position. To construct such a collection for a particular position, I propose a seeding algorithm that extracts a set of terms relevant to the social position. Coherence-based evaluation shows that the proposed approach learns significantly more coherent representations when compared with a relevance modelling baseline.
Third, I present a diversification approach based on the proposed framework. Diversification algorithms aim to return a result list for a search query that would potentially satisfy users with diverse information needs. I propose to identify social positions that are relevant to a search query. These positions act as an implicit representation of the many possible interpretations of the search query. Then, relevant positions are provided to a diversification technique that proportionally diversifies results based on each social position's importance. I evaluate my approach using four test collections provided by the diversity task of the Text REtrieval Conference (TREC) web tracks for 2009, 2010, 2011, and 2012. Results demonstrate that my proposed diversification approach is effective and provides statistically significant improvements over various implicit diversification approaches.
Fourth, I introduce a session-based search system under the framework of learning to rank. Such a system aims to improve the retrieval performance for a search query using previous user interactions during the search session. I present a method to match a search session to its most relevant social positions based on the session's interaction data. I then suggest identifying related sessions from query logs that are likely to be issued by users with similar information needs. Novel learning features are then estimated from the session's social positions, related sessions, and interaction data. I evaluate the proposed system using four test collections from the TREC session track. This approach achieves state-of-the-art results compared with effective session-based search systems. I demonstrate that such a strong performance is mainly attributed to features that are derived from social positions' data
Re-ranking the Search Results for Users with Time-periodic Intents
This paper investigates the time of search as a feature to improve the personalization of information retrieval systems. In general, users issue small and ambiguous queries, which can refer to different topics of interest. Although personalized information retrieval systems take care of user’s topics of interest, but they do not consider if the topics are time periodic. The same ranked list cannot satisfy user search intents every time. This paper proposes a solution to rerank the search results for time sensitive ambiguous queries. An algorithm "HighTime" is presented here to disambiguate the time sensitive ambiguous queries and re-rank the default Google results by using a time sensitive user profile. The algorithm is evaluated by using two comparative measures, MAP and NDCG.
Results from user experiments showed that re-ranking of search results based on HighTime is effective in presenting relevant results to the users
Recuperação multimodal e interativa de informação orientada por diversidade
Orientador: Ricardo da Silva TorresTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Os mĂ©todos de Recuperação da Informação, especialmente considerando-se dados multimĂdia, evoluĂram para a integração de mĂşltiplas fontes de evidĂŞncia na análise de relevância de itens em uma tarefa de busca. Neste contexto, para atenuar a distância semântica entre as propriedades de baixo nĂvel extraĂdas do conteĂşdo dos objetos digitais e os conceitos semânticos de alto nĂvel (objetos, categorias, etc.) e tornar estes sistemas adaptativos Ă s diferentes necessidades dos usuários, modelos interativos que consideram o usuário mais prĂłximo do processo de recuperação tĂŞm sido propostos, permitindo a sua interação com o sistema, principalmente por meio da realimentação de relevância implĂcita ou explĂcita. Analogamente, a promoção de diversidade surgiu como uma alternativa para lidar com consultas ambĂguas ou incompletas. Adicionalmente, muitos trabalhos tĂŞm tratado a ideia de minimização do esforço requerido do usuário em fornecer julgamentos de relevância, Ă medida que mantĂ©m nĂveis aceitáveis de eficácia. Esta tese aborda, propõe e analisa experimentalmente mĂ©todos de recuperação da informação interativos e multimodais orientados por diversidade. Este trabalho aborda de forma abrangente a literatura acerca da recuperação interativa da informação e discute sobre os avanços recentes, os grandes desafios de pesquisa e oportunidades promissoras de trabalho. NĂłs propusemos e avaliamos dois mĂ©todos de aprimoramento do balanço entre relevância e diversidade, os quais integram mĂşltiplas informações de imagens, tais como: propriedades visuais, metadados textuais, informação geográfica e descritores de credibilidade dos usuários. Por sua vez, como integração de tĂ©cnicas de recuperação interativa e de promoção de diversidade, visando maximizar a cobertura de mĂşltiplas interpretações/aspectos de busca e acelerar a transferĂŞncia de informação entre o usuário e o sistema, nĂłs propusemos e avaliamos um mĂ©todo multimodal de aprendizado para ranqueamento utilizando realimentação de relevância sobre resultados diversificados. Nossa análise experimental mostra que o uso conjunto de mĂşltiplas fontes de informação teve impacto positivo nos algoritmos de balanceamento entre relevância e diversidade. Estes resultados sugerem que a integração de filtragem e re-ranqueamento multimodais Ă© eficaz para o aumento da relevância dos resultados e tambĂ©m como mecanismo de potencialização dos mĂ©todos de diversificação. AlĂ©m disso, com uma análise experimental minuciosa, nĂłs investigamos várias questões de pesquisa relacionadas Ă possibilidade de aumento da diversidade dos resultados e a manutenção ou atĂ© mesmo melhoria da sua relevância em sessões interativas. Adicionalmente, nĂłs analisamos como o esforço em diversificar afeta os resultados gerais de uma sessĂŁo de busca e como diferentes abordagens de diversificação se comportam para diferentes modalidades de dados. Analisando a eficácia geral e tambĂ©m em cada iteração de realimentação de relevância, nĂłs mostramos que introduzir diversidade nos resultados pode prejudicar resultados iniciais, enquanto que aumenta significativamente a eficácia geral em uma sessĂŁo de busca, considerando-se nĂŁo apenas a relevância e diversidade geral, mas tambĂ©m o quĂŁo cedo o usuário Ă© exposto ao mesmo montante de itens relevantes e nĂvel de diversidadeAbstract: Information retrieval methods, especially considering multimedia data, have evolved towards the integration of multiple sources of evidence in the analysis of the relevance of items considering a given user search task. In this context, for attenuating the semantic gap between low-level features extracted from the content of the digital objects and high-level semantic concepts (objects, categories, etc.) and making the systems adaptive to different user needs, interactive models have brought the user closer to the retrieval loop allowing user-system interaction mainly through implicit or explicit relevance feedback. Analogously, diversity promotion has emerged as an alternative for tackling ambiguous or underspecified queries. Additionally, several works have addressed the issue of minimizing the required user effort on providing relevance assessments while keeping an acceptable overall effectiveness. This thesis discusses, proposes, and experimentally analyzes multimodal and interactive diversity-oriented information retrieval methods. This work, comprehensively covers the interactive information retrieval literature and also discusses about recent advances, the great research challenges, and promising research opportunities. We have proposed and evaluated two relevance-diversity trade-off enhancement work-flows, which integrate multiple information from images, such as: visual features, textual metadata, geographic information, and user credibility descriptors. In turn, as an integration of interactive retrieval and diversity promotion techniques, for maximizing the coverage of multiple query interpretations/aspects and speeding up the information transfer between the user and the system, we have proposed and evaluated a multimodal learning-to-rank method trained with relevance feedback over diversified results. Our experimental analysis shows that the joint usage of multiple information sources positively impacted the relevance-diversity balancing algorithms. Our results also suggest that the integration of multimodal-relevance-based filtering and reranking was effective on improving result relevance and also boosted diversity promotion methods. Beyond it, with a thorough experimental analysis we have investigated several research questions related to the possibility of improving result diversity and keeping or even improving relevance in interactive search sessions. Moreover, we analyze how much the diversification effort affects overall search session results and how different diversification approaches behave for the different data modalities. By analyzing the overall and per feedback iteration effectiveness, we show that introducing diversity may harm initial results whereas it significantly enhances the overall session effectiveness not only considering the relevance and diversity, but also how early the user is exposed to the same amount of relevant items and diversityDoutoradoCiĂŞncia da ComputaçãoDoutor em CiĂŞncia da ComputaçãoP-4388/2010140977/2012-0CAPESCNP
Unveiling Black-boxes: Explainable Deep Learning Models for Patent Classification
Recent technological advancements have led to a large number of patents in a
diverse range of domains, making it challenging for human experts to analyze
and manage. State-of-the-art methods for multi-label patent classification rely
on deep neural networks (DNNs), which are complex and often considered
black-boxes due to their opaque decision-making processes. In this paper, we
propose a novel deep explainable patent classification framework by introducing
layer-wise relevance propagation (LRP) to provide human-understandable
explanations for predictions. We train several DNN models, including Bi-LSTM,
CNN, and CNN-BiLSTM, and propagate the predictions backward from the output
layer up to the input layer of the model to identify the relevance of words for
individual predictions. Considering the relevance score, we then generate
explanations by visualizing relevant words for the predicted patent class.
Experimental results on two datasets comprising two-million patent texts
demonstrate high performance in terms of various evaluation measures. The
explanations generated for each prediction highlight important relevant words
that align with the predicted class, making the prediction more understandable.
Explainable systems have the potential to facilitate the adoption of complex
AI-enabled methods for patent classification in real-world applications.Comment: This is the pre-print of the submitted manuscript on the World
Conference on eXplainable Artificial Intelligence (xAI2023), Lisbon,
Portugal. The published manuscript can be found here
https://doi.org/10.1007/978-3-031-44067-0_2
Automatic methods for low-cost evaluation and position-aware models for neural information retrieval
An information retrieval (IR) system assists people in consuming huge amount of data, where the evaluation and the construction of such systems are important. However, there exist two difficulties: the overwhelmingly large number of query-document pairs to judge, making IR evaluation a manually laborious task; and the complicated patterns to model due to the non-symmetric, heterogeneous relationships between a query-document pair, where different interaction patterns such as term dependency and proximity have been demonstrated to be useful, yet are non-trivial for a single IR model to encode. In this thesis we attempt to address both difficulties from the perspectives of IR evaluation and of the retrieval model respectively, by reducing the manual cost with automatic methods, by investigating the usage of crowdsourcing in collecting preference judgments, and by proposing novel neural retrieval models. In particular, to address the large number of query-document pairs in IR evaluation, a low-cost selective labeling method is proposed to pick out a small subset of representative documents for manual judgments in favor of the follow-up prediction for the remaining query-document pairs; furthermore, a language-model based cascade measure framework is developed to evaluate the novelty and diversity, utilizing the content of the labeled documents to mitigate incomplete labels. In addition, we also attempt to make the preference judgments practically usable by empirically investigating different properties of the judgments when collected via crowdsourcing; and by proposing a novel judgment mechanism, making a compromise between the judgment quality and the number of judgments. Finally, to model different complicated patterns in a single retrieval model, inspired by the recent advances in deep learning, we develop novel neural IR models to incorporate different patterns like term dependency, query proximity, density of relevance, and query coverage in a single model. We demonstrate their superior performances through evaluations on different datasets.Ein Information-Retrieval (IR) System hilft Menschen bei der Arbeit mit großen Datenmengen, daher ist die Entwicklung und Evaluation solcher Systeme wichtig. Allerdings gibt es zwei Herausforderungen: die große Anzahl von Anfrage-Dokument-Paaren, die manuelle IREvaluation schwierig macht; sowie die komplizierten zu modellierenden Muster, aufgrund der nicht-symmetrischen, heterogenen Beziehung zwischen einem Anfragen und Dokumenten, wo erwiesen ist dass verschiedene Interaktionsmuster wie Termabhängigkeiten und Termnähe wichtig sind, aber nicht einfach durch ein einzelnes IR-Modell zu erfassen sind. In dieser Dissertation versuchen wir, beide Herausforderungen aus der Perspektive der IR-Evaluation bzw. der IR-Modellierung anzugehen, indem wir die manuellen Kosten mit automatischen Methoden reduzieren, indem wir die Verwendung von Crowdsourcing bei der Erfassung von Präferenzbewertungen untersuchen und indem wir neue neuronale IR-Modelle vorschlagen. Um die große Anzahl von Anfrage-Dokument-Paaren in der IR-Evaluation in Angriff zu nehmen, schlagen wir eine kostengünstige selektive Bewertungsmethode vor, die nur eine kleine Untermenge von repräsentativen Dokumenten für manuelle Beurteilungen auswählt, deren Ergebnisse dann extrapoliert werden; darüber hinaus wird ein unüberwachtes sprachmodellbasiertes Gütemaß für Neuheit und Diversität vorgeschlagen, wobei der Inhalt der bewerteten Dokumente genutzt wird, um unvollständige Bewertungen zu kompensieren. Außerdem versuchen wir Präferenzbewertungen praktisch nutzbar zu machen, indem wir empirisch verschiedene Eigenschaften der Bewertungen beim Sammeln über Crowdsourcing untersuchen, und indem wir einen neuartigen Bewertungsmechanismus entwickeln, der einen Kompromiss zwischen der Bewertungsqualität und der Anzahl der Bewertungen macht. Abschließend, um verschiedene komplizierte Muster in einem einzigen IR-Modell zu erfassen, inspiriert von den jüngsten Fortschritten bei Deep-Learning-Verfahren, entwickeln wir neuartige neuronale IR-Modelle, die verschiedene Muster wie Termabhängigkeit, Termnähe, Relevanzdichte sowie Anfrageabdeckung in einem einzelnen IR-Modell integrieren. Experimente auf verschiedenen Datensätzen zeigen die überlegene Performance des vorgeschlagenen IR-Modells
A Survey on Intent-based Diversification for Fuzzy Keyword Search
Keyword search is an interesting phenomenon, it is the process of finding important and relevant information from various data repositories. Structured and semistructured data can precisely be stored. Fully unstructured documents can annotate and be stored in the form of metadata. For the total web search, half of the web search is for information exploration process. In this paper, the earlier works for semantic meaning of keywords based on their context in the specified documents are thoroughly analyzed. In a tree data representation, the nodes are objects and could hold some intention. These nodes act as anchors for a Smallest Lowest Common Ancestor (SLCA) based pruning process. Based on their features, nodes are clustered. The feature is a distinctive attribute, it is the quality, property or traits of something. Automatic text classification algorithms are the modern way for feature extraction. Summarization and segmentation produce n consecutive grams from various forms of documents. The set of items which describe and summarize one important aspect of a query is known as the facet. Instead of exact string matching a fuzzy mapping based on semantic correlation is the new trend, whereas the correlation is quantified by cosine similarity. Once the outlier is detected, nearest neighbors of the selected points are mapped to the same hash code of the intend nodes with high probability. These methods collectively retrieve the relevant data and prune out the unnecessary data, and at the same time create a hash signature for the nearest neighbor search. This survey emphasizes the need for a framework for fuzzy oriented keyword search
- …