142 research outputs found
Cross-view Embeddings for Information Retrieval
In this dissertation, we deal with the cross-view tasks related to information retrieval
using embedding methods. We study existing methodologies and propose new methods to overcome their limitations. We formally introduce the concept of mixed-script
IR, which deals with the challenges faced by an IR system when a language is written
in different scripts because of various technological and sociological factors. Mixed-script terms are represented by a small and finite feature space comprised of character
n-grams. We propose the cross-view autoencoder (CAE) to model such terms in an
abstract space and CAE provides the state-of-the-art performance.
We study a wide variety of models for cross-language information retrieval (CLIR)
and propose a model based on compositional neural networks (XCNN) which overcomes the limitations of the existing methods and achieves the best results for many
CLIR tasks such as ad-hoc retrieval, parallel sentence retrieval and cross-language
plagiarism detection. We empirically test the proposed models for these tasks on
publicly available datasets and present the results with analyses.
In this dissertation, we also explore an effective method to incorporate contextual
similarity for lexical selection in machine translation. Concretely, we investigate a
feature based on context available in source sentence calculated using deep autoencoders. The proposed feature exhibits statistically significant improvements over the
strong baselines for English-to-Spanish and English-to-Hindi translation tasks.
Finally, we explore the the methods to evaluate the quality of autoencoder generated representations of text data and analyse its architectural properties. For this,
we propose two metrics based on reconstruction capabilities of the autoencoders:
structure preservation index (SPI) and similarity accumulation index (SAI). We also
introduce a concept of critical bottleneck dimensionality (CBD) below which the
structural information is lost and present analyses linking CBD and language perplexity.En esta disertación estudiamos problemas de vistas-múltiples relacionados con la recuperación de información utilizando técnicas de representación en espacios de baja dimensionalidad. Estudiamos las técnicas existentes y proponemos nuevas técnicas para solventar algunas de las limitaciones existentes. Presentamos formalmente el concepto de recuperación de información con escritura mixta, el cual trata las dificultades de los sistemas de recuperación de información cuando los textos contienen escrituras en distintos alfabetos debido a razones tecnológicas y socioculturales. Las palabras en escritura mixta son representadas en un espacio de caracterÃsticas finito y reducido, compuesto por n-gramas de caracteres. Proponemos los auto-codificadores de vistas-múltiples (CAE, por sus siglas en inglés) para modelar dichas palabras en un espacio abstracto, y esta técnica produce resultados de vanguardia.
En este sentido, estudiamos varios modelos para la recuperación de información entre lenguas diferentes (CLIR, por sus siglas en inglés) y proponemos un modelo basado en redes neuronales composicionales (XCNN, por sus siglas en inglés), el cual supera las limitaciones de los métodos existentes. El método de XCNN propuesto produce mejores resultados en diferentes tareas de CLIR tales como la recuperación de información ad-hoc, la identificación de oraciones equivalentes en lenguas distintas y la detección de plagio entre lenguas diferentes. Para tal efecto, realizamos pruebas experimentales para dichas tareas sobre conjuntos de datos disponibles públicamente, presentando los resultados y análisis correspondientes.
En esta disertación, también exploramos un método eficiente para utilizar similitud semántica de contextos en el proceso de selección léxica en traducción automática. EspecÃficamente, proponemos caracterÃsticas extraÃdas de los contextos disponibles en las oraciones fuentes mediante el uso de auto-codificadores. El uso de las caracterÃsticas propuestas demuestra mejoras estadÃsticamente significativas sobre sistemas de traducción robustos para las tareas de traducción entre inglés y español, e inglés e hindú.
Finalmente, exploramos métodos para evaluar la calidad de las representaciones de datos de texto generadas por los auto-codificadores, a la vez que analizamos las propiedades de sus arquitecturas. Como resultado, proponemos dos nuevas métricas para cuantificar la calidad de las reconstrucciones generadas por los auto-codificadores: el Ãndice de preservación de estructura (SPI, por sus siglas en inglés) y el Ãndice de acumulación de similitud (SAI, por sus siglas en inglés). También presentamos el concepto de dimensión crÃtica de cuello de botella (CBD, por sus siglas en inglés), por debajo de la cual la información estructural se deteriora. Mostramos que, interesantemente, la CBD está relacionada con la perplejidad de la lengua.En aquesta dissertació estudiem els problemes de vistes-múltiples relacionats amb la recuperació d'informació utilitzant tècniques de representació en espais de baixa dimensionalitat. Estudiem les tècniques existents i en proposem unes de noves per solucionar algunes de les limitacions existents. Presentem formalment el concepte de recuperació d'informació amb escriptura mixta, el qual tracta les dificultats dels sistemes de recuperació d'informació quan els textos contenen escriptures en diferents alfabets per motius tecnològics i socioculturals. Les paraules en escriptura mixta són representades en un espai de caracterÃstiques finit i reduït, composat per n-grames de carà cters. Proposem els auto-codificadors de vistes-múltiples (CAE, per les seves sigles en anglès) per modelar aquestes paraules en un espai abstracte, i aquesta tècnica produeix resultats d'avantguarda.
En aquest sentit, estudiem diversos models per a la recuperació d'informació entre llengües diferents (CLIR , per les sevas sigles en anglès) i proposem un model basat en xarxes neuronals composicionals (XCNN, per les sevas sigles en anglès), el qual supera les limitacions dels mètodes existents. El mètode de XCNN proposat produeix millors resultats en diferents tasques de CLIR com ara la recuperació d'informació ad-hoc, la identificació d'oracions equivalents en llengües diferents, i la detecció de plagi entre llengües diferents. Per a tal efecte, realitzem proves experimentals per aquestes tasques sobre conjunts de dades disponibles públicament, presentant els resultats i anà lisis corresponents.
En aquesta dissertació, també explorem un mètode eficient per utilitzar similitud semà ntica de contextos en el procés de selecció lèxica en traducció automà tica. EspecÃficament, proposem caracterÃstiques extretes dels contextos disponibles a les oracions fonts mitjançant l'ús d'auto-codificadors. L'ús de les caracterÃstiques proposades demostra millores estadÃsticament significatives sobre sistemes de traducció robustos per a les tasques de traducció entre anglès i espanyol, i anglès i hindú.
Finalment, explorem mètodes per avaluar la qualitat de les representacions de dades de text generades pels auto-codificadors, alhora que analitzem les propietats de les seves arquitectures. Com a resultat, proposem dues noves mètriques per quantificar la qualitat de les reconstruccions generades pels auto-codificadors: l'Ãndex de preservació d'estructura (SCI, per les seves sigles en anglès) i l'Ãndex d'acumulació de similitud (SAI, per les seves sigles en anglès). També presentem el concepte de dimensió crÃtica de coll d'ampolla (CBD, per les seves sigles en anglès), per sota de la qual la informació estructural es deteriora. Mostrem que, de manera interessant, la CBD està relacionada amb la perplexitat de la llengua.Gupta, PA. (2017). Cross-view Embeddings for Information Retrieval [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/78457TESI
Improving Distributed Representations of Tweets - Present and Future
Unsupervised representation learning for tweets is an important research
field which helps in solving several business applications such as sentiment
analysis, hashtag prediction, paraphrase detection and microblog ranking. A
good tweet representation learning model must handle the idiosyncratic nature
of tweets which poses several challenges such as short length, informal words,
unusual grammar and misspellings. However, there is a lack of prior work which
surveys the representation learning models with a focus on tweets. In this
work, we organize the models based on its objective function which aids the
understanding of the literature. We also provide interesting future directions,
which we believe are fruitful in advancing this field by building high-quality
tweet representation learning models.Comment: To be presented in Student Research Workshop (SRW) at ACL 201
Improving Distributed Representations of Tweets - Present and Future
Unsupervised representation learning for tweets is an important research
field which helps in solving several business applications such as sentiment
analysis, hashtag prediction, paraphrase detection and microblog ranking. A
good tweet representation learning model must handle the idiosyncratic nature
of tweets which poses several challenges such as short length, informal words,
unusual grammar and misspellings. However, there is a lack of prior work which
surveys the representation learning models with a focus on tweets. In this
work, we organize the models based on its objective function which aids the
understanding of the literature. We also provide interesting future directions,
which we believe are fruitful in advancing this field by building high-quality
tweet representation learning models.Comment: To be presented in Student Research Workshop (SRW) at ACL 201
The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives
The Archive Query Log (AQL) is a previously unused, comprehensive query log
collected at the Internet Archive over the last 25 years. Its first version
includes 356 million queries, 166 million search result pages, and 1.7 billion
search results across 550 search providers. Although many query logs have been
studied in the literature, the search providers that own them generally do not
publish their logs to protect user privacy and vital business data. Of the few
query logs publicly available, none combines size, scope, and diversity. The
AQL is the first to do so, enabling research on new retrieval models and
(diachronic) search engine analyses. Provided in a privacy-preserving manner,
it promotes open research as well as more transparency and accountability in
the search industry.Comment: SIGIR 2023 resource paper, 13 page
Recommended from our members
A user-centred approach to information retrieval
A user model is a fundamental component in user-centred information retrieval systems. It enables personalization of a user's search experience. The development of such a model involves three phases: collecting information about each user, representing such information, and integrating the model into a retrieval application. Progress in this area is typically met with privacy and scalability challenges that hinder the ability to synthesize collective knowledge from each user's search behaviour. In this thesis, I propose a framework that addresses each of these three phases. The proposed framework is based on social role theory from the social science literature and at the centre of this theory is the concept of a social position. A social position is a label for a group of users with similar behavioural patterns. Examples of such positions are traveller, patient, movie fan, and computer scientist. In this thesis, a social position acts as a label for users who are expected to have similar interests. The proposed framework does not require real users' data; rather it uses the web as a resource to model users.
The proposed framework offers a data-driven and modular design for each of the three phases of building a user model. First, I present an approach to identify social positions from natural language sentences. I formulate this task as a binary classification task and develop a method to enumerate candidate social positions. The proposed classifier achieves an accuracy score of 85.8%, which indicates that social positions can be identified with good accuracy. Through an inter-annotator agreement study, I further show a reasonable level of agreement between users when identifying social positions.
Second, I introduce a novel topic modelling-based approach to represent each social position as a multinomial distribution over words. This approach estimates a topic from a document collection for each position. To construct such a collection for a particular position, I propose a seeding algorithm that extracts a set of terms relevant to the social position. Coherence-based evaluation shows that the proposed approach learns significantly more coherent representations when compared with a relevance modelling baseline.
Third, I present a diversification approach based on the proposed framework. Diversification algorithms aim to return a result list for a search query that would potentially satisfy users with diverse information needs. I propose to identify social positions that are relevant to a search query. These positions act as an implicit representation of the many possible interpretations of the search query. Then, relevant positions are provided to a diversification technique that proportionally diversifies results based on each social position's importance. I evaluate my approach using four test collections provided by the diversity task of the Text REtrieval Conference (TREC) web tracks for 2009, 2010, 2011, and 2012. Results demonstrate that my proposed diversification approach is effective and provides statistically significant improvements over various implicit diversification approaches.
Fourth, I introduce a session-based search system under the framework of learning to rank. Such a system aims to improve the retrieval performance for a search query using previous user interactions during the search session. I present a method to match a search session to its most relevant social positions based on the session's interaction data. I then suggest identifying related sessions from query logs that are likely to be issued by users with similar information needs. Novel learning features are then estimated from the session's social positions, related sessions, and interaction data. I evaluate the proposed system using four test collections from the TREC session track. This approach achieves state-of-the-art results compared with effective session-based search systems. I demonstrate that such a strong performance is mainly attributed to features that are derived from social positions' data
- …