71 research outputs found
Recommended from our members
Text-based document geolocation and its application to the digital humanities
This dissertation investigates automatic geolocation of documents (i.e. identification of their location, expressed as latitude/longitude coordinates), based on the text of those documents rather than metadata. I assert that such geolocation can be performed using text alone, at a sufficient accuracy for use in real-world applications. Although in some corpora metadata is found in abundance (e.g. home location, time zone, friends, followers, etc. in Twitter), it is lacking in others, such as many corpora of primary-source documents in the digital humanities, an area to which document geolocation has hardly been applied. To this end, I first develop methods for accurate text-based geolocation and then apply them to newly-annotated corpora in the digital humanities. The geolocation methods I develop use both uniform and adaptive (k-d tree) grids over the Earth’s surface, culminating in a hierarchical logistic-regression-based technique that achieves state of the art results on well-known corpora (Twitter user feeds, Wikipedia articles and Flickr image tags). In the second part of the dissertation I develop a new NLP task, text-based geolocation of historical corpora. Because there are no existing corpora to test on, I create and annotate two new corpora of significantly different natures (a 19th-century travel log and a large set of Civil War archives). I show how my methods produce good geolocation accuracy even given the relatively small amount of annotated data available, which can be further improved using domain adaptation. I then use the predictions on the much larger unannotated portion of the Civil War archives to generate and analyze geographic topic models, showing how they can be mined to produce interesting revelations concerning various Civil War-related subjects. Finally, I develop a new geolocation technique for text-only corpora involving co-training between document-geolocation and toponym- resolution models, using a gazetteer to inject additional information into the training process. To evaluate this technique I develop a new metric, the closest toponym error distance, on which I show improvements compared with a baseline geolocator.Linguistic
User Modeling and User Profiling: A Comprehensive Survey
The integration of artificial intelligence (AI) into daily life, particularly
through information retrieval and recommender systems, has necessitated
advanced user modeling and profiling techniques to deliver personalized
experiences. These techniques aim to construct accurate user representations
based on the rich amounts of data generated through interactions with these
systems. This paper presents a comprehensive survey of the current state,
evolution, and future directions of user modeling and profiling research. We
provide a historical overview, tracing the development from early stereotype
models to the latest deep learning techniques, and propose a novel taxonomy
that encompasses all active topics in this research area, including recent
trends. Our survey highlights the paradigm shifts towards more sophisticated
user profiling methods, emphasizing implicit data collection, multi-behavior
modeling, and the integration of graph data structures. We also address the
critical need for privacy-preserving techniques and the push towards
explainability and fairness in user modeling approaches. By examining the
definitions of core terminology, we aim to clarify ambiguities and foster a
clearer understanding of the field by proposing two novel encyclopedic
definitions of the main terms. Furthermore, we explore the application of user
modeling in various domains, such as fake news detection, cybersecurity, and
personalized education. This survey serves as a comprehensive resource for
researchers and practitioners, offering insights into the evolution of user
modeling and profiling and guiding the development of more personalized,
ethical, and effective AI systems.Comment: 71 page
Evaluating Information Retrieval and Access Tasks
This open access book summarizes the first two decades of the NII Testbeds and Community for Information access Research (NTCIR). NTCIR is a series of evaluation forums run by a global team of researchers and hosted by the National Institute of Informatics (NII), Japan. The book is unique in that it discusses not just what was done at NTCIR, but also how it was done and the impact it has achieved. For example, in some chapters the reader sees the early seeds of what eventually grew to be the search engines that provide access to content on the World Wide Web, today’s smartphones that can tailor what they show to the needs of their owners, and the smart speakers that enrich our lives at home and on the move. We also get glimpses into how new search engines can be built for mathematical formulae, or for the digital record of a lived human life. Key to the success of the NTCIR endeavor was early recognition that information access research is an empirical discipline and that evaluation therefore lay at the core of the enterprise. Evaluation is thus at the heart of each chapter in this book. They show, for example, how the recognition that some documents are more important than others has shaped thinking about evaluation design. The thirty-three contributors to this volume speak for the many hundreds of researchers from dozens of countries around the world who together shaped NTCIR as organizers and participants. This book is suitable for researchers, practitioners, and students—anyone who wants to learn about past and present evaluation efforts in information retrieval, information access, and natural language processing, as well as those who want to participate in an evaluation task or even to design and organize one
Multi-Dimensional Joins
We present three novel algorithms for performing multi-dimensional
joins and an in-depth survey and analysis of a low-dimensional
spatial join. The first algorithm, the Iterative Spatial Join,
performs a spatial join on low-dimensional data and is based
on a plane-sweep technique.
As we show analytically and experimentally,
the Iterative Spatial Join performs well when internal memory is
limited, compared to competing methods. This suggests that
the Iterative Spatial Join would be useful for very large data sets
or in situations where internal memory is a shared resource and
is therefore limited, such as with today's database engines which
share internal memory amongst several queries. Furthermore, the
performance of the Iterative Spatial Join is predictable and has
no parameters which need to be tuned, unlike other algorithms.
The second algorithm, the Quickjoin algorithm,
performs a higher-dimensional
similarity join in which pairs of objects that lie within a
certain distance epsilon of each other are reported.
The Quickjoin algorithm overcomes drawbacks of competing methods,
such as requiring embedding methods on the data first or using
multi-dimensional indices, which limit
the ability to discriminate between objects in each
dimension, thereby degrading performance.
A formal analysis is provided of the Quickjoin method, and
experiments show that the Quickjoin method significantly outperforms
competing methods.
The third algorithm adapts
incremental join techniques to improve the
speed of calculating the Hausdorff distance, which
is used in applications such as image matching, image analysis,
and surface approximations.
The nearest neighbor incremental join technique for indices that
are based on hierarchical containment use a priority queue
of index node pairs and bounds on the distance values between
pairs, both of which need to modified in order to calculate the
Hausdorff distance. Results of experiments are described that
confirm the performance improvement.
Finally, a survey is provided which
instead of just summarizing the literature and presenting each
technique in its entirety, describes distinct components of
the different techniques, and each technique is decomposed into
an overall framework for performing a spatial join
Joint Discourse-aware Concept Disambiguation and Clustering
This thesis addresses the tasks of concept disambiguation and clustering. Concept disambiguation is the task of linking common nouns and proper names in a text – henceforth called mentions – to their corresponding concepts in a predefined inventory. Concept clustering is the task of clustering mentions, so that all mentions in one cluster denote the same concept. In this thesis, we investigate concept disambiguation and clustering from a discourse perspective and propose a discourse-aware approach for joint concept disambiguation and clustering in the framework of Markov logic. The contributions of this thesis are fourfold:
Joint Concept Disambiguation and Clustering. In previous approaches, concept disambiguation and concept clustering have been considered as two separate tasks (Schütze, 1998; Ji & Grishman, 2011). We analyze the relationship between concept disambiguation and concept clustering and argue that these two tasks can mutually support each other. We propose the – to our knowledge – first joint approach for concept disambiguation and clustering.
Discourse-Aware Concept Disambiguation. One of the determining factors for concept disambiguation and clustering is the context definition. Most previous approaches use the same context definition for all mentions (Milne & Witten, 2008b; Kulkarni et al., 2009; Ratinov et al., 2011, inter alia). We approach the question which context is relevant to disambiguate a mention from a discourse perspective and state that different mentions require different notions of contexts. We state that the context that is relevant to disambiguate a mention depends on its embedding into discourse. However, how a mention is embedded into discourse depends on its denoted concept. Hence, the identification of the denoted concept and the relevant concept mutually depend on each other. We propose a binwise approach with three different context definitions and model the selection of the context definition and the disambiguation jointly.
Modeling Interdependencies with Markov Logic. To model the interdependencies between concept disambiguation and concept clustering as well as the interdependencies between the context definition and the disambiguation, we use Markov logic (Domingos & Lowd, 2009). Markov logic combines first order logic with probabilities and allows us to concisely formalize these interdependencies. We investigate how we can balance between linguistic appropriateness and time efficiency and propose a hybrid approach that combines joint inference with aggregation techniques.
Concept Disambiguation and Clustering beyond English: Multi- and Cross-linguality. Given the vast amount of texts written in different languages, the capability to extend an approach to cope with other languages than English is essential. We thus analyze how our approach copes with other languages than English and show that our approach largely scales across languages, even without retraining.
Our approach is evaluated on multiple data sets originating from different sources (e.g. news, web) and across multiple languages. As an inventory, we use Wikipedia. We compare our approach to other approaches and show that it achieves state-of-the-art results. Furthermore, we show that joint concept disambiguating and clustering as well as joint context selection and disambiguation leads to significant improvements ceteris paribus
Interim research assessment 2003-2005 - Computer Science
This report primarily serves as a source of information for the 2007 Interim Research Assessment Committee for Computer Science at the three technical universities in the Netherlands. The report also provides information for others interested in our research activities
Spatial and temporal resolution of sensor observations
Beobachtung ist ein Kernkonzept der Geoinformatik. Beobachtungen dienen bei Phänomenen wie Klimawandel, Massenbewegungen (z. B. Hangbewegungen) und demographischer Wandel zur Überwachung, Entwicklung von Modellen und Simulation dieser Erscheinungen. Auflösung ist eine zentrale Eigenschaft von Beobachtungen. Der Gebrauch von Beobachtungen unterschiedlicher Auflösung führt zu (potenziell) unterschiedlichen Entscheidungen, da die Auflösung der Beobachtungen das Erkennen von Strukturen während der Phase der Datenanalyse beeinflusst. Der Hauptbeitrag dieser Arbeit ist eine entwickelte Theorie der raum- und zeitlichen Auflösung von Beobachtungen, die sowohl auf technische Sensoren (z. B. Fotoapparat) als auch auf menschliche Sensoren anwendbar ist. Die Konsistenz der Theorie wurde anhand der Sprache Haskell evaluiert, und ihre praktische Anwendbarkeit wurde unter Einsatz von Beobachtungen des Webportals Flickr illustriert
Geographic information extraction from texts
A large volume of unstructured texts, containing valuable geographic information, is available online. This information – provided implicitly or explicitly – is useful not only for scientific studies (e.g., spatial humanities) but also for many practical applications (e.g., geographic information retrieval). Although large progress has been achieved in geographic information extraction from texts, there are still unsolved challenges and issues, ranging from methods, systems, and data, to applications and privacy. Therefore, this workshop will provide a timely opportunity to discuss the recent advances, new ideas, and concepts but also identify research gaps in geographic information extraction
- …