283 research outputs found

    Doctor of Philosophy

    Get PDF
    dissertationThe explosion of structured Web data (e.g., online databases, Wikipedia infoboxes) creates many opportunities for integrating and querying these data that go far beyond the simple search capabilities provided by search engines. Although much work has been devoted to data integration in the database community, the Web brings new challenges: the Web-scale (e.g., the large and growing volume of data) and the heterogeneity in Web data. Because there are so much data, scalable techniques that require little or no manual intervention and that are robust to noisy data are needed. In this dissertation, we propose a new and effective approach for matching Web-form interfaces and for matching multilingual Wikipedia infoboxes. As a further step toward these problems, we propose a general prudent schema-matching framework that matches a large number of schemas effectively. Our comprehensive experiments for Web-form interfaces and Wikipedia infoboxes show that it can enable on-the-fly, automatic integration of large collections of structured Web data. Another problem we address in this dissertation is schema discovery. While existing integration approaches assume that the relevant data sources and their schemas have been identified in advance, schemas are not always available for structured Web data. Approaches exist that exploit information in Wikipedia to discover the entity types and their associate schemas. However, due to inconsistencies, sparseness, and noise from the community contribution, these approaches are error prone and require substantial human intervention. Given the schema heterogeneity in Wikipedia infoboxes, we developed a new approach that uses the structured information available in infoboxes to cluster similar infoboxes and infer the schemata for entity types. Our approach is unsupervised and resilient to the unpredictable skew in the entity class distribution. Our experiments, using over one hundred thousand infoboxes extracted from Wikipedia, indicate that our approach is effective and produces accurate schemata for Wikipedia entities

    An Unsupervised Approach to Modelling Visual Data

    Get PDF
    For very large visual datasets, producing expert ground-truth data for training supervised algorithms can represent a substantial human effort. In these situations there is scope for the use of unsupervised approaches that can model collections of images and automatically summarise their content. The primary motivation for this thesis comes from the problem of labelling large visual datasets of the seafloor obtained by an Autonomous Underwater Vehicle (AUV) for ecological analysis. It is expensive to label this data, as taxonomical experts for the specific region are required, whereas automatically generated summaries can be used to focus the efforts of experts, and inform decisions on additional sampling. The contributions in this thesis arise from modelling this visual data in entirely unsupervised ways to obtain comprehensive visual summaries. Firstly, popular unsupervised image feature learning approaches are adapted to work with large datasets and unsupervised clustering algorithms. Next, using Bayesian models the performance of rudimentary scene clustering is boosted by sharing clusters between multiple related datasets, such as regular photo albums or AUV surveys. These Bayesian scene clustering models are extended to simultaneously cluster sub-image segments to form unsupervised notions of “objects” within scenes. The frequency distribution of these objects within scenes is used as the scene descriptor for simultaneous scene clustering. Finally, this simultaneous clustering model is extended to make use of whole image descriptors, which encode rudimentary spatial information, as well as object frequency distributions to describe scenes. This is achieved by unifying the previously presented Bayesian clustering models, and in so doing rectifies some of their weaknesses and limitations. Hence, the final contribution of this thesis is a practical unsupervised algorithm for modelling images from the super-pixel to album levels, and is applicable to large datasets

    Multi-modal modelling with multi-module mechanics:Autonomy in a computational model of language learing

    Get PDF

    Semantic Segmentation with Neural Networks in Environment Monitoring

    Get PDF
    The Finnish Environment Institute (SYKE) has at least two missions which require surveying large land areas: finding invasive alien species and monitoring the state of Finnish lakes. Various methods to accomplish these tasks exist, but they traditionally rely on manual labor by experts or citizen activism, and as such do not scale well. This thesis explores the usage of computer vision to dramatically improve the scaling of these tasks. Specifically, the aim is to fly a drone over selected areas and use a convolutional neural network architecture (U-net) to create segmentations of the images. The method performs well on select biomass estimation task classes due to large enough datasets and easy-to-distinguish core features of the classes. Furthermore, a qualitative study of datasets was performed, yielding an estimate for a lower bound of number of examples for an useful dataset. ACM Computing Classification System (CCS): CCS → Computing methodologies → Machine learning → Machine learning approaches → Neural network

    Effective Instance Matching for Heterogeneous Structured Data

    Get PDF
    One main problem towards the effective usage of structured data is instance matching, where the goal is to find instance representations referring to the same real-world thing. In this book we investigate how to effectively match Heterogeneous structured data. We evaluate our approaches against the latest baselines. The results show advances beyond the state-of-the-art

    Topic and link detection from multilingual news.

    Get PDF
    Huang Ruizhang.Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.Includes bibliographical references (leaves 110-114).Abstracts in English and Chinese.Abstract --- p.iAcknowledgement --- p.ivChapter 1 --- Introduction --- p.1Chapter 1.1 --- The Defitition of Topic and Event --- p.2Chapter 1.2 --- Event and Topic Discovery --- p.2Chapter 1.2.1 --- Problem Definition --- p.2Chapter 1.2.2 --- Characteristics of the Discovery Problems --- p.3Chapter 1.2.3 --- Our Contributions --- p.5Chapter 1.3 --- Story Link Detection --- p.5Chapter 1.3.1 --- Problem Definition --- p.5Chapter 1.3.2 --- Our Contributions --- p.6Chapter 1.4 --- Thesis Organization --- p.7Chapter 2 --- Literature Review --- p.8Chapter 2.1 --- University of Massachusetts (UMass) --- p.8Chapter 2.1.1 --- Topic Detection Approach --- p.8Chapter 2.1.2 --- Story Link Detection Approach --- p.9Chapter 2.2 --- BBN Technologies --- p.10Chapter 2.3 --- IBM Research Center --- p.11Chapter 2.4 --- Carnegie Mellon University (CMU) --- p.12Chapter 2.4.1 --- Topic Detection Approach --- p.12Chapter 2.4.2 --- Story Link Detection Approach --- p.14Chapter 2.5 --- National Taiwan University (NTU) --- p.14Chapter 2.5.1 --- Topic Detection Approach --- p.14Chapter 2.5.2 --- Story Link Detection Approach --- p.15Chapter 3 --- System Overview --- p.17Chapter 3.1 --- News Sources --- p.18Chapter 3.2 --- Story Preprocessing --- p.24Chapter 3.3 --- Information Extraction --- p.25Chapter 3.4 --- Gloss Translation --- p.26Chapter 3.5 --- Term Weight Calculation --- p.30Chapter 3.6 --- Event And Topic Discovery --- p.31Chapter 3.7 --- Story Link Detection --- p.33Chapter 4 --- Event And Topic Discovery --- p.34Chapter 4.1 --- Overview of Event and Topic discovery --- p.34Chapter 4.2 --- Event Discovery Component --- p.37Chapter 4.2.1 --- Overview of Event Discovery Algorithm --- p.37Chapter 4.2.2 --- Similarity Calculation --- p.39Chapter 4.2.3 --- Story and Event Combination --- p.43Chapter 4.2.4 --- Event Discovery Output --- p.44Chapter 4.3 --- Topic Discovery Component --- p.45Chapter 4.3.1 --- Overview of Topic Discovery Algorithm --- p.47Chapter 4.3.2 --- Relevance Model --- p.47Chapter 4.3.3 --- Event and Topic Combination --- p.50Chapter 4.3.4 --- Topic Discovery Output --- p.50Chapter 5 --- Event And Topic Discovery Experimental Results --- p.54Chapter 5.1 --- Testing Corpus --- p.54Chapter 5.2 --- Evaluation Methodology --- p.56Chapter 5.3 --- Experimental Results on Event Discovery --- p.58Chapter 5.3.1 --- Parameter Tuning --- p.58Chapter 5.3.2 --- Event Discovery Result --- p.59Chapter 5.4 --- Experimental Results on Topic Discovery --- p.62Chapter 5.4.1 --- Parameter Tuning --- p.64Chapter 5.4.2 --- Topic Discovery Results --- p.64Chapter 6 --- Story Link Detection --- p.67Chapter 6.1 --- Topic Types --- p.67Chapter 6.2 --- Overview of Link Detection Component --- p.68Chapter 6.3 --- Automatic Topic Type Categorization --- p.70Chapter 6.3.1 --- Training Data Preparation --- p.70Chapter 6.3.2 --- Feature Selection --- p.72Chapter 6.3.3 --- Training and Tuning Categorization Model --- p.73Chapter 6.4 --- Link Detection Algorithm --- p.74Chapter 6.4.1 --- Story Component Weight --- p.74Chapter 6.4.2 --- Story Link Similarity Calculation --- p.76Chapter 6.5 --- Story Link Detection Output --- p.77Chapter 7 --- Link Detection Experimental Results --- p.80Chapter 7.1 --- Testing Corpus --- p.80Chapter 7.2 --- Topic Type Categorization Result --- p.81Chapter 7.3 --- Link Detection Evaluation Methodology --- p.82Chapter 7.4 --- Experimental Results on Link Detection --- p.83Chapter 7.4.1 --- Language Normalization Factor Tuning --- p.83Chapter 7.4.2 --- Link Detection Performance --- p.90Chapter 7.4.3 --- Link Detection Performance Breakdown --- p.91Chapter 8 --- Conclusions and Future Work --- p.95Chapter 8.1 --- Conclusions --- p.95Chapter 8.2 --- Future Work --- p.96Chapter A --- List of Topic Title Annotated for TDT3 corpus by LDC --- p.98Chapter B --- List of Manually Annotated Events for TDT3 Corpus --- p.104Bibliography --- p.11

    Knowledge aggregation in people recommender systems : matching skills to tasks

    Get PDF
    People recommender systems (PRS) are a special type of RS. They are often adopted to identify people capable of performing a task. Recommending people poses several challenges not exhibited in traditional RS. Elements such as availability, overload, unresponsiveness, and bad recommendations can have adverse effects. This thesis explores how people’s preferences can be elicited for single-event matchmaking under uncertainty and how to align them with appropriate tasks. Different methodologies are introduced to profile people, each based on the nature of the information from which it was obtained. These methodologies are developed into three use cases to illustrate the challenges of PRS and the steps taken to address them. Each one emphasizes the priorities of the matching process and the constraints under which these recommendations are made. First, multi-criteria profiles are derived completely from heterogeneous sources in an implicit manner characterizing users from multiple perspectives and multi-dimensional points-of-view without influence from the user. The profiles are introduced to the conference reviewer assignment problem. Attention is given to distribute people across items in order reduce potential overloading of a person, and neglect or rejection of a task. Second, people’s areas of interest are inferred from their resumes and expressed in terms of their uncertainty avoiding explicit elicitation from an individual or outsider. The profile is applied to a personnel selection problem where emphasis is placed on the preferences of the candidate leading to an asymmetric matching process. Third, profiles are created by integrating implicit information and explicitly stated attributes. A model is developed to classify citizens according to their lifestyles which maintains the original information in the data set throughout the cluster formation. These use cases serve as pilot tests for generalization to real-life implementations. Areas for future application are discussed from new perspectives.Els sistemes de recomanació de persones (PRS) són un tipus especial de sistemes recomanadors (RS). Sovint s’utilitzen per identificar persones per a realitzar una tasca. La recomanació de persones comporta diversos reptes no exposats en la RS tradicional. Elements com la disponibilitat, la sobrecàrrega, la falta de resposta i les recomanacions incorrectes poden tenir efectes adversos. En aquesta tesi s'explora com es poden obtenir les preferències dels usuaris per a la definició d'assignacions sota incertesa i com aquestes assignacions es poden alinear amb tasques definides. S'introdueixen diferents metodologies per definir el perfil d’usuaris, cadascun en funció de la naturalesa de la informació necessària. Aquestes metodologies es desenvolupen i s’apliquen en tres casos d’ús per il·lustrar els reptes dels PRS i els passos realitzats per abordar-los. Cadascun destaca les prioritats del procés, l’encaix de les recomanacions i les seves limitacions. En el primer cas, els perfils es deriven de variables heterogènies de manera implícita per tal de caracteritzar als usuaris des de múltiples perspectives i punts de vista multidimensionals sense la influència explícita de l’usuari. Això s’aplica al problema d'assignació d’avaluadors per a articles de conferències. Es presta especial atenció al fet de distribuir els avaluadors entre articles per tal de reduir la sobrecàrrega potencial d'una persona i el neguit o el rebuig a la tasca. En el segon cas, les àrees d’interès per a caracteritzar les persones es dedueixen dels seus currículums i s’expressen en termes d’incertesa evitant que els interessos es demanin explícitament a les persones. El sistema s'aplica a un problema de selecció de personal on es posa èmfasi en les preferències del candidat que condueixen a un procés d’encaix asimètric. En el tercer cas, els perfils dels usuaris es defineixen integrant informació implícita i atributs indicats explícitament. Es desenvolupa un model per classificar els ciutadans segons els seus estils de vida que manté la informació original del conjunt de dades del clúster al que ell pertany. Finalment, s’analitzen aquests casos com a proves pilot per generalitzar implementacions en futurs casos reals. Es discuteixen les àrees d'aplicació futures i noves perspectives.Postprint (published version

    Improved Coreference Resolution Using Cognitive Insights

    Get PDF
    Coreference resolution is the task of extracting referential expressions, or mentions, in text and clustering these by the entity or concept they refer to. The sustained research interest in the task reflects the richness of reference expression usage in natural language and the difficulty in encoding insights from linguistic and cognitive theories effectively. In this thesis, we design and implement LIMERIC, a state-of-the-art coreference resolution engine. LIMERIC naturally incorporates both non-local decoding and entity-level modelling to achieve the highly competitive benchmark performance of 64.22% and 59.99% on the CoNLL-2012 benchmark with a simple model and a baseline feature set. As well as strong performance, a key contribution of this work is a reconceptualisation of the coreference task. We draw an analogy between shift-reduce parsing and coreference resolution to develop an algorithm which naturally mimics cognitive models of human discourse processing. In our feature development work, we leverage insights from cognitive theories to improve our modelling. Each contribution achieves statistically significant improvements and sum to gains of 1.65% and 1.66% on the CoNLL-2012 benchmark, yielding performance values of 65.76% and 61.27%. For each novel feature we propose, we contribute an accompanying analysis so as to better understand how cognitive theories apply to real language data. LIMERIC is at once a platform for exploring cognitive insights into coreference and a viable alternative to current systems. We are excited by the promise of incorporating our and further cognitive insights into more complex frameworks since this has the potential to both improve the performance of computational models, as well as our understanding of the mechanisms underpinning human reference resolution

    Unsupervised Recognition of Motion Verbs Metaphoricity in Atyical Political Dialogues

    Get PDF
    This thesis deals with the unsupervised recognition of the novel metaphorical use of lexical items in dialogical naturally-occurring political texts without the recourse to task-specific hand-crafted knowledge. The focus of metaphorical analysis is represented by the class of verbs of motion identified by Beth Levin. These lexical items are investigated in the atypical political genre of the White House Press Briefings due to their role in the communication strategies deployed in public and political discourse. The Computational White House press Briefings (CompWHoB) corpus, a large resource developed as one of the main objectives of the present work, is used for the extraction of the press briefings including the lexical items under analysis. The metaphor recognition of the motion verbs is addressed employing unsupervised techniques which theoretical foundations primarily lie in the Distributional Hypothesis theory, i.e. word embeddings and topic models. Three algorithms are developed for the task, combining the Word2Vec and the Latent Dirichlet Allocation models, and based on two approaches representing their foundational theoretical framework. The first one is defined as "local" and leverages the syntactic relations of the verb of motion with its direct object for the detection of metaphoricity. The second one, termed as "global", drifts away from the use of the syntactic knowledge as feature of the system hence only using the information inferred from the discourse context. The three systems and their corresponding approaches are evaluated against 1220 instances of verbs of motion annotated by human judges according to their metaphoricity. Results show that the global approach performs poorly compared to the other two models also implementing the local approach, leading to the conclusion that a syntax-agnostic system is still far from reaching a significant performance. The evaluation of the local approach yields instead promising results, proving the importance of endowing the machine with syntactic knowledge as also confirmed by a qualitative analysis on the influence of the linguistic properties of metaphorical utterances
    • …
    corecore