22 research outputs found

    ARCHITECTURE, MODELS, AND ALGORITHMS FOR TEXTUAL SIMILARITY

    Get PDF
    Identifying similar pieces of texts remains one of the fundamental problems in computational linguistics. This dissertation focuses on the textual similarity measurement and identification problem by studying a variety of major tasks that share common properties, and presents our efforts to address 7 closely-related similarity tasks given over 20 public benchmarks, including paraphrase identification, answer selection for question answering, pairwise learning to rank, monolingual/cross-lingual semantic textual similarity measurement, insight extraction on biomedical literature, and high performance cross-lingual pattern matching for machine translation on GPUs. We investigate how to make textual similarity measurement more accurate with deep neural networks. Traditional approaches are either based on feature engineering which leads to disconnected solutions, or the Siamese architecture which treats inputs independently, utilizes single representation view and straightforward similarity comparison. In contrast, we focus on modeling stronger interactions between inputs and develop interaction-based neural modeling that explicitly encodes the alignments of input words or aggregated sentence representations into our models. As a result, our multiple deep neural networks show highly competitive performance on many textual similarity measurement public benchmarks we evaluated. Our multi-perspective convolutional neural networks (MPCNN) uses a multiplicity of perspectives to process input sentences with multiple parallel convolutional neural networks, is able to extract salient sentence-level features automatically at multiple granularities with different types of pooling. Our novel structured similarity layer encourages stronger input interactions by comparing local regions of both sentence representations. This model is the first example of our interaction-based neural modeling. We also provide an attention-based input interaction layer on top of the MPCNN model. The input interaction layer models a closer relationship of input words by converting two separate sentences into an inter-related sentence pair. This layer utilizes the attention mechanism in a straightforward way, and is another example of our interaction-based neural modeling. We then provide our pairwise word interaction model with very deep neural networks (PWI). This model directly encodes input word interactions with novel pairwise word interaction modeling and a novel similarity focus layer. The use of very deep architecture in this model is the first example in NLP domain for better textual similarity modeling. Our PWI model outperforms the Siamese architecture and feature engineering approach on multiple tasks, and is another example of our interaction-based neural modeling. We also focus on the question answering task with a pairwise ranking approach. Unlike traditional pointwise approach of the task, our pairwise ranking approach with the use of negative sampling focuses on modeling interactions between two pairs of question and answer inputs, then learns a relative order of the pairs to predict which answer is more relevant to the question. We demonstrate its high effectiveness against competitive previous pointwise baselines. For the insight extraction on biomedical literature task, we develop neural networks with similarity modeling for better causality/correlation relation extraction, as we convert the extraction task into a similarity measurement task. Our approach innovates in that it explicitly models the interactions among the trio: named entities, entity relations and contexts, and then measures both relational and contextual similarity among them, finally integrate both similarity evaluations into considerations for insight extraction. We also build an end-to-end system to extract insights, with human evaluations we show our system is able to extract insights with high human acceptance accuracy. Lastly, we explore how to exploit massive parallelism offered by modern GPUs for high-efficiency pattern matching. We take advantage of GPU hardware advances and develop a massive parallelism approach. We firstly work on phrase-based SMT, where we enable phrase lookup and extraction on suffix arrays to be massively parallelized and vastly many queries to be carried out in parallel. We then work on computationally expensive hierarchical SMT model, which requires matching grammar patterns that contain ''gaps''. In order to get high efficiency for the similarity identification task on GPUs, we show developing massively parallel algorithms on GPUs is the most important approach to fully utilize GPU's raw processing power, and developing compact data structures on GPUs is helpful to lower GPU's memory latency. Compared to a highly-optimized, state-of-the-art multi-threaded CPU implementation, our techniques achieve orders of magnitude improvement in terms of throughput

    Modelling plant trait variability in changing arid environments

    Get PDF
    Modellierung der Variabilität von Pflanzen-Traits auf Populations- und Lebensgemeinschaftsebene in ariden Gebieten mit Umweltveränderungen. Lebensgemeinschaften in ariden Gebieten sind angesichts globaler Umweltveränderungen besonders anfällig, da sie höchst unvorhersagbaren Umweltbedingungen ausgesetzt sind. Das Schicksal von Gemeinschaften in einer ungewissen Zukunft kann durch das Verständnis der Triebkräfte dieser Gemeinschaften aufgeklärt werden. Das Zusammenspiel der Triebkräfte der Gemeinschaften kann mit Hilfe von Ansätzen entschlüsselt werden, die auf funktionalen Merkmalen (Traits) basieren, weil sie Pflanzenstrategien und die Reaktionen der Gemeinschaften auf Umweltveränderungen beschreiben können. Darüber hinaus liefert die inter- und intraspezifische Variabilität der Traits die notwendigen Anhaltspunkte für die Identifizierung von Überlebensstrategien von Wüstenpflanzen unter wechselhaften Umweltbedingungen. Die Erforschung von Wüstenpflanzengemeinschaften könnte jedoch aufgrund der räumlichen und zeitlichen Heterogenität der ariden Umweltbedingungen eine Herausforderung darstellen. Modellierungsansätze unterstützen und ergänzen empirische, trait-basierte Ansätze bei der Erforschung von Wüstenpflanzengemeinschaften und ihrer Triebkräfte und Dynamik in sich verändernden ariden Gebieten. Das Gesamtziel dieser Arbeit war es, die intra- und interspezifische Variabilität der funktionalen Traits in ariden Umgebungen zu erforschen und zu untersuchen, wie sich diese Variabilität auf die Fähigkeit von Pflanzen auswirkt, Trockenstress zu tolerieren und in der Konkurrenz mit ihren Nachbarn erfolgreich zu sein. Um dieses Ziel zu erreichen, habe ich ein räumlich-explizites individuen- und trait-basiertes Simulationsmodell entwickelt, implementiert und analysiert, ein Simulationsexperiment durchgeführt, Daten aus empirischen Experimenten analysiert und einen Überblick der Literatur zu trait-basierten Modellen und Metamodellierungsansätzen zusammengestellt. Meine Forschung basiert auf Daten zu annuellen Pflanzengemeinschaften in der Wüste Negev in Israel, die von der Echte Rose von Jericho (Anastatica hierochuntica) dominiert werden. Die Literaturzusammenschau in Kapitel 1 offenbart, dass trait-basierte Modelle eine geeignete Methode sind, um Veränderungen in den Mustern von Gemeinschaften unter globalen Veränderungen vorherzusagen und die zugrunde liegenden Mechanismen der Zusammensetzung und Dynamik von Lebensgemeinschaften zu verstehen. Durch die Kombination von Modellierung und trait-basierten Ansätzen lassen sich technische Herausforderungen, Skalierungsprobleme und Datenknappheit überwinden. Insbesondere wurde eine Kombination aus trait-basierten Ansätzen und individuenbasierter Modellierung empfohlen, um die Parametrisierung der Modelle zu vereinfachen, Interaktionen zwischen Pflanzen auf individueller Ebene zu erfassen und die Gemeinschaftsdynamik zu erklären. Eine Forderung aus Kapitel 1 umsetzend wurde in Kapitel 2 das räumlich-explizite, trait- und individuenbasierte ATID-Modell entwickelt, implementiert und analysiert, um zu untersuchen, wie Gemeinschaftsdynamiken aus Pflanzentraits und Interaktionen von Pflanzen untereinander und mit ihrer Umwelt entstehen. Die Sensitivitätsanalyse des Modells hob die funktionalen Traits von Pflanzen als Schlüsselfaktoren der Gemeinschaftsdynamik hervor, wobei den Umweltfaktoren im Modell eine relativ geringere Bedeutung zugewiesen wurde. Die sensitivitätverursachenden Traits umfassten sowohl solche Traits, die an den Pflanze-Pflanze-Interaktionen beteiligt waren, wie zum Beispiel die relative Wachstumsrate und maximale Biomasse, als auch solche, die die Toleranz gegenüber abiotischem Stress fördern, wie die Keimruhe und Keimungswahrscheinlichkeit. Unter den Umweltfaktoren waren die Verfügbarkeit von Bodenwasser und Niederschlag die einflussreichsten Faktoren. Die besondere Rolle von funktionalen Traits in der Gemeinschaftsdynamik einjähriger Wüstenpflanzen zeigt die Bedeutung trait-basierter Strategien als Anpassung an die harschen Bedingungen in ariden Gebieten. Kapitel 3 befasst sich mit den Ergebnissen eines Simulationsexperiments, das mit dem ATID-Modell durchgeführt wurde. Dieses Experiment untersuchte den Einfluss funktionaler Traits auf die Gemeinschaftsdynamik, die bei zwei Überlebensstrategien eine Rolle spielen, die in der Studie in einem neuen Strategiekonzept als "Schutz-Konkurrenz"- und "Flucht-Kolonisierungs"-Strategien definiert wurden. Diese Strategien unterschieden sich nicht nur in der Samengröße und der Anzahl der Samen, sondern auch in bestimmten Pflanzentraits, die mit Konkurrenz und Überleben zusammenhängen und die in der Sensitivitätsanalyse des Modells aus Kapitel 2 hervorgehoben worden waren. Die Integration der Konzepte des Kolonisierung-Konkurrenz-Trade-offs und des Entkommens in Zeit und Raum in einem neuen Strategiekonzept ergab eine realistischere Darstellung der Arten, da die integrierten Strategien den gesamten Lebenszyklus der Pflanze berücksichtigen. Um ein besseres Verständnis empirischer Trait-Verteilungen zu erlangen, wurden in Kapitel 4 Daten zur intraspezifischen Traitvariabilität und zu Trait-Räumen der annuellen Wüstenpflanze A. hierochutica aus einem Gewächshausversuch analysiert. Hohe Salzkonzentrationen hatten signifikante Auswirkungen auf die Durchschnittswerte der funktionalen Traits der Pflanzen. Zusätzlich beeinflusste Salzstress die intraspezifischen Trait-Räume unterschiedlich in Bezug auf die Umweltbedingungen des Ursprungsortes der Pflanzen. Die Trait-Räume der Populationen, die vom gleichen Standort stammten, aber unterschiedlichen Salzstress-Niveaus ausgesetzt waren, wurden mit zunehmender Aridität unähnlicher. Daher erwiesen sich die intraspezifische Trait-Variabilität und die Salzeffekte als wesentlich für die Aufdeckung von Prozessen auf Populations- und Lebensgemeinschaftsebene in Wüsten und sollten in zukünftigen Versionen des ATID-Modells berücksichtigt werden. Zur Unterstützung der zukünftigen Entwicklung des in Kapitel 2 entwickelten ATID-Modells wurden in Kapitel 5 Metamodelltypen und ihre Anwendungsbereiche in der individuenbasierten Modellierung überprüft und bewertet. Die Überprüfung berücksichtigte 40 Metamodelle, die für die Sensitivitätsanalyse, Kalibrierung, Vorhersage und Skalierung von individuenbasierten Modellen eingesetzt werden können und als Leitfaden für die Implementierung und Validierung von Metamodellen dienen können. Insgesamt beleuchtet diese Arbeit und insbesondere die Analysen des ATID-Modells, wie trait-basierte Modellierungsansätze zum Verständnis des Zusammenspiels der Schlüsseltriebkräfte von Wüstenpflanzengemeinschaften in ariden Umgebungen beitragen können. Die begleitende Analyse des Gewächshausexperiments und die kritischen Literaturübersichten dienen als Grundlage für zukünftige Erweiterungen des Modells und die in dieser Arbeit identifizierten Wege zur Überwindung technischer Herausforderungen und Datenknappheit. Darüber hinaus empfiehlt diese Dissertation eine intensivere Untersuchung der Strategien annueller Wüstenpflanzen für das Überleben unter zeitlich und räumlich heterogenen Umweltbedingungen mit besonderem Schwerpunkt auf funktionalen Pflanzen-Traits. Somit bietet das in dieser Arbeit vorgestellte Grundmodell die Basis für zukünftige Forschungen über das Schicksal von Lebensgemeinschaften in ariden Gebieten unter dem Einfluss globaler Umweltveränderungen.Communities in arid environments are especially vulnerable to global change because they experience highly unpredictable environmental conditions. The fate of communities in an uncertain future may be elucidated by understanding the drivers of these communities. The interplay between community drivers may be unravelled by using approaches based on functional traits because traits describe plant strategies and the responses of communities to environmental changes. Furthermore, inter- and intraspecific trait variability provides the necessary cues to identify survival strategies of desert plants under fluctuating environmental conditions. However, studying desert plant communities is challenging due to the spatial and temporal heterogeneity of arid environments. Modelling approaches support and complement empirical trait-based approaches in exploring desert plant communities and their drivers and dynamics in changing arid environments. The overarching aim of this thesis was to explore intra- and inter-specific variability of functional traits in arid environments and to investigate how this variability affects the ability of plants to tolerate aridity stress and succeed in competition with their neighbours. To address this aim, I developed, implemented and analysed a spatially explicit individual- and trait-based simulation model, conducted a simulation experiment, analysed data from model simulations and empirical experiments and synthesized the literature on trait-based models and metamodelling approaches. My research was focused on annual plant communities dominated by the True Rose of Jericho (Anastatica hierochuntica L.) in the Negev desert in Israel. According to the review in chapter 1, trait-based models are a suitable method to predict changes in community patterns under global change and to understand the underlying mechanisms of community assembly and dynamics. Combining modelling and trait-based approaches overcomes technical challenges, scaling problems, and data scarcity. Specifically, a combination of trait-based approaches and individual-based modelling was recommended to simplify the parameterization of models and to capture plant-plant interactions at the individual level, and to explain community dynamics. In chapter 2, in line with the major claim of chapter 1, the spatially explicit trait- and individual-based ATID-model was developed, implemented and analysed to explore how community dynamics arise from plant traits and the interactions among plants and with their environment. The sensitivity analysis of the model highlighted plant functional traits as key drivers of community dynamics and indicated that environmental factors were less important in the model. The outlined traits included both those traits that are involved in plant-plant interactions, such as relative growth rate and maximum biomass, and those that promote tolerance to abiotic stress, such as dormancy and germination probability. Among the environmental factors, the most influential factors were soil water availability and precipitation. The special role of functional traits in the community dynamics of desert annual plants indicates the importance of trait-based strategies as an adaptation to the stressful arid environment. Chapter 3 addresses the results from a simulation experiment that was conducted in the ATID-model. This experiment explored the influence of functional traits involved in two survival strategies defined in the study as ‘protective-competition’ and ‘escape-colonization’ strategies on community dynamics. These strategies differed not only in seed size and the number of seeds, but also in the plant functional traits related to competition and survival, which were highlighted in the sensitivity analysis of the model from chapter 2. Merging the colonization-competition trade-off with escape in time and space into one strategy set provided a more realistic representation of species because the merged strategies related to the entire plant life cycle. To gain more understanding on empirical trait distributions, in chapter 4 data on intraspecific trait variability and trait spaces of the desert annual plant A. hierochutica from a nethouse experiment were analysed. High salinity had significant effects on the average values of plant functional traits. Additionally, salinity stress affected the intraspecific trait spaces differentially with respect to the environmental conditions of the site of origin. Trait spaces of the populations originating from the same site but exposed to different salt stress levels became more dissimilar with increasing environmental aridity. Thus, intraspecific trait variability and salinity effects turned out to be essential in revealing population- and community-level processes in deserts and should be considered in future versions of the ATID-model. In support of the future development of the ATID-model developed in chapter 2, common metamodel types and the purposes of their usage for individual-based models were reviewed and evaluated in chapter 5. The review considered 40 metamodels applied for sensitivity analysis, calibration, prediction and scaling-up of individual-based models and can be used as a guide for the implementation and validation of metamodels. Overall, this thesis, and particularly the ATID-model analyses, highlights how trait-based modelling approaches can contribute to understanding the interplay between key drivers of desert plant communities in arid environments. The accompanying analysis of the nethouse experiment and critical literature reviews outline future extensions of the model and the ways to overcome the technical challenges and data scarcity identified in this thesis. Moreover, this thesis advocates for more intensive studies of the strategies of desert annual plants to survive in temporally and spatially heterogeneous environments with a focus on plant functional traits. Thus, the modelling framework presented in this thesis provides the basis for future research on the fate of communities in arid environments under global change

    Tune your brown clustering, please

    Get PDF
    Brown clustering, an unsupervised hierarchical clustering technique based on ngram mutual information, has proven useful in many NLP applications. However, most uses of Brown clustering employ the same default configuration; the appropriateness of this configuration has gone predominantly unexplored. Accordingly, we present information for practitioners on the behaviour of Brown clustering in order to assist hyper-parametre tuning, in the form of a theoretical model of Brown clustering utility. This model is then evaluated empirically in two sequence labelling tasks over two text types. We explore the dynamic between the input corpus size, chosen number of classes, and quality of the resulting clusters, which has an impact for any approach using Brown clustering. In every scenario that we examine, our results reveal that the values most commonly used for the clustering are sub-optimal

    Probabilistic Inference for Phrase-based Machine Translation: A Sampling Approach

    Get PDF
    Recent advances in statistical machine translation (SMT) have used dynamic programming (DP) based beam search methods for approximate inference within probabilistic translation models. Despite their success, these methods compromise the probabilistic interpretation of the underlying model thus limiting the application of probabilistically defined decision rules during training and decoding. As an alternative, in this thesis, we propose a novel Monte Carlo sampling approach for theoretically sound approximate probabilistic inference within these models. The distribution we are interested in is the conditional distribution of a log-linear translation model; however, often, there is no tractable way of computing the normalisation term of the model. Instead, a Gibbs sampling approach for phrase-based machine translation models is developed which obviates the need of computing this term yet produces samples from the required distribution. We establish that the sampler effectively explores the distribution defined by a phrase-based models by showing that it converges in a reasonable amount of time to the desired distribution, irrespective of initialisation. Empirical evidence is provided to confirm that the sampler can provide accurate estimates of expectations of functions of interest. The mix of high probability and low probability derivations obtained through sampling is shown to provide a more accurate estimate of expectations than merely using the n-most highly probable derivations. Subsequently, we show that the sampler provides a tractable solution for finding the maximum probability translation in the model. We also present a unified approach to approximating two additional intractable problems: minimum risk training and minimum Bayes risk decoding. Key to our approach is the use of the sampler which allows us to explore the entire probability distribution and maintain a strict probabilistic formulation through the translation pipeline. For these tasks, sampling allies the simplicity of n-best list approaches with the extended view of the distribution that lattice-based approaches benefit from, while avoiding the biases associated with beam search. Our approach is theoretically well-motivated and can give better and more stable results than current state of the art methods

    Hierarchical Back-off Modeling of Hiero Grammar based on Non-parametric Bayesian Model

    No full text
    In hierarchical phrase-based machine translation, a rule table is automatically learned by heuristically extracting syn-chronous rules from a parallel corpus. As a result, spuriously many rules are extracted which may be composed of various incorrect rules. The larger rule table incurs more run time for decoding and may result in lower translation quality. To resolve the problems, we propose a hierarchical back-off model for Hiero grammar, an instance of a synchronous context free grammar (SCFG), on the basis of the hierarchical Pitman-Yor process. The model can extract a compact rule and phrase table without resorting to any heuristics by hierarchically backing off to smaller phrases under SCFG. Inference is efficiently carried out using two-step synchronous parsing of Xiao et al., (2012) combined with slice sampling. In our experiments, the proposed model achieved higher or at least comparable translation quality against a previous Bayesian model on various language pairs; German/French/Spanish/Japanese-English. When compared against heuristic models, our model achieved comparable translation quality on a full size German-English language pair in Europarl v7 corpus with significantly smaller grammar size; less than 10 % of that for heuristic model.

    Hierarchical Back-off Modeling of Hiero Grammar based on Non-parametric Bayesian Model

    No full text
    In hierarchical phrase-based machine translation, a rule table is automatically learned by heuristically extracting syn-chronous rules from a parallel corpus. As a result, spuriously many rules are extracted which may be composed of various incorrect rules. The larger rule table incurs more run time for decoding and may result in lower translation quality. To resolve the problems, we propose a hierarchical back-off model for Hiero grammar, an instance of a synchronous context free grammar (SCFG), on the basis of the hierarchical Pitman-Yor process. The model can extract a compact rule and phrase table without resorting to any heuristics by hierarchically backing off to smaller phrases under SCFG. Inference is efficiently carried out using two-step synchronous parsing of Xiao et al., (2012) combined with slice sampling. In our experiments, the proposed model achieved higher or at least comparable translation quality against a previous Bayesian model on various language pairs; German/French/Spanish/Japanese-English. When compared against heuristic models, our model achieved comparable translation quality on a full size German-English language pair in Europarl v7 corpus with significantly smaller grammar size; less than 10 % of that for heuristic model.

    2016-2017 University of Dallas Bulletin

    Get PDF

    2015-2016 University of Dallas Bulletin

    Get PDF

    2017-2018 University of Dallas Bulletin

    Get PDF
    corecore