Search CORE

26 research outputs found

Learning the Structure of Variable-Order CRFs: a finite-state perspective

Author: Lavergne Thomas
Yvon François
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2017
Field of study

The computational complexity of linear-chain Conditional Random Fields (CRFs) makes it difficult to deal with very large label sets and long range dependencies. Such situations are not rare and arise when dealing with morphologically rich languages or joint labelling tasks. We extend here recent proposals to consider variable order CRFs. Using an effective finite-state representation of variable-length dependencies, we propose new ways to perform feature selection at large scale and report experimental results where we outperform strong baselines on a tagging task

Crossref

Computing a partition function of a generalized pattern-based energy over a semiring

Author: Takhanov Rustem
Publication venue
Publication date: 27/05/2023
Field of study

Valued constraint satisfaction problems with ordered variables (VCSPO) are a special case of Valued CSPs in which variables are totally ordered and soft constraints are imposed on tuples of variables that do not violate the order. We study a restriction of VCSPO, in which soft constraints are imposed on a segment of adjacent variables and a constraint language

\Gamma

consists of

\{0,1\}

-valued characteristic functions of predicates. This kind of potentials generalizes the so-called pattern-based potentials, which were applied in many tasks of structured prediction. For a constraint language

\Gamma

we introduce a closure operator,

\overline{\Gamma^{\cap}}\supseteq \Gamma

, and give examples of constraint languages for which

|\overline{\Gamma^{\cap}}|

is small. If all predicates in

\Gamma

are cartesian products, we show that the minimization of a generalized pattern-based potential (or, the computation of its partition function) can be made in

{\mathcal O}(|V|\cdot |D|^2 \cdot |\overline{\Gamma^{\cap}}|^2 )

time, where

V

is a set of variables,

D

is a domain set. If, additionally, only non-positive weights of constraints are allowed, the complexity of the minimization task drops to

{\mathcal O}(|V|\cdot |\overline{\Gamma^{\cap}}| \cdot |D| \cdot \max_{\rho\in \Gamma}\|\rho\|^2 )

where

\|\rho\|

is the arity of

\rho\in \Gamma

. For a general language

\Gamma

and non-positive weights, the minimization task can be carried out in

{\mathcal O}(|V|\cdot |\overline{\Gamma^{\cap}}|^2)

time. We argue that in many natural cases

\overline{\Gamma^{\cap}}

is of moderate size, though in the worst case

|\overline{\Gamma^{\cap}}|

can blow up and depend exponentially on

\max_{\rho\in \Gamma}\|\rho\|

arXiv.org e-Print Archive

Algorithms for Acyclic Weighted Finite-State Automata with Failure Arcs

Author: Cotterell Ryan
Dayan Benjamin
Eisner Jason
Svete Anej
Vieira Tim
Publication venue
Publication date: 11/07/2023
Field of study

Weighted finite-state automata (WSFAs) are commonly used in NLP. Failure transitions are a useful extension for compactly representing backoffs or interpolation in

n

-gram models and CRFs, which are special cases of WFSAs. The pathsum in ordinary acyclic WFSAs is efficiently computed by the backward algorithm in time

O(|E|)

, where

E

is the set of transitions. However, this does not allow failure transitions, and preprocessing the WFSA to eliminate failure transitions could greatly increase

|E|

. We extend the backward algorithm to handle failure transitions directly. Our approach is efficient when the average state has outgoing arcs for only a small fraction

s \ll 1

of the alphabet

\Sigma

. We propose an algorithm for general acyclic WFSAs which runs in

O{\left(|E| + s |\Sigma| |Q| T_\text{max} \log{|\Sigma|}\right)}

, where

Q

is the set of states and

T_\text{max}

is the size of the largest connected component of failure transitions. When the failure transition topology satisfies a condition exemplified by CRFs, the

T_\text{max}

factor can be dropped, and when the weight semiring is a ring, the

\log{|\Sigma|}

factor can be dropped. In the latter case (ring-weighted acyclic WFSAs), we also give an alternative algorithm with complexity

\displaystyle O{\left(|E| + |\Sigma| |Q| \min(1,s\pi_\text{max}) \right)}

, where

\pi_\text{max}

is the size of the longest failure path.Comment: 9 pages, Proceedings of EMNLP 202

arXiv.org e-Print Archive

Recommended from our members

Machine Learning Models for Efficient and Robust Natural Language Processing

Author: Strubell Emma
Publication venue: ScholarWorks@UMass Amherst
Publication date: 30/10/2019
Field of study

Natural language processing (NLP) has come of age. For example, semantic role labeling (SRL), which automatically annotates sentences with a labeled graph representing who did what to whom, has in the past ten years seen nearly 40% reduction in error, bringing it to useful accuracy. As a result, a myriad of practitioners now want to deploy NLP systems on billions of documents across many domains. However, state-of-the-art NLP systems are typically not optimized for cross-domain robustness nor computational efficiency. In this dissertation I develop machine learning methods to facilitate fast and robust inference across many common NLP tasks. First, I describe paired learning and inference algorithms for dynamic feature selection which accelerate inference in linear classifiers, the heart of the fastest NLP models, by 5-10 times. I then present iterated dilated convolutional neural networks (ID-CNNs), a distinct combination of network structure, parameter sharing and training procedures that increase inference speed by 14-20 times with accuracy matching bidirectional LSTMs, the most accurate models for NLP sequence labeling. Finally, I describe linguistically-informed self-attention (LISA), a neural network model that combines multi-head self-attention with multi-task learning to facilitate improved generalization to new domains. We show that incorporating linguistic structure in this way leads to substantial improvements over the previous state-of-the-art (syntax-free) neural network models for SRL, especially when evaluating out-of-domain. I conclude with a brief discussion of potential future directions stemming from my thesis work

ScholarWorks@UMass Amherst

Recommended from our members

Integrating learning and search for structured prediction

Author: Doppa Janardhan Rao
Publication venue: 'Oregon State University'
Publication date
Field of study

We are witnessing the rise of the data-driven science paradigm, in which massive amounts of data - much of it collected as a side-effect of ordinary human activity - can be analyzed to make sense of the data and to make useful predictions. To fully realize the promise of this paradigm, we need automated systems that can transform structured inputs to structured outputs. Examples include parsing a sentence, resolving coreferences of entity and event mentions in a piece of text, interpreting a visual scene, and translating from one language to another. Problems such as these are often referred to as structured prediction problems in the machine learning community. These prediction problems pose severe learning and inference challenges due to the huge number of possible outputs. This thesis explores how to integrate two fundamental branches of Artificial Intelligence, namely learning and search, to solve structured prediction tasks. We study a new framework for structured prediction called HC-Search, where we formulate the problem of structured prediction as an explicit search process in the combinatorial space of outputs. The system starts from a reasonably good initial solution and performs an heuristic search guided by a learned heuristic function H until a fixed number of alternative solutions has been generated or a fixed time limit is reached. It then evaluates each of these alternatives using a learned cost function C and returns the minimum-cost solution. There are three key learning challenges in this framework - Search space design: how can we automatically design an efficient search space over structured outputs?; Heuristic learning: how can we learn a heuristic function H for effectively guiding the search?; Cost function learning: how can we learn a cost function C that can accurately select the best output among the candidate outputs? We develop generic solutions for each of these learning challenges and an engineering methodology for applying this framework. We show that the HC-Search framework achieves results in a wide range of structured prediction problems that significantly exceed the best previous results

ScholarsArchive@OSU

Recommended from our members

Deep Energy-Based Models for Structured Prediction

Author: Belanger David
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/11/2017
Field of study

We introduce structured prediction energy networks (SPENs), a flexible frame- work for structured prediction. A deep architecture is used to define an energy func- tion over candidate outputs and predictions are produced by gradient-based energy minimization. This deep energy captures dependencies between labels that would lead to intractable graphical models, and allows us to automatically discover discrim- inative features of the structured output. Furthermore, practitioners can explore a wide variety of energy function architectures without having to hand-design predic- tion and learning methods for each model. This is because all of our prediction and learning methods interact with the energy only via the standard interface for deep networks: forward and back-propagation. In a variety of applications, we find that we can obtain better accuracy using approximate minimization of non-convex deep energy functions than baseline models that employ simple energy functions for which exact minimization is tractable

ScholarWorks@UMass Amherst

Optimization tools for non-asymptotic statistics in exponential families

Author: Le Priol Rémi
Publication venue
Publication date: 01/04/2022
Field of study

Les familles exponentielles sont une classe de modèles omniprésente en statistique. D'une part, elle peut modéliser n'importe quel type de données. En fait la plupart des distributions communes en font partie : Gaussiennes, variables catégoriques, Poisson, Gamma, Wishart, Dirichlet. D'autre part elle est à la base des modèles linéaires généralisés (GLM), une classe de modèles fondamentale en apprentissage automatique. Enfin les mathématiques qui les sous-tendent sont souvent magnifiques, grâce à leur lien avec la dualité convexe et la transformée de Laplace. L'auteur de cette thèse a fréquemment été motivé par cette beauté. Dans cette thèse, nous faisons trois contributions à l'intersection de l'optimisation et des statistiques, qui tournent toutes autour de la famille exponentielle. La première contribution adapte et améliore un algorithme d'optimisation à variance réduite appelé ascension des coordonnées duales stochastique (SDCA), pour entraîner une classe particulière de GLM appelée champ aléatoire conditionnel (CRF). Les CRF sont un des piliers de la prédiction structurée. Les CRF étaient connus pour être difficiles à entraîner jusqu'à la découverte des technique d'optimisation à variance réduite. Notre version améliorée de SDCA obtient des performances favorables comparées à l'état de l'art antérieur et actuel. La deuxième contribution s'intéresse à la découverte causale. Les familles exponentielles sont fréquemment utilisées dans les modèles graphiques, et en particulier dans les modèles graphique causaux. Cette contribution mène l'enquête sur une conjecture spécifique qui a attiré l'attention dans de précédents travaux : les modèles causaux s'adaptent plus rapidement aux perturbations de l'environnement. Nos résultats, obtenus à partir de théorèmes d'optimisation, soutiennent cette hypothèse sous certaines conditions. Mais sous d'autre conditions, nos résultats contredisent cette hypothèse. Cela appelle à une précision de cette hypothèse, ou à une sophistication de notre notion de modèle causal. La troisième contribution s'intéresse à une propriété fondamentale des familles exponentielles. L'une des propriétés les plus séduisantes des familles exponentielles est la forme close de l'estimateur du maximum de vraisemblance (MLE), ou maximum a posteriori (MAP) pour un choix naturel de prior conjugué. Ces deux estimateurs sont utilisés presque partout, souvent sans même y penser. (Combien de fois calcule-t-on une moyenne et une variance pour des données en cloche sans penser au modèle Gaussien sous-jacent ?) Pourtant la littérature actuelle manque de résultats sur la convergence de ces modèles pour des tailles d'échantillons finis, lorsque l'on mesure la qualité de ces modèles avec la divergence de Kullback-Leibler (KL). Pourtant cette divergence est la mesure de différence standard en théorie de l'information. En établissant un parallèle avec l'optimisation, nous faisons quelques pas vers un tel résultat, et nous relevons quelques directions pouvant mener à des progrès, tant en statistiques qu'en optimisation. Ces trois contributions mettent des outil d'optimisation au service des statistiques dans les familles exponentielles : améliorer la vitesse d'apprentissage de GLM de prédiction structurée, caractériser la vitesse d'adaptation de modèles causaux, estimer la vitesse d'apprentissage de modèles omniprésents. En traçant des ponts entre statistiques et optimisation, cette thèse fait progresser notre maîtrise de méthodes fondamentales d'apprentissage automatique.Exponential families are a ubiquitous class of models in statistics. On the one hand, they can model any data type. Actually, the most common distributions are exponential families: Gaussians, categorical, Poisson, Gamma, Wishart, or Dirichlet. On the other hand, they sit at the core of generalized linear models (GLM), a foundational class of models in machine learning. They are also supported by beautiful mathematics thanks to their connection with convex duality and the Laplace transform. This beauty is definitely responsible for the existence of this thesis. In this manuscript, we make three contributions at the intersection of optimization and statistics, all revolving around exponential families. The first contribution adapts and improves a variance reduction optimization algorithm called stochastic dual coordinate ascent (SDCA) to train a particular class of GLM called conditional random fields (CRF). CRF are one of the cornerstones of structured prediction. CRF were notoriously hard to train until the advent of variance reduction techniques, and our improved version of SDCA performs favorably compared to the previous state-of-the-art. The second contribution focuses on causal discovery. Exponential families are widely used in graphical models, and in particular in causal graphical models. This contribution investigates a specific conjecture that gained some traction in previous work: causal models adapt faster to perturbations of the environment. Using results from optimization, we find strong support for this assumption when the perturbation is coming from an intervention on a cause, and support against this assumption when perturbation is coming from an intervention on an effect. These pieces of evidence are calling for a refinement of the conjecture. The third contribution addresses a fundamental property of exponential families. One of the most appealing properties of exponential families is its closed-form maximum likelihood estimate (MLE) and maximum a posteriori (MAP) for a natural choice of conjugate prior. These two estimators are used almost everywhere, often unknowingly -- how often are mean and variance computed for bell-shaped data without thinking about the Gaussian model they underly? Nevertheless, literature to date lacks results on the finite sample convergence property of the information (Kulback-Leibler) divergence between these estimators and the true distribution. Drawing on a parallel with optimization, we take some steps towards such a result, and we highlight directions for progress both in statistics and optimization. These three contributions are all using tools from optimization at the service of statistics in exponential families: improving upon an algorithm to learn GLM, characterizing the adaptation speed of causal models, and estimating the learning speed of ubiquitous models. By tying together optimization and statistics, this thesis is taking a step towards a better understanding of the fundamentals of machine learning

Dépôt Institutionnel Numérique

Neural networks versus Logistic regression for 30 days all-cause readmission prediction

Author: Allam Ahmed
Krauthammer Michael
Nagy Mate
Thoma George
Publication venue
Publication date: 22/12/2018
Field of study

Heart failure (HF) is one of the leading causes of hospital admissions in the US. Readmission within 30 days after a HF hospitalization is both a recognized indicator for disease progression and a source of considerable financial burden to the healthcare system. Consequently, the identification of patients at risk for readmission is a key step in improving disease management and patient outcome. In this work, we used a large administrative claims dataset to (1)explore the systematic application of neural network-based models versus logistic regression for predicting 30 days all-cause readmission after discharge from a HF admission, and (2)to examine the additive value of patients' hospitalization timelines on prediction performance. Based on data from 272,778 (49% female) patients with a mean (SD) age of 73 years (14) and 343,328 HF admissions (67% of total admissions), we trained and tested our predictive readmission models following a stratified 5-fold cross-validation scheme. Among the deep learning approaches, a recurrent neural network (RNN) combined with conditional random fields (CRF) model (RNNCRF) achieved the best performance in readmission prediction with 0.642 AUC (95% CI, 0.640-0.645). Other models, such as those based on RNN, convolutional neural networks and CRF alone had lower performance, with a non-timeline based model (MLP) performing worst. A competitive model based on logistic regression with LASSO achieved a performance of 0.643 AUC (95%CI, 0.640-0.646). We conclude that data from patient timelines improve 30 day readmission prediction for neural network-based models, that a logistic regression with LASSO has equal performance to the best neural network model and that the use of administrative data result in competitive performance compared to published approaches based on richer clinical datasets

arXiv.org e-Print Archive

ZORA