39 research outputs found

    Sparse Model Selection using Information Complexity

    Get PDF
    This dissertation studies and uses the application of information complexity to statistical model selection through three different projects. Specifically, we design statistical models that incorporate sparsity features to make the models more explanatory and computationally efficient. In the first project, we propose a Sparse Bridge Regression model for variable selection when the number of variables is much greater than the number of observations if model misspecification occurs. The model is demonstrated to have excellent explanatory power in high-dimensional data analysis through numerical simulations and real-world data analysis. The second project proposes a novel hybrid modeling method that utilizes a mixture of sparse principal component regression (MIX-SPCR) to segment high-dimensional time series data. Using the MIX-SPCR model, we empirically analyze the S\&P 500 index data (from 1999 to 2019) and identify two key change points. The third project investigates the use of nonlinear features in the Sparse Kernel Factor Analysis (SKFA) method to derive the information criterion. Using a variety of wide datasets, we demonstrate the benefits of SKFA in the nonlinear representation and classification of data. The results obtained show the flexibility and the utility of information complexity in such data modeling problems

    Classification of clinical outcomes using high-throughput and clinical informatics.

    Get PDF
    It is widely recognized that many cancer therapies are effective only for a subset of patients. However clinical studies are most often powered to detect an overall treatment effect. To address this issue, classification methods are increasingly being used to predict a subset of patients which respond differently to treatment. This study begins with a brief history of classification methods with an emphasis on applications involving melanoma. Nonparametric methods suitable for predicting subsets of patients responding differently to treatment are then reviewed. Each method has different ways of incorporating continuous, categorical, clinical and high-throughput covariates. For nonparametric and parametric methods, distance measures specific to the method are used to make classification decisions. Approaches are outlined which employ these distances to measure treatment interactions and predict patients more sensitive to treatment. Simulations are also carried out to examine empirical power of some of these classification methods in an adaptive signature design. Results were compared with logistic regression models. It was found that parametric and nonparametric methods performed reasonably well. Relative performance of the methods depends on the simulation scenario. Finally a method was developed to evaluate power and sample size needed for an adaptive signature design in order to predict the subset of patients sensitive to treatment. It is hoped that this study will stimulate more development of nonparametric and parametric methods to predict subsets of patients responding differently to treatment

    Machine Learning in Credit Risk Management: An Empirical Analysis for Recovery Rates

    Get PDF

    Advances in Hyperspectral Image Classification Methods for Vegetation and Agricultural Cropland Studies

    Get PDF
    Hyperspectral data are becoming more widely available via sensors on airborne and unmanned aerial vehicle (UAV) platforms, as well as proximal platforms. While space-based hyperspectral data continue to be limited in availability, multiple spaceborne Earth-observing missions on traditional platforms are scheduled for launch, and companies are experimenting with small satellites for constellations to observe the Earth, as well as for planetary missions. Land cover mapping via classification is one of the most important applications of hyperspectral remote sensing and will increase in significance as time series of imagery are more readily available. However, while the narrow bands of hyperspectral data provide new opportunities for chemistry-based modeling and mapping, challenges remain. Hyperspectral data are high dimensional, and many bands are highly correlated or irrelevant for a given classification problem. For supervised classification methods, the quantity of training data is typically limited relative to the dimension of the input space. The resulting Hughes phenomenon, often referred to as the curse of dimensionality, increases potential for unstable parameter estimates, overfitting, and poor generalization of classifiers. This is particularly problematic for parametric approaches such as Gaussian maximum likelihoodbased classifiers that have been the backbone of pixel-based multispectral classification methods. This issue has motivated investigation of alternatives, including regularization of the class covariance matrices, ensembles of weak classifiers, development of feature selection and extraction methods, adoption of nonparametric classifiers, and exploration of methods to exploit unlabeled samples via semi-supervised and active learning. Data sets are also quite large, motivating computationally efficient algorithms and implementations. This chapter provides an overview of the recent advances in classification methods for mapping vegetation using hyperspectral data. Three data sets that are used in the hyperspectral classification literature (e.g., Botswana Hyperion satellite data and AVIRIS airborne data over both Kennedy Space Center and Indian Pines) are described in Section 3.2 and used to illustrate methods described in the chapter. An additional high-resolution hyperspectral data set acquired by a SpecTIR sensor on an airborne platform over the Indian Pines area is included to exemplify the use of new deep learning approaches, and a multiplatform example of airborne hyperspectral data is provided to demonstrate transfer learning in hyperspectral image classification. Classical approaches for supervised and unsupervised feature selection and extraction are reviewed in Section 3.3. In particular, nonlinearities exhibited in hyperspectral imagery have motivated development of nonlinear feature extraction methods in manifold learning, which are outlined in Section 3.3.1.4. Spatial context is also important in classification of both natural vegetation with complex textural patterns and large agricultural fields with significant local variability within fields. Approaches to exploit spatial features at both the pixel level (e.g., co-occurrencebased texture and extended morphological attribute profiles [EMAPs]) and integration of segmentation approaches (e.g., HSeg) are discussed in this context in Section 3.3.2. Recently, classification methods that leverage nonparametric methods originating in the machine learning community have grown in popularity. An overview of both widely used and newly emerging approaches, including support vector machines (SVMs), Gaussian mixture models, and deep learning based on convolutional neural networks is provided in Section 3.4. Strategies to exploit unlabeled samples, including active learning and metric learning, which combine feature extraction and augmentation of the pool of training samples in an active learning framework, are outlined in Section 3.5. Integration of image segmentation with classification to accommodate spatial coherence typically observed in vegetation is also explored, including as an integrated active learning system. Exploitation of multisensor strategies for augmenting the pool of training samples is investigated via a transfer learning framework in Section 3.5.1.2. Finally, we look to the future, considering opportunities soon to be provided by new paradigms, as hyperspectral sensing is becoming common at multiple scales from ground-based and airborne autonomous vehicles to manned aircraft and space-based platforms

    High-dimensional and one-class classification

    Get PDF
    When dealing with high-dimensional data and, in particular, when the number of attributes p is large comparatively to the sample size n, several classification methods cannot be applied. Fisher's linear discriminant rule or the quadratic discriminant one are unfeasible, as the inverse of the involved covariance matrices cannot be computed. A recent approach to overcome this problem is based on Random Projections (RPs), which have emerged as a powerful method for dimensionality reduction. In 2017, Cannings and Samworth introduced the RP method in the ensemble context to extend to the high-dimensional domain classification methods originally designed for low-dimensional data. Although the RP ensemble classifier allows improving classification accuracy, it may still include redundant information. Moreover, differently from other ensemble classifiers (e.g. Random Forest), it does not provide any insight on the actual classification importance of the input features. To account for these aspects, in the first part of this thesis, we investigate two new directions of the RP ensemble classifier. Firstly, combining the original idea of using the Multiplicative Binomial distribution as the reference model to describe and predict the ensemble accuracy and an important result on such distribution, we introduce a stepwise strategy for post-pruning (called Ensemble Selection Algorithm). Secondly, we propose a criterion (called Variable Importance in Projection) that uses the feature coefficients in the best discriminant projections to measure the variable importance in classification. In the second part, we faced the new challenges posed by the high-dimensional data in a recently emerging classification context: one-class classification. This is a special classification task, where only one class is fully known (the target class), while the information on the others is completely missing. In particular, we address this task by using Gini's transvariation probability as a measure of typicality, aimed at identifying the best boundary around the target class

    Self organisation for 4G/5G networks

    Get PDF
    Nowadays, the rapid growth of mobile communications is changing the world towards a fully connected society. Current 4G networks account for almost half of total mobile traffic, and in the forthcoming years, the overall mobile data traffic is expected to dramatically increase. To manage this increase in data traffic, operators adopt network topologies such as Heterogeneous Networks. Thus, operators can de­ ploy hundreds of small cells for each macro cell, allowing them to reduce coverage hales and/or lack of capacity. The advent of this technology is expected to tremendously increase the number of nodes in this new ecosystem, so that traditional network management activities based on, e.g., classic manual and field trial design approaches are just not be viable anymore. As a consequence, the academic J literature has dedicated a significant amount of effort to Self-Organising Network (SON) algorithms. These solutions aim to bring intelligence and autonomous adaptability into cellular networks, thereby reducing capital and operation expenditures (CAPEX/OPEX). Another aspect to take into account is that, these type of networks generate a large amount of data during their normal operation in the form of control, management and data measurements. This data is expected to increase in SG due to different aspects, such as densification, heterogeneity in layers and technologies, additional control and management complexity in Network Functions Virtualisation (NFV) and Software Defined Network (SDN), and the advent of the Internet of Things (loT), among others. In this context, operators face the challenge of de ­ signing efficient technologies, while introducing new services, reaching challenges in terms networks, which are self-aware, self-adaptive, and intelligent. This dissertation provides a contribution to the design, analysis, and evaluation of SON solutions to improve network opera tor performance, expenses, and users' experience, by making the network more self-adaptive and intelligent. It also provides a contribution to the design of a self-aware network planning tool, which allows to predict the Quality of Service (QoS) offered to end-users, based on data al ­ ready available in the network . The main thesis contributions are divided into two parts. The first part presents a novel functional architecture based on an automatic and self-organised Reinforcement Learning (RL) based approach to model SON functionalities, in which the main task is the self-coordination of different actions taken by different SON functions to be automatically executed in a self-organised realistic Long Term Evolution (LTE) network. The proposed approach introduces a new paradigm to deal with the conflicts genera ted by the concurrent execution of multiple SON functions, revealing that the proposed approach is general enough to modelali the SON functions and their derived conflicts. The second part of the thesis is dedicated to the problem of QoS prediction. In particular, we aim at finding patterns of knowledge from physical layer data acquired from heterogeneous LTE networks. We propose an approach that not only is able to verify the QoS level experienced by the users, through physical layer measurements of the UEs, but it is a lso able to predict it based on measurements collected at different time, and from different regions of the heterogeneous network. We propose then to make predictions independently of the physical location, in order to exploit the experience gained in other sectors of the network, to properly dimension and deploy heterogeneous nodes. In this context, we use Machine Learning (ML) as a tool to allow the network to learn from experience, improving performances, and big data analytics to drive the network from reactive to predictive.Hoy en día, el rápido crecimiento de las comunicaciones móviles está cambiando el mundo hacia una sociedad completamente conectada. Las redes 4G actuales representan casi la mitad del tráfico móvil total, y en los próximos años se espera que el tráfico total de los dispositivos móviles aumente drásticamente. Para gestionar este incremento de tráfico de datos, los operadores adoptan tecnologías de redes como las redes heterogéneas. De esta manera, los operadores pueden desplegar centena res de pequeñas celdas por cada macro celda, permitiendo reducir zonas sin cobertura y/o falta de capacidad. Con la introducción de esta tecnología, se espera que incremente de manera sustancia l el número de nodos en el nuevo ecosistema, de manera que las actividades de gestión de las redes tradicionales, basadas en, por ejemplo, el diseño manual, sean inviables. Como consecuencia, la literatura académica ha dedicado un esfuerzo significativo al diseño de algoritmos de redes auto-organizadas (SON). Estas soluciones tienen como objetivo introducir inteligencia y capacidad autónoma a las redes móviles, reduciendo la capacidad y costes operativos. Otro aspecto a tener en cuenta es que este tipo de redes generan una gran cantidad de datos durante su funcionamiento habitual, en forma de medidas de control y gestión de datos. Se espera que estos datos incrementen con la tecnología SG, debido a diferentes aspectos como los son la densificación de redes heterogéneas, la complejidad adicional en el control y la gestión de la virtualización de las funciones de redes (NFV) y las redes definidas por software (SON), así como la llegada del internet de las cosas (loT), entre otros. En este contexto, los operadores se enfrentan al reto de diseñar tecnologías eficientes, mientras introducen nuevos servicios, consiguiendo objetivos en términos de satisfacción del cliente, en donde el objetivo global del operador es la construcción de redes auto-conscientes, auto-adaptables e inteligentes. Esta tesis ofrece una contribución al diseño y evaluación de soluciones SON para mejorar el rendimiento de las redes, los costes y la experiencia de los usuarios, consiguiendo que la red sea auto-adaptable e inteligente. Así mismo, proporciona una contribución al diseño de una herramienta de planificación de red auto-consciente, que permita predecir la calidad de servicio brindada a los usuarios finales, basada en la explotación de datos disponibles en la red.Avui en dia, el ràpid creixement de les comunicacions mòbils està canviant el món cap a una societat completament connectada. Les xarxes 4G actuals representen casi la m trànsit mòbil total, i en els propers anys s’espera que el trànsit total de dades mòbils augmenti dràsticament. Per gestionar aquest increment de trànsit de dades, els operadors adopten topologies de xarxa com ara les xarxes heterogènies (HetNets). D’aquesta manera, els operadors poden desplegar centenars de cel·les petites per a cada cella macro, permetent reduir forats en la cobertura i/o la manca de capacitat. Amb l’arribada d’aquesta tecnologia, s’espera que incrementi enormement el nombre de nodes en el nou ecosistema, de manera que les activitats de gestió de xarxa tradicionals, basades en, per exemple, el disseny manual i els assaigs de camp esdevenen simplement inviables. Com a conseqüència, la literatura acadèmica ha dedicat una quantitat significativa d’esforç als algorismes de xarxa auto organitzada (SON). Aquestes solucions tenen com a objectiu portar la intel·ligència i capacitat d’adaptació autònoma a les xarxes mòbils, reduint el capital i les despeses operatives (CAPES/OPEX). Un altre aspecte a tenir en compte és que aquest tipus de xarxes generen una gran quantitat de dades durant el seu funcionament habitual, en forma de mesuraments de control, gestió i dades. S’espera que aquestes dades incrementin amb la tecnologia 5G, degut a diferents aspectes com ara la densificació, l’heterogeneïtat en capes i tecnologies, la complexitat addicional en el control i la gestió de la virtualització de les funcions de xarxa (NFV) i xarxes definides per software (SDN), i l’adveniment de la internet de les coses (IoT), entre d’altres. En aquest context, els operadors s’enfronten al repte de dissenyar tecnologies eficients, mentre introdueixen nous serveis, aconseguint objectius en termes de satisfacció del client, i on l’objectiu global d’un operador és la construcció de xarxes que són autoconscients, auto-adaptables i intel·ligents. Aquesta tesis ofereix una contribució al disseny, l’anàlisi i l’avaluació de les solucions SON per millorar el rendiment de l’operador de xarxa, les xi despeses i l’experiència dels usuaris, fent que la xarxa sigui més auto-adaptable i intel·ligent. També proporciona una contribució al disseny d’una eina de planificació de xarxa autoconscient, el que permet predir la qualitat de servei (QoS) oferta als usuaris finals, basada en dades ja disponibles a la xarxa. Les contribucions principals d’aquesta tesis es divideixen en dues parts. La primera part presenta una nova arquitectura funcional basada en un aprenentatge per reforç (RL) automàtic i auto-organitzat, enfocat en modelar funcionalitats SON, on la tasca principal és l’auto-coordinació de les diferents accions dutes a terme perles diferents funcions SON a ser executades de forma automàtica en una xarxa Long Term Evolution (LTE) auto-organitzada. L’enfocament proposat introdueix un nou paradigma perfer front als conflictes generats per l’execució simultània de múltiples funcions SON, revelant que l’enfocament proposat és prou general per modelar totes les funcions SON i els seus conflictes derivats. La segona part de la tesis està dedicada al problema de la predicció de la qualitat de servei. En particular, el nostre objectiu és trobar patrons de coneixement a partir de dades de la capa física adquirides de xarxes LTE heterogènies. Proposem un enfocament que no només és capaç de verificar el nivell de QoS experimentat pels usuaris, a través de mesuraments de la capa física dels UEs, sinó que també és capaç de predir-ho basant-se en mesuraments adquirits en diferents instants, i de diferents regions de la xarxa heterogènia. Proposem per tant fer prediccions amb independència de la ubicació física, aprofitant l’experiència adquirida en altres sectors de la xarxa, per dimensionar i desplegar nodes heterogenis correctament. En aquest context, utilitzem l’aprenentatge automàtic (ML) com a eina per permetre que la xarxa aprengui de l’experiència, millorant el rendiment, i l’anàlisi de grans volums de dades per a conduir la xarxa de reactiva a predictiva. Durant l’elaboració d’aquesta tesis, s’han extret dues conclusions principals clau. En primer lloc, destaquem la importància de dissenyar algorismes SON eficients per fer front eficaçment a diversos reptes, com ara la ubicació més adequada de funcions SON i algorismes per resoldre adequadament el problema d’implementació distribuïda o centralitzada, o la solució de conflictes entre funcions SON executades a diferents nodes o xarxes. En segon lloc, en termes d’eines de planificació de xarxes, es poden trobar diferents eines cobrint una àmplia gamma de sistemes i aplicacions orientades a la indústria, així com per a fins d’investigació. En aquest context, les solucions investigades són sotmeses contínuament a canvis importants, on un del principals impulsors és presentar solucions més rentable

    Old Dogs, New Tricks: Authoritarian Regime Persistence Through Learning

    Get PDF
    How does diffusion lead to authoritarian regime persistence? Political decisions, regardless of what the actors involved might believe or espouse, do not happen in isolation. Policy changes, institutional alterations, regime transitions-- these political phenomena are all in some part a product of diffusion processes as much as they are derived from internal determinants. As such, political regimes do not exist in a vacuum, nor do they ignore the outside world. When making decisions about policy and practice, we should expect competent political actors to take a look at the wider external world. This dissertation project presents a theory of regime learning and authoritarian persistence to augment the extant literature on diffusion and democratization. While this literature provides important links between the outcomes across borders, it also falls short in explaining if and how diffusion can explain the absence of change-- authoritarian persistence. The new theoretical approach is rooted in concepts drawn from the democratization literature as well as the psychology of learning, and distinguishes simplistic learning (emulation)-- based on the availability heuristic-- and a more sophisticated learning process rooted in the representativeness heuristic. To test the implications of this theory, I develop a pair of new measures of change: liberalization (making concessions) and deliberalization (increasing repression). Using a combination of human and machine coding of yearly Freedom House country reports, I determine whether authoritarian regimes made liberalizing or deliberalizing moves which fall short of the significant regime changes that aggregate measures such as POLITY, Freedom House, and similar capture. An empirical examination employing these new measures reveals that diffusion does exist among authoritarian regimes at the regional level, among contiguous neighborhoods, and within more carefully confined groups of peers. These results add to our understanding of persistent authoritarianism and establish that emulation can be identified. Although authoritarian regimes seem to be be copying the liberalization and deliberalization strategies of their peers, there is not clear support for more sophisticated learning processes at this time

    Machine Learning Methods with Noisy, Incomplete or Small Datasets

    Get PDF
    In many machine learning applications, available datasets are sometimes incomplete, noisy or affected by artifacts. In supervised scenarios, it could happen that label information has low quality, which might include unbalanced training sets, noisy labels and other problems. Moreover, in practice, it is very common that available data samples are not enough to derive useful supervised or unsupervised classifiers. All these issues are commonly referred to as the low-quality data problem. This book collects novel contributions on machine learning methods for low-quality datasets, to contribute to the dissemination of new ideas to solve this challenging problem, and to provide clear examples of application in real scenarios

    Dimensionality reduction methods for microarray cancer data using prior knowledge

    No full text
    Microarray studies are currently a very popular source of biological information. They allow the simultaneous measurement of hundreds of thousands of genes, drastically increasing the amount of data that can be gathered in a small amount of time and also decreasing the cost of producing such results. Large numbers of high dimensional data sets are currently being generated and there is an ongoing need to find ways to analyse them to obtain meaningful interpretations. Many microarray experiments are concerned with answering specific biological or medical questions regarding diseases and treatments. Cancer is one of the most popular research areas and there is a plethora of data available requiring in depth analysis. Although the analysis of microarray data has been thoroughly researched over the past ten years, new approaches still appear regularly, and may lead to a better understanding of the available information. The size of the modern data sets presents considerable difficulties to traditional methodologies based on hypothesis testing, and there is a new move towards the use of machine learning in microarray data analysis. Two new methods of using prior genetic knowledge in machine learning algorithms have been developed and their results are compared with existing methods. The prior knowledge consists of biological pathway data that can be found in on-line databases, and gene ontology terms. The first method, called ``a priori manifold learning'' uses the prior knowledge when constructing a manifold for non-linear feature extraction. It was found to perform better than both linear principal components analysis (PCA) and the non-linear Isomap algorithm (without prior knowledge) in both classification accuracy and quality of the clusters. Both pathway and GO terms were used as prior knowledge, and results showed that using GO terms can make the models over-fit the data. In the cases where the use of GO terms does not over-fit, the results are better than PCA, Isomap and a priori manifold learning using pathways. The second method, called ``the feature selection over pathway segmentation algorithm'', uses the pathway information to split a big dataset into smaller ones. Then, using AdaBoost, decision trees are constructed for each of the smaller sets and the sets that achieve higher classification accuracy are identified. The individual genes in these subsets are assessed to determine their role in the classification process. Using data sets concerning chronic myeloid leukaemia (CML) two subsets based on pathways were found to be strongly associated with the response to treatment. Using a different data set from measurements on lower grade glioma (LGG) tumours, four informative gene sets were discovered. Further analysis based on the Gini importance measure identified a set of genes for each cancer type (CML, LGG) that could predict the response to treatment very accurately (> 90%). Moreover a single gene that can predict the response to CML treatment accurately was identified.Open Acces

    Linking topiramate exposure to changes in electrophysiological activity and behavioral deficits through quantitative pharmacological modeling

    Get PDF
    University of Minnesota Ph.D. dissertation. May 2019. Major: Experimental & Clinical Pharmacology. Advisors: Susan Marino, Angela Birnbaum. 1 computer file (PDF); xv, 166 pages.Topiramate is a broad-spectrum anti-epileptic drug used to treat a variety of conditions, including epilepsy, migraine, substance abuse, mood, and eating disorders. We investigated the effects of topiramate on the working memory system using population pharmacokinetic-pharmacodynamic modeling and unsupervised machine learning approaches. Working memory is the capacity-limited neurocognitive system responsible for simultaneous maintenance and manipulation of information in order to achieve a goal. Behavioral and electrophysiological indices of working memory function were measured using data collected during a double-blind, placebo-controlled crossover study in healthy volunteers. Subjects completed a Sternberg working memory task, during which accuracy and reaction time were measured, while subjects’ EEG was recorded. A pharmacokinetic-pharmacodynamic model was constructed which demonstrated that accuracy decreased linearly as a function of plasma concentration, and that the magnitude of individual deficits was predicted by working memory capacity. A separate pharmacokinetic-pharmacodynamic model was developed which showed that spectral power in the theta frequency band (4-8 Hz) recorded during the retention phase of the Sternberg task increased as a function of plasma concentration. Furthermore, a mixture model identified two subpopulations with differential sensitivity in topiramate-induced theta reactivity. In the subpopulation defined by lower reactivity, reaction times were 20% slower than in the high theta reactivity subpopulation. Principal component regression was used to quantify the relationship between changes in multiple measures of electrophysiological activity and behavioral deficits. Theta power during retention was found to be the best predictor of topiramate-related behavioral deficits. Performance on another working memory task, Digit Span Forward, was also predicted by theta power during retention, as well as alpha (8-12 Hz) power during encoding and retrieval stages. In conclusion, two treatment-independent factors that predict differences in behavioral and electrophysiological responses to topiramate administration were identified: working memory capacity and theta reactivity. Future research will be needed to determine the utility of these demographic factors in predicting risk of cognitive side effects in patients eligible for treatment with topiramate
    corecore