10 research outputs found

    Discriminative Features via Generalized Eigenvectors

    Full text link
    Representing examples in a way that is compatible with the underlying classifier can greatly enhance the performance of a learning system. In this paper we investigate scalable techniques for inducing discriminative features by taking advantage of simple second order structure in the data. We focus on multiclass classification and show that features extracted from the generalized eigenvectors of the class conditional second moments lead to classifiers with excellent empirical performance. Moreover, these features have attractive theoretical properties, such as inducing representations that are invariant to linear transformations of the input. We evaluate classifiers built from these features on three different tasks, obtaining state of the art results

    Discriminative Clustering by Regularized Information Maximization

    Get PDF
    Is there a principled way to learn a probabilistic discriminative classifier from an unlabeled data set? We present a framework that simultaneously clusters the data and trains a discriminative classifier. We call it Regularized Information Maximization (RIM). RIM optimizes an intuitive information-theoretic objective function which balances class separation, class balance and classifier complexity. The approach can flexibly incorporate different likelihood functions, express prior assumptions about the relative size of different classes and incorporate partial labels for semi-supervised learning. In particular, we instantiate the framework to unsupervised, multi-class kernelized logistic regression. Our empirical evaluation indicates that RIM outperforms existing methods on several real data sets, and demonstrates that RIM is an effective model selection method

    Improving Query Classification by Features’ Weight Learning

    Get PDF
    This work is an attempt to enhance query classification in call routing applications. A new method has been introduced to learn weights from training data by means of a regression model. This work has investigated applying the tf-idf weighting method, but the approach is not limited to a specific method and can be used for any weighting scheme. Empirical evaluations with several classifiers including Support Vector Machines (SVM), Maximum Entropy, Naive Bayes, and k-Nearest Neighbor (k-NN) show substantial improvement in both macro and micro F1 measures

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Systems Analytics and Integration of Big Omics Data

    Get PDF
    A “genotype"" is essentially an organism's full hereditary information which is obtained from its parents. A ""phenotype"" is an organism's actual observed physical and behavioral properties. These may include traits such as morphology, size, height, eye color, metabolism, etc. One of the pressing challenges in computational and systems biology is genotype-to-phenotype prediction. This is challenging given the amount of data generated by modern Omics technologies. This “Big Data” is so large and complex that traditional data processing applications are not up to the task. Challenges arise in collection, analysis, mining, sharing, transfer, visualization, archiving, and integration of these data. In this Special Issue, there is a focus on the systems-level analysis of Omics data, recent developments in gene ontology annotation, and advances in biological pathways and network biology. The integration of Omics data with clinical and biomedical data using machine learning is explored. This Special Issue covers new methodologies in the context of gene–environment interactions, tissue-specific gene expression, and how external factors or host genetics impact the microbiome

    Contabilidad forense y blanqueo de capitales: aplicación del aprendizaje automático en un proceso judicial español

    Get PDF
    This PhD Dissertation adds two new results in detecting signs of financial fraud: (1) the application of automated learning techniques to internal accounting databases of companies to detect money laundering, and (2) the offer of information to the investigating authorities on how the money laundering network is organized, with the objective of orientating the judicial investigation towards those companies or physical persons who present signs of suspicious patterns. Thus, in the context of a real macro-case on money laundering in which the author has collaborated as forensic accountant, this study analyses the database available of the operations carried out between a core company and a set of 643 supplier companies, 26 of which had already been identified a priori by the Judicial Police as fraudulent. Faced with a well-founded suspicion that other suppliers within the network might have committed criminal acts, and in order to better manage the scarce police resources available, machine learning techniques are proposed with two different approaches to detect patterns of fraud. The first proposed approach is the implementation of Neural Network models to the expert-assisted work for the detection of fraud operations. For this purpose, based on machine learning techniques, the network structure used is that proposed by Hastie et al. (2008): The Back-Propagation Network. In the second approach, it is proposed a more ambitious procedure to pattern detection than the previous one, in which Benford's Law (Nigrini and Mittermaider, 1997), a tool to characterize accounting records of the commercial operations between the main company and its supplier, is combined with four models of classification: Ridge Logistic Regression (LG) (Le Cessie and van Houwelingen, 1992), Artificial Neural Networks (NN) (Hastie et al., 2008), Decision Tree C4.5 (DT) (Quinlan, 1993 and 1996) and Random Forest (RF) (Breiman, 2001). Overall, the Random Forest showed the best results with the SMOTE transformation, obtaining 96.15% of true negatives (TN Rate) and 94.98% of true positives (TP Rate). The classification capacity of this methodology is undoubtedly very high.Thus, the machine learning techniques proposed in this paper represent an efficient and objective new tool for detecting fraudulent patterns of behaviour for the investigation of money laundering offences, allowing police investigators to focus the limited economic and human resources available in the judicial processes on those companies under suspicion who present a pattern of behaviour similar to that of previously recognized fraudulent companies. This PhD Dissertation is structured in two parts. On the first part, composed of three Chapters, establishes the theoretical framework on which the research is based. The first Chapter outlines the concept of money laundering and studies the tendency of this crime in Spain. Chapter II describes the process of management and access to information prior to the application of the proposed techniques (Data Pre-processing). Next, Chapter III specifies the methodology applied based on machine learning techniques for the detection of money laundering pattern. The second part is devoted to the presentation of the judicial process and the analysis of the results. After the presentation of the judicial process and the description of the sample, on the Chapter IV are presented the results obtained in the application of the machine learning techniques proposed in the two approaches. The PhD Dissertation ends with the conclusions and with proposals for further research

    Contribuciones al meta-análisis de pruebas diagnósticas en enfermedades de baja prevalencia

    Get PDF
    [ES]La capacidad discriminante de una prueba se expresa comúnmente en términos de sensibilidad y especi cidad, y por lo general, existe una relación de compromiso entre estas dos medidas, ya que un umbral creciente para de nir la positividad de la prueba provoca una disminución de la sensibilidad y un aumento de la especificidad. Los métodos recomendados para el meta-análisis de pruebas diagnósticas, como el modelo bivariante, se centra en la estimación de una sensibilidad y especificidad resumen en un umbral común, mientras que el modelo HSROC se centran en la estimación de una curva resumen a partir de estudios que han utilizado umbrales diferentes. Sin embargo, estos modelos no informan la media general, sino mas bien la media de un estudio central y tienen diferencias en la estimación de la correlación entre la sensibilidad y la especificidad cuando el número de estudios en el meta-análisis es pequeño y/o cuando la varianza entre los estudios es relativamente grande. Para solucionar los problemas antes mencionados se considera el uso de estructuras que pueden ser modeladas usando funciones de cópula como una alternativa al modelado de dependencia. En este trabajo, se presentan los pasos a seguir para llevar a cabo un meta-análisis de pruebas diagnósticas en situaciones de baja prevalencia, mediante un novedoso esquema de decisión que se fundamenta en el análisis y estudio de la modelización jerárquica y cópulas. Con el modelo HSROC se simularon meta-análisis de enfermedades de baja prevalencia con diferentes puntos de corte y, a partir de las medidas simuladas, se construyeron predictores y una variable respuesta, con el n de modelizar un aprendizaje automático que sugiera el mejor modelo estadistica a la hora de resumir los resultados de un meta-análisis de pruebas diagnósticas. El problema de la multicolinealidad fue abordado, para ello se identificó los predictores inflactados, mediante la propuesta de un algoritmo que calcula la matriz de correlación de los predictores e identifica todas las correlaciones por pares que estén por debajo de un umbral. Finalmente, dentro del análisis expuesto, se propone un novedoso diagrama de flujo en conjunto con un algoritmo, que le permita al investigador construir un modelo estadístico con algoritmos de aprendizaje automático, con la finalidad de obtener resultados confiables tanto en la predicción como en clasificación

    Combinación de clasificadores mediante el método boosting. Una aplicación a la predicción del fracaso empresarial en España

    Get PDF
    El trabajo que se presenta está estructurado en tres partes. La primera parte comprende del capítulo i al capítulo IV. Tras exponer algunos aspectos generales de los problemas de clasificación, se analizan algunos de los métodos de clasificación individuales más utilizados en la actualidad, destacando sus principales ventajas e inconvenientes. En la segunda parte (capítulos V al VII) se analizan algunos aspectos relacionados con el comportamiento y las propiedades de los clasificadores individuales. En concreto, se plantean las dificultades que pueden surgir debido al uso de los clasificadores individuales, como son la precisión y la estabilidad de los mismos. A continuación, en el capitulo vi, se aborda el estudio de la combinación de clasificadores prestando especial atención al método boosting. Además, se recoge una taxonomía de los métodos de combinación y se introducen también el método bagging y el bosque aleatorio. Por último, se estudian los primeros algoritmos que han dado lugar al desarrollo posterior del método boosting. También se exponen algunas de las modificaciones que se han propuesto al algoritmo adaboost, incluyendo las que sirven para afrontar la existencia de más de dos clases y, para acabar, se analiza cual debe ser el tamaño adecuado de los arboles utilizados en la combinación. En la tercera parte (capítulos 8-10) se proporciona una visión general de la predicción del fallo empresarial, sus antecedentes y estado actual. Además, se elaborara un listado con los ratios financieros que han resultado de mayor utilidad para el pronóstico del fracaso. También se lleva a cabo una descripción de la evolución que han seguido en España las empresas fracasadas. El capitulo IX se centra en la aplicación práctica. Después de recoger brevemente algunas consideraciones teóricas sobre el tratamiento de la información, se realiza un análisis exploratorio de los datos. Además de catorce ratios financieros, se utilizan otras tres variables menos habituales que intentan recoger el tamaño de la empresa, la actividad a la que se dedica y la forma jurídica que presenta. Se coteja el método boosting con los arboles de clasificación, tanto para el caso dicotómico, como cuando se distingue entre tres clases. A continuación se realiza una comparación, algo menos detallada, con otros cinco métodos de clasificación. Posteriormente, se examina la capacidad de los modelos establecidos anteriormente para predecir el fracaso empresarial cuando aumenta la distancia temporal al periodo en que se hace efectivo el fallo. Finalmente, se concluye que boosting mejora los resultados de los arboles de clasificación individuales. Entre las principales aportaciones de este trabajo destacan el uso de una técnica novedosa, el método boosting y la consideración de un concepto de fracaso empresarial más amplio del habitual

    Kernel multilogit algorithm for multiclass classification

    No full text
    An algorithm for multi-class classification is proposed. The soft classification problem is considered, where the target variable is a multivariate random variable. The proposed algorithm transforms the original target variable into a new space using the multilogit function. Assuming Gaussian noise on this transformation and using a standard Bayesian approach the model yields a quadratic functional whose global minimum can easily be obtained by solving a set of linear system of equations. In order to obtain the classification, the inverse multilogit-based transformation should be applied and the obtained result can be interpreted as a 'soft' or probabilistic classification. Then, the final classification is obtained by using the 'Winner takes all' strategy. A Kernel-based formulation is presented in order to consider the non-linearities associated with the feature space of the data. The proposed algorithm is applied on real data, using databases available online. The experimental study shows that the algorithm is competitive with respect to other classical algorithms for multiclass classification. © 2014 Elsevier B.V. All rights reserved
    corecore