94 research outputs found

    An extensive experimental survey of regression methods

    Get PDF
    Regression is a very relevant problem in machine learning, with many different available approaches. The current work presents a comparison of a large collection composed by 77 popular regression models which belong to 19 families: linear and generalized linear models, generalized additive models, least squares, projection methods, LASSO and ridge regression, Bayesian models, Gaussian processes, quantile regression, nearest neighbors, regression trees and rules, random forests, bagging and boosting, neural networks, deep learning and support vector regression. These methods are evaluated using all the regression datasets of the UCI machine learning repository (83 datasets), with some exceptions due to technical reasons. The experimental work identifies several outstanding regression models: the M5 rule-based model with corrections based on nearest neighbors (cubist), the gradient boosted machine (gbm), the boosting ensemble of regression trees (bstTree) and the M5 regression tree. Cubist achieves the best squared correlation (R2) in 15.7% of datasets being very near to it, with difference below 0.2 for 89.1% of datasets, and the median of these differences over the dataset collection is very low (0.0192), compared e.g. to the classical linear regression (0.150). However, cubist is slow and fails in several large datasets, while other similar regression models as M5 never fail and its difference to the best R2 is below 0.2 for 92.8% of datasets. Other well-performing regression models are the committee of neural networks (avNNet), extremely randomized regression trees (extraTrees, which achieves the best R2 in 33.7% of datasets), random forest (rf) and ε-support vector regression (svr), but they are slower and fail in several datasets. The fastest regression model is least angle regression lars, which is 70 and 2,115 times faster than M5 and cubist, respectively. The model which requires least memory is non-negative least squares (nnls), about 2 GB, similarly to cubist, while M5 requires about 8 GB. For 97.6% of datasets there is a regression model among the 10 bests which is very near (difference below 0.1) to the best R2, which increases to 100% allowing differences of 0.2. Therefore, provided that our dataset and model collection are representative enough, the main conclusion of this study is that, for a new regression problem, some model in our top-10 should achieve R2 near to the best attainable for that problemThis work has received financial support from the Erasmus Mundus Euphrates programme [project number 2013-2540/001-001-EMA2], from the Xunta de Galicia (Centro singular de investigación de Galicia, accreditation 2016–2019) and the European Union (European Regional Development Fund — ERDF), Project MTM2016–76969–P (Spanish State Research Agency, AEI)co-funded by the European Regional Development Fund (ERDF) and IAP network from Belgian Science PolicyS

    Application of machine learning to agricultural soil data

    Get PDF
    Agriculture is a major sector in the Indian economy. One key advantage of classification and prediction of soil parameters is to save time of specialized technicians developing expensive chemical analysis. In this context, this PhD thesis has been developed in three stages: 1. Classification for soil data: we used chemical soil measurements to classify many relevant soil parameters: village-wise fertility indices; soil pH and type; soil nutrients, in order to recommend suitable amounts of fertilizers; and preferable crop. 2. Regression for generic data: we developed an experimental comparison of many regressors to a large collection of generic datasets selected from the University of California at Irving (UCI) machine learning repository. 3. Regression for soil data: We applied the regressors used in stage 2 to the soil datasets, developing a direct prediction of their numeric values. The accuracy of the prediction was evaluated for the ten soil problems, as an alternative to the prediction of the quantified values (classification) developed in stage 1

    Assessing skin lesion evolution from multispectral image sequences

    Get PDF
    During the evaluation of skin disease treatments, dermatologists have to clinically measure the evolution of the pathology severity of each patient during treatment periods. Such a process is sensitive to intra- and inter- dermatologist diagnosis. To make this severity measurement more objective we quantify the pathology severity using a new image processing based method. We focus on a hyperpigmentation disorder called melasma. During a treatment period, multispectral images are taken on patients receiving the same treatment. After co-registration and segmentation steps, we propose an algorithm to measure the intensity, the size and the homogeneity evolution of the pathological areas. Obtained results are compared with a dermatologist diagnosis using statistical tests on two clinical studies containing respectively 384 images from 16 patients and 352 images from 22 patients.This research report is an update of the report 8136. It describes methods and experiments in more details and provides more references.Lors de l'évaluation des traitements des maladies de peau, les dermatologues doivent mesurer la sévérité de la pathologie de chaque patient tout au long d'une période de traitement. Un tel procédé est sensible aux variations intra- et inter- dermatologues. Pour rendrecette mesure de sévérité plus robuste, nous proposons d'utiliser l'imagerie spectrale. Nous nous concentrons sur une pathologie d'hyperpigmentation cutanée appelée mélasma. Au cours d'une période de traitement, des images multispectrales sont acquises sur une population de patients sous traitement. Après des étapes de recalage des séries temporelles d'images et de classification des régions d'intérêt, nous proposons une méthodologie permettant de mesurer, dans le temps, la variation de contraste, de surface et d'homogénéité de la zone pathologique pour chaque patient. Les résultats obtenus sont comparés à un diagnostique clinique à l'aide de tests statistiques réalisés sur une étude clinique complète.Ce rapport de recherche est un complément du rapport de recherche 8136, afin de compléter la bibliographie, et de décrire plus en détail les méthodes et résultat

    Stock market uncertainty determination with news headlines : a digital twin approach

    Get PDF
    Published online: 13 December 2023We present a novel digital twin model that implements advanced artificial intelligence techniques to robustly link news and stock market uncertainty. On the basis of central results in financial economics, our model efficiently identifies, quantifies, and forecasts the uncertainty encapsulated in the news by mirroring the human mind’s information processing mechanisms. After obtaining full statistical descriptions of the timeline and contextual patterns of the appearances of specific words, the applied data mining techniques lead to the definition of regions of homogeneous knowledge. The absence of a clear assignment of informative elements to specific knowledge regions is regarded as uncertainty, which is then measured and quantified using Shannon Entropy. As compared with standard models, the empirical analyses demonstrate the effectiveness of this approach in anticipating stock market uncertainty, thus showcasing a meaningful integration of natural language processing, artificial intelligence, and information theory to comprehend the perception of uncertainty encapsulated in the news by market agents and its subsequent impact on stock markets

    Vol. 15, No. 1 (Full Issue)

    Get PDF

    Untangling hotel industry’s inefficiency: An SFA approach applied to a renowned Portuguese hotel chain

    Get PDF
    The present paper explores the technical efficiency of four hotels from Teixeira Duarte Group - a renowned Portuguese hotel chain. An efficiency ranking is established from these four hotel units located in Portugal using Stochastic Frontier Analysis. This methodology allows to discriminate between measurement error and systematic inefficiencies in the estimation process enabling to investigate the main inefficiency causes. Several suggestions concerning efficiency improvement are undertaken for each hotel studied.info:eu-repo/semantics/publishedVersio

    How on Earth Did Spanish Banking Sell the Housing Stock?

    Get PDF
    The accumulation of properties by Spanish banks during the crisis of the first decade of the 21st century has definitely changed the housing market. An optimal house price valuation is useful to determine the bank’s actual financial situation. Furthermore, properties valued according to the market can be sold in a shorter span of time and at a better price. Using a sample of 24,781 properties and a simulation exercise, we aim to identify the decision criteria that Spanish banking used to decide which properties were going to be sold and at what price. The results of the comparison among four methods used to value real estate—artificial neural networks, semi log regressions, a combined model by means of weighted least squares regression, and quantile regressions—and the actual situation suggest that banking aimed to maximize the reversal of impairment losses, although this would mean capital losses, selling less properties, and decreasing their revenues. Therefore, the actual combined result was very detrimental to banking and, consequently, to the Spanish society because of its banking bailout

    A method for comparing multiple imputation techniques: A case study on the U.S. national COVID cohort collaborative.

    Get PDF
    Healthcare datasets obtained from Electronic Health Records have proven to be extremely useful for assessing associations between patients’ predictors and outcomes of interest. However, these datasets often suffer from missing values in a high proportion of cases, whose removal may introduce severe bias. Several multiple imputation algorithms have been proposed to attempt to recover the missing information under an assumed missingness mechanism. Each algorithm presents strengths and weaknesses, and there is currently no consensus on which multiple imputation algorithm works best in a given scenario. Furthermore, the selection of each algorithm’s pa- rameters and data-related modeling choices are also both crucial and challenging
    • …
    corecore