5,979 research outputs found

    Conjecturing-Based Computational Discovery of Patterns in Data

    Full text link
    Modern machine learning methods are designed to exploit complex patterns in data regardless of their form, while not necessarily revealing them to the investigator. Here we demonstrate situations where modern machine learning methods are ill-equipped to reveal feature interaction effects and other nonlinear relationships. We propose the use of a conjecturing machine that generates feature relationships in the form of bounds for numerical features and boolean expressions for nominal features that are ignored by machine learning algorithms. The proposed framework is demonstrated for a classification problem with an interaction effect and a nonlinear regression problem. In both settings, true underlying relationships are revealed and generalization performance improves. The framework is then applied to patient-level data regarding COVID-19 outcomes to suggest possible risk factors.Comment: 25 pages, 6 figure

    A Multi-Contextual Approach to Modeling the Impact of Critical Highway Work Zones in Large Urban Corridors

    Get PDF
    Accurate Construction Work Zone (CWZ) impact assessments of unprecedented travel inconvenience to the general public are required for all federally-funded highway infrastructure improvement projects. These assessments are critical, but they are also very difficult to perform. Most existing prediction approaches are project-specific, shortterm, and univariate, thus incapable of benchmarking the potential traffic impact of CWZs for highway construction projects. This study fills these gaps by creating a big-data-based decision-support framework and testing if it can reliably predict the potential impact of a CWZ under arbitrary lane closure scenarios. This study proposes a big-data-based decision-support analytical framework, “Multi-contextual learning for the Impact of Critical Urban highway work Zones” (MICUZ). MICUZ is unique as it models the impact of CWZ operations through a multi-contextual quantitative method utilizing sensored big transportation data. MICUZ was developed through a three-phase modeling process. First, robustness of the collected sensored data was examined through a Wheeler’s repeatability and reproducibility analysis, for the purpose of verifying the homogeneity of the variability of traffic flow data. The analysis results led to a notable conclusion that the proposed framework is feasible due to the relative simplicity and periodicity of highway traffic profiles. Second, a machine-learning algorithm using a Feedforward Neural Networks (FNN) technique was applied to model the multi-contextual aspects of iii long-term traffic flow predictions. The validation study showed that the proposed multi-contextual FNN yields an accurate prediction rate of traffic flow rates and truck percentages. Third, employing these predicted traffic parameters, a curve-fitting modeling technique was implemented to quantify the impact of what-if lane closures on the overall traffic flow. The robustness of the proposed curve-fitting models was then scientifically verified and validated by measuring forecast accuracy. The results of this study convey the fact that MICUZ would recognize how stereotypical regional traffic patterns react to existing CWZs and lane closure tactics, and quantify the probable but reliable travel time delays at CWZs in heavily trafficked urban cores. The proposed framework provides a rigorous theoretical basis for comparatively analyzing what-if construction scenarios, enabling engineers and planners to choose the most efficient transportation management plans much more quickly and accurately

    Machine Learning for Biometrics

    Get PDF
    Biometrics aims at reliable and robust identification of humans from their personal traits, mainly for security and authentication purposes, but also for identifying and tracking the users of smarter applications. Frequently considered modalities are fingerprint, face, iris, palmprint and voice, but there are many other possible biometrics, including gait, ear image, retina, DNA, and even behaviours. This chapter presents a survey of machine learning methods used for biometrics applications, and identifies relevant research issues. We focus on three areas of interest: offline methods for biometric template construction and recognition, information fusion methods for integrating multiple biometrics to obtain robust results, and methods for dealing with temporal information. By introducing exemplary and influential machine learning approaches in the context of specific biometrics applications, we hope to provide the reader with the means to create novel machine learning solutions to challenging biometrics problems

    Distribution of Mutual Information from Complete and Incomplete Data

    Full text link
    Mutual information is widely used, in a descriptive way, to measure the stochastic dependence of categorical random variables. In order to address questions such as the reliability of the descriptive value, one must consider sample-to-population inferential approaches. This paper deals with the posterior distribution of mutual information, as obtained in a Bayesian framework by a second-order Dirichlet prior distribution. The exact analytical expression for the mean, and analytical approximations for the variance, skewness and kurtosis are derived. These approximations have a guaranteed accuracy level of the order O(1/n^3), where n is the sample size. Leading order approximations for the mean and the variance are derived in the case of incomplete samples. The derived analytical expressions allow the distribution of mutual information to be approximated reliably and quickly. In fact, the derived expressions can be computed with the same order of complexity needed for descriptive mutual information. This makes the distribution of mutual information become a concrete alternative to descriptive mutual information in many applications which would benefit from moving to the inductive side. Some of these prospective applications are discussed, and one of them, namely feature selection, is shown to perform significantly better when inductive mutual information is used.Comment: 26 pages, LaTeX, 5 figures, 4 table

    Predicting class-imbalanced business risk using resampling, regularization, and model ensembling algorithms

    Get PDF
    We aim at developing and improving the imbalanced business risk modeling via jointly using proper evaluation criteria, resampling, cross-validation, classifier regularization, and ensembling techniques. Area Under the Receiver Operating Characteristic Curve (AUC of ROC) is used for model comparison based on 10-fold cross-validation. Two undersampling strategies including random undersampling (RUS) and cluster centroid undersampling (CCUS), as well as two oversampling methods including random oversampling (ROS) and Synthetic Minority Oversampling Technique (SMOTE), are applied. Three highly interpretable classifiers, including logistic regression without regularization (LR), L1-regularized LR (L1LR), and decision tree (DT) are implemented. Two ensembling techniques, including Bagging and Boosting, are applied to the DT classifier for further model improvement. The results show that Boosting on DT by using the oversampled data containing 50% positives via SMOTE is the optimal model and it can achieve AUC, recall, and F1 score valued 0.8633, 0.9260, and 0.8907, respectively
    • …
    corecore