353 research outputs found

    Usability of one-class classification in mapping and detecting changes in bare peat surfaces in the tundra

    Get PDF
    Arctic areas have experienced greening and changes in permafrost caused by climate change during recent decades. However, there has been a lack of automated methods in mapping changes in fine-scale patterns of permafrost landscapes. We mapped areal coverage of bare peat areas and changes in them in a peat plateau located in north-western Russia between 2007 and 2015. We utilized QuickBird and WorldView-3 satellite image data in an object-based setting. We compared four different one-class classifiers (one-class support vector machine, binary support vector machine, random forest, rotation forest) both in a fully supervised binary setting and with positive and unlabelled training data. There was notable variation in classification performance. The bare peat area F-score varied between 0.77 and 0.96 when evaluated by cross-validated training data and between 0.22 and 0.57 when evaluated by independent test data. Overall, random forest performed the most robustly but all classifiers performed well in some classifications. During the 8 year period, there was a 21%-26% decrease in the bare peat areal coverage. We conclude that (1) tested classifiers can be used in one-class settings and (2) there is a need to develop methods for tracking changes in single land cover types.Peer reviewe

    Real-time head movement tracking through earables in moving vehicles

    Get PDF
    Abstract. The Internet of Things is enabling innovations in the automotive industry by expanding the capabilities of vehicles by connecting them with the cloud. One important application domain is traffic safety, which can benefit from monitoring the driver’s condition to see if they are capable of safely handling the vehicle. By detecting drowsiness, inattentiveness, and distraction of the driver it is possible to react before accidents happen. This thesis explores how accelerometer and gyroscope data collected using earables can be used to classify the orientation of the driver’s head in a moving vehicle. It is found that machine learning algorithms such as Random Forest and K-Nearest Neighbor can be used to reach fairly accurate classifications even without applying any noise reduction to the signal data. Data cleaning and transformation approaches are studied to see how the models could be improved further. This study paves the way for the development of driver monitoring systems capable of reacting to anomalous driving behavior before traffic accidents can happen

    Deep learning predicts total knee replacement from magnetic resonance images

    Full text link
    Knee Osteoarthritis (OA) is a common musculoskeletal disorder in the United States. When diagnosed at early stages, lifestyle interventions such as exercise and weight loss can slow OA progression, but at later stages, only an invasive option is available: total knee replacement (TKR). Though a generally successful procedure, only 2/3 of patients who undergo the procedure report their knees feeling ''normal'' post-operation, and complications can arise that require revision. This necessitates a model to identify a population at higher risk of TKR, particularly at less advanced stages of OA, such that appropriate treatments can be implemented that slow OA progression and delay TKR. Here, we present a deep learning pipeline that leverages MRI images and clinical and demographic information to predict TKR with AUC 0.834±0.0360.834 \pm 0.036 (p < 0.05). Most notably, the pipeline predicts TKR with AUC 0.943±0.0570.943 \pm 0.057 (p < 0.05) for patients without OA. Furthermore, we develop occlusion maps for case-control pairs in test data and compare regions used by the model in both, thereby identifying TKR imaging biomarkers. As such, this work takes strides towards a pipeline with clinical utility, and the biomarkers identified further our understanding of OA progression and eventual TKR onset.Comment: 18 pages, 5 figures (4 in main article, 1 supplemental), 8 tables (5 in main article, 3 supplemental). Submitted to Scientific Reports and currently in revisio

    Land of 10,000 pixels: applications of remote sensing & geospatial data to improve forest management in northern Minnesota, USA

    Get PDF
    2018 Summer.Includes bibliographical references.The use of remote sensing and geospatial data has become commonplace in a wide variety of ecological applications. However, the utility of these applications is often limited by field sampling design or the constraints on spatial resolution inherent in remote sensing technology. Because land managers require map products that more accurately reflect habitat composition at local, operational levels there is a need to overcome these limitations and improve upon currently available data products. This study addresses this need through two unique applications demonstrating the ability of remote sensing to enhance operational forest management at local scales. In the first chapter, remote sensing products were evaluated to improve upon regional estimates of the spatial configuration, extent, and distribution of black ash from forest inventory and analysis (FIA) survey data. To do this, spectral and topographic indices, as well as ancillary geospatial data were combined with FIA survey information in a non-parametric modeling framework to predict the presence and absence of black ash dominated stands in northern Minnesota, USA. The final model produced low error rates (Overall: 14.5%, Presence: 14.3%, Absence: 14.6%; AUC: 0.92) and was strongly informed by an optimized set of predictors related to soil saturation and seasonal growth patterns. The model allowed the production of accurate, fine-scale presence/absence maps of black ash stand dominance that can ultimately be used in support of invasive species risk management. In the second chapter, metrics from low-density LiDAR were evaluated for improving upon estimates of forest canopy attributes traditionally accessed through the LANDFIRE program. To do this, LiDAR metrics were combined with a Landsat time-series derived canopy cover layer in random forest k-nearest neighbor imputation approach to estimate canopy bulk density, two measures of canopy base height, and stand age across the Boundary Waters Canoe Area in northern Minnesota, USA. These models produced strong relationships between the estimates of canopy fuel attributes and field-based data for stand age (R2 = 0.82, RMSE = 10.12 years), crown fuel base height (R2 = 0.79, RMSE = 1.10 m.), live crown base height (R2 = 0.71, RMSE 1.60 m.), and canopy bulk density (R2 = 0.58, RMSE 0.09 kg/m3). An additional standard randomForest model of canopy height was less successful (R2 = 0.33, RMSE 2.08 m). The map products generated from these models improve upon the accuracy of national available canopy fuel products and provide local forest managers with cost-efficient and operationally ready data required to simulate fire behavior and support management efforts

    The impact of training data characteristics on ensemble classification of land cover

    Get PDF
    Supervised classification of remote sensing imagery has long been recognised as an essential technology for large area land cover mapping. Remote sensing derived land cover and forest classification maps are important sources of information for understanding environmental processes and informing natural resource management decision making. In recent years, the supervised transformation of remote sensing data into thematic products has been advanced through the introduction and development of machine learning classification techniques. Applied to a variety of science and engineering problems over the past twenty years (Lary et al., 2016), machine learning provides greater accuracy and efficiency than traditional parametric classifiers, capable of dealing with large data volumes across complex measurement spaces. The Random forest (RF) classifier in particular, has become popular in the remote sensing community, with a range of commonly cited advantages, including its low parameterisation requirements, excellent classification results and ability to handle noisy observation data and outliers, in a complex measurement space and small training data relative to the study area size. In the context of large area land cover classification for forest cover, using multisource remote sensing and geospatial data, this research sets out to examine proposed advantages of the RF classifier - insensitivity to training data noise (mislabelling) and handling training data class imbalance. Through margin theory, the research also investigates the utility of ensemble learning &amp;ndash; in which multiple base classifiers are combined to reduce generalisation error in classification &amp;ndash; as a means of designing more efficient classifiers, improving classification performance, and reducing reference (training and test) data redundancy. The first part of the thesis (chapters 2 and 3) introduces the experimental setting and data used in the research, including a description (in chapter 2) of the sampling framework for the reference data used in classification experiments that follow. Chapter 3 evaluates the performance of the RF classifier applied across 7.2 million hectares of public land study area in Victoria, Australia. This chapter describes an open-source framework for deploying the RF classifier over large areas and processing significant volumes of multi-source remote sensing and ancillary spatial data. The second part of this thesis (research chapters 4 through 6) examines the effect of training data characteristics (class imbalance and mislabelling) on the performance of RF, and explores the application of the ensemble margin, as a means of both examining RF classification performance, and informing training data sampling to improve classification accuracy. Results of binary and multiclass experiments described in chapter 4, provide insights into the behaviour of RF, in which training data are not evenly distributed among classes and contain systematically mislabelled instances. Results show that while the error rate of the RF classifier is relatively insensitive to mislabelled training data (in the multiclass experiment, overall 78.3% Kappa with no mislabelled instances to 70.1% with 25% mislabelling in each class), the level of associated confidence falls at a faster rate than overall accuracy with increasing rates of mislabelled training data. This study section also demonstrates that imbalanced training data can be introduced to reduce error in classes that are most difficult to classify. The relationship between per-class and overall classification performance and the diversity of members in a RF ensemble classifier, is explored through experiments presented in chapter 5. This research examines ways of targeting particular training data samples to induce RF ensemble diversity and improve per-class and overall classification performance and efficiency. Through use of the ensemble margin, this study offers insights into the trade-off between ensemble classification accuracy and diversity. The research shows that boosting diversity among RF ensemble members, by emphasising the contribution of lower margin training instances used in the learning process, is an effective means of improving classification performance, particularly for more difficult or rarer classes, and is a way of reducing information redundancy and improving the efficiency of classification problems. Research chapter 6 looks at the application of the RF classifier for calculating Landscape Pattern Indices (LPIs) from classification prediction maps, and examines the sensitivity of these indices to training data characteristics and sampling based on the ensemble margin. This research reveals a range of commonly used LPIs to have significant sensitivity to training data mislabelling in RF classification, as well as margin-based training data sampling. In conclusion, this thesis examines proposed advantages of the popular machine learning classifier, Random forests - the relative insensitivity to training data noise (mislabelling) and its ability to handle class imbalance. This research also explores the utility of the ensemble margin for designing more efficient classifiers, measuring and improving classification performance, and designing ensemble classification systems which use reference data more efficiently and effectively, with less data redundancy. These findings have practical applications and implications for large area land cover classification, for which the generation of high quality reference data is often a time consuming, subjective and expensive exercise

    Non-H3 CDR template selection in antibody modeling through machine learning

    Get PDF
    Antibodies are proteins generated by the adaptive immune system to recognize and counteract a plethora of pathogens through specific binding. This adaptive binding is mediated by structural diversity in the six complementary determining region (CDR) loops (H1, H2, H3, L1, L2 and L3), which also makes accurate structural modeling of CDRs challenging. Both homology and de novo modeling approaches have been used; to date, the former has achieved greater accuracy for the non-H3 loops. The homology modeling of non-H3 CDRs is more accurate because non-H3 CDR loops of the same length and type can be grouped into a few structural clusters. Most antibody-modeling suites utilize homology modeling for the non-H3 CDRs, differing only in the alignment algorithm and how/if they utilize structural clusters. While RosettaAntibody and SAbPred do not explicitly assign query CDR sequences to clusters, two other approaches, PIGS and Kotai Antibody Builder, utilize sequence-based rules to assign CDR sequences to clusters. While the manually curated sequence rules can identify better structural templates, because their curation requires extensive literature search and human effort, they lag behind the deposition of new antibody structures and are infrequently updated. In this study, we propose a machine learning approach (Gradient Boosting Machine [GBM]) to learn the structural clusters of non-H3 CDRs from sequence alone. The GBM method simplifies feature selection and can easily integrate new data, compared to manual sequence rule curation. We compare the classification results using the GBM method to that of RosettaAntibody in a 3-repeat 10-fold cross-validation (CV) scheme on the cluster-annotated antibody database PyIgClassify and we observe an improvement in the classification accuracy of the concerned loops from 84.5% ± 0.24% to 88.16% ± 0.056%. The GBM models reduce the errors in specific cluster membership misclassifications when the involved clusters have relatively abundant data. Based on the factors identified, we suggest methods that can enrich structural classes with sparse data to further improve prediction accuracy in future studies

    Fast learning optimized prediction methodology for protein secondary structure prediction, relative solvent accessibility prediction and phosphorylation prediction

    Get PDF
    Computational methods are rapidly gaining importance in the field of structural biology, mostly due to the explosive progress in genome sequencing projects and the large disparity between the number of sequences and the number of structures. There has been an exponential growth in the number of available protein sequences and a slower growth in the number of structures. There is therefore an urgent need to develop computed structures and identify the functions of these sequences. Developing methods that will satisfy these needs both efficiently and accurately is of paramount importance for advances in many biomedical fields, for a better basic understanding of aberrant states of stress and disease, including drug discovery and discovery of biomarkers. Several aspects of secondary structure predictions and other protein structure-related predictions are investigated using different types of information such as data obtained from knowledge-based potentials derived from amino acids in protein sequences, physicochemical properties of amino acids and propensities of amino acids to appear at the ends of secondary structures. Investigating the performance of these secondary structure predictions by type of amino acid highlights some interesting aspects relating to the influences of the individual amino acid types on formation of secondary structures and points toward ways to make further gains. Other research areas include Relative Solvent Accessibility (RSA) predictions and predictions of phosphorylation sites, which is one of the Post-Translational Modification (PTM) sites in proteins. Protein secondary structures and other features of proteins are predicted efficiently, reliably, less expensively and more accurately. A novel method called Fast Learning Optimized PREDiction (FLOPRED) Methodology is proposed for predicting protein secondary structures and other features, using knowledge-based potentials, a Neural Network based Extreme Learning Machine (ELM) and advanced Particle Swarm Optimization (PSO) techniques that yield better and faster convergence to produce more accurate results. These techniques yield superior classification of secondary structures, with a training accuracy of 93.33% and a testing accuracy of 92.24% with a standard deviation of 0.48% obtained for a small group of 84 proteins. We have a Matthew\u27s correlation-coefficient ranging between 80.58% and 84.30% for these secondary structures. Accuracies for individual amino acids range between 83% and 92% with an average standard deviation between 0.3% and 2.9% for the 20 amino acids. On a larger set of 415 proteins, we obtain a testing accuracy of 86.5% with a standard deviation of 1.38%. These results are significantly higher than those found in the literature. Prediction of protein secondary structure based on amino acid sequence is a common technique used to predict its 3-D structure. Additional information such as the biophysical properties of the amino acids can help improve the results of secondary structure prediction. A database of protein physicochemical properties is used as features to encode protein sequences and this data is used for secondary structure prediction using FLOPRED. Preliminary studies using a Genetic Algorithm (GA) for feature selection, Principal Component Analysis (PCA) for feature reduction and FLOPRED for classification give promising results. Some amino acids appear more often at the ends of secondary structures than others. A preliminary study has indicated that secondary structure accuracy can be improved as much as 6% by including these effects for those residues present at the ends of alpha-helix, beta-strand and coil. A study on RSA prediction using ELM shows large gains in processing speed compared to using support vector machines for classification. This indicates that ELM yields a distinct advantage in terms of processing speed and performance for RSA. Additional gains in accuracies are possible when the more advanced FLOPRED algorithm and PSO optimization are implemented. Phosphorylation is a post-translational modification on proteins often controls and regulates their activities. It is an important mechanism for regulation. Phosphorylated sites are known to be present often in intrinsically disordered regions of proteins lacking unique tertiary structures, and thus less information is available about the structures of phosphorylated sites. It is important to be able to computationally predict phosphorylation sites in protein sequences obtained from mass-scale sequencing of genomes. Phosphorylation sites may aid in the determination of the functions of a protein and to better understanding the mechanisms of protein functions in healthy and diseased states. FLOPRED is used to model and predict experimentally determined phosphorylation sites in protein sequences. Our new PSO optimization included in FLOPRED enable the prediction of phosphorylation sites with higher accuracy and with better generalization. Our preliminary studies on 984 sequences demonstrate that this model can predict phosphorylation sites with a training accuracy of 92.53% , a testing accuracy 91.42% and Matthew\u27s correlation coefficient of 83.9%. In summary, secondary structure prediction, Relative Solvent Accessibility and phosphorylation site prediction have been carried out on multiple sets of data, encoded with a variety of information drawn from proteins and the physicochemical properties of their constituent amino acids. Improved and efficient algorithms called S-ELM and FLOPRED, which are based on Neural Networks and Particle Swarm Optimization are used for classifying and predicting protein sequences. Analysis of the results of these studies provide new and interesting insights into the influence of amino acids on secondary structure prediction. S-ELM and FLOPRED have also proven to be robust and efficient for predicting relative solvent accessibility of proteins and phosphorylation sites. These studies show that our method is robust and resilient and can be applied for a variety of purposes. It can be expected to yield higher classification accuracy and better generalization performance compared to previous methods

    Bridging the Domain-Gap in Computer Vision Tasks

    Get PDF

    Permanent water and flash flood detection using global navigation satellite system reflectometry

    Get PDF
    In this thesis, research for inland water extent and flash floods remote sensing using Global Navigation Satellite System Reflectometry (GNSS-R) data of the Cyclone Global Navigation Satellite System (CYGNSS) is presented. Firstly, a high-resolution Machine Learning (ML) method for detecting inland water extent using the CYGNSS data is implemented via the Random Under-Sampling Boosted (RUSBoost) algorithm. The CYGNSS data of the year 2018 over the Congo and Amazon basins are gridded into 0.01゚ × 0.01゚ cells. The RUSBoost-based classifier is trained and tested with the CYGNSS data over the Congo basin. The Amazon basin data that is unknown to the classifier is then used for further evaluation. Using only three observables extracted from the CYGNSS Delay-Doppler Maps (DDMs), the proposed technique is able to detect 95.4% and 93.3% of the water bodies over the Congo and Amazon basins, respectively. The performance of the RUSBoost-based classifier is also compared with an image processing based inland water detection method. For the Congo and Amazon basins, the RUSBoost-based classifier has a 3.9% and 14.2% higher water detection accuracies, respectively. Secondly, a flash flood detection method using the CYGNSS data is investigated. Considering Hurricane Harvey and Hurricane Irma as two case studies, six different Data Preparation Approaches (DPAs) for flood detection based on the CYGNSS data and the RUSBoost classification algorithm are investigated in this thesis. Taking flood and land as two classes, flash flood detection is tackled as a binary classification problem. Eleven observables are extracted from the DDMs for feature selection. These observables, alongside two features from ancillary data, are considered in feature selection. All the combinations of these observables with and without ancillary data are fed into the classifier with 5-fold cross-validation one-by-one. Based on the test results, five observables with the ancillary data are selected as a suitable feature vector for flood detection here. Using the selected feature vector, six different DPAs are investigated and compared to find the best one for flash flood detection. Then, the performance of the proposed method is compared with that of a Support Vector Machine (SVM) based classifier. For Hurricane Harvey and Hurricane Irma, the selected method is able to detect 89.00% and 85.00% of flooded points, respectively, with a resolution of 500m × 500m, and the detection accuracy for non-flooded land points is 97.20% and 71.00%, respectively
    • …
    corecore