337 research outputs found

    Analysis of Microarray Data using Machine Learning Techniques on Scalable Platforms

    Get PDF
    Microarray-based gene expression profiling has been emerged as an efficient technique for classification, diagnosis, prognosis, and treatment of cancer disease. Frequent changes in the behavior of this disease, generate a huge volume of data. The data retrieved from microarray cover its veracities, and the changes observed as time changes (velocity). Although, it is a type of high-dimensional data which has very large number of features rather than number of samples. Therefore, the analysis of microarray high-dimensional dataset in a short period is very much essential. It often contains huge number of data, only a fraction of which comprises significantly expressed genes. The identification of the precise and interesting genes which are responsible for the cause of cancer is imperative in microarray data analysis. Most of the existing schemes employ a two phase process such as feature selection/extraction followed by classification. Our investigation starts with the analysis of microarray data using kernel based classifiers followed by feature selection using statistical t-test. In this work, various kernel based classifiers like Extreme learning machine (ELM), Relevance vector machine (RVM), and a new proposed method called kernel fuzzy inference system (KFIS) are implemented. The proposed models are investigated using three microarray datasets like Leukemia, Breast and Ovarian cancer. Finally, the performance of these classifiers are measured and compared with Support vector machine (SVM). From the results, it is revealed that the proposed models are able to classify the datasets efficiently and the performance is comparable to the existing kernel based classifiers. As the data size increases, to handle and process these datasets becomes very bottleneck. Hence, a distributed and a scalable cluster like Hadoop is needed for storing (HDFS) and processing (MapReduce as well as Spark) the datasets in an efficient way. The next contribution in this thesis deals with the implementation of feature selection methods, which are able to process the data in a distributed manner. Various statistical tests like ANOVA, Kruskal-Wallis, and Friedman tests are implemented using MapReduce and Spark frameworks which are executed on the top of Hadoop cluster. The performance of these scalable models are measured and compared with the conventional system. From the results, it is observed that the proposed scalable models are very efficient to process data of larger dimensions (GBs, TBs, etc.), as it is not possible to process with the traditional implementation of those algorithms. After selecting the relevant features, the next contribution of this thesis is the scalable viii implementation of the proximal support vector machine classifier, which is an efficient variant of SVM. The proposed classifier is implemented on the two scalable frameworks like MapReduce and Spark and executed on the Hadoop cluster. The obtained results are compared with the results obtained using conventional system. From the results, it is observed that the scalable cluster is well suited for the Big data. Furthermore, it is concluded that Spark is more efficient than MapReduce due to its an intelligent way of handling the datasets through Resilient distributed dataset (RDD) as well as in-memory processing and conventional system to analyze the Big datasets. Therefore, the next contribution of the thesis is the implementation of various scalable classifiers base on Spark. In this work various classifiers like, Logistic regression (LR), Support vector machine (SVM), Naive Bayes (NB), K-Nearest Neighbor (KNN), Artificial Neural Network (ANN), and Radial basis function network (RBFN) with two variants hybrid and gradient descent learning algorithms are proposed and implemented using Spark framework. The proposed scalable models are executed on Hadoop cluster as well as conventional system and the results are investigated. From the obtained results, it is observed that the execution of the scalable algorithms are very efficient than conventional system for processing the Big datasets. The efficacy of the proposed scalable algorithms to handle Big datasets are investigated and compared with the conventional system (where data are not distributed, kept on standalone machine and processed in a traditional manner). The comparative analysis shows that the scalable algorithms are very efficient to process Big datasets on Hadoop cluster rather than the conventional system

    Machine Learning with Metaheuristic Algorithms for Sustainable Water Resources Management

    Get PDF
    The main aim of this book is to present various implementations of ML methods and metaheuristic algorithms to improve modelling and prediction hydrological and water resources phenomena having vital importance in water resource management

    Dynamic contrast enhanced (DCE) MRI estimation of vascular parameters using knowledge-based adaptive models

    Get PDF
    We introduce and validate four adaptive models (AMs) to perform a physiologically based Nested-Model-Selection (NMS) estimation of such microvascular parameters as forward volumetric transfer constant, K(trans), plasma volume fraction, v(p), and extravascular, extracellular space, v(e), directly from Dynamic Contrast-Enhanced (DCE) MRI raw information without the need for an Arterial-Input Function (AIF). In sixty-six immune-compromised-RNU rats implanted with human U-251 cancer cells, DCE-MRI studies estimated pharmacokinetic (PK) parameters using a group-averaged radiological AIF and an extended Patlak-based NMS paradigm. One-hundred-ninety features extracted from raw DCE-MRI information were used to construct and validate (nested-cross-validation, NCV) four AMs for estimation of model-based regions and their three PK parameters. An NMS-based a priori knowledge was used to fine-tune the AMs to improve their performance. Compared to the conventional analysis, AMs produced stable maps of vascular parameters and nested-model regions less impacted by AIF-dispersion. The performance (Correlation coefficient and Adjusted R-squared for NCV test cohorts) of the AMs were: 0.914/0.834, 0.825/0.720, 0.938/0.880, and 0.890/0.792 for predictions of nested model regions, v(p), K(trans), and v(e), respectively. This study demonstrates an application of AMs that quickens and improves DCE-MRI based quantification of microvasculature properties of tumors and normal tissues relative to conventional approaches

    Navigating the Statistical Minefield of Model Selection and Clustering in Neuroscience

    Get PDF
    Model selection is often implicit: when performing an ANOVA, one assumes that the normal distribution is a good model of the data; fitting a tuning curve implies that an additive and a multiplicative scaler describes the behavior of the neuron; even calculating an average implicitly assumes that the data were sampled from a distribution that has a finite first statistical moment: the mean. Model selection may be explicit, when the aim is to test whether one model provides a better description of the data than a competing one. As a special case, clustering algorithms identify groups with similar properties within the data. They are widely used from spike sorting to cell type identification to gene expression analysis. We discuss model selection and clustering techniques from a statistician's point of view, revealing the assumptions behind, and the logic that governs the various approaches. We also showcase important neuroscience applications and provide suggestions how neuroscientists could put model selection algorithms to best use as well as what mistakes should be avoided

    Machine Learning for Microcontroller-Class Hardware -- A Review

    Full text link
    The advancements in machine learning opened a new opportunity to bring intelligence to the low-end Internet-of-Things nodes such as microcontrollers. Conventional machine learning deployment has high memory and compute footprint hindering their direct deployment on ultra resource-constrained microcontrollers. This paper highlights the unique requirements of enabling onboard machine learning for microcontroller class devices. Researchers use a specialized model development workflow for resource-limited applications to ensure the compute and latency budget is within the device limits while still maintaining the desired performance. We characterize a closed-loop widely applicable workflow of machine learning model development for microcontroller class devices and show that several classes of applications adopt a specific instance of it. We present both qualitative and numerical insights into different stages of model development by showcasing several use cases. Finally, we identify the open research challenges and unsolved questions demanding careful considerations moving forward.Comment: Accepted for publication at IEEE Sensors Journa

    Data Science: Measuring Uncertainties

    Get PDF
    With the increase in data processing and storage capacity, a large amount of data is available. Data without analysis does not have much value. Thus, the demand for data analysis is increasing daily, and the consequence is the appearance of a large number of jobs and published articles. Data science has emerged as a multidisciplinary field to support data-driven activities, integrating and developing ideas, methods, and processes to extract information from data. This includes methods built from different knowledge areas: Statistics, Computer Science, Mathematics, Physics, Information Science, and Engineering. This mixture of areas has given rise to what we call Data Science. New solutions to the new problems are reproducing rapidly to generate large volumes of data. Current and future challenges require greater care in creating new solutions that satisfy the rationality for each type of problem. Labels such as Big Data, Data Science, Machine Learning, Statistical Learning, and Artificial Intelligence are demanding more sophistication in the foundations and how they are being applied. This point highlights the importance of building the foundations of Data Science. This book is dedicated to solutions and discussions of measuring uncertainties in data analysis problems

    Mining previously unknown patterns in time series data

    Get PDF
    The emerging importance of distributed computing systems raises the needs of gaining a better understanding of system performance. As a major indicator of system performance, analysing CPU host load helps evaluate system performance in many ways. Discovering similar patterns in CPU host load is very useful since many applications rely on the pattern mined from the CPU host load, such as pattern-based prediction, classification and relative rule mining of CPU host load. Essentially, the problem of mining patterns in CPU host load is mining the time series data. Due to the complexity of the problem, many traditional mining techniques for time series data are not suitable anymore. Comparing to mining known patterns in time series, mining unknown patterns is a much more challenging task. In this thesis, we investigate the major difficulties of the problem and develop the techniques for mining unknown patterns by extending the traditional techniques of mining the known patterns. In this thesis, we develop two different CPU host load discovery methods: the segment-based method and the reduction-based method to optimize the pattern discovery process. The segment-based method works by extracting segment features while the reduction-based method works by reducing the size of raw data. The segment-based pattern discovery method maps the CPU host load segments to a 5-dimension space, then applies the DBSCAN clustering method to discover similar segments. The reduction-based method reduces the dimensionality and numerosity of the CPU host load to reduce the search space. A cascade method is proposed to support accurate pattern mining while maintaining efficiency. The investigations into the CPU host load data inspired us to further develop a pattern mining algorithm for general time series data. The method filters out the unlikely starting positions for reoccurring patterns at the early stage and then iteratively locates all best-matching patterns. The results obtained by our method do not contain any meaningless patterns, which has been a different problematic issue for a long time. Comparing to the state of art techniques, our method is more efficient and effective in most scenarios

    Efficient Learning Machines

    Get PDF
    Computer scienc

    Uncertainty analysis of 100-year flood maps under climate change scenarios

    Get PDF
    Floods are natural disastrous hazards that throughout history have had and still have major adverse impacts on people’s life, economy, and the environment. One of the useful tools for flood management are flood maps, which are developed to identify flood prone areas and can be used by insurance companies, local authorities and land planners for rescue and taking proper actions against flood hazards. Developing flood maps is often carried out by flood inundation modeling tools such as 2D hydrodynamic models. However, often flood maps are generated using a single deterministic model outcome without considering the uncertainty that arises from different sources and propagates through the modeling process. Moreover, the increasing number of flood events in the last decades combined with the effects of global climate change requires developing accurate and safe flood maps in which the uncertainty has been considered. Therefore, in this thesis the uncertainty of 100-year flood maps under 3 scenarios (present and future RCP4.5 and RCP8.5) is assessed through intensive Monte Carlo simulations. The uncertainty introduced by model input data namely, roughness coefficient, runoff coefficient and precipitation intensity (which incorporates three different sources of uncertainty: RCP scenario, climate model, and probability distribution function), is propagated through a surrogate hydrodynamic/hydrologic model developed based on a physical 2D model. The results obtained from this study challenge the use of deterministic flood maps and recommend using probabilistic approaches for developing safe and reliable flood maps. Furthermore, they show that the main source of uncertainty comes from the precipitation, namely the selected probability distribution compared to the selected RCP and climate model.publishedVersio
    corecore