487 research outputs found
Automating the Hunt for Volcanoes on Venus
Our long-term goal is to develop a trainable tool for locating patterns of interest in large image databases. Toward this goal we have developed a prototype system, based on classical filtering and statistical pattern recognition techniques, for automatically locating volcanoes in the Magellan SAR database of Venus. Training for the specific volcano-detection task is obtained by synthesizing feature templates (via normalization and principal components analysis) from a small number of examples provided by experts. Candidate regions identified by a focus of attention (FOA) algorithm are classified based on correlations with the feature templates. Preliminary tests show performance comparable to trained human observers
Condition monitoring of an advanced gas-cooled nuclear reactor core
A critical component of an advanced gas-cooled reactor station is the graphite core. As a station ages, the graphite bricks that comprise the core can distort and may eventually crack. Since the core cannot be replaced, the core integrity ultimately determines the station life. Monitoring these distortions is usually restricted to the routine outages, which occur every few years, as this is the only time that the reactor core can be accessed by external sensing equipment. This paper presents a monitoring module based on model-based techniques using measurements obtained during the refuelling process. A fault detection and isolation filter based on unknown input observer techniques is developed. The role of this filter is to estimate the friction force produced by the interaction between the wall of the fuel channel and the fuel assembly supporting brushes. This allows an estimate to be made of the shape of the graphite bricks that comprise the core and, therefore, to monitor any distortion on them
SkICAT: A cataloging and analysis tool for wide field imaging surveys
We describe an integrated system, SkICAT (Sky Image Cataloging and Analysis Tool), for the automated reduction and analysis of the Palomar Observatory-ST ScI Digitized Sky Survey. The Survey will consist of the complete digitization of the photographic Second Palomar Observatory Sky Survey (POSS-II) in three bands, comprising nearly three Terabytes of pixel data. SkICAT applies a combination of existing packages, including FOCAS for basic image detection and measurement and SAS for database management, as well as custom software, to the task of managing this wealth of data. One of the most novel aspects of the system is its method of object classification. Using state-of-theart machine learning classification techniques (GID3* and O-BTree), we have developed a powerful method for automatically distinguishing point sources from non-point sources and artifacts, achieving comparably accurate discrimination a full magnitude fainter than in previous Schmidt plate surveys. The learning algorithms produce decision trees for classification by examining instances of objects classified by eye on both plate and higher quality CCD data. The same techniques will be applied to perform higher-level object classification (e.g., of galaxy morphology) in the near future. Another key feature of the system is the facility to integrate the catalogs from multiple plates (and portions thereof) to construct a single catalog of uniform calibration and quality down to the faintest limits of the survey. SkICAT also provides a variety of data analysis and exploration tools for the scientific utilization of the resulting catalogs. We include initial results of applying this system to measure the counts and distribution of galaxies in two bands down to Bj is approximately 21 mag over an approximate 70 square degree multi-plate field from POSS-II. SkICAT is constructed in a modular and general fashion and should be readily adaptable to other large-scale imaging surveys
A Mixed-Attribute Approach in Ant-Miner Classification Rule Discovery Algorithm
In this paper, we introduce Ant-MinerMA to tackle mixed-attribute classification problems. Most classification problems involve continuous, ordinal and categorical attributes. The majority of Ant Colony Optimization (ACO) classification algorithms have the limitation of being able to handle categorical attributes only, with few exceptions that use a discretisation procedure when handling continuous attributes either in a preprocessing stage or during the rule creation. Using a solution archive as a pheromone model, inspired by the ACO for mixed-variable optimization (ACO-MV), we eliminate the need for a discretisation procedure and attributes can be treated directly as continuous, ordinal, or categorical. We compared the proposed Ant-MinerMA against cAnt-Miner, an ACO-based classification algorithm that uses a discretisation procedure in the rule construction process. Our results show that Ant-MinerMA achieved significant improvements on computational time due to the elimination of the discretisation procedure without affecting the predictive performance
The Art of Data Science
To flourish in the new data-intensive environment of 21st century science, we
need to evolve new skills. These can be expressed in terms of the systemized
framework that formed the basis of mediaeval education - the trivium (logic,
grammar, and rhetoric) and quadrivium (arithmetic, geometry, music, and
astronomy). However, rather than focusing on number, data is the new keystone.
We need to understand what rules it obeys, how it is symbolized and
communicated and what its relationship to physical space and time is. In this
paper, we will review this understanding in terms of the technologies and
processes that it requires. We contend that, at least, an appreciation of all
these aspects is crucial to enable us to extract scientific information and
knowledge from the data sets which threaten to engulf and overwhelm us.Comment: 12 pages, invited talk at Astrostatistics and Data Mining in Large
Astronomical Databases workshop, La Palma, Spain, 30 May - 3 June 2011, to
appear in Springer Series on Astrostatistic
An intelligent assistant for exploratory data analysis
In this paper we present an account of the main features of SNOUT, an intelligent assistant for exploratory data analysis (EDA) of social science survey data that incorporates a range of data mining techniques. EDA has much in common with existing data mining techniques: its main objective is to help an investigator reach an understanding of the important relationships ina data set rather than simply develop predictive models for selectd variables. Brief descriptions of a number of novel techniques developed for use in SNOUT are presented. These include heuristic variable level inference and classification, automatic category formation, the use of similarity trees to identify groups of related variables, interactive decision tree construction and model selection using a genetic algorithm
Data mining: a tool for detecting cyclical disturbances in supply networks.
Disturbances in supply chains may be either exogenous or endogenous. The ability automatically to detect, diagnose, and distinguish between the causes of disturbances is of prime importance to decision makers in order to avoid uncertainty. The spectral principal component analysis (SPCA) technique has been utilized to distinguish between real and rogue disturbances in a steel supply network. The data set used was collected from four different business units in the network and consists of 43 variables; each is described by 72 data points. The present paper will utilize the same data set to test an alternative approach to SPCA in detecting the disturbances. The new approach employs statistical data pre-processing, clustering, and classification learning techniques to analyse the supply network data. In particular, the incremental k-means
clustering and the RULES-6 classification rule-learning algorithms, developed by the present authors’ team, have been applied to identify important patterns in the data set. Results show that the proposed approach has the capability automatically to detect and characterize network-wide cyclical disturbances and generate hypotheses about their root cause
Software defect prediction: do different classifiers find the same defects?
Open Access: This article is distributed under the terms of the Creative Commons Attribution 4.0 International License CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.During the last 10 years, hundreds of different defect prediction models have been published. The performance of the classifiers used in these models is reported to be similar with models rarely performing above the predictive performance ceiling of about 80% recall. We investigate the individual defects that four classifiers predict and analyse the level of prediction uncertainty produced by these classifiers. We perform a sensitivity analysis to compare the performance of Random Forest, Naïve Bayes, RPart and SVM classifiers when predicting defects in NASA, open source and commercial datasets. The defect predictions that each classifier makes is captured in a confusion matrix and the prediction uncertainty of each classifier is compared. Despite similar predictive performance values for these four classifiers, each detects different sets of defects. Some classifiers are more consistent in predicting defects than others. Our results confirm that a unique subset of defects can be detected by specific classifiers. However, while some classifiers are consistent in the predictions they make, other classifiers vary in their predictions. Given our results, we conclude that classifier ensembles with decision-making strategies not based on majority voting are likely to perform best in defect prediction.Peer reviewedFinal Published versio
Using data mining for prediction of hospital length of stay: an application of the CRISP-DM Methodology
Hospitals are nowadays collecting vast amounts of data related with patient records. All this data hold valuable knowledge that can be used to improve hospital decision making. Data mining techniques aim precisely at the extraction of useful knowledge from raw data. This work describes an implementation of a medical data mining project approach based on the CRISP-DM methodology. Recent real-world data, from 2000 to 2013, were collected from a Portuguese hospital and related with inpatient hospitalization. The goal was to predict generic hospital Length Of Stay based on indicators that are commonly available at the hospitalization process (e.g., gender, age, episode type, medical specialty). At the data preparation stage, the data were cleaned and variables were selected and transformed, leading to 14 inputs. Next, at the modeling stage, a regression approach was adopted, where six learning methods were compared: Average Prediction, Multiple Regression, Decision Tree, Artificial Neural Network ensemble, Support Vector Machine and Random Forest. The best learning model was obtained by the Random Forest method, which presents a high quality coefficient of determination value (0.81). This model was then opened by using a sensitivity analysis procedure that revealed three influential input attributes: the hospital episode type, the physical service where the patient is hospitalized and the associated medical specialty. Such extracted knowledge confirmed that the obtained predictive model is credible and with potential value for supporting decisions of hospital managers
A bioinformatics knowledge discovery in text application for grid computing
<p>Abstract</p> <p>Background</p> <p>A fundamental activity in biomedical research is Knowledge Discovery which has the ability to search through large amounts of biomedical information such as documents and data. High performance computational infrastructures, such as Grid technologies, are emerging as a possible infrastructure to tackle the intensive use of Information and Communication resources in life science. The goal of this work was to develop a software middleware solution in order to exploit the many knowledge discovery applications on scalable and distributed computing systems to achieve intensive use of ICT resources.</p> <p>Methods</p> <p>The development of a grid application for Knowledge Discovery in Text using a middleware solution based methodology is presented. The system must be able to: perform a user application model, process the jobs with the aim of creating many parallel jobs to distribute on the computational nodes. Finally, the system must be aware of the computational resources available, their status and must be able to monitor the execution of parallel jobs. These operative requirements lead to design a middleware to be specialized using user application modules. It included a graphical user interface in order to access to a node search system, a load balancing system and a transfer optimizer to reduce communication costs.</p> <p>Results</p> <p>A middleware solution prototype and the performance evaluation of it in terms of the speed-up factor is shown. It was written in JAVA on Globus Toolkit 4 to build the grid infrastructure based on GNU/Linux computer grid nodes. A test was carried out and the results are shown for the named entity recognition search of symptoms and pathologies. The search was applied to a collection of 5,000 scientific documents taken from PubMed.</p> <p>Conclusion</p> <p>In this paper we discuss the development of a grid application based on a middleware solution. It has been tested on a knowledge discovery in text process to extract new and useful information about symptoms and pathologies from a large collection of unstructured scientific documents. As an example a computation of Knowledge Discovery in Database was applied on the output produced by the KDT user module to extract new knowledge about symptom and pathology bio-entities.</p
- …