11,520 research outputs found

    Integration of Data Mining and Data Warehousing: a practical methodology

    Get PDF
    The ever growing repository of data in all fields poses new challenges to the modern analytical systems. Real-world datasets, with mixed numeric and nominal variables, are difficult to analyze and require effective visual exploration that conveys semantic relationships of data. Traditional data mining techniques such as clustering clusters only the numeric data. Little research has been carried out in tackling the problem of clustering high cardinality nominal variables to get better insight of underlying dataset. Several works in the literature proved the likelihood of integrating data mining with warehousing to discover knowledge from data. For the seamless integration, the mined data has to be modeled in form of a data warehouse schema. Schema generation process is complex manual task and requires domain and warehousing familiarity. Automated techniques are required to generate warehouse schema to overcome the existing dependencies. To fulfill the growing analytical needs and to overcome the existing limitations, we propose a novel methodology in this paper that permits efficient analysis of mixed numeric and nominal data, effective visual data exploration, automatic warehouse schema generation and integration of data mining and warehousing. The proposed methodology is evaluated by performing case study on real-world data set. Results show that multidimensional analysis can be performed in an easier and flexible way to discover meaningful knowledge from large datasets

    Cross-Recurrence Quantification Analysis of Categorical and Continuous Time Series: an R package

    Get PDF
    This paper describes the R package crqa to perform cross-recurrence quantification analysis of two time series of either a categorical or continuous nature. Streams of behavioral information, from eye movements to linguistic elements, unfold over time. When two people interact, such as in conversation, they often adapt to each other, leading these behavioral levels to exhibit recurrent states. In dialogue, for example, interlocutors adapt to each other by exchanging interactive cues: smiles, nods, gestures, choice of words, and so on. In order for us to capture closely the goings-on of dynamic interaction, and uncover the extent of coupling between two individuals, we need to quantify how much recurrence is taking place at these levels. Methods available in crqa would allow researchers in cognitive science to pose such questions as how much are two people recurrent at some level of analysis, what is the characteristic lag time for one person to maximally match another, or whether one person is leading another. First, we set the theoretical ground to understand the difference between 'correlation' and 'co-visitation' when comparing two time series, using an aggregative or cross-recurrence approach. Then, we describe more formally the principles of cross-recurrence, and show with the current package how to carry out analyses applying them. We end the paper by comparing computational efficiency, and results' consistency, of crqa R package, with the benchmark MATLAB toolbox crptoolbox. We show perfect comparability between the two libraries on both levels

    Homogeneity analysis with k sets of variables: An alternating least squares method with optimal scaling features

    Get PDF
    Homogeneity analysis, or multiple correspondence analysis, is usually applied to k separate variables. In this paper, it is applied to sets of variables by using sums within sets. The resulting technique is referred to as OVERALS. It uses the notion of optimal scaling, with transformations that can be multiple or single. The single transformations consist of three types: (1) nominal; (2) ordinal; and (3) numerical. The corresponding OVERALS computer program minimizes a least squares loss function by using an alternating least squares algorithm. Many existing linear and non-linear multivariate analysis techniques are shown to be special cases of OVERALS. Disadvantages of the OVERALS method include the possibility of local minima in some complicated special cases, a lack of information on the stability of results, and its inability to handle incomplete data matrices. Means of dealing with some of these problems are suggested (i.e., an alternating least squares algorithm to solve the minimization problem). An application of the method to data from an epidemiological survey is provided

    An integrated analysis of micro- and macro-habitat features as a tool to detect weather-driven constraints: a case study with cavity nesters

    Get PDF
    The effects of climate change on animal populations may be shaped by habitat characteristics at both micro- and macro-habitat level, however, empirical studies integrating these two scales of observation are lacking. As analyses of the effects of climate change commonly rely on data from a much larger scale than the microhabitat level organisms are affected at, this mismatch risks hampering progress in developing understanding of the details of the ecological and evolutionary responses of organisms and, ultimately, effective actions to preserve their populations. Cavity nesters, often with a conservation status of concern, are an ideal model because the cavity is a microenvironment potentially different from the macroenvironment but nonetheless inevitably interacting with it. The lesser kestrel (Falco naumanni) is a cavity nester which was until recently classified by as Vulnerable species. Since 2004, for nine years, we collected detailed biotic and abiotic data at both micro- and macro-scales of observation in a kestrel population breeding in the Gela Plain (Italy), a Mediterranean area where high temperatures may reach lethal values for the nest content. We show that macroclimatic features needed to be integrated with both abiotic and biotic factors recorded at a microscale before reliably predicting nest temperatures. Among the nest types used by lesser kestrels, we detected a preferential occupation of the cooler nest types, roof tiles, by early breeders whereas, paradoxically, late breeders nesting with hotter temperatures occupied the overheated nest holes. Not consistent with such a suggested nest selection, the coolest nest type did not host a higher reproductive success than the overheated nests. We discussed our findings in the light of cavity temperatures and nest types deployed within conservation actions assessed by integrating selected factors at different observation scales

    Interactive Correspondence Analysis in a Dynamic Object-Oriented Environment

    Get PDF
    A highly interactive, user-friendly object-oriented software package written in LispStat is introduced that performs simple and multiple correspondence analysis, and profile analysis. These three techniques are integrated into a single environment driven by a user-friendly graphical interface that takes advantage of Lisp-Stat's advanced graphical capabilities. Techniques that assess the stability of the solution are also introduced. Some of the features of the package include colored graphics, incremental graph zooming capabilities, manual point separation to determine identities of overlapping points, and stability and fit measures. The features of the package are used to show some interesting trends in a large educational dataset.

    Designing and sample size calculation in presence of heterogeneity in biological studies involving high-throughput data.

    Get PDF
    The designing and determination of sample size are important for conducting high-throughput biological experiments such as proteomics experiments and RNA-Seq expression studies, thus leading to better understanding of complex mechanisms underlying various biological processes. The variations in the biological data or technical approaches to data collection lead to heterogeneity for the samples under study. We critically worked on the issues of technical and biological heterogeneity. The quantitative measurements based on liquid chromatography (LC) coupled with mass spectrometry (MS) often suffer from the problem of missing values (MVs) and data heterogeneity. We considered a proteomics data set generated from human kidney biopsy material to investigate the technical effects of sample preparation and the quantitative MS. We studied the effect of tissue storage methods (TSMs) and tissue extraction methods (TEMs) on data analysis. There are two TSMs: frozen (FR) and FFPE (formalin-fixed paraffin embedded); and three TEMs: MAX, TX followed by MAX and SDS followed by MAX. We assessed the impact of different strategies to analyze the data while considering heterogeneity and MVs. We found that the FFPE is better than that of FR for tissue storage. We also found that the one-step TEM (MAX) is better than those of two-steps TEMs. Furthermore, we found the imputation method is a better approach than excluding the proteins with MVs or using unbalanced design. We introduce a web application, PWST (Proteomics Workflow Standardization Tool) to standardize the proteomics workflow. The tool will be helpful in deciding the most suitable choice for each step and studying the variability associated with technical steps as well as the effects of continuous variables. We have used the special cases of general linear model - ANCOVA and ANOVA with fixed effects to study the effects due to various sources of variability. We introduce an interactive tool, “SATP: Statistical Analysis Tool for Proteomics”, for analyzing proteomics expression data that is scalable to large clinical proteomic studies. The user can perform differential expression analysis of proteomics data either at the protein or peptide level using multiple approaches. We have developed statistical approaches for calculating sample size for proteomics experiments under allocation and cost constraints. We have developed R programs and a shiny app “SSCP: Sample Size Calculator for Proteomics Experiment” for computing sample sizes. We have proposed statistical approaches for calculating sample size for RNA-Seq experiments considering allocation and cost. We have developed R programs and shiny apps to calculate sample size for conducting RNA-Seq experiments

    A General Spatio-Temporal Clustering-Based Non-local Formulation for Multiscale Modeling of Compartmentalized Reservoirs

    Full text link
    Representing the reservoir as a network of discrete compartments with neighbor and non-neighbor connections is a fast, yet accurate method for analyzing oil and gas reservoirs. Automatic and rapid detection of coarse-scale compartments with distinct static and dynamic properties is an integral part of such high-level reservoir analysis. In this work, we present a hybrid framework specific to reservoir analysis for an automatic detection of clusters in space using spatial and temporal field data, coupled with a physics-based multiscale modeling approach. In this work a novel hybrid approach is presented in which we couple a physics-based non-local modeling framework with data-driven clustering techniques to provide a fast and accurate multiscale modeling of compartmentalized reservoirs. This research also adds to the literature by presenting a comprehensive work on spatio-temporal clustering for reservoir studies applications that well considers the clustering complexities, the intrinsic sparse and noisy nature of the data, and the interpretability of the outcome. Keywords: Artificial Intelligence; Machine Learning; Spatio-Temporal Clustering; Physics-Based Data-Driven Formulation; Multiscale Modelin

    SMT 2.0: A Surrogate Modeling Toolbox with a focus on Hierarchical and Mixed Variables Gaussian Processes

    Full text link
    The Surrogate Modeling Toolbox (SMT) is an open-source Python package that offers a collection of surrogate modeling methods, sampling techniques, and a set of sample problems. This paper presents SMT 2.0, a major new release of SMT that introduces significant upgrades and new features to the toolbox. This release adds the capability to handle mixed-variable surrogate models and hierarchical variables. These types of variables are becoming increasingly important in several surrogate modeling applications. SMT 2.0 also improves SMT by extending sampling methods, adding new surrogate models, and computing variance and kernel derivatives for Kriging. This release also includes new functions to handle noisy and use multifidelity data. To the best of our knowledge, SMT 2.0 is the first open-source surrogate library to propose surrogate models for hierarchical and mixed inputs. This open-source software is distributed under the New BSD license.Comment: version
    • 

    corecore