11,520 research outputs found
Integration of Data Mining and Data Warehousing: a practical methodology
The ever growing repository of data in all fields poses new challenges to the modern analytical
systems. Real-world datasets, with mixed numeric and nominal variables, are difficult to analyze and require effective visual exploration that conveys semantic relationships of data. Traditional data mining techniques such as clustering clusters only the numeric data. Little research has been carried out in tackling the problem of clustering high cardinality nominal variables to get better insight of underlying dataset. Several works in the literature proved the likelihood of integrating data mining with warehousing to discover knowledge from data. For the seamless integration, the mined data has
to be modeled in form of a data warehouse schema. Schema generation process is complex manual
task and requires domain and warehousing familiarity. Automated techniques are required to generate warehouse schema to overcome the existing dependencies. To fulfill the growing analytical needs and to overcome the existing limitations, we propose a novel methodology in this paper that permits efficient analysis of mixed numeric and nominal data, effective visual data exploration, automatic warehouse schema generation and integration of data mining and warehousing. The proposed methodology is evaluated by performing case study on real-world data set. Results show that multidimensional analysis can be performed in an easier and flexible way to discover meaningful
knowledge from large datasets
Cross-Recurrence Quantification Analysis of Categorical and Continuous Time Series: an R package
This paper describes the R package crqa to perform cross-recurrence
quantification analysis of two time series of either a categorical or
continuous nature. Streams of behavioral information, from eye movements to
linguistic elements, unfold over time. When two people interact, such as in
conversation, they often adapt to each other, leading these behavioral levels
to exhibit recurrent states. In dialogue, for example, interlocutors adapt to
each other by exchanging interactive cues: smiles, nods, gestures, choice of
words, and so on. In order for us to capture closely the goings-on of dynamic
interaction, and uncover the extent of coupling between two individuals, we
need to quantify how much recurrence is taking place at these levels. Methods
available in crqa would allow researchers in cognitive science to pose such
questions as how much are two people recurrent at some level of analysis, what
is the characteristic lag time for one person to maximally match another, or
whether one person is leading another. First, we set the theoretical ground to
understand the difference between 'correlation' and 'co-visitation' when
comparing two time series, using an aggregative or cross-recurrence approach.
Then, we describe more formally the principles of cross-recurrence, and show
with the current package how to carry out analyses applying them. We end the
paper by comparing computational efficiency, and results' consistency, of crqa
R package, with the benchmark MATLAB toolbox crptoolbox. We show perfect
comparability between the two libraries on both levels
Homogeneity analysis with k sets of variables: An alternating least squares method with optimal scaling features
Homogeneity analysis, or multiple correspondence analysis, is usually applied to k separate variables. In this paper, it is applied to sets of variables by using sums within sets. The resulting technique is referred to as OVERALS. It uses the notion of optimal scaling, with transformations that can be multiple or single. The single transformations consist of three types: (1) nominal; (2) ordinal; and (3) numerical. The corresponding OVERALS computer program minimizes a least squares loss function by using an alternating least squares algorithm. Many existing linear and non-linear multivariate analysis techniques are shown to be special cases of OVERALS. Disadvantages of the OVERALS method include the possibility of local minima in some complicated special cases, a lack of information on the stability of results, and its inability to handle incomplete data matrices. Means of dealing with some of these problems are suggested (i.e., an alternating least squares algorithm to solve the minimization problem). An application of the method to data from an epidemiological survey is provided
An integrated analysis of micro- and macro-habitat features as a tool to detect weather-driven constraints: a case study with cavity nesters
The effects of climate change on animal populations may be shaped by habitat characteristics at both micro- and macro-habitat level, however, empirical studies integrating these two scales of observation are lacking. As analyses of the effects of climate change commonly rely on data from a much larger scale than the microhabitat level organisms are affected at, this mismatch risks hampering progress in developing understanding of the details of the ecological and evolutionary responses of organisms and, ultimately, effective actions to preserve their populations. Cavity nesters, often with a conservation status of concern, are an ideal model because the cavity is a microenvironment potentially different from the macroenvironment but nonetheless inevitably interacting with it. The lesser kestrel (Falco naumanni) is a cavity nester which was until recently classified by as Vulnerable species. Since 2004, for nine years, we collected detailed biotic and abiotic data at both micro- and macro-scales of observation in a kestrel population breeding in the Gela Plain (Italy), a Mediterranean area where high temperatures may reach lethal values for the nest content. We show that macroclimatic features needed to be integrated with both abiotic and biotic factors recorded at a microscale before reliably predicting nest temperatures. Among the nest types used by lesser kestrels, we detected a preferential occupation of the cooler nest types, roof tiles, by early breeders whereas, paradoxically, late breeders nesting with hotter temperatures occupied the overheated nest holes. Not consistent with such a suggested nest selection, the coolest nest type did not host a higher reproductive success than the overheated nests. We discussed our findings in the light of cavity temperatures and nest types deployed within conservation actions assessed by integrating selected factors at different observation scales
Interactive Correspondence Analysis in a Dynamic Object-Oriented Environment
A highly interactive, user-friendly object-oriented software package written in LispStat is introduced that performs simple and multiple correspondence analysis, and profile analysis. These three techniques are integrated into a single environment driven by a user-friendly graphical interface that takes advantage of Lisp-Stat's advanced graphical capabilities. Techniques that assess the stability of the solution are also introduced. Some of the features of the package include colored graphics, incremental graph zooming capabilities, manual point separation to determine identities of overlapping points, and stability and fit measures. The features of the package are used to show some interesting trends in a large educational dataset.
Designing and sample size calculation in presence of heterogeneity in biological studies involving high-throughput data.
The designing and determination of sample size are important for conducting high-throughput biological experiments such as proteomics experiments and RNA-Seq expression studies, thus leading to better understanding of complex mechanisms underlying various biological processes. The variations in the biological data or technical approaches to data collection lead to heterogeneity for the samples under study. We critically worked on the issues of technical and biological heterogeneity. The quantitative measurements based on liquid chromatography (LC) coupled with mass spectrometry (MS) often suffer from the problem of missing values (MVs) and data heterogeneity. We considered a proteomics data set generated from human kidney biopsy material to investigate the technical effects of sample preparation and the quantitative MS. We studied the effect of tissue storage methods (TSMs) and tissue extraction methods (TEMs) on data analysis. There are two TSMs: frozen (FR) and FFPE (formalin-fixed paraffin embedded); and three TEMs: MAX, TX followed by MAX and SDS followed by MAX. We assessed the impact of different strategies to analyze the data while considering heterogeneity and MVs. We found that the FFPE is better than that of FR for tissue storage. We also found that the one-step TEM (MAX) is better than those of two-steps TEMs. Furthermore, we found the imputation method is a better approach than excluding the proteins with MVs or using unbalanced design. We introduce a web application, PWST (Proteomics Workflow Standardization Tool) to standardize the proteomics workflow. The tool will be helpful in deciding the most suitable choice for each step and studying the variability associated with technical steps as well as the effects of continuous variables. We have used the special cases of general linear model - ANCOVA and ANOVA with fixed effects to study the effects due to various sources of variability. We introduce an interactive tool, âSATP: Statistical Analysis Tool for Proteomicsâ, for analyzing proteomics expression data that is scalable to large clinical proteomic studies. The user can perform differential expression analysis of proteomics data either at the protein or peptide level using multiple approaches. We have developed statistical approaches for calculating sample size for proteomics experiments under allocation and cost constraints. We have developed R programs and a shiny app âSSCP: Sample Size Calculator for Proteomics Experimentâ for computing sample sizes. We have proposed statistical approaches for calculating sample size for RNA-Seq experiments considering allocation and cost. We have developed R programs and shiny apps to calculate sample size for conducting RNA-Seq experiments
A General Spatio-Temporal Clustering-Based Non-local Formulation for Multiscale Modeling of Compartmentalized Reservoirs
Representing the reservoir as a network of discrete compartments with
neighbor and non-neighbor connections is a fast, yet accurate method for
analyzing oil and gas reservoirs. Automatic and rapid detection of coarse-scale
compartments with distinct static and dynamic properties is an integral part of
such high-level reservoir analysis. In this work, we present a hybrid framework
specific to reservoir analysis for an automatic detection of clusters in space
using spatial and temporal field data, coupled with a physics-based multiscale
modeling approach. In this work a novel hybrid approach is presented in which
we couple a physics-based non-local modeling framework with data-driven
clustering techniques to provide a fast and accurate multiscale modeling of
compartmentalized reservoirs. This research also adds to the literature by
presenting a comprehensive work on spatio-temporal clustering for reservoir
studies applications that well considers the clustering complexities, the
intrinsic sparse and noisy nature of the data, and the interpretability of the
outcome.
Keywords: Artificial Intelligence; Machine Learning; Spatio-Temporal
Clustering; Physics-Based Data-Driven Formulation; Multiscale Modelin
SMT 2.0: A Surrogate Modeling Toolbox with a focus on Hierarchical and Mixed Variables Gaussian Processes
The Surrogate Modeling Toolbox (SMT) is an open-source Python package that
offers a collection of surrogate modeling methods, sampling techniques, and a
set of sample problems. This paper presents SMT 2.0, a major new release of SMT
that introduces significant upgrades and new features to the toolbox. This
release adds the capability to handle mixed-variable surrogate models and
hierarchical variables. These types of variables are becoming increasingly
important in several surrogate modeling applications. SMT 2.0 also improves SMT
by extending sampling methods, adding new surrogate models, and computing
variance and kernel derivatives for Kriging. This release also includes new
functions to handle noisy and use multifidelity data. To the best of our
knowledge, SMT 2.0 is the first open-source surrogate library to propose
surrogate models for hierarchical and mixed inputs. This open-source software
is distributed under the New BSD license.Comment: version
- âŠ