328 research outputs found
Data Cube Approximation and Mining using Probabilistic Modeling
On-line Analytical Processing (OLAP) techniques commonly used in data warehouses allow the exploration of data cubes according to different analysis axes (dimensions) and under different abstraction levels in a dimension hierarchy. However, such techniques are not aimed at mining multidimensional data.
Since data cubes are nothing but multi-way tables, we propose to analyze the potential of two probabilistic modeling techniques, namely non-negative multi-way array factorization and log-linear modeling, with the ultimate objective of compressing and mining aggregate and multidimensional values. With the first technique, we compute the set of components that best fit the initial data set and whose superposition coincides with the original data; with the second technique we identify a parsimonious model (i.e., one with a reduced set of parameters), highlight strong associations among dimensions and discover possible outliers in data cells. A real life example will be
used to (i) discuss the potential benefits of the modeling output on cube exploration and mining, (ii) show how OLAP queries can be answered in an approximate way, and (iii) illustrate the strengths and limitations of these modeling approaches
A Geneaology of Correspondence Analysis: Part 2 - The Variants
In 2012, a comprehensive historical and genealogical discussion of correspondence analysis was published in Australian and New Zealand Journal of Statistics. That genealogy consisted of more than 270 key books and articles and focused on an historical development of the correspondence analysis,a statistical tool which provides the analyst with a visual inspection of the association between two or more categorical variables. In this new genealogy, we provide a brief overview of over 30 variants of correspondence analysis that now exist outside of the traditional approaches used to analysethe association between two or more categorical variables. It comprises of a bibliography of a more than 300 books and articles that were not included in the 2012 bibliography and highlights the growth in the development ofcorrespondence analysis across all areas of research
The zCOSMOS redshift survey: the three-dimensional classification cube and bimodality in galaxy physical properties
Aims. We investigate the relationships between three main optical galaxy
observables (spectral properties, colours, and morphology), exploiting the data
set provided by the COSMOS/zCOSMOS survey. The purpose of this paper is to
define a simple galaxy classification cube, using a carefully selected sample
of around 1000 galaxies. Methods. Using medium resolution spectra of the first
1k zCOSMOS-bright sample, optical photometry from the Subaru/COSMOS
observations, and morphological measurements derived from ACS imaging, we
analyze the properties of the galaxy population out to z~1. Applying three
straightforward classification schemes (spectral, photometric, and
morphological), we identify two main galaxy types, which appear to be linked to
the bimodality of galaxy population. The three parametric classifications
constitute the axes of a "classification cube". Results. A very good agreement
exists between the classification from spectral data (quiescent/star-forming
galaxies) and that based on colours (red/blue galaxies). The third parameter
(morphology) is less well correlated with the first two: in fact a good
correlation between the spectral classification and that based on morphological
analysis (early-/late-type galaxies) is achieved only after partially
complementing the morphological classification with additional colour
information. Finally, analyzing the 3D-distribution of all galaxies in the
sample, we find that about 85% of the galaxies show a fully concordant
classification, being either quiescent, red, bulge-dominated galaxies (~20%) or
star-forming, blue, disk-dominated galaxies (~65%). These results imply that
the galaxy bimodality is a consistent behaviour both in morphology, colour and
dominant stellar population, at least out to z~1.Comment: 11 pages, Accepted for publication in A&
The zCOSMOS redshift survey: the three-dimensional classification cube and bimodality in galaxy physical properties
open59sĂŹAims. We investigate the relationships between three main optical galaxy observables (spectral properties, colors, and morphology), exploiting the data set provided by the COSMOS/zCOSMOS survey. The purpose of this paper is to define a simple galaxy classification cube, with a carefully selected sample of â1000 galaxies.
Methods. Using medium resolution spectra of the first zCOSMOS-bright sample, optical photometry from the Subaru/COSMOS observations, and morphological measurements derived from ACS imaging, we analyze the properties of the galaxy population out to z ~ 1. Applying three straightforward classification schemes (spectral, photometric, and morphological), we identify two main galaxy types, which appear to be linked to the bimodality of galaxy population. The three parametric classifications constitute the axes of a âclassification cubeâ.
Results. A very good agreement exists between the classification from spectral data (quiescent/star-forming galaxies) and the one based on colors (red/blue galaxies). The third parameter (morphology) is not as well correlated with the first two; in fact, a good correlation between the spectral classification and the classification based on morphological analysis (early-/late-type galaxies) is achieved only after partially complementing the morphological classification with additional color information. Finally, analyzing the 3D-distribution of all galaxies in the sample, we find that about 85% of the galaxies show a fully concordant classification, being either quiescent, red, bulge-dominated galaxies (~20%) or star-forming, blue, disk-dominated galaxies (~65%). These results imply that the galaxy bimodality is a consistent behavior both in morphology, color, and dominant stellar population, at least out to z ~ 1.openMignoli, M.; Zamorani, G.; Scodeggio, M.; Cimatti, A.; Halliday, C.; Lilly, S. J.; Pozzetti, L.; Vergani, D.; Carollo, C. M.; Contini, T.; Le FĂ©vre, O.; Mainieri, V.; Renzini, A.; Bardelli, S.; Bolzonella, M.; Bongiorno, A.; Caputi, K.; Coppa, G.; Cucciati, O.; de la Torre, S.; de Ravel, L.; Franzetti, P.; Garilli, B.; Iovino, A.; Kampczyk, P.; Kneib, J.-P.; Knobel, C.; KovaÄ, K.; Lamareille, F.; Le Borgne, J.-F.; Le Brun, V.; Maier, C.; PellĂČ, R.; Peng, Y.; Perez Montero, E.; Ricciardelli, E.; Scarlata, C.; Silverman, J. D.; Tanaka, M.; Tasca, L.; Tresse, L.; Zucca, E.; Abbas, U.; Bottini, D.; Capak, P.; Cappi, A.; Cassata, P.; Fumana, M.; Guzzo, L.; Leauthaud, A.; Maccagni, D.; Marinoni, C.; McCracken, H. J.; Memeo, P.; Meneux, B.; Oesch, P.; Porciani, C.; Scaramella, R.; Scoville, N.Mignoli, M.; Zamorani, G.; Scodeggio, M.; Cimatti, A.; Halliday, C.; Lilly, S. J.; Pozzetti, L.; Vergani, D.; Carollo, C. M.; Contini, T.; Le FĂ©vre, O.; Mainieri, V.; Renzini, A.; Bardelli, S.; Bolzonella, M.; Bongiorno, A.; Caputi, K.; Coppa, G.; Cucciati, O.; de la Torre, S.; de Ravel, L.; Franzetti, P.; Garilli, B.; Iovino, A.; Kampczyk, P.; Kneib, J. -P.; Knobel, C.; KovaÄ, K.; Lamareille, F.; Le Borgne, J. -F.; Le Brun, V.; Maier, C.; PellĂČ, R.; Peng, Y.; Perez Montero, E.; Ricciardelli, E.; Scarlata, C.; Silverman, J. D.; Tanaka, M.; Tasca, L.; Tresse, L.; Zucca, E.; Abbas, U.; Bottini, D.; Capak, P.; Cappi, A.; Cassata, P.; Fumana, M.; Guzzo, L.; Leauthaud, A.; Maccagni, D.; Marinoni, C.; Mccracken, H. J.; Memeo, P.; Meneux, B.; Oesch, P.; Porciani, C.; Scaramella, R.; Scoville, N
Explanation of Exceptional Values in Multi-dimensional Business Databases
âHow can the functionality of multi-dimensional business databases be extended with
diagnostic capabilities to support managerial decision-making?â This question states
the main research problem addressed in this thesis. Before giving an answer, the question
first requires clarification and delineation. In this chapter, the research question
is placed briefly into context, both regarding academic and business relevance. This
leads to the formulation of three specific research questions. Subsequently, a section
is dedicated to each specific research question. An outline of this thesis concludes the
chapter
Contributions to the multivariate Analysis of Marine Environmental Monitoring
The thesis parts from the view that statistics starts with data, and starts by introducing the data sets studied: marine benthic species counts and chemical measurements made at a set of sites in the Norwegian Ekofisk oil field, with replicates and annually repeated. An introductory chapter details the sampling procedure and shows with reliability calculations that the (transformed) chemical variables have excellent reliability, whereas the biological variables have poor reliability, except for a small subset of abundant species. Transformed chemical variables are shown to be approximately normal. Bootstrap methods are used to assess whether the biological variables follow a Poisson distribution, and lead to the conclusion that the Poisson distribution must be rejected, except for rare species. A separate chapter details more work on the distribution of the species variables: truncated and zero-inflated Poisson distributions as well as Poisson mixtures are used in order to account for sparseness and overdispersion. Species are thought to respond to environmental variables, and regressions of the abundance of a few selected species onto chemical variables are reported. For rare species, logistic regression and Poisson regression are the tools considered, though there are problems of overdispersion. For abundant species, random coefficient models are needed in order to cope with intraclass correlation. The environmental variables, mainly heavy metals, are highly correlated, leading to multicollinearity problems. The next chapters use a multivariate approach, where all species data is now treated simultaneously. The theory of correspondence analysis is reviewed, and some theoretical results on this method are reported (bounds for singular values, centring matrices). An applied chapter discusses the correspondence analysis of the species data in detail, detects outliers, addresses stability issues, and considers different ways of stacking data matrices to obtain an integrated analysis of several years of data, and to decompose variation into a within-sites and between-sites component. More than 40 % of the total inertia is due to variation within stations. Principal components analysis is used to analyse the set of chemical variables. Attempts are made to integrate the analysis of the biological and chemical variables. A detailed theoretical development shows how continuous variables can be mapped in an optimal manner as supplementary vectors into a correspondence analysis biplot. Geometrical properties are worked out in detail, and measures for the quality of the display are given, whereas artificial data and data from the monitoring survey are used to illustrate the theory developed. The theory of display of supplementary variables in biplots is also worked out in detail for principal component analysis, with attention for the different types of scaling, and optimality of displayed correlations. A theoretical chapter follows that gives an in depth theoretical treatment of canonical correspondence analysis, (linearly constrained correspondence analysis, CCA for short) detailing many mathematical properties and aspects of this multivariate method, such as geometrical properties, biplots, use of generalized inverses, relationships with other methods, etc. Some applications of CCA to the survey data are dealt with in a separate chapter, with their interpretation and indication of the quality of the display of the different matrices involved in the analysis. Weighted principal component analysis of weighted averages is proposed as an alternative for CCA. This leads to a better display of the weighted averages of the species, and in the cases so far studied, also leads to biplots with a higher amount of explained variance for the environmental data. The thesis closes with a bibliography and outlines some suggestions for further research, such as a the generalization of canonical correlation analysis for working with singular covariance matrices, the use partial least squares methods to account for the excess of predictors, and data fusion problems to estimate missing biological data.Postprint (published version
- âŠ