11,181 research outputs found

    Change detection in categorical evolving data streams

    Get PDF
    Detecting change in evolving data streams is a central issue for accurate adaptive learning. In real world applications, data streams have categorical features, and changes induced in the data distribution of these categorical features have not been considered extensively so far. Previous work on change detection focused on detecting changes in the accuracy of the learners, but without considering changes in the data distribution. To cope with these issues, we propose a new unsupervised change detection method, called CDCStream (Change Detection in Categorical Data Streams), well suited for categorical data streams. The proposed method is able to detect changes in a batch incremental scenario. It is based on the two following characteristics: (i) a summarization strategy is proposed to compress the actual batch by extracting a descriptive summary and (ii) a new segmentation algorithm is proposed to highlight changes and issue warnings for a data stream. To evaluate our proposal we employ it in a learning task over real world data and we compare its results with state of the art methods. We also report qualitative evaluation in order to show the behavior of CDCStream

    A High-Fidelity Realization of the Euclid Code Comparison NN-body Simulation with Abacus

    Get PDF
    We present a high-fidelity realization of the cosmological NN-body simulation from the Schneider et al. (2016) code comparison project. The simulation was performed with our Abacus NN-body code, which offers high force accuracy, high performance, and minimal particle integration errors. The simulation consists of 204832048^3 particles in a 500 h−1Mpc500\ h^{-1}\mathrm{Mpc} box, for a particle mass of 1.2×109 h−1M⊙1.2\times 10^9\ h^{-1}\mathrm{M}_\odot with $10\ h^{-1}\mathrm{kpc}splinesoftening.Abacusexecuted1052globaltimestepsto spline softening. Abacus executed 1052 global time steps to z=0in107hoursononedual−Xeon,dual−GPUnode,forameanrateof23millionparticlespersecondperstep.WefindAbacusisingoodagreementwithRamsesandPkdgrav3andlesssowithGadget3.Wevalidateourchoiceoftimestepbyhalvingthestepsizeandfindsub−percentdifferencesinthepowerspectrumand2PCFatnearlyallmeasuredscales,with in 107 hours on one dual-Xeon, dual-GPU node, for a mean rate of 23 million particles per second per step. We find Abacus is in good agreement with Ramses and Pkdgrav3 and less so with Gadget3. We validate our choice of time step by halving the step size and find sub-percent differences in the power spectrum and 2PCF at nearly all measured scales, with <0.3\%errorsat errors at k<10\ \mathrm{Mpc}^{-1}h.Onlargescales,Abacusreproduceslineartheorybetterthan. On large scales, Abacus reproduces linear theory better than 0.01\%$. Simulation snapshots are available at http://nbody.rc.fas.harvard.edu/public/S2016 .Comment: 13 pages, 8 figures. Minor changes to match MNRAS accepted versio

    SOM-based algorithms for qualitative variables

    Full text link
    It is well known that the SOM algorithm achieves a clustering of data which can be interpreted as an extension of Principal Component Analysis, because of its topology-preserving property. But the SOM algorithm can only process real-valued data. In previous papers, we have proposed several methods based on the SOM algorithm to analyze categorical data, which is the case in survey data. In this paper, we present these methods in a unified manner. The first one (Kohonen Multiple Correspondence Analysis, KMCA) deals only with the modalities, while the two others (Kohonen Multiple Correspondence Analysis with individuals, KMCA\_ind, Kohonen algorithm on DISJonctive table, KDISJ) can take into account the individuals, and the modalities simultaneously.Comment: Special Issue apr\`{e}s WSOM 03 \`{a} Kitakiush

    Improving the family orientation process in Cuban Special Schools trough Nearest Prototype classification

    Get PDF
    Cuban Schools for children with Affective – Behavioral Maladies (SABM) have as goal to accomplish a major change in children behavior, to insert them effectively into society. One of the key elements in this objective is to give an adequate orientation to the children’s families; due to the family is one of the most important educational contexts in which the children will develop their personality. The family orientation process in SABM involves clustering and classification of mixed type data with non-symmetric similarity functions. To improve this process, this paper includes some novel characteristics in clustering and prototype selection. The proposed approach uses a hierarchical clustering based on compact sets, making it suitable for dealing with non-symmetric similarity functions, as well as with mixed and incomplete data. The proposal obtains very good results on the SABM data, and over repository databases

    Robust PCA as Bilinear Decomposition with Outlier-Sparsity Regularization

    Full text link
    Principal component analysis (PCA) is widely used for dimensionality reduction, with well-documented merits in various applications involving high-dimensional data, including computer vision, preference measurement, and bioinformatics. In this context, the fresh look advocated here permeates benefits from variable selection and compressive sampling, to robustify PCA against outliers. A least-trimmed squares estimator of a low-rank bilinear factor analysis model is shown closely related to that obtained from an ℓ0\ell_0-(pseudo)norm-regularized criterion encouraging sparsity in a matrix explicitly modeling the outliers. This connection suggests robust PCA schemes based on convex relaxation, which lead naturally to a family of robust estimators encompassing Huber's optimal M-class as a special case. Outliers are identified by tuning a regularization parameter, which amounts to controlling sparsity of the outlier matrix along the whole robustification path of (group) least-absolute shrinkage and selection operator (Lasso) solutions. Beyond its neat ties to robust statistics, the developed outlier-aware PCA framework is versatile to accommodate novel and scalable algorithms to: i) track the low-rank signal subspace robustly, as new data are acquired in real time; and ii) determine principal components robustly in (possibly) infinite-dimensional feature spaces. Synthetic and real data tests corroborate the effectiveness of the proposed robust PCA schemes, when used to identify aberrant responses in personality assessment surveys, as well as unveil communities in social networks, and intruders from video surveillance data.Comment: 30 pages, submitted to IEEE Transactions on Signal Processin
    • 

    corecore