1,960 research outputs found

    Incremental procedures for partitioning highly intermixed multi-class datasets into hyper-spherical and hyper-ellipsoidal clusters

    Get PDF
    Two procedures for partitioning large collections of highly intermixed datasets of different classes into a number of hyper-spherical or hyper-ellipsoidal clusters are presented. The incremental procedures are to generate a minimum numbers of hyper-spherical or hyper-ellipsoidal clusters with each cluster containing a maximum number of data points of the same class. The procedures extend the move-to-front algorithms originally designed for construction of minimum sized enclosing balls or ellipsoids for dataset of a single class. The resulting clusters of the dataset can be used for data modeling, outlier detection, discrimination analysis, and knowledge discovery

    Clustering of nonstationary data streams: a survey of fuzzy partitional methods

    Get PDF
    YesData streams have arisen as a relevant research topic during the past decade. They are real‐time, incremental in nature, temporally ordered, massive, contain outliers, and the objects in a data stream may evolve over time (concept drift). Clustering is often one of the earliest and most important steps in the streaming data analysis workflow. A comprehensive literature is available about stream data clustering; however, less attention is devoted to the fuzzy clustering approach, even though the nonstationary nature of many data streams makes it especially appealing. This survey discusses relevant data stream clustering algorithms focusing mainly on fuzzy methods, including their treatment of outliers and concept drift and shift.Ministero dell‘Istruzione, dell‘Universitá e della Ricerca

    DENFIS: Dynamic Evolving Neural-Fuzzy Inference System and its Application for Time Series Prediction

    Get PDF
    This paper introduces a new type of fuzzy inference systems, denoted as DENFIS (dynamic evolving neural-fuzzy inference system), for adaptive on-line and off-line learning, and their application for dynamic time series prediction. DENFIS evolve through incremental, hybrid (supervised/unsupervised), learning and accommodate new input data, including new features, new classes, etc. through local element tuning. New fuzzy rules are created and updated during the operation of the system. At each time moment the output of DENFIS is calculated through a fuzzy inference system based on m-most activated fuzzy rules which are dynamically chosen from a fuzzy rule set. Two approaches are proposed: (1) dynamic creation of a first-order TakagiSugeno type fuzzy rule set for a DENFIS on-line model; (2) creation of a first-order TakagiSugeno type fuzzy rule set, or an expanded high-order one, for a DENFIS off-line model. A set of fuzzy rules can be inserted into DENFIS before, or during its learning process. Fuzzy rules can also be extracted during the learning process or after it. An evolving clustering method (ECM), which is employed in both on-line and off-line DENFIS models, is also introduced. It is demonstrated that DENFIS can effectively learn complex temporal sequences in an adaptive way and outperform some well known, existing models

    Unsupervised tracking of time-evolving data streams and an application to short-term urban traffic flow forecasting

    Get PDF
    I am indebted to many people for their help and support I receive during my Ph.D. study and research at DIBRIS-University of Genoa. First and foremost, I would like to express my sincere thanks to my supervisors Prof.Dr. Masulli, and Prof.Dr. Rovetta for the invaluable guidance, frequent meetings, and discussions, and the encouragement and support on my way of research. I thanks all the members of the DIBRIS for their support and kindness during my 4 years Ph.D. I would like also to acknowledge the contribution of the projects Piattaforma per la mobili\ue0 Urbana con Gestione delle INformazioni da sorgenti eterogenee (PLUG-IN) and COST Action IC1406 High Performance Modelling and Simulation for Big Data Applications (cHiPSet). Last and most importantly, I wish to thanks my family: my wife Shaimaa who stays with me through the joys and pains; my daughter and son whom gives me happiness every-day; and my parents for their constant love and encouragement

    Semistructured and structured data manipulation.

    Get PDF
    by Kuo Yin-Hung.Thesis (M.Phil.)--Chinese University of Hong Kong, 2001.Includes bibliographical references (leaves 91-97).Abstracts in English and Chinese.Abstract --- p.iiAcknowledgments --- p.ivChapter 1 --- Introduction --- p.1Chapter 1.1 --- Web Document Classification --- p.3Chapter 1.2 --- Web Document Integration --- p.5Chapter 1.3 --- Dictionary and Incremental Update --- p.5Chapter 1.4 --- IR-Tree --- p.6Chapter 1.5 --- Thesis Overview --- p.7Chapter 2 --- Related Works --- p.9Chapter 2.1 --- Semi-structured Data and OEM --- p.9Chapter 2.1.1 --- Semi-structured Data --- p.9Chapter 2.1.2 --- Object Exchange Model --- p.10Chapter 2.2 --- Related Work on Web Document Partitioning --- p.11Chapter 2.2.1 --- Retrieval of Authoritatives --- p.12Chapter 2.2.2 --- Document Categorization Methodology --- p.13Chapter 2.3 --- Semi-structured Data Indexing --- p.14Chapter 2.3.1 --- Lore --- p.14Chapter 2.3.2 --- Tsimmis --- p.15Chapter 2.3.3 --- Other Algorithms --- p.15Chapter 2.4 --- Related Work on SAMs --- p.15Chapter 2.4.1 --- R-Tree and R*-Tree --- p.16Chapter 2.4.2 --- SS-Tree and SR-Tree --- p.16Chapter 2.4.3 --- TV-Tree and X-Tree --- p.18Chapter 2.5 --- Clustering Algorithms --- p.18Chapter 2.5.1 --- DBSCAN and Incremental-DBSCAN --- p.20Chapter 3 --- Web Document Classification --- p.21Chapter 3.1 --- Basic Definitions --- p.21Chapter 3.2 --- Similarity Computation --- p.26Chapter 3.2.1 --- Structural Transformation --- p.27Chapter 3.2.2 --- Node Similarity --- p.29Chapter 3.2.3 --- Edge Label Similarity --- p.30Chapter 3.2.4 --- Structural Similarity --- p.31Chapter 3.2.5 --- Overall Similarity --- p.32Chapter 3.2.6 --- Representative Selection --- p.33Chapter 3.3 --- Incremental Update --- p.34Chapter 3.3.1 --- Documents related to a subset --- p.35Chapter 3.3.2 --- Documents unrelated to any subset --- p.35Chapter 3.3.3 --- Documents linking up two or more subsets --- p.35Chapter 3.4 --- Experimental Results --- p.36Chapter 3.4.1 --- Compare with K-NN --- p.36Chapter 3.4.2 --- Representative vs Feature Vector --- p.38Chapter 4 --- Web Document Integration --- p.40Chapter 4.1 --- Structure Borrowing --- p.40Chapter 4.2 --- Integration of Seeds --- p.42Chapter 4.3 --- Incremental Update --- p.48Chapter 4.3.1 --- New OEM record is a normal record --- p.49Chapter 4.3.2 --- New record is a potential seed --- p.50Chapter 5 --- Dictionary --- p.51Chapter 5.1 --- Structure of a Dictionary Entry --- p.52Chapter 5.2 --- Dictionary: Relation Identifier --- p.54Chapter 5.3 --- Dictionary: Complement of Representative --- p.55Chapter 5.4 --- Incremental Update --- p.56Chapter 5.5 --- Experimental Result --- p.57Chapter 5.5.1 --- Search based on keyword --- p.57Chapter 5.5.2 --- Search by submitting ambiguous words --- p.58Chapter 5.5.3 --- Retrieval of related words --- p.59Chapter 6 --- Structured Data Manipulation: IR-Tree --- p.61Chapter 6.1 --- Range Search vs Nearest Neighbor Search --- p.61Chapter 6.2 --- Why R*-Tree and Incremental-DBSCAN? --- p.63Chapter 6.3 --- IR-Tree: The Integration of Clustering and Indexing --- p.64Chapter 6.3.1 --- Index Structure --- p.64Chapter 6.3.2 --- Insertion of IR-Tree --- p.66Chapter 6.3.3 --- Deletion on IR-tree --- p.68Chapter 6.3.4 --- Nearest Neighbor Search --- p.69Chapter 6.3.5 --- Discussion on IR-Tree --- p.73Chapter 6.4 --- Experimental Results --- p.73Chapter 6.4.1 --- General knn-search performance --- p.74Chapter 6.4.2 --- Performance on Varying Dimensionality and Distribution --- p.76Chapter 7 --- IM-Tree: An Review --- p.80Chapter 7.1 --- Indexing Techniques on Metric Space --- p.80Chapter 7.1.1 --- Definition --- p.81Chapter 7.1.2 --- Metric Space Indexing Algorithms --- p.81Chapter 7.2 --- Clustering Algorithms on Metric Space --- p.83Chapter 7.3 --- The Integration of Clustering and Metric-Space Indexing Algorithm --- p.84Chapter 7.4 --- Proposed Algorithm --- p.85Chapter 7.4.1 --- Index Structure --- p.85Chapter 7.4.2 --- Nearest Neighbor Search --- p.86Chapter 7.5 --- Future Works --- p.86Chapter 8 --- Conclusion and Future Works --- p.87Chapter 8.1 --- Semi-structured Data Manipulation --- p.88Chapter 8.2 --- Structured Data Manipulation --- p.8

    Query-driven learning for predictive analytics of data subspace cardinality

    Get PDF
    Fundamental to many predictive analytics tasks is the ability to estimate the cardinality (number of data items) of multi-dimensional data subspaces, defined by query selections over datasets. This is crucial for data analysts dealing with, e.g., interactive data subspace explorations, data subspace visualizations, and in query processing optimization. However, in many modern data systems, predictive analytics may be (i) too costly money-wise, e.g., in clouds, (ii) unreliable, e.g., in modern Big Data query engines, where accurate statistics are difficult to obtain/maintain, or (iii) infeasible, e.g., for privacy issues. We contribute a novel, query-driven, function estimation model of analyst-defined data subspace cardinality. The proposed estimation model is highly accurate in terms of prediction and accommodating the well-known selection queries: multi-dimensional range and distance-nearest neighbors (radius) queries. Our function estimation model: (i) quantizes the vectorial query space, by learning the analysts’ access patterns over a data space, (ii) associates query vectors with their corresponding cardinalities of the analyst-defined data subspaces, (iii) abstracts and employs query vectorial similarity to predict the cardinality of an unseen/unexplored data subspace, and (iv) identifies and adapts to possible changes of the query subspaces based on the theory of optimal stopping. The proposed model is decentralized, facilitating the scaling-out of such predictive analytics queries. The research significance of the model lies in that (i) it is an attractive solution when data-driven statistical techniques are undesirable or infeasible, (ii) it offers a scale-out, decentralized training solution, (iii) it is applicable to different selection query types, and (iv) it offers a performance that is superior to that of data-driven approaches

    Advanced similarity queries and their application in data mining

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH
    corecore