4,047 research outputs found

    A systematic review of data quality issues in knowledge discovery tasks

    Get PDF
    Hay un gran crecimiento en el volumen de datos porque las organizaciones capturan permanentemente la cantidad colectiva de datos para lograr un mejor proceso de toma de decisiones. El desafío mas fundamental es la exploración de los grandes volúmenes de datos y la extracción de conocimiento útil para futuras acciones por medio de tareas para el descubrimiento del conocimiento; sin embargo, muchos datos presentan mala calidad. Presentamos una revisión sistemática de los asuntos de calidad de datos en las áreas del descubrimiento de conocimiento y un estudio de caso aplicado a la enfermedad agrícola conocida como la roya del café.Large volume of data is growing because the organizations are continuously capturing the collective amount of data for better decision-making process. The most fundamental challenge is to explore the large volumes of data and extract useful knowledge for future actions through knowledge discovery tasks, nevertheless many data has poor quality. We presented a systematic review of the data quality issues in knowledge discovery tasks and a case study applied to agricultural disease named coffee rust

    Streaming Feature Grouping and Selection (Sfgs) For Big Data Classification

    Get PDF
    Real-time data has always been an essential element for organizations when the quickness of data delivery is critical to their businesses. Today, organizations understand the importance of real-time data analysis to maintain benefits from their generated data. Real-time data analysis is also known as real-time analytics, streaming analytics, real-time streaming analytics, and event processing. Stream processing is the key to getting results in real-time. It allows us to process the data stream in real-time as it arrives. The concept of streaming data means the data are generated dynamically, and the full stream is unknown or even infinite. This data becomes massive and diverse and forms what is known as a big data challenge. In machine learning, streaming feature selection has always been a preferred method in the preprocessing of streaming data. Recently, feature grouping, which can measure the hidden information between selected features, has begun gaining attention. This dissertation’s main contribution is in solving the issue of the extremely high dimensionality of streaming big data by delivering a streaming feature grouping and selection algorithm. Also, the literature review presents a comprehensive review of the current streaming feature selection approaches and highlights the state-of-the-art algorithms trending in this area. The proposed algorithm is designed with the idea of grouping together similar features to reduce redundancy and handle the stream of features in an online fashion. This algorithm has been implemented and evaluated using benchmark datasets against state-of-the-art streaming feature selection algorithms and feature grouping techniques. The results showed better performance regarding prediction accuracy than with state-of-the-art algorithms

    Automatic Target Recognition Strategy for Synthetic Aperture Radar Images Based on Combined Discrimination Trees

    Get PDF
    A strategy is introduced for achieving high accuracy in synthetic aperture radar (SAR) automatic target recognition (ATR) tasks. Initially, a novel pose rectification process and an image normalization process are sequentially introduced to produce images with less variations prior to the feature processing stage. Then, feature sets that have a wealth of texture and edge information are extracted with the utilization of wavelet coefficients, where more effective and compact feature sets are acquired by reducing the redundancy and dimensionality of the extracted feature set. Finally, a group of discrimination trees are learned and combined into a final classifier in the framework of Real-AdaBoost. The proposed method is evaluated with the public release database for moving and stationary target acquisition and recognition (MSTAR). Several comparative studies are conducted to evaluate the effectiveness of the proposed algorithm. Experimental results show the distinctive superiority of the proposed method under both standard operating conditions (SOCs) and extended operating conditions (EOCs). Moreover, our additional tests suggest that good recognition accuracy can be achieved even with limited number of training images as long as these are captured with appropriately incremental sample step in target poses

    Sélection de variables pour l’analyse des données semi-supervisées dans les systèmes d’Information décisionnels

    Get PDF
    Feature selection is an important task in data mining and machine learning processes. This task is well known in both supervised and unsupervised contexts. The semi-supervised feature selection is still under development and far from being mature. In general, machine learning has been well developed in order to deal with partially-labeled data. Thus, feature selection has obtained special importance in the semi-supervised context. It became more adapted with the real world applications where labeling process is costly to obtain. In this thesis, we present a literature review on semi-supervised feature selection, with regard to supervised and unsupervised contexts. The goal is to show the importance of compromising between the structure from unlabeled part of data, and the background information from their labeled part. In particular, we are interested in the so-called «small labeled-sample problem» where the difference between both data parts is very important. In order to deal with the problem of semi-supervised feature selection, we propose two groups of approaches. The first group is of «Filter» type, in which, we propose some algorithms which evaluate the relevance of features by a scoring function. In our case, this function is based on spectral-graph theory and the integration of pairwise constraints which can be extracted from the data in hand. The second group of methods is of «Embedded» type, where feature selection becomes an internal function integrated in the learning process. In order to realize embedded feature selection, we propose algorithms based on feature weighting. The proposed methods rely on constrained clustering. In this sense, we propose two visions, (1) a global vision, based on relaxed satisfaction of pairwise constraints. This is done by integrating the constraints in the objective function of the proposed clustering model; and (2) a second vision, which is local and based on strict control of constraint violation. Both approaches evaluate the relevance of features by weights which are learned during the construction of the clustering model. In addition to the main task which is feature selection, we are interested in redundancy elimination. In order to tackle this problem, we propose a novel algorithm based on combining the mutual information with maximum spanning tree-based algorithm. We construct this tree from the relevant features in order to optimize the number of these selected features at the end. Finally, all proposed methods in this thesis are analyzed and their complexities are studied. Furthermore, they are validated on high-dimensional data versus other representative methods in the literature.La sélection de variables est une tâche primordiale en fouille de données et apprentissage automatique. Il s’agit d’une problématique très bien connue par les deux communautés dans les contextes, supervisé et non-supervisé. Le contexte semi-supervisé est relativement récent et les travaux sont embryonnaires. Récemment, l’apprentissage automatique a bien été développé à partir des données partiellement labélisées. La sélection de variables est donc devenue plus importante dans le contexte semi-supervisé et plus adaptée aux applications réelles, où l’étiquetage des données est devenu plus couteux et difficile à obtenir. Dans cette thèse, nous présentons une étude centrée sur l’état de l’art du domaine de la sélection de variable en s’appuyant sur les méthodes qui opèrent en mode semi-supervisé par rapport à celles des deux contextes, supervisé et non-supervisé. Il s’agit de montrer le bon compromis entre la structure géométrique de la partie non labélisée des données et l’information supervisée de leur partie labélisée. Nous nous sommes particulièrement intéressés au «small labeled-sample problem» où l’écart est très important entre les deux parties qui constituent les données. Pour la sélection de variables dans ce contexte semi-supervisé, nous proposons deux familles d’approches en deux grandes parties. La première famille est de type «Filtre» avec une série d’algorithmes qui évaluent la pertinence d’une variable par une fonction de score. Dans notre cas, cette fonction est basée sur la théorie spectrale de graphe et l’intégration de contraintes qui peuvent être extraites à partir des données en question. La deuxième famille d’approches est de type «Embedded» où la sélection de variable est intrinsèquement liée à un modèle d’apprentissage. Pour ce faire, nous proposons des algorithmes à base de pondération de variables dans un paradigme de classification automatique sous contraintes. Deux visions sont développées à cet effet, (1) une vision globale en se basant sur la satisfaction relaxée des contraintes intégrées directement dans la fonction objective du modèle proposé ; et (2) une deuxième vision, qui est locale et basée sur le contrôle stricte de violation de ces dites contraintes. Les deux approches évaluent la pertinence des variables par des poids appris en cours de la construction du modèle de classification. En outre de cette tâche principale de sélection de variables, nous nous intéressons au traitement de la redondance. Pour traiter ce problème, nous proposons une méthode originale combinant l’information mutuelle et un algorithme de recherche d’arbre couvrant construit à partir de variables pertinentes en vue de l’optimisation de leur nombre au final. Finalement, toutes les approches développées dans le cadre de cette thèse sont étudiées en termes de leur complexité algorithmique d’une part et sont validés sur des données de très grande dimension face et des méthodes connues dans la littérature d’autre part

    Broadcasting Protocol for Effective Data Dissemination in Vehicular Ad Hoc Networks

    Get PDF
    VANET topology is very dynamic due to frequent movements of the nodes. Using beacon information connected dominated set are formed and nodes further enhanced with neighbor elimination scheme. With acknowledgement the inter section issues are solve. A modified Broadcast Conquest and Delay De-synchronization mechanism address the broadcasting storm issues. Although data dissemination is possible in all direction, the performance of data dissemination in the opposite direction is investigated and compared against the existing protocols

    Derivation of continuous zoomable road network maps through utilization of Space-Scale-Cube

    Get PDF
    The process of performing cartographic generalization in an automatic way applied on geographic information is of highly interest in the field of cartography, both in academia and industry. Many research e↵orts have been done to implement di↵erent automatic generalization approaches. Being able to answer the research question on automatic generalization, another interesting question opens up: ”Is it possible to retrieve and visualize geographic information in any arbitrary scale?” This is the question in the field of vario-scale geoinformation. Potential research works should answer this question with solutions which provide valid and efficient representation of geoinformation in any on-demand scale. More brilliant solutions will also provide smooth transitions between these on-demand arbitrary scales. Space-Scale-Cube (Meijers and Van Oosterom 2011) is a reactive tree (Van Oosterom 1991) data structure which shows positive potential for achieving smooth automatic vario-scale generalization of area features. The topic of this research work is investigation of adaptation of this approach on an interesting class of geographic information: road networks datasets. Firstly theoretical background will be introduced and discussed and afterwards, implementing the adaptation would be described. This research work includes development of a hierarchical data structure based on road network datasets and the potential use of this data structure in vario-scale geoinformation retrieval and visualization.:Declaration of Authorship i Abstract iii Acknowledgements iv List of Figures vii Abbreviations viii 1 Introduction 1 1.1 Problem Definition 2 1.1.1 Research Questions 2 1.1.2 Objectives 3 1.2 Proposed Solution 3 1.3 Structure of the Thesis 4 1.4 Notes on Terminology 4 2 Cartographic Generalization 6 2.1 Cartographic Generalization: Definitions and Classifications 6 2.2 Generalization Operators 9 2.3 Efforts on Vario-Scale Visualization of Geoinformation 10 2.4 Efforts on Generalization of Road Networks and Similar Other Networks 16 2.4.1 Geometric Generalization of Networks 17 2.4.2 Model Generalization of Networks 18 2.5 Clarification of Interest 20 3 Theory of Road Network SSC 21 3.1 Background of an SSC 21 3.1.1 tGAP 21 3.1.2 Smoothing tGAP 23 3.2 Road Network as a ’Network’ 24 3.2.1 Short Background on Graph Theory 5 3.3 Formation of Road Network SSC 26 3.3.1 Geometry 26 3.3.2 Network Topology 27 3.3.3 Building up tGAP on The Road Network 28 3.3.4 Smoothing of Road Network SSC 31 3.3.4.1 Smoothing Elimination 32 3.3.4.2 Smoothing Simplification 32 3.4 Reading from a road network SSC 34 3.4.1 Discussion on Scale 34 3.4.2 Iterating Over The Forest 35 3.4.3 Planar Slices 35 3.4.4 Non-Planar Slices 36 4 Implementation of Road Network SSC 37 4.1 General Information Regarding The Implementation 37 4.1.1 Programming Language 37 4.1.2 RDBMS 38 4.1.3 Geometry Library 39 4.1.4 Graph Library 39 4.2 Data Structure 40 4.2.1 Node 40 4.2.2 Edge 41 4.2.3 Edge-Node-Relation 41 4.3 Software Architecture 42 4.3.1 More Detail on Building The SSC 42 4.3.1.1 Initial Data Processing 42 4.3.1.2 Network Processing 43 4.3.2 More Detail on Querying The SSC 46 4.3.2.1 Database Query 46 4.3.2.2 Building Geometry 46 4.3.2.3 Interface and Visualization 47 4.4 Results 48 5 Conclusions and Outlook 49 Bibliography 5

    A Multiscale Pyramid Transform for Graph Signals

    Get PDF
    Multiscale transforms designed to process analog and discrete-time signals and images cannot be directly applied to analyze high-dimensional data residing on the vertices of a weighted graph, as they do not capture the intrinsic geometric structure of the underlying graph data domain. In this paper, we adapt the Laplacian pyramid transform for signals on Euclidean domains so that it can be used to analyze high-dimensional data residing on the vertices of a weighted graph. Our approach is to study existing methods and develop new methods for the four fundamental operations of graph downsampling, graph reduction, and filtering and interpolation of signals on graphs. Equipped with appropriate notions of these operations, we leverage the basic multiscale constructs and intuitions from classical signal processing to generate a transform that yields both a multiresolution of graphs and an associated multiresolution of a graph signal on the underlying sequence of graphs.Comment: 16 pages, 13 figure
    • …
    corecore