237 research outputs found

    A systematic review of data quality issues in knowledge discovery tasks

    Hay un gran crecimiento en el volumen de datos porque las organizaciones capturan permanentemente la cantidad colectiva de datos para lograr un mejor proceso de toma de decisiones. El desafío mas fundamental es la exploración de los grandes volúmenes de datos y la extracción de conocimiento útil para futuras acciones por medio de tareas para el descubrimiento del conocimiento; sin embargo, muchos datos presentan mala calidad. Presentamos una revisión sistemática de los asuntos de calidad de datos en las áreas del descubrimiento de conocimiento y un estudio de caso aplicado a la enfermedad agrícola conocida como la roya del café.Large volume of data is growing because the organizations are continuously capturing the collective amount of data for better decision-making process. The most fundamental challenge is to explore the large volumes of data and extract useful knowledge for future actions through knowledge discovery tasks, nevertheless many data has poor quality. We presented a systematic review of the data quality issues in knowledge discovery tasks and a case study applied to agricultural disease named coffee rust

    How to Address the Data Quality Issues in Regression Models: A Guided Process for Data Cleaning

    Today, data availability has gone from scarce to superabundant. Technologies like IoT, trends in social media and the capabilities of smart-phones are producing and digitizing lots of data that was previously unavailable. This massive increase of data creates opportunities to gain new business models, but also demands new techniques and methods of data quality in knowledge discovery, especially when the data comes from different sources (e.g., sensors, social networks, cameras, etc.). The data quality process of the data set proposes conclusions about the information they contain. This is increasingly done with the aid of data cleaning approaches. Therefore, guaranteeing a high data quality is considered as the primary goal of the data scientist. In this paper, we propose a process for data cleaning in regression models (DC-RM). The proposed data cleaning process is evaluated through a real datasets coming from the UCI Repository of Machine Learning Databases. With the aim of assessing the data cleaning process, the dataset that is cleaned by DC-RM was used to train the same regression models proposed by the authors of UCI datasets. The results achieved by the trained models with the dataset produced by DC-RM are better than or equal to that presented by the datasets' authors.This work has been also supported by the Spanish Ministry of Economy, Industry and Competitiveness (Projects TRA2015-63708-R and TRA2016-78886-C3-1-R)

    A new predictive neural architecture for solving temperature inverse problems in microwave-assisted drying processes

    In this paper, a novel learning architecture based on neural networks is used for temperature inverse modeling in microwave-assisted drying processes. The proposed design combines the accuracy of the radial basis functions (RBF) and the algebraic capabilities of the matrix polynomial structures by using a two-level structure. This architecture is trained by temperature curves, TcðtÞ; previously generated by a validated drying model. The interconnection of the learning-based networks has enabled the finding of electric field (E) optimal values which provide the TcðtÞ curve that best fits a desired temperature target in a specific time slo

    New details about the frequency behavior of irradiated bipolar operational amplifiers

    The frequency behavior of a bipolar operational amplifier (op amp) is always expected to worsen when the device is irradiated. In other words, parameters like the slew rate and the gain-bandwidth product are to decrease after either neutron or gamma tests. However, some neutron and TID tests performed on a large variety of bipolar op amps have shown that the evolution of the frequency behavior is not as simple as it is usually believed. In fact, there is evidence of an increasing influence of the power supply values on the former parameters, which can be extremely important in some devices. Also, the relationship among different frequency parameters has been investigated and, finally, an interesting and scarcely reported phenomenon is depicted. This phenomenon is the appearance of spontaneous oscillations in fed-back op amps, without doubt related to the modification of the gain and phase margins of the device

    Sample selection method for arbitrary fading emulation using mode-stirred chambers

    Mode-stirred chambers (MSC) consist on one or more resonant cavities coupled in some way in order to allow the measurement of different antenna parameters such as antenna efficiency, correlation, diversity gain or MIMO capacity, among others. In a single-cavity mode-stirred chamber, also known as a reverberation chamber (RC), the environment is isotropic and the amplitude of the signal is Rayleigh distributed. Real environments, however, rarely follow an isotropic Rayleigh-fading scenario. Previous results have shown that a Rician-fading emulation can be obtained via hardware modification using an RC. The different methods lack from an accurate emulation performance and are strongly dependent upon chamber size and antenna configurations. With the innate complexity of more-than-one cavity MSC, the coupling structure generates sample sets which are complex enough so as to contain different clusters with diverse fading characteristics. This paper presents a novel method to accurately emulate a more realistic Rician-fading distribution from a Rayleigh-fading distribution by selecting parts of the sample set that forms different statistical ensembles using a complex two-cavity multi-iris-coupled MSC. Sample selection is performed using a genetic algorithm. Results demonstrate the potential of MSCs for versatile MIMO fading emulation and OTA testing. The method is patent protected by EMITE Ing.This work was supported in part by the Spanish National R&D Programme through TEC2008-05811 and by Fundación Séneca, the R&D coordinating agency for the Region of Murcia (Spain) under the 11783/PI/09 project

    A case-based reasoning system for recommendation of data cleaning algorithms in classification and regression tasks

    Recently, advances in Information Technologies (social networks, mobile applications, Internet of Things, etc.) generate a deluge of digital data; but to convert these data into useful information for business decisions is a growing challenge. Exploiting the massive amount of data through knowledge discovery (KD) process includes identifying valid, novel, potentially useful and understandable patterns from a huge volume of data. However, to prepare the data is a non-trivial refinement task that requires technical expertise in methods and algorithms for data cleaning. Consequently, the use of a suitable data analysis technique is a headache for inexpert users. To address these problems, we propose a case-based reasoning system (CBR) to recommend data cleaning algorithms for classification and regression tasks. In our approach, we represent the problem space by the meta-features of the dataset, its attributes, and the target variable. The solution space contains the algorithms of data cleaning used for each dataset. We represent the cases through a Data Cleaning Ontology. The case retrieval mechanism is composed of a filter and similarity phases. In the first phase, we defined two filter approaches based on clustering and quartile analysis. These filters retrieve a reduced number of relevant cases. The second phase computes a ranking of the retrieved cases by filter approaches, and it scores a similarity between a new case and the retrieved cases. The retrieval mechanism proposed was evaluated through a set of judges. The panel of judges scores the similarity between a query case against all cases of the case-base (ground truth). The results of the retrieval mechanism reach an average precision on judges ranking of 94.5% in top 3, for top 7 84.55%, while in top 10 78.35%.The authors are grateful to the research groups: Control Learning Systems Optimization Group (CAOS) of the Carlos III University of Madrid and Telematics Engineering Group (GIT) of the University of Cauca for the technical support. In addition, the authors are grateful to COLCIENCIAS for PhD scholarship granted to PhD. David Camilo Corrales. This work has been also supported by: Project Alternativas Innovadoras de Agricultura Inteligente para sistemas productivos agrícolas del departamento del Cauca soportado en entornos de IoT financed by Convocatoria 04C-2018 Banco de Proyectos Conjuntos UEES-Sostenibilidad of Project Red de formación de talento humano para la innovación social y productiva en el Departamento del Cauca InnovAcción Cauca, ID-3848. The Spanish Ministry of Economy, Industry and Competitiveness (Projects TRA2015-63708-R and TRA2016-78886-C3-1-R)

    SeMLaPS: Real-time Semantic Mapping with Latent Prior Networks and Quasi-Planar Segmentation

    The availability of real-time semantics greatly improves the core geometric functionality of SLAM systems, enabling numerous robotic and AR/VR applications. We present a new methodology for real-time semantic mapping from RGB-D sequences that combines a 2D neural network and a 3D network based on a SLAM system with 3D occupancy mapping. When segmenting a new frame we perform latent feature re-projection from previous frames based on differentiable rendering. Fusing re-projected feature maps from previous frames with current-frame features greatly improves image segmentation quality, compared to a baseline that processes images independently. For 3D map processing, we propose a novel geometric quasi-planar over-segmentation method that groups 3D map elements likely to belong to the same semantic classes, relying on surface normals. We also describe a novel neural network design for lightweight semantic map post-processing. Our system achieves state-of-the-art semantic mapping quality within 2D-3D networks-based systems and matches the performance of 3D convolutional networks on three real indoor datasets, while working in real-time. Moreover, it shows better cross-sensor generalization abilities compared to 3D CNNs, enabling training and inference with different depth sensors. Code and data will be released on project page: http://jingwenwang95.github.io/SeMLaPSComment: 8 pages, 7 figures, submitted to RA-L. Project page: http://jingwenwang95.github.io/SeMLaP

    From Theory to Practice: A Data Quality Framework for Classification Tasks

    The data preprocessing is an essential step in knowledge discovery projects. The experts affirm that preprocessing tasks take between 50% to 70% of the total time of the knowledge discovery process. In this sense, several authors consider the data cleaning as one of the most cumbersome and critical tasks. Failure to provide high data quality in the preprocessing stage will significantly reduce the accuracy of any data analytic project. In this paper, we propose a framework to address the data quality issues in classification tasks DQF4CT. Our approach is composed of: (i) a conceptual framework to provide the user guidance on how to deal with data problems in classification tasks; and (ii) an ontology that represents the knowledge in data cleaning and suggests the proper data cleaning approaches. We presented two case studies through real datasets: physical activity monitoring (PAM) and occupancy detection of an office room (OD). With the aim of evaluating our proposal, the cleaned datasets by DQF4CT were used to train the same algorithms used in classification tasks by the authors of PAM and OD. Additionally, we evaluated DQF4CT through datasets of the Repository of Machine Learning Databases of the University of California, Irvine (UCI). In addition, 84% of the results achieved by the models of the datasets cleaned by DQF4CT are better than the models of the datasets authors.This work has also been supported by: Project: “Red de formación de talento humano para la innovación social y productiva en el Departamento del Cauca InnovAcción Cauca”. Convocatoria 03-2018 Publicación de artículos en revistas de alto impacto. Project: “Alternativas Innovadoras de Agricultura Inteligente para sistemas productivos agrícolas del departamento del Cauca soportado en entornos de IoT - ID 4633” financed by Convocatoria 04C–2018 “Banco de Proyectos Conjuntos UEES-Sostenibilidad” of Project “Red de formación de talento humano para la innovación social y productiva en el Departamento del Cauca InnovAcción Cauca”. Spanish Ministry of Economy, Industry and Competitiveness (Projects TRA2015-63708-R and TRA2016-78886-C3-1-R)