105 research outputs found
Core clustering as a tool for tackling noise in cluster labels
Real-world data sets often contain mislabelled entities. This can be particularly problematic if the data set is being used by a supervised classification algorithm at its learning phase. In this case the accuracy of this classification algorithm, when applied to unlabelled data, is likely to suffer considerably. In this paper we introduce a clustering-based method capable of reducing the number of mislabelled entities in data sets. Our method can be summarised as follows: (i) cluster the data set; (ii) select the entities that have the most potential to be assigned to correct clusters; (iii) use the entities of the previous step to define the core clusters and map them to the labels using a confusion matrix; (iv) use the core clusters and our cluster membership criterion to correct the labels of the remaining entities. We perform numerous experiments to validate our method empirically using k-nearest neighbour classifiers as a benchmark. We experiment with both synthetic and real-world data sets with different proportions of mislabelled entities. Our experiments demonstrate that the proposed method produces promising results. Thus, it could be used as a pre-processing data correction step of a supervised machine learning algorithm
Analysis of Students Emotion for Twitter Data using Naïve Bayes and Non Linear Support Vector Machine Approachs
Students' informal discussions on social media (e.g Twitter, Facebook) shed light into their educational understandings- opinions, feelings, and concerns about the knowledge process. Data from such surroundings can provide valuable knowledge about students learning. Examining such data, however can be challenging. The difficulty of students' experiences reflected from social media content requires human analysis. However, the growing scale of data demands spontaneous data analysis techniques. The posts of engineering students' on twitter is focused to understand issues and problems in their educational experiences. Analysis on samples taken from tweets related to engineering students' college life is conducted. The proposed work is to explore engineering students informal conversations on Twitter in order to understand issues and problems students encounter in their learning experiences. The encounter problems of engineering students from tweets such as heavy study load, lack of social engagement and sleep deprivation are considered as labels. To classify tweets reflecting students' problems multi-label classification algorithms is implemented. Non Linear Support Vector Machine, Naïve Bayes and Linear Support Vector Machine methods are used as multilabel classifiers which are implemented and compared in terms of accuracy. Non Linear SVM has shown more accuracy than Naïve Bayes classifier and linear Support Vector Machine classifier. The algorithms are used to train a detector of student problems from tweets.
DOI: 10.17762/ijritcc2321-8169.150515
Recommended from our members
Robust Registration of Dynamic Facial Sequences.
Accurate face registration is a key step for several image analysis applications. However, existing registration methods are prone to temporal drift errors or jitter among consecutive frames. In this paper, we propose an iterative rigid registration framework that estimates the misalignment with trained regressors. The input of the regressors is a robust motion representation that encodes the motion between a misaligned frame and the reference frame(s), and enables reliable performance under non-uniform illumination variations. Drift errors are reduced when the motion representation is computed from multiple reference frames. Furthermore, we use the L2 norm of the representation as a cue for performing coarse-to-fine registration efficiently. Importantly, the framework can identify registration failures and correct them. Experiments show that the proposed approach achieves significantly higher registration accuracy than the state-of-the-art techniques in challenging sequences.The research work of Evangelos Sariyanidi and Hatice Gunes has been partially supported by the EPSRC under its IDEAS Factory Sandpits call on Digital Personhood (Grant Ref.: EP/L00416X/1)
W2WNet: A two-module probabilistic Convolutional Neural Network with embedded data cleansing functionality
Ideally, Convolutional Neural Networks (CNNs) should be trained with high quality images with minimum noise and correct ground truth labels. Nonetheless, in many real-world scenarios, such high quality is very hard to obtain, and datasets may be affected by any sort of image degradation and mislabelling issues. This negatively impacts the performance of standard CNNs, both during the training and the inference phase. To
address this issue we propose Wise2WipedNet (W2WNet), a new two-module Convolutional Neural Network, where a Wise module exploits Bayesian inference to identify and discard spurious images during the training and a Wiped module takes care of the final classification, while broadcasting information on the prediction confidence at inference time. The goodness of our solution is demonstrated on a number of public benchmarks addressing different image classification tasks, as well as on a real-world case study on histological image analysis. Overall, our experiments demonstrate that W2WNet is able to identify image degradation and mislabelling issues both at training and at inference time, with positive impact on the final classification accurac
Incremental algorithm for Decision Rule generation in data stream contexts
Actualmente, la ciencia de datos está ganando mucha atención en diferentes sectores.
Concretamente en la industria, muchas aplicaciones pueden ser consideradas. Utilizar
técnicas de ciencia de datos en el proceso de toma de decisiones es una de esas
aplicaciones que pueden aportar valor a la industria. El incremento de la disponibilidad
de los datos y de la aparición de flujos continuos en forma de data streams hace
emerger nuevos retos a la hora de trabajar con datos cambiantes. Este trabajo presenta
una propuesta innovadora, Incremental Decision Rules Algorithm (IDRA), un
algoritmo que, de manera incremental, genera y modifica reglas de decisión para
entornos de data stream para incorporar cambios que puedan aparecer a lo largo del
tiempo. Este método busca proponer una nueva estructura de reglas que busca mejorar
el proceso de toma de decisiones, planteando una base de conocimiento descriptiva y
transparente que pueda ser integrada en una herramienta decisional. Esta tesis describe
la lógica existente bajo la propuesta de IDRA, en todas sus versiones, y propone una
variedad de experimentos para compararlas con un método clásico (CREA) y un
método adaptativo (VFDR). Conjuntos de datos reales, juntamente con algunos
escenarios simulados con diferentes tipos y ratios de error, se utilizan para comparar
estos algoritmos. El estudio prueba que IDRA, especÃficamente la versión reactiva de
IDRA (RIDRA), mejora la precisión de VFDR y CREA en todos los escenarios, tanto
reales como simulados, a cambio de un incremento en el tiempo.Nowadays, data science is earning a lot of attention in many different sectors.
Specifically in the industry, many applications might be considered. Using data
science techniques in the decision-making process is a valuable approach among the
mentioned applications. Along with this, the growth of data availability and the
appearance of continuous data flows in the form of data stream arise other challenges
when dealing with changing data. This work presents a novel proposal of an algorithm,
Incremental Decision Rules Algorithm (IDRA), that incrementally generates and
modify decision rules for data stream contexts to incorporate the changes that could
appear over time. This method aims to propose new rule structures that improve the
decision-making process by providing a descriptive and transparent base of knowledge
that could be integrated in a decision tool. This work describes the logic underneath
IDRA, in all its versions, and proposes a variety of experiments to compare them with
a classical method (CREA) and an adaptive method (VFDR). Some real datasets,
together with some simulated scenarios with different error types and rates are used to
compare these algorithms. The study proved that IDRA, specifically the reactive
version of IDRA (RIDRA), improves the accuracies of VFDR and CREA in all the
studied scenarios, both real and simulated, in exchange of more time
W2WNet: a two-module probabilistic Convolutional Neural Network with embedded data cleansing functionality
Convolutional Neural Networks (CNNs) are supposed to be fed with only
high-quality annotated datasets. Nonetheless, in many real-world scenarios,
such high quality is very hard to obtain, and datasets may be affected by any
sort of image degradation and mislabelling issues. This negatively impacts the
performance of standard CNNs, both during the training and the inference phase.
To address this issue we propose Wise2WipedNet (W2WNet), a new two-module
Convolutional Neural Network, where a Wise module exploits Bayesian inference
to identify and discard spurious images during the training, and a Wiped module
takes care of the final classification while broadcasting information on the
prediction confidence at inference time. The goodness of our solution is
demonstrated on a number of public benchmarks addressing different image
classification tasks, as well as on a real-world case study on histological
image analysis. Overall, our experiments demonstrate that W2WNet is able to
identify image degradation and mislabelling issues both at training and at
inference time, with a positive impact on the final classification accuracy
- …