27 research outputs found
Meta-learning for dynamic tuning of active learning on stream classification
Supervised data stream learning depends on the incoming sample's true label to update a classifier's model. In real life, obtaining the ground truth for each instance is a challenging process; it is highly costly and time consuming. Active Learning has already bridged this gap by finding a reduced set of instances to support the creation of a reliable stream classifier. However, identifying a reduced number of informative instances to support a suitable classifier update and drift adaptation is very tricky. To better adapt to concept drifts using a reduced number of samples, we propose an online tuning of the Uncertainty Sampling threshold using a meta-learning approach. Our approach exploits statistical meta-features from adaptive windows to meta-recommend a suitable threshold to address the trade-off between the number of labelling queries and high accuracy. Experiments exposed that the proposed approach provides the best trade-off between accuracy and query reduction by dynamic tuning the uncertainty threshold using lightweight meta-features
Selecting Optimal Trace Clustering Pipelines with Meta-learning
Trace clustering has been extensively used to discover aspects of the data from event logs. Process Mining techniques guide the identification of sub-logs by grouping traces with similar behaviors, producing more understandable models and improving conformance indicators. Nevertheless, little attention has been posed to the relationship among event log properties, the pipeline of encoding and clustering algorithms, and the quality of the obtained outcome. The present study contributes to the understanding of the aforementioned relationships and provides an automatic selection of a proper combination of algorithms for clustering a given event log. We propose a Meta-Learning framework to recommend the most suitable pipeline for trace clustering, which encompasses the encoding method, clustering algorithm, and its hyperparameters. Our experiments were conducted using a thousand event logs, four encoding techniques, and three clustering methods. Results indicate that our framework sheds light on the trace clustering problem and can assist users in choosing the best pipeline considering their environment
Time series segmentation based on stationarity analysis to improve new samples prediction
A wide range of applications based on sequential data, named time series, have become increasingly popular in recent years, mainly those based on the Internet of Things (IoT). Several different machine learning algorithms exploit the patterns extracted from sequential data to support multiple tasks. However, this data can suffer from unreliable readings that can lead to low accuracy models due to the low-quality training sets available. Detecting the change point between high representative segments is an important ally to find and thread biased subsequences. By constructing a framework based on the Augmented Dickey-Fuller (ADF) test for data stationarity, two proposals to automatically segment subsequences in a time series were developed. The former proposal, called Change Detector segmentation, relies on change detection methods of data stream mining. The latter, called ADF-based segmentation, is constructed on a new change detector derived from the ADF test only. Experiments over real-file IoT databases and benchmarks showed the improvement provided by our proposals for prediction tasks with traditional Autoregressive integrated moving average (ARIMA) and Deep Learning (Long short-term memory and Temporal Convolutional Networks) methods. Results obtained by the Long short-term memory predictive model reduced the relative prediction error from 1 to 0.67, compared to time series without segmentation
Multi-target prediction of wheat flour quality parameters with near infrared spectroscopy
Near Infrared (NIR) spectroscopy is an analytical technology widely used for the non-destructive characterisation of organic samples, considering both qualitative and quantitative attributes. In the present study, the combination of Multi-target (MT) prediction approaches and Machine Learning algorithms has been evaluated as an effective strategy to improve prediction performances of NIR data from wheat flour samples. Three different Multi-target approaches have been tested: Multi-target Regressor Stacking (MTRS), Ensemble of Regressor Chains (ERC) and Deep Structure for Tracking Asynchronous Regressor Stack (DSTARS). Each one of these techniques has been tested with different regression methods: Support Vector Machine (SVM), Random Forest (RF) and Linear Regression (LR), on a dataset composed of NIR spectra of bread wheat flours for the prediction of quality-related parameters. By combining all MT techniques and predictors, we obtained an improvement up to 7% in predictive performance, compared with the corresponding Single-target (ST) approaches. The results support the potential advantage of MT techniques over ST techniques for analysing NIR spectra
Classification of fermented cocoa beans (cut test) using computer vision
Fermentation of cocoa beans is a critical step for chocolate manufacturing, since fermentation influences the development of flavour, affecting components such as free amino acids, peptides and sugars. The degree of fermentation is determined by visual inspection of changes in the internal colour and texture of beans, through the cut-test. Although considered standard for evaluation of fermentation in cocoa beans, this method is time consuming and relies on specialized personnel. Therefore, this study aims to classify fermented cocoa beans using computer vision as a fast and accurate method. Imaging and image analysis provides hand-crafted features computed from the beans, that were used as predictors in random decision forests to classify the samples. A total of 1800 beans were classified into four grades of fermentation. Concerning all image features, 0.93 of accuracy was obtained for validation of unbalanced dataset, with precision of 0.85, recall of 0.81. Although the unbalanced dataset represents actual variation of fermentation, the method was tested for a balanced dataset, to investigate the influence of a smaller number of samples per class, obtaining 0.92, 0.92 and 0.90 for accuracy, precision and recall, respectively. The technique can evolve into an industrial application with a proper integration framework, substituting the traditional method to classify fermented cocoa beans
Benchmarking Change Detector Algorithms from Different Concept Drift Perspectives
The stream mining paradigm has become increasingly popular due to the vast number of algorithms and methodologies it provides to address the current challenges of Internet of Things (IoT) and modern machine learning systems. Change detection algorithms, which focus on identifying drifts in the data distribution during the operation of a machine learning solution, are a crucial aspect of this paradigm. However, selecting the best change detection method for different types of concept drift can be challenging. This work aimed to provide a benchmark for four drift detection algorithms (EDDM, DDM, HDDMW, and HDDMA) for abrupt, gradual, and incremental drift types. To shed light on the capacity and possible trade-offs involved in selecting a concept drift algorithm, we compare their detection capability, detection time, and detection delay. The experiments were carried out using synthetic datasets, where various attributes, such as stream size, the amount of drifts, and drift duration can be controlled and manipulated on our generator of synthetic stream. Our results show that HDDMW provides the best trade-off among all performance indicators, demonstrating superior consistency in detecting abrupt drifts, but has suboptimal time consumption and a limited ability to detect incremental drifts. However, it outperforms other algorithms in detection delay for both abrupt and gradual drifts with an efficient detection performance and detection time performance
Language-independent fake news detection: English, Portuguese, and Spanish mutual features
Online Social Media (OSM) have been substantially transforming the process of spreading news, improving its speed, and reducing barriers toward reaching out to a broad audience. However, OSM are very limited in providing mechanisms to check the credibility of news propagated through their structure. The majority of studies on automatic fake news detection are restricted to English documents, with few works evaluating other languages, and none comparing language-independent characteristics. Moreover, the spreading of deceptive news tends to be a worldwide problem; therefore, this work evaluates textual features that are not tied to a specific language when describing textual data for detecting news. Corpora of news written in American English, Brazilian Portuguese, and Spanish were explored to study complexity, stylometric, and psychological text features. The extracted features support the detection of fake, legitimate, and satirical news. We compared four machine learning algorithms (k-Nearest Neighbors (k-NN), Support Vector Machine (SVM), Random Forest (RF), and Extreme Gradient Boosting (XGB)) to induce the detection model. Results show our proposed language-independent features are successful in describing fake, satirical, and legitimate news across three different languages, with an average detection accuracy of 85.3% with RF
How people interact with a chatbot against disinformation and fake news in COVID-19 in Brazil: The CoronaAI case
Background: The search for valid information was one of the main challenges encountered during the COVID-19 pandemic, which resulted in the development of several online alternatives. Objectives: To describe the development of a computational solution to interact with users of different levels of digital literacy on topics related to COVID-19 and to map the correlations between user behavior and events and news that occurred throughout the pandemic. Method: CoronaAI, a chatbot based on Google's Dialogflow technology, was developed at a public university in Brazil and made available on WhatsApp. The dataset with usersâ interactions with the chatbot comprises approximately 7,000 hits recorded throughout eleven months of CoronaAI usage. Results: CoronaAI was widely accessed by users in search of valuable and updated information on COVID-19, including checking the veracity of possible fake news about the spread of cases, deaths, symptoms, tests and protocols, among others. The mapping of users' behavior revealed that as the number of cases and deaths increased and as COVID-19 became closer, users showed a greater need for information applicable to self-care compared to following the statistical data. In addition, they showed that the constant updating of this technology may contribute to public health by enhancing general information on the pandemic and at the individual level by clarifying specific doubts about COVID-19. Conclusion: Our findings reinforce the potential usefulness of chatbot technology to resolve a wide spectrum of citizens' doubts about COVID-19, acting as a cost-effective tool against the parallel pandemic of misinformation and fake news
Football player dominant region determined by a novel model based on instantaneous kinematics variables
Dominant regions are defined as regions of the pitch where a player can reach before any other and are commonly determined without considering the free-spaces in the pitch. We presented an approach to football playersâ dominant regions analysis, based on movement models created from playersâ positions, displacement, velocity, and acceleration vectors. 109 Brazilian male professional football players were analysed during official matches, computing over 15 million positional data obtained by video-based tracking system. Movement models were created based on playersâ instantaneous vectorial kinematics variables, then probabilities models and dominant regions were determined. Accuracy in determining dominant regions by the proposed model was tested for different time-lag windows. We calculated the areas of dominant, free-spaces, and Voronoi regions. Mean correct predictions of dominant region were 96.56%, 88.64%, and 72.31% for one, two, and three seconds, respectively. Dominant regions areas were lower than the ones computed by Voronoi, with median values of 73 and 171 m2, respectively. A median value of 5537 m2 was presented for free-space regions, representing a large part of the pitch. The proposed movement model proved to be more realistic, representing the match dynamics and can be a useful method to evaluate the playersâ tactical behaviours during matches
Deep computer vision system for cocoa classification
Cocoa hybridisation generates new varieties which are resistant to several plant diseases, but has individual chemical characteristics that affect chocolate production. Image analysis is a useful method for visual discrimination of cocoa beans, while deep learning (DL) has emerged as the de facto technique for image processing. However, these algorithms require a large amount of data and careful tuning of hyperparameters. Since it is necessary to acquire a large number of images to encompass the wide range of agricultural products, in this paper, we compare a Deep Computer Vision System (DCVS) and a traditional Computer Vision System (CVS) to classify cocoa beans into different varieties. For DCVS, we used a Resnet18 and Resnet50 as backbone, while for CVS, we experimented traditional machine learning algorithms, Support Vector Machine (SVM), and Random Forest (RF). All the algorithms were selected since they provide good classification performance and their potential application for food classification A dataset with 1,239 samples was used to evaluate both systems. The best accuracy was 96.82% for DCVS (ResNet 18), compared to 85.71% obtained by the CVS using SVM. The essential handcrafted features were reported and discussed regarding their influence on cocoa bean classification. Class Activation Maps was applied to DCVSâs predictions, providing a meaningful visualisation of the most important regions of the images in the model