21 research outputs found

    Time series segmentation based on stationarity analysis to improve new samples prediction

    Get PDF
    A wide range of applications based on sequential data, named time series, have become increasingly popular in recent years, mainly those based on the Internet of Things (IoT). Several different machine learning algorithms exploit the patterns extracted from sequential data to support multiple tasks. However, this data can suffer from unreliable readings that can lead to low accuracy models due to the low-quality training sets available. Detecting the change point between high representative segments is an important ally to find and thread biased subsequences. By constructing a framework based on the Augmented Dickey-Fuller (ADF) test for data stationarity, two proposals to automatically segment subsequences in a time series were developed. The former proposal, called Change Detector segmentation, relies on change detection methods of data stream mining. The latter, called ADF-based segmentation, is constructed on a new change detector derived from the ADF test only. Experiments over real-file IoT databases and benchmarks showed the improvement provided by our proposals for prediction tasks with traditional Autoregressive integrated moving average (ARIMA) and Deep Learning (Long short-term memory and Temporal Convolutional Networks) methods. Results obtained by the Long short-term memory predictive model reduced the relative prediction error from 1 to 0.67, compared to time series without segmentation

    Multi-target prediction of wheat flour quality parameters with near infrared spectroscopy

    Get PDF
    Near Infrared (NIR) spectroscopy is an analytical technology widely used for the non-destructive characterisation of organic samples, considering both qualitative and quantitative attributes. In the present study, the combination of Multi-target (MT) prediction approaches and Machine Learning algorithms has been evaluated as an effective strategy to improve prediction performances of NIR data from wheat flour samples. Three different Multi-target approaches have been tested: Multi-target Regressor Stacking (MTRS), Ensemble of Regressor Chains (ERC) and Deep Structure for Tracking Asynchronous Regressor Stack (DSTARS). Each one of these techniques has been tested with different regression methods: Support Vector Machine (SVM), Random Forest (RF) and Linear Regression (LR), on a dataset composed of NIR spectra of bread wheat flours for the prediction of quality-related parameters. By combining all MT techniques and predictors, we obtained an improvement up to 7% in predictive performance, compared with the corresponding Single-target (ST) approaches. The results support the potential advantage of MT techniques over ST techniques for analysing NIR spectra

    Classification of fermented cocoa beans (cut test) using computer vision

    Get PDF
    Fermentation of cocoa beans is a critical step for chocolate manufacturing, since fermentation influences the development of flavour, affecting components such as free amino acids, peptides and sugars. The degree of fermentation is determined by visual inspection of changes in the internal colour and texture of beans, through the cut-test. Although considered standard for evaluation of fermentation in cocoa beans, this method is time consuming and relies on specialized personnel. Therefore, this study aims to classify fermented cocoa beans using computer vision as a fast and accurate method. Imaging and image analysis provides hand-crafted features computed from the beans, that were used as predictors in random decision forests to classify the samples. A total of 1800 beans were classified into four grades of fermentation. Concerning all image features, 0.93 of accuracy was obtained for validation of unbalanced dataset, with precision of 0.85, recall of 0.81. Although the unbalanced dataset represents actual variation of fermentation, the method was tested for a balanced dataset, to investigate the influence of a smaller number of samples per class, obtaining 0.92, 0.92 and 0.90 for accuracy, precision and recall, respectively. The technique can evolve into an industrial application with a proper integration framework, substituting the traditional method to classify fermented cocoa beans

    Language-independent fake news detection: English, Portuguese, and Spanish mutual features

    Get PDF
    Online Social Media (OSM) have been substantially transforming the process of spreading news, improving its speed, and reducing barriers toward reaching out to a broad audience. However, OSM are very limited in providing mechanisms to check the credibility of news propagated through their structure. The majority of studies on automatic fake news detection are restricted to English documents, with few works evaluating other languages, and none comparing language-independent characteristics. Moreover, the spreading of deceptive news tends to be a worldwide problem; therefore, this work evaluates textual features that are not tied to a specific language when describing textual data for detecting news. Corpora of news written in American English, Brazilian Portuguese, and Spanish were explored to study complexity, stylometric, and psychological text features. The extracted features support the detection of fake, legitimate, and satirical news. We compared four machine learning algorithms (k-Nearest Neighbors (k-NN), Support Vector Machine (SVM), Random Forest (RF), and Extreme Gradient Boosting (XGB)) to induce the detection model. Results show our proposed language-independent features are successful in describing fake, satirical, and legitimate news across three different languages, with an average detection accuracy of 85.3% with RF

    Football player dominant region determined by a novel model based on instantaneous kinematics variables

    Get PDF
    Dominant regions are defined as regions of the pitch where a player can reach before any other and are commonly determined without considering the free-spaces in the pitch. We presented an approach to football players’ dominant regions analysis, based on movement models created from players’ positions, displacement, velocity, and acceleration vectors. 109 Brazilian male professional football players were analysed during official matches, computing over 15 million positional data obtained by video-based tracking system. Movement models were created based on players’ instantaneous vectorial kinematics variables, then probabilities models and dominant regions were determined. Accuracy in determining dominant regions by the proposed model was tested for different time-lag windows. We calculated the areas of dominant, free-spaces, and Voronoi regions. Mean correct predictions of dominant region were 96.56%, 88.64%, and 72.31% for one, two, and three seconds, respectively. Dominant regions areas were lower than the ones computed by Voronoi, with median values of 73 and 171 m2, respectively. A median value of 5537 m2 was presented for free-space regions, representing a large part of the pitch. The proposed movement model proved to be more realistic, representing the match dynamics and can be a useful method to evaluate the players’ tactical behaviours during matches

    Process Mining Encoding via Meta-learning for an Enhanced Anomaly Detection

    No full text
    Anomalous traces diminish the event log’s quality due to bad execution or security issues, for instance. Focusing on mitigating this phenomenon, organizations spend efforts to detect anomalous traces in their business processes to save resources and improve process execution. Conformance checking techniques are usually employed in these situations. These methods rely on the comparison of the event log obtained and the designed process model. However, in many real-world environments, the log is noisy and the model unavailable, requiring more robust techniques and expert assistance to perform conformance checking. The considerable number of techniques and reduced availability of experts pose an additional challenge to detecting anomalous traces for particular event log scenarios. In this work, we combine the representational power of encoding with a Meta-learning strategy to enhance the detection of anomalous traces in event logs towards fitting the best discriminative capability between common and irregular traces. Our method extracts meta-features from an event log and recommends the most suitable encoding technique to increase the anomaly detection performance. We used three encoding techniques from different families, 80 log descriptors, 168 event logs, and six anomaly types for experiments. Results indicate that event log characteristics influence the representational capability of encodings differently. Our proposed Meta-learning method outperforms the baseline reaching an F-score of 0.73. This performance demonstrates that traditional process mining analysis can be leveraged when matched with intelligent decision support approaches

    Analysis of Language Inspired Trace Representation for Anomaly Detection

    No full text
    A great concern for organizations is to detect anomalous process instances within their business processes. For that, conformance checking performs model-aware analysis by comparing process logs to business models for the detection of anomalous process executions. However, in several scenarios, a model is either unavailable or its generation is costly, which requires the employment of alternative methods to allow a confident representation of traces. This work supports the analysis of language inspired process analysis grounded in the word2vec encoding algorithm. We argue that natural language encodings correctly model the behavior of business processes, supporting a proper distinction between common and anomalous behavior. In the experiments, we compared accuracy and time cost among different word2vec setups and classic encoding methods (token-based replay and alignment features), addressing seven different anomaly scenarios. Feature importance values and the impact of different anomalies in seven event logs were also evaluated to bring insights on the trace representation subject. Results show the proposed encoding overcomes representational capability of traditional conformance metrics for the anomaly detection task

    Pre-trained Data Augmentation for Text Classification

    No full text
    Data augmentation is a widely adopted method for improving model performance in image classification tasks. Although it still not as ubiquitous in Natural Language Processing (NLP) community, some methods have already been proposed to increase the amount of training data using simple text transformations or text generation through language models. However, recent text classification tasks need to deal with domains characterized by a small amount of text and informal writing, e.g., Online Social Networks content, reducing the capabilities of current methods. Facing these challenges by taking advantage of the pre-trained language models, low computational resource consumption, and model compression, we proposed the PRE-trained Data AugmenTOR (PREDATOR) method. Our data augmentation method is composed of two modules: the Generator, which synthesizes new samples grounded on a lightweight model, and the Filter, that selects only the high-quality ones. The experiments comparing Bidirectional Encoder Representations from Transformer (BERT), Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM) and Multinomial Naive Bayes (NB) in three datasets exposed the effective improvement of accuracy. It was obtained 28.5% of accuracy improvement with LSTM on the best scenario and an average improvement of 8% across all scenarios. PREDATOR was able to augment real-world social media datasets and other domains, overcoming the recent text augmentation techniques

    Artificial Immune Systems and Fuzzy Logic to Detect Flooding Attacks in Software-Defined Networks

    Get PDF
    Software-defined Networking (SDN) has been discovered as an architecture that uses applications to make networks flexible and centrally controlled. Although SDN provides innovative management, it still susceptible to attacks daily. Traditional detection approaches may not be sufficient to contain these threats. In this paper, we present an Artificial Immune System based IDS named AIS-IDS, which is inspired by the human body's defense cells. AIS-IDS can detect variations in network behavior and identify attacks without prior knowledge about them. Along with AIS, the fuzzy logic is applied on detection to minimize the uncertainty when there is no clear boundary between anomalous and normal traffic behavior. We have simulated portscan and flooding attacks as well as used a public dataset with several types of DDoS attacks to assess our proposal. We compared the AIS-IDS performance with Naive Bayes, k-nearest neighbors, and the Local Outlier Factor. The AIS-IDS outperformed the compared algorithms, achieving f-measure rates 99.97% and 92.28% when submitted to a simulated and a public dataset, respectively

    Evaluating the Four-Way Performance Trade-Off for Data Stream Classification in Edge Computing

    Get PDF
    Edge computing (EC) is a promising technology capable of bridging the gap between Cloud computing services and the demands of emerging technologies such as the Internet of Things (IoT). Most EC-based solutions, from wearable devices to smart cities architectures, benefit from Machine Learning (ML) methods to perform various tasks, such as classification. In these cases, ML solutions need to deal efficiently with a huge amount of data, while balancing predictive performance, memory and time costs, and energy consumption. The fact that these data usually come in the form of a continuous and evolving data stream makes the scenario even more challenging. Many algorithms have been proposed to cope with data stream classification, e.g., Very Fast Decision Tree (VFDT) and Strict VFDT (SVFDT). Recently, Online Local Boosting (OLBoost) has also been introduced to improve predictive performance without modifying the underlying structure of the decision tree produced by these algorithms. In this work, we compared the four-way relationship among time efficiency, energy consumption, predictive performance, and memory costs, tuning the hyperparameters of VFDT and the two versions of SVFDT with and without OLBoost. Experiments over 6 benchmark datasets using an EC device revealed that VFDT and SVFDT-I were the most energy-friendly algorithms, with SVFDT-I also significantly reducing memory consumption. OLBoost, as expected, improved the predictive performance, but caused a deterioration in memory and energy consumption
    corecore