152,162 research outputs found

    Klasifikasi Teks menggunakan Genetic Programming dengan Implementasi Web Scraping dan Map Reduce

    Get PDF
    Classification of text documents on online media is a big data problem and requires automation. Research has developed a text classification system with pre-processing using map-reduce and web scraping data collection. This study aims to evaluate text classification performance by combining genetic programming algorithms, map-reduce and web scraping for processing large data in the form of text. Data collection was carried out by observing web-based scraping. Data was collected by reducing 8126 duplicates. Map-reduce has tokenized and stopped-word removal with 28507 terms with 4306 unique terms and 24201 duplication terms. Text classification evaluation shows that a single tree produces better accuracy (0.7072) than a decision tree (0.6874), and the lowest is a multi-tree (0.6726). For the acquisition of genetic programming support values with the multi-tree, the highest average support is 0.3854, followed by the decision tree with 0.3584 and the smallest single tree with 0.3494. In general, the amount of support is not in line with the accuracy value achieved.Classification of text documents on online media is a big data problem and requires automation. Research has developed a text classification system with pre-processing using map-reduce and web scraping data collection. This study aims to evaluate text classification performance by combining genetic programming algorithms, map-reduce and web scraping for processing large data in the form of text. Data collection was carried out by observing web-based scraping. Data was collected by reducing 8126 duplicates. Map-reduce has tokenized and stopped-word removal with 28507 terms with 4306 unique terms and 24201 duplication terms. Text classification evaluation shows that a single tree produces better accuracy (0.7072) than a decision tree (0.6874), and the lowest is a multi-tree (0.6726). For the acquisition of genetic programming support values with the multi-tree, the highest average support is 0.3854, followed by the decision tree with 0.3584 and the smallest single tree with 0.3494. In general, the amount of support is not in line with the accuracy value achieved

    ACP Dashboard: an interactive visualization tool for selecting analytics configurations in an industrial setting

    Get PDF
    The production process on a factory can be described by big amount of data. It is used to optimize the production process, reduce number of failures and control material waste. For this, data is processed, analyzed and classified using the analysis techniques - text classification algorithms. Thus there should be an approach that supports choice of algorithms on both, technical and management levels. We propose a tool called Analytics Configuration Performance Dashboard which facilitates process of algorithm configurations comparison. It is based on a meta-learning approach. Additionally, we introduce three business metrics on which algorithms are compared, they map onto machine learning algorithm evaluation metrics and help to assess algorithms from industry perspective. Moreover, we develop a visualization in order to provide clear representation of the data. Clustering is used to define groups of algorithms that have common performance in business metrics. We conclude with evaluation of the proposed approach and techniques, which were chosen for its implementation

    Design of Multi-View Based Email Classification for IoT Systems via Semi-Supervised Learning

    Get PDF
    Suspicious emails are one big threat for Internet of Things (IoT) security, which aim to induce users to click and then redirect them to a phishing webpage. To protect IoT systems, email classification is an essential mechanism to classify spam and legitimate emails. In the literature, most email classification approaches adopt supervised learning algorithms that require a large number of labeled data for classifier training. However, data labeling is very time consuming and expensive, making only a very small set of data available in practice, which would greatly degrade the effectiveness of email classification. To mitigate this problem, in this work, we develop an email classification approach based on multi-view disagreement-based semi-supervised learning. The idea behind is that multi-view method can offer richer information for classification, which is often ignored by literature. The use of semi-supervised learning can help leverage both labeled and unlabeled data. In the evaluation, we investigate the performance of our proposed approach with datasets and in real network environments. Experimental results demonstrate that multi-view can achieve better classification performance than single view, and that our approach can achieve better performance as compared to the existing similar algorithms

    Surveying alignment-free features for Ortholog detection in related yeast proteomes by using supervised big data classifiers

    Get PDF
    Abstract Background: The development of new ortholog detection algorithms and the improvement of existing ones are of major importance in functional genomics. We have previously introduced a successful supervised pairwise ortholog classification approach implemented in a big data platform that considered several pairwise protein features and the low ortholog pair ratios found between two annotated proteomes (Galpert, D et al., BioMed Research International, 2015). The supervised models were built and tested using a Saccharomycete yeast benchmark dataset proposed by Salichos and Rokas (2011). Despite several pairwise protein features being combined in a supervised big data approach; they all, to some extent were alignment-based features and the proposed algorithms were evaluated on a unique test set. Here, we aim to evaluate the impact of alignment-free features on the performance of supervised models implemented in the Spark big data platform for pairwise ortholog detection in several related yeast proteomes. Results: The Spark Random Forest and Decision Trees with oversampling and undersampling techniques, and built with only alignment-based similarity measures or combined with several alignment-free pairwise protein features showed the highest classification performance for ortholog detection in three yeast proteome pairs. Although such supervised approaches outperformed traditional methods, there were no significant differences between the exclusive use of alignment-based similarity measures and their combination with alignment-free features, even within the twilight zone of the studied proteomes. Just when alignment-based and alignment-free features were combined in Spark Decision Trees with imbalance management, a higher success rate (98.71%) within the twilight zone could be achieved for a yeast proteome pair that underwent a whole genome duplication. The feature selection study showed that alignment-based features were top-ranked for the best classifiers while the runners-up were alignment-free features related to amino acid composition. Conclusions: The incorporation of alignment-free features in supervised big data models did not significantly improve ortholog detection in yeast proteomes regarding the classification qualities achieved with just alignment-based similarity measures. However, the similarity of their classification performance to that of traditional ortholog detection methods encourages the evaluation of other alignment-free protein pair descriptors in future research.This work was supported by the following financial sources: Postdoc fellowship (SFRH/BPD/92978/2013) granted to GACh by the Portuguese Fundação para a Ciência e a Tecnologia (FCT). AA was supported by the MarInfo – Integrated Platform for Marine Data Acquisition and Analysis (reference NORTE-01-0145-FEDER-000031), a project supported by the North Portugal Regional Operational Program (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF)

    Big Data Classification of Ultrasound Doppler Scan Images Using a Decision Tree Classifier Based on Maximally Stable Region Feature Points

    Get PDF
    The classification of ultrasound scan images is important in monitoring the development of prenatal and maternal structures. This paper proposes a big data classification system for ultrasound Doppler scan images that combines the residual of maximally stable extreme regions and speeded up robust features (SURF) with a decision tree classifier. The algorithm first preprocesses the ultrasound scan images before detecting the maximally stable extremal regions (MSER). A few essential regions are chosen from the MSER regions, along with the residual region that provides the best Region of Interest (ROI). SURF features points that best represent the region are detected using the gradient of the estimated cumulative region of interest. To extract the feature from the pixels that surround the SURF feature points, the Triangular Vertex Transform (TVT) transform is used. A decision tree classifier is used to train the extracted TVT features. The proposed ultrasound scan image classification system is validated using performance parameters such as accuracy, specificity, precision, sensitivity, and F1 score. For validation, a large dataset of 12,400 scan images collected from 1792 patients is used. The proposed method has an F1score of 94.12%, sensitivity, specificity, precision, and accuracy of 93.57%, 93.57%, and 97.96%, respectively. The evaluation results show that the proposed algorithm for classifying Doppler scan images is better than other algorithms that have been used in the past.&nbsp
    • …
    corecore