152,162 research outputs found
Klasifikasi Teks menggunakan Genetic Programming dengan Implementasi Web Scraping dan Map Reduce
Classification of text documents on online media is a big data problem and requires automation. Research has developed a text classification system with pre-processing using map-reduce and web scraping data collection. This study aims to evaluate text classification performance by combining genetic programming algorithms, map-reduce and web scraping for processing large data in the form of text. Data collection was carried out by observing web-based scraping. Data was collected by reducing 8126 duplicates. Map-reduce has tokenized and stopped-word removal with 28507 terms with 4306 unique terms and 24201 duplication terms. Text classification evaluation shows that a single tree produces better accuracy (0.7072) than a decision tree (0.6874), and the lowest is a multi-tree (0.6726). For the acquisition of genetic programming support values with the multi-tree, the highest average support is 0.3854, followed by the decision tree with 0.3584 and the smallest single tree with 0.3494. In general, the amount of support is not in line with the accuracy value achieved.Classification of text documents on online media is a big data problem and requires automation. Research has developed a text classification system with pre-processing using map-reduce and web scraping data collection. This study aims to evaluate text classification performance by combining genetic programming algorithms, map-reduce and web scraping for processing large data in the form of text. Data collection was carried out by observing web-based scraping. Data was collected by reducing 8126 duplicates. Map-reduce has tokenized and stopped-word removal with 28507 terms with 4306 unique terms and 24201 duplication terms. Text classification evaluation shows that a single tree produces better accuracy (0.7072) than a decision tree (0.6874), and the lowest is a multi-tree (0.6726). For the acquisition of genetic programming support values with the multi-tree, the highest average support is 0.3854, followed by the decision tree with 0.3584 and the smallest single tree with 0.3494. In general, the amount of support is not in line with the accuracy value achieved
ACP Dashboard: an interactive visualization tool for selecting analytics configurations in an industrial setting
The production process on a factory can be described by big amount of data. It is used to optimize the production process, reduce number of failures and control material waste. For this, data is processed, analyzed and classified using the analysis techniques - text classification algorithms. Thus there should be an approach that supports choice of algorithms on both, technical and management levels. We propose a tool called Analytics Configuration Performance Dashboard which facilitates process of algorithm configurations comparison. It is based on a meta-learning approach. Additionally, we introduce three business metrics on which algorithms are compared, they map onto machine learning algorithm evaluation metrics and help to assess algorithms from industry perspective. Moreover, we develop a visualization in order to provide clear representation of the data. Clustering is used to define groups of algorithms that have common performance in business metrics. We conclude with evaluation of the proposed approach and techniques, which were chosen for its implementation
Design of Multi-View Based Email Classification for IoT Systems via Semi-Supervised Learning
Suspicious emails are one big threat for Internet of Things (IoT) security, which aim to induce users to click and then redirect them to a phishing webpage. To protect IoT systems, email classification is an essential mechanism to classify spam and legitimate emails. In the literature, most email classification approaches adopt supervised learning algorithms that require a large number of labeled data for classifier training. However, data labeling is very time consuming and expensive, making only a very small set of data available in practice, which would greatly degrade the effectiveness of email classification. To mitigate this problem, in this work, we develop an email classification approach based on multi-view disagreement-based semi-supervised learning. The idea behind is that multi-view method can offer richer information for classification, which is often ignored by literature. The use of semi-supervised learning can help leverage both labeled and unlabeled data. In the evaluation, we investigate the performance of our proposed approach with datasets and in real network environments. Experimental results demonstrate that multi-view can achieve better classification performance than single view, and that our approach can achieve better performance as compared to the existing similar algorithms
Surveying alignment-free features for Ortholog detection in related yeast proteomes by using supervised big data classifiers
Abstract
Background: The development of new ortholog detection algorithms and the improvement of existing ones are of
major importance in functional genomics. We have previously introduced a successful supervised pairwise ortholog
classification approach implemented in a big data platform that considered several pairwise protein features and the
low ortholog pair ratios found between two annotated proteomes (Galpert, D et al., BioMed Research International,
2015). The supervised models were built and tested using a Saccharomycete yeast benchmark dataset proposed by
Salichos and Rokas (2011). Despite several pairwise protein features being combined in a supervised big data approach;
they all, to some extent were alignment-based features and the proposed algorithms were evaluated on a unique test
set. Here, we aim to evaluate the impact of alignment-free features on the performance of supervised models
implemented in the Spark big data platform for pairwise ortholog detection in several related yeast proteomes.
Results: The Spark Random Forest and Decision Trees with oversampling and undersampling techniques, and built
with only alignment-based similarity measures or combined with several alignment-free pairwise protein features
showed the highest classification performance for ortholog detection in three yeast proteome pairs. Although such
supervised approaches outperformed traditional methods, there were no significant differences between the exclusive
use of alignment-based similarity measures and their combination with alignment-free features, even within the
twilight zone of the studied proteomes. Just when alignment-based and alignment-free features were combined in
Spark Decision Trees with imbalance management, a higher success rate (98.71%) within the twilight zone could be
achieved for a yeast proteome pair that underwent a whole genome duplication. The feature selection study showed
that alignment-based features were top-ranked for the best classifiers while the runners-up were alignment-free
features related to amino acid composition.
Conclusions: The incorporation of alignment-free features in supervised big data models did not significantly improve
ortholog detection in yeast proteomes regarding the classification qualities achieved with just alignment-based
similarity measures. However, the similarity of their classification performance to that of traditional ortholog detection
methods encourages the evaluation of other alignment-free protein pair descriptors in future research.This work was supported by the following financial sources: Postdoc
fellowship (SFRH/BPD/92978/2013) granted to GACh by the Portuguese
Fundação para a Ciência e a Tecnologia (FCT). AA was supported by the
MarInfo – Integrated Platform for Marine Data Acquisition and Analysis
(reference NORTE-01-0145-FEDER-000031), a project supported by the
North Portugal Regional Operational Program (NORTE 2020), under the
PORTUGAL 2020 Partnership Agreement, through the European Regional
Development Fund (ERDF)
Big Data Classification of Ultrasound Doppler Scan Images Using a Decision Tree Classifier Based on Maximally Stable Region Feature Points
The classification of ultrasound scan images is important in monitoring the development of prenatal and maternal structures. This paper proposes a big data classification system for ultrasound Doppler scan images that combines the residual of maximally stable extreme regions and speeded up robust features (SURF) with a decision tree classifier. The algorithm first preprocesses the ultrasound scan images before detecting the maximally stable extremal regions (MSER). A few essential regions are chosen from the MSER regions, along with the residual region that provides the best Region of Interest (ROI). SURF features points that best represent the region are detected using the gradient of the estimated cumulative region of interest. To extract the feature from the pixels that surround the SURF feature points, the Triangular Vertex Transform (TVT) transform is used. A decision tree classifier is used to train the extracted TVT features. The proposed ultrasound scan image classification system is validated using performance parameters such as accuracy, specificity, precision, sensitivity, and F1 score. For validation, a large dataset of 12,400 scan images collected from 1792 patients is used. The proposed method has an F1score of 94.12%, sensitivity, specificity, precision, and accuracy of 93.57%, 93.57%, and 97.96%, respectively. The evaluation results show that the proposed algorithm for classifying Doppler scan images is better than other algorithms that have been used in the past. 
- …