81,975 research outputs found
ALGORITHM COMPARISON AND FEATURE SELECTION FOR CLASSIFICATION OF BROILER CHICKEN HARVEST
Broiler chickens are the result of superior breeds that produce a lot of meat. In practice, however, many breeders experience crop failure, which has a serious impact on the economy and can also affect farmer quality, resulting in sanctions. The value of the performance index produced at harvest indicates the success rate of harvesting broiler chickens. Broiler crop yield data can be used to help classify broiler crop yield data using an approach method. The CRISP-DM (Cross Industry Standard Process for Data Mining) method was used in this study's data mining technique. This study compares 3 classification algorithms to determine the best algorithm and 3 feature selection methods to determine the best method for improving algorithm performance. According to the findings of this study, the Random Forest algorithm is the best algorithm for classifying harvest data, with an accuracy rate of 89.14 percent. The best way to improve the algorithm's performance is to use the Backward Elimination method, which can increase the accuracy by 7.53 percent. As a result, the Random Forest + Backward Elimination algorithm yields an accuracy value of 96.67 percent. According to this study, the factors that influence crop yield increase are FCR, number of harvests, and body weight
On Identifying Critical Nuggets Of Information During Classification Task
In large databases, there may exist critical nuggets - small collections of records or instances that contain domain-specific important information. This information can be used for future decision making such as labeling of critical, unlabeled data records and improving classification results by reducing false positive and false negative errors. In recent years, data mining efforts have focussed on pattern and outlier detection methods. However, not much effort has been dedicated to finding critical nuggets within a data set. This work introduces the idea of critical nuggets, proposes an innovative domain-independent method to measure criticality, suggests a heuristic to reduce the search space for finding critical nuggets, and isolates and validates critical nuggets from some real world data sets. It seems that only a few subsets may qualify to be critical nuggets, underlying the importance of finding them. The proposed methodology can detect them. This work also identifies certain properties of critical nuggets and provides experimental validation of the properties. Critical nuggets were then applied to 2 important classification task related performance metrics - classification accuracy and misclassification costs. Experimental results helped validate that critical nuggets can assist in improving classification accuracies in real world data sets when compared with other standalone classification algorithms. The improvements in accuracy using the critical nuggets were statistically significant. Extensive studies were also undertaken on real world data sets that utilized critical nuggets to help minimize misclassification costs. In this case as well the critical nuggets based approach yielded statistically significant, lower misclassification costs than than standalone classification methods
Feature selection from colon cancer dataset for cancer classification using Artificial Neural Network
In the fast-growing field of medicine and its dynamic demand in research, a study that proves significant improvement to healthcare seems imperative especially when it is on cancer research. This research paved way to such significant findings by the inclusion of feature selection as one of its major components. Feature selection has become a vital task to apply data mining algorithms effectively in the real-world problems for classification. Feature selection has been the focus of interest for quite some time and much completed work related to it. Although much research conducted on the field, a study that proved a nearly perfect accuracy seems limited; hence, more scientifically driven results should be produced. Using various research on feature selection as basis for the choices in this study, the method was product of careful selection and planning. Specifically, this study used feature selection for improving classification accuracy on cancerous dataset. This study proposed Artificial Neural Network (ANN) for cancer classification with feature selection on colon cancer dataset. The study used best first search method in weka tools for feature selection. Through the process, a promising result has been achieved. The result of the experiment achieved 98.4 % accuracy for cancer classification after feature selection by using proposed algorithm. The result displayed that feature selection improved the classification accuracy based on the experiment conducted on the colon cancer dataset. The result of this experiment was comparable with the other studies on colon cancer research. It  showed another significant improvement and can be considered promising for more future applications
Comparison of Decision Tree, Naïve Bayes and KNearest Neighbors for Predicting Thesis Graduation
Thesis is one of the evaluations of learning for students. In Universitas Budi Luhur (UBL), especially in the Informatics Department, the thesis is one of the requirements for graduating students to obtain a Bachelor of Computer degree. In each semester, the number of Informatics Department students who take thesis is around 200-300 students. The problem that is still faced is that student graduation in the thesis is not optimal. Student failures in the thesis are allegedly related to several technical and nontechnical factors. In this study, an analysis using data mining algorithms was carried out to determine the factors that influence student graduation in the thesis. The dataset obtained from the Informatics Department students who took a thesis in the 2016/2017, and 2017/2018. In order to obtain the right classification method, this research was tested with three classification methods, namely Decision Tree, Naïve Bayes, and k-Nearest Neighbors (kNN). The results of the comparison of the values of accuracy, precision, and recall indicate that the kNN algorithm has advantages, so this method is chosen to predict graduation. In this study also developed an application for predicting graduation of students' thesis by applying the kNN classification method. The test results showed an accuracy of 78.20%, precision of 80.32%, and recall of 96.49%. This research is expected to be useful for improving the service quality of student thesi
An evolutionary approach for balancing effectiveness and representation level in gene selection
As data mining develops and expands to new application areas, feature selection also reveals various aspects to be considered. This paper underlines two aspects that seem to categorize the large body of available feature selection algorithms: the effectiveness and the representation level. The effectiveness deals with selecting the minimum set of variables that maximize the accuracy of a classifier and the representation level concerns discovering how relevant the variables are for the domain of interest. For balancing the above aspects, the paper proposes an evolutionary framework for feature selection that expresses a hybrid method, organized in layers, each of them exploits a specific model of search strategy. Extensive experiments on gene selection from DNA-microarray datasets are presented and discussed. Results indicate that the framework compares well with different hybrid methods proposed in literature as it has the capability of finding well suited subsets of informative features while improving classification accurac
PRESISTANT: Learning based assistant for data pre-processing
Data pre-processing is one of the most time consuming and relevant steps in a
data analysis process (e.g., classification task). A given data pre-processing
operator (e.g., transformation) can have positive, negative or zero impact on
the final result of the analysis. Expert users have the required knowledge to
find the right pre-processing operators. However, when it comes to non-experts,
they are overwhelmed by the amount of pre-processing operators and it is
challenging for them to find operators that would positively impact their
analysis (e.g., increase the predictive accuracy of a classifier). Existing
solutions either assume that users have expert knowledge, or they recommend
pre-processing operators that are only "syntactically" applicable to a dataset,
without taking into account their impact on the final analysis. In this work,
we aim at providing assistance to non-expert users by recommending data
pre-processing operators that are ranked according to their impact on the final
analysis. We developed a tool PRESISTANT, that uses Random Forests to learn the
impact of pre-processing operators on the performance (e.g., predictive
accuracy) of 5 different classification algorithms, such as J48, Naive Bayes,
PART, Logistic Regression, and Nearest Neighbor. Extensive evaluations on the
recommendations provided by our tool, show that PRESISTANT can effectively help
non-experts in order to achieve improved results in their analytical tasks
- …