5 research outputs found
Perceptron learning with random coordinate descent
A perceptron is a linear threshold classifier that separates examples with a hyperplane. It is perhaps the simplest learning model that is used standalone. In this paper, we propose a family of random coordinate descent algorithms for perceptron learning on binary classification problems. Unlike most perceptron learning algorithms which require smooth cost functions, our algorithms directly minimize the training error, and usually achieve the lowest training error compared with other algorithms. The algorithms are also computational efficient. Such advantages make them favorable for both standalone use and ensemble learning, on problems that are not linearly separable. Experiments show that our algorithms work very well with AdaBoost, and achieve the lowest test errors for half of the datasets
Data complexity in machine learning
We investigate the role of data complexity in the context of binary classification problems. The universal data complexity is defined for a data set as the Kolmogorov complexity of the mapping enforced by the data set. It is closely related to several existing principles used in machine learning such as Occam's razor, the minimum description length, and the Bayesian approach. The data complexity can also be defined based on a learning model, which is more realistic for applications. We demonstrate the application of the data complexity in two learning problems, data decomposition and data pruning. In data decomposition, we illustrate that a data set is best approximated by its principal subsets which are Pareto optimal with respect to the complexity and the set size. In data pruning, we show that outliers usually have high complexity contributions, and propose methods for estimating the complexity contribution. Since in practice we have to approximate the ideal data complexity measures, we also discuss the impact of such approximations
Properties and identification of antibiotic drug targets
<p>Abstract</p> <p>Background</p> <p>We analysed 48 non-redundant antibiotic target proteins from all bacteria, 22 antibiotic target proteins from <it>E. coli </it>only and 4243 non-drug targets from <it>E. coli </it>to identify differences in their properties and to predict new potential drug targets.</p> <p>Results</p> <p>When compared to non-targets, bacterial antibiotic targets tend to be long, have high Ī²-sheet and low Ī±-helix contents, are polar, are found in the cytoplasm rather than in membranes, and are usually enzymes, with ligases particularly favoured. Sequence features were used to build a support vector machine model for <it>E. coli </it>proteins, allowing the assignment of any sequence to the drug target or non-target classes, with an accuracy in the training set of 94%. We identified 319 proteins (7%) in the non-target set that have target-like properties, many of which have unknown function. 63 of these proteins have significant and undesirable similarity to a human protein, leaving 256 target like proteins that are not present in humans.</p> <p>Conclusions</p> <p>We suggest that antibiotic discovery programs would be more likely to succeed if new targets are chosen from this set of target like proteins or their homologues. In particular, 64 are essential genes where the cell is not able to recover from a random insertion disruption.</p
Prediksi Nilai Warna Larutan (ICUMSA) dan Besar Jenis Butir (BJB) untuk Menentukan Kualitas Gula Berdasarkan Metode Support Vector Machine (Studi Kasus: PT Pabrik Gula Rajawali I Surabaya)
Gula merupakan salah satu komoditas yang sering kita temui
dalam kehidupan sehari-hari. Gula biasa dimanfaatkan untuk
menambah cita rasa manis pada makanan atau minuman.
Penggunaan gula tidak hanya oleh rumah tangga namun juga
banyak digunakan di bidang industri, khususnya industri di
bidang makanan dan minuman. Mengingat penggunaan gula
baik di rumah tangga atau industri, tidak heran jika jumlah
konsumsi gula di Indonesia juga besar. Sebagai bahan untuk
membuat produk makanan atau minuman yang akan
dikonsumsi oleh masyarakat, gula yang dipergunakan tentunya
perlu memenuhi standar kualitas atau mutu tertentu agar layak
untuk dikonsumsi. Untuk itu, pemerintah melalui Badan
Standardisasi Nasional telah mengatur standar mengenai
kualitas gula.
PT. PG Rajawali I Surabaya merupakan salah satu pabrik yang
memproduksi gula. Untuk dapat melakukan pengujian kualitas
gula, perusahaan memerlukan pihak ketiga yang berlokasi
diluar Surabaya. Hal ini menyebabkan perusahaan mengalami
kesulitan untuk melakukan pengujian kualitas gula yaitu berupa
permasalahan biaya yang mahal dan waktu yang dibutuhkan
untuk pengujian lama.
Untuk mengatasi permasalahan tersebut, perusahaan dapat
melakukan prediksi kualitas gula mereka sendiri. Karena dapat
xi
dilakukan sendiri, waktu yang dibutuhkan bisa menjadi lebih
singkat sehingga bisa segera dilakukan evaluasi jika hasil
produksi kualitasnya rendah. Melalui penelitian tugas akhir ini,
metode Support Vector Machine (SVM) digunakan untuk
memprediksi kualitas gula yang dihasilkan di PT. PG Rajawali
I Surabaya. SVM merupakan salah satu metode yang dapat
digunakan untuk melakukan prediksi dengan mencari nilai
hyperplane dari data-data yang ada. Nilai prediksi dapat
dioptimumkan dengan mengatur parameter-parameter yang
mempengaruhi prediksi. Model terbaik untuk data training
ditentukan berdasarkan nilai root mean square error (RMSE)
dan absolute error dari hasil prediksi. Namun model terbaik
untuk testing ditentukan berdasarkan nilai MAPE yang
dihasilkan.
Pada penelitian ini, terdapat tiga jenis kualitas gula yang
dihasilkan yaitu GKP 1, GKP 2, dan gula yang tidak termasuk
kedalam GKP 1 atau GKP 2 (undefined). Untuk menentukan
kualitas gula tersebut, proses produksi sangat berpengaruh.
Dimana pada masing-masing proses produksi terdapat
beberapa parameter yang perlu dipenuhi agar gula tersebut
bisa menghasilkan kualitas yang sesuai standar.
Model terbaik untuk data testing warna larutan (ICUMSA)
yang memberikan MAPE terbaik adalah menggunakan kernel
Radial, C=18.65, gamma=0.045 dengan MAPE 31% yang
termasuk kategori cukup baik. Model terbaik untuk data testing
BJB menghasilkan MAPE sebesar 8% termasuk kategori sangat
baik. Untuk kernel Dot nilai C=1 dan kernel Radial C=0.675
serta gamma=8.65. Sedangkan hasil akurasi klasifikasi
kualitas gula terbaik adalah sebesar 73.33% dengan
menggunakan kernel Dot.
=======================================================================================
Sugar is one of commodity that we use everyday in our life.
Sugars are used to add sweet taste to foods or drinks. The use
of sugar is not only by household but also widely used in
industry, especially industry in foods and beverages. Given the
use of sugar either in the household or industry, do not be
surprised if the amount of sugar consumption in Indonesia is
also large. As an ingredient to make foods or beverage products
that will be consumed by the customers, sugar must have certain
quality or quality standards to be worth consuming. To ensure
this condition, the government through the National
Standardization Agency has set the standard on the quality of
sugar.
PT. PG Rajawali I Surabaya is one of the sugar producing
factories. To be able to test the quality of sugar, the company
requires a third party located outside of Surabaya. This causes
the company to have difficulty to conduct sugar quality testing
that is in the form of expensive cost problems and the time
required for the old test.
To solve these problems, companies can make predictions of
their own sugar quality. Because it can be done alone, the time
required can be shortened so that the evaluation can be done
immediately if the quality of production is low. Through this
final project, Support Vector Machine (SVM) method is used to
xiii
predict the quality of sugar produced at PT. PG Rajawali I
Surabaya. SVM is one method that can be used to make
predictions by finding the value of hyperplane from existing
data. Predicted values can be optimized by setting parameters
that affect predictions. The best model for training data is
determined based on the root mean square error (RMSE) and
the absolute error value of the predicted result. However the
best model for testing is determined based on the resulting
MAPE value.
In this research, there are three types of quality of sugar
produced are GKP 1, GKP 2, and sugar which is not included
into GKP 1 or GKP 2 (undefined). To determine the quality of
sugar, the production process is very influential. Where in each
production process there are several parameters that need to
be met so that the sugar can produce the appropriate quality
standard.
The best model for color solution testing (ICUMSA) which gives
the best MAPE is using Radial kernel, C = 18.65, gamma =
0.045 with MAPE 31% which is good enough category. The best
model for BJB data testing yields 8% MAPE including excellent
category. For Dot kernel value C = 1 and Radial kernel C =
0.675 and gamma = 8.65. While the best quality classification
accuracy is 73.33% by using Dot kernel kernel
Improving generalization by data categorization
Abstract. In most of the learning algorithms, examples in the training set are treated equally. Some examples, however, carry more reliable or critical information about the target than the others, and some may carry wrong information. According to their intrinsic margin, examples can be grouped into three categories: typical, critical, and noisy. We propose three methods, namely the selection cost, SVM confidence margin, and AdaBoost data weight, to automatically group training examples into these three categories. Experimental results on artificial datasets show that, although the three methods have quite different nature, they give similar and reasonable categorization. Results with real-world datasets further demonstrate that treating the three data categories differently in learning can improve generalization.