Search CORE

5 research outputs found

Perceptron learning with random coordinate descent

Author: Li Ling
Publication venue: 'California Institute of Technology Library'
Publication date: 01/01/2005
Field of study

A perceptron is a linear threshold classifier that separates examples with a hyperplane. It is perhaps the simplest learning model that is used standalone. In this paper, we propose a family of random coordinate descent algorithms for perceptron learning on binary classification problems. Unlike most perceptron learning algorithms which require smooth cost functions, our algorithms directly minimize the training error, and usually achieve the lowest training error compared with other algorithms. The algorithms are also computational efficient. Such advantages make them favorable for both standalone use and ensemble learning, on problems that are not linearly separable. Experiments show that our algorithms work very well with AdaBoost, and achieve the lowest test errors for half of the datasets

CiteSeerX

Caltech Authors

Data complexity in machine learning

Author: Abu-Mostafa Yaser S.
Li Ling
Publication venue: 'California Institute of Technology Library'
Publication date: 26/05/2006
Field of study

We investigate the role of data complexity in the context of binary classification problems. The universal data complexity is defined for a data set as the Kolmogorov complexity of the mapping enforced by the data set. It is closely related to several existing principles used in machine learning such as Occam's razor, the minimum description length, and the Bayesian approach. The data complexity can also be defined based on a learning model, which is more realistic for applications. We demonstrate the application of the data complexity in two learning problems, data decomposition and data pruning. In data decomposition, we illustrate that a data set is best approximated by its principal subsets which are Pareto optimal with respect to the complexity and the set size. In data pruning, we show that outliers usually have high complexity contributions, and propose methods for estimating the complexity contribution. Since in practice we have to approximate the ideal data complexity measures, we also discuss the impact of such approximations

Caltech Authors

Properties and identification of antibiotic drug targets

Author: A Krogh
A Maxwell
A Yonath
Andrew J Doig
AP Carter
BG Spratt
DE Brodersen
DS Wishart
ED Brown
F Schlunzen
FC Tenover
GL Wang
J Kim
J Kyte
JA Cuff
JC Wootton
JD Bendtsen
K Julenius
KR Sakharkar
L Li
LJ Jensen
M Ashburner
MC McManus
MW Vetting
P Hu
REW Hancock
S Rey
SF Altschul
SY Gerdes
T Izard
T Nakama
Tala M Bakheet
TM Bakheet
VN Vapnik
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background We analysed 48 non-redundant antibiotic target proteins from all bacteria, 22 antibiotic target proteins from <it>E. coli </it>only and 4243 non-drug targets from <it>E. coli </it>to identify differences in their properties and to predict new potential drug targets. Results When compared to non-targets, bacterial antibiotic targets tend to be long, have high β-sheet and low α-helix contents, are polar, are found in the cytoplasm rather than in membranes, and are usually enzymes, with ligases particularly favoured. Sequence features were used to build a support vector machine model for <it>E. coli </it>proteins, allowing the assignment of any sequence to the drug target or non-target classes, with an accuracy in the training set of 94%. We identified 319 proteins (7%) in the non-target set that have target-like properties, many of which have unknown function. 63 of these proteins have significant and undesirable similarity to a human protein, leaving 256 target like proteins that are not present in humans. Conclusions We suggest that antibiotic discovery programs would be more likely to succeed if new targets are chosen from this set of target like proteins or their homologues. In particular, 64 are essential genes where the cell is not able to recover from a random insertion disruption.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

The University of Manchester - Institutional Repository

Prediksi Nilai Warna Larutan (ICUMSA) dan Besar Jenis Butir (BJB) untuk Menentukan Kualitas Gula Berdasarkan Metode Support Vector Machine (Studi Kasus: PT Pabrik Gula Rajawali I Surabaya)

Author: Ariani Ria Widiya
Publication venue
Publication date: 01/01/2018
Field of study

Gula merupakan salah satu komoditas yang sering kita temui dalam kehidupan sehari-hari. Gula biasa dimanfaatkan untuk menambah cita rasa manis pada makanan atau minuman. Penggunaan gula tidak hanya oleh rumah tangga namun juga banyak digunakan di bidang industri, khususnya industri di bidang makanan dan minuman. Mengingat penggunaan gula baik di rumah tangga atau industri, tidak heran jika jumlah konsumsi gula di Indonesia juga besar. Sebagai bahan untuk membuat produk makanan atau minuman yang akan dikonsumsi oleh masyarakat, gula yang dipergunakan tentunya perlu memenuhi standar kualitas atau mutu tertentu agar layak untuk dikonsumsi. Untuk itu, pemerintah melalui Badan Standardisasi Nasional telah mengatur standar mengenai kualitas gula. PT. PG Rajawali I Surabaya merupakan salah satu pabrik yang memproduksi gula. Untuk dapat melakukan pengujian kualitas gula, perusahaan memerlukan pihak ketiga yang berlokasi diluar Surabaya. Hal ini menyebabkan perusahaan mengalami kesulitan untuk melakukan pengujian kualitas gula yaitu berupa permasalahan biaya yang mahal dan waktu yang dibutuhkan untuk pengujian lama. Untuk mengatasi permasalahan tersebut, perusahaan dapat melakukan prediksi kualitas gula mereka sendiri. Karena dapat xi dilakukan sendiri, waktu yang dibutuhkan bisa menjadi lebih singkat sehingga bisa segera dilakukan evaluasi jika hasil produksi kualitasnya rendah. Melalui penelitian tugas akhir ini, metode Support Vector Machine (SVM) digunakan untuk memprediksi kualitas gula yang dihasilkan di PT. PG Rajawali I Surabaya. SVM merupakan salah satu metode yang dapat digunakan untuk melakukan prediksi dengan mencari nilai hyperplane dari data-data yang ada. Nilai prediksi dapat dioptimumkan dengan mengatur parameter-parameter yang mempengaruhi prediksi. Model terbaik untuk data training ditentukan berdasarkan nilai root mean square error (RMSE) dan absolute error dari hasil prediksi. Namun model terbaik untuk testing ditentukan berdasarkan nilai MAPE yang dihasilkan. Pada penelitian ini, terdapat tiga jenis kualitas gula yang dihasilkan yaitu GKP 1, GKP 2, dan gula yang tidak termasuk kedalam GKP 1 atau GKP 2 (undefined). Untuk menentukan kualitas gula tersebut, proses produksi sangat berpengaruh. Dimana pada masing-masing proses produksi terdapat beberapa parameter yang perlu dipenuhi agar gula tersebut bisa menghasilkan kualitas yang sesuai standar. Model terbaik untuk data testing warna larutan (ICUMSA) yang memberikan MAPE terbaik adalah menggunakan kernel Radial, C=18.65, gamma=0.045 dengan MAPE 31% yang termasuk kategori cukup baik. Model terbaik untuk data testing BJB menghasilkan MAPE sebesar 8% termasuk kategori sangat baik. Untuk kernel Dot nilai C=1 dan kernel Radial C=0.675 serta gamma=8.65. Sedangkan hasil akurasi klasifikasi kualitas gula terbaik adalah sebesar 73.33% dengan menggunakan kernel Dot. ======================================================================================= Sugar is one of commodity that we use everyday in our life. Sugars are used to add sweet taste to foods or drinks. The use of sugar is not only by household but also widely used in industry, especially industry in foods and beverages. Given the use of sugar either in the household or industry, do not be surprised if the amount of sugar consumption in Indonesia is also large. As an ingredient to make foods or beverage products that will be consumed by the customers, sugar must have certain quality or quality standards to be worth consuming. To ensure this condition, the government through the National Standardization Agency has set the standard on the quality of sugar. PT. PG Rajawali I Surabaya is one of the sugar producing factories. To be able to test the quality of sugar, the company requires a third party located outside of Surabaya. This causes the company to have difficulty to conduct sugar quality testing that is in the form of expensive cost problems and the time required for the old test. To solve these problems, companies can make predictions of their own sugar quality. Because it can be done alone, the time required can be shortened so that the evaluation can be done immediately if the quality of production is low. Through this final project, Support Vector Machine (SVM) method is used to xiii predict the quality of sugar produced at PT. PG Rajawali I Surabaya. SVM is one method that can be used to make predictions by finding the value of hyperplane from existing data. Predicted values can be optimized by setting parameters that affect predictions. The best model for training data is determined based on the root mean square error (RMSE) and the absolute error value of the predicted result. However the best model for testing is determined based on the resulting MAPE value. In this research, there are three types of quality of sugar produced are GKP 1, GKP 2, and sugar which is not included into GKP 1 or GKP 2 (undefined). To determine the quality of sugar, the production process is very influential. Where in each production process there are several parameters that need to be met so that the sugar can produce the appropriate quality standard. The best model for color solution testing (ICUMSA) which gives the best MAPE is using Radial kernel, C = 18.65, gamma = 0.045 with MAPE 31% which is good enough category. The best model for BJB data testing yields 8% MAPE including excellent category. For Dot kernel value C = 1 and Radial kernel C = 0.675 and gamma = 8.65. While the best quality classification accuracy is 73.33% by using Dot kernel kernel

ITS Repository

Improving generalization by data categorization

Author: Amrit Pratap
Hsuan-tien Lin
Ling Li
Yaser S. Abu-mostafa
Publication venue: Springer-Verlag
Publication date: 01/01/2005
Field of study

Abstract. In most of the learning algorithms, examples in the training set are treated equally. Some examples, however, carry more reliable or critical information about the target than the others, and some may carry wrong information. According to their intrinsic margin, examples can be grouped into three categories: typical, critical, and noisy. We propose three methods, namely the selection cost, SVM confidence margin, and AdaBoost data weight, to automatically group training examples into these three categories. Experimental results on artificial datasets show that, although the three methods have quite different nature, they give similar and reasonable categorization. Results with real-world datasets further demonstrate that treating the three data categories differently in learning can improve generalization.

CiteSeerX

Crossref

Caltech Authors