Search CORE

796 research outputs found

Digging into acceptor splice site prediction : an iterative feature selection approach

Author: A.I. Blum
A.K. Jain
C. Mathé
D. Mladenić
E. Alpaydin
G.R. Harik
H. Mühlenbein
I. Guyon
I. Guyon
J. Weston
M. Kudo
M. Pertea
P. Larrañaga
R. Kohavi
R.O. Duda
S. Degroeve
T. Joachims
X. Zhang
Y. Saeys
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2004
Field of study

Feature selection techniques are often used to reduce data dimensionality, increase classification performance, and gain insight into the processes that generated the data. In this paper, we describe an iterative procedure of feature selection and feature construction steps, improving the classification of acceptor splice sites, an important subtask of gene prediction. We show that acceptor prediction can benefit from feature selection, and describe how feature selection techniques can be used to gain new insights in the classification of acceptor sites. This is illustrated by the identification of a new, biologically motivated feature: the AG-scanning feature. The results described in this paper contribute both to the domain of gene prediction, and to research in feature selection techniques, describing a new wrapper based feature weighting method that aids in knowledge discovery when dealing with complex datasets

Crossref

Ghent University Academic Bibliography

Learning Moore Machines from Input-Output Traces

Author: A Gupta
A Solar-Lezama
AV Aleksandrov
AW Biermann
B Jonsson
C Higuera de la
CL Heitmeyer
D Angluin
D Lee
EM Gold
EM Gold
F Aarts
F Aarts
F Howar
HI Akram
IP Buzhinsky
J Oncina
K Meinke
K Takahashi
KJ Lang
LPJ Veelenturf
M Shahbaz
M Spichakova
MA Colón
MJ Heule
O Grinchtein
P Dupont
R Alur
R Dorofeeva
S Cassel
T Berg
TM Mitchell
TS Chow
V Ulyantsev
X Jin
Z Kohavi
Publication venue
Publication date: 02/09/2016
Field of study

The problem of learning automata from example traces (but no equivalence or membership queries) is fundamental in automata learning theory and practice. In this paper we study this problem for finite state machines with inputs and outputs, and in particular for Moore machines. We develop three algorithms for solving this problem: (1) the PTAP algorithm, which transforms a set of input-output traces into an incomplete Moore machine and then completes the machine with self-loops; (2) the PRPNI algorithm, which uses the well-known RPNI algorithm for automata learning to learn a product of automata encoding a Moore machine; and (3) the MooreMI algorithm, which directly learns a Moore machine using PTAP extended with state merging. We prove that MooreMI has the fundamental identification in the limit property. We also compare the algorithms experimentally in terms of the size of the learned machine and several notions of accuracy, introduced in this paper. Finally, we compare with OSTIA, an algorithm that learns a more general class of transducers, and find that OSTIA generally does not learn a Moore machine, even when fed with a characteristic sample

arXiv.org e-Print Archive

Crossref

Preceding rule induction with instance reduction methods

Author: A. Lukasz
D. Gamberger
D.L. Wilson
D.R. Wilsson
D.R. Wilsson
D.T. Pham
D.W. Aha
G.L. Ritter
G.W. Gates
I. Tomek
J. Fürnkranz
K. Grudzinski
K. Grudziński
K. Hindi El
K.P. Zhao
O. Othman
P. Clark
P. Clark
P.E. Hart
R. Kohavi
R. Schapire
S. Weiss
T.M. Mitchell
W. Cohen
W. Cohen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

A new prepruning technique for rule induction is presented which applies instance reduction before rule induction. An empirical evaluation records the predictive accuracy and size of rule-sets generated from 24 datasets from the UCI Machine Learning Repository. Three instance reduction algorithms (Edited Nearest Neighbour, AllKnn and DROP5) are compared. Each one is used to reduce the size of the training set, prior to inducing a set of rules using Clark and Boswell's modification of CN2. A hybrid instance reduction algorithm (comprised of AllKnn and DROP5) is also tested. For most of the datasets, pruning the training set using ENN, AllKnn or the hybrid significantly reduces the number of rules generated by CN2, without adversely affecting the predictive performance. The hybrid achieves the highest average predictive accuracy

CiteSeerX

University of Salford Institutional Repository

Crossref

Building cloud applications for challenged networks

Author: B Sat
D Merkel
E Brewer
FF-H Nah
J Semke
K Fall
K Winstein
M Cavage
M Claypool
R Kohavi
S Soltesz
T Lakshman
Publication venue: Springer Verlag
Publication date: 21/11/2015
Field of study

Cloud computing has seen vast advancements and uptake in many parts of the world. However, many of the design patterns and deployment models are not very suitable for locations with challenged networks such as countries with no nearby datacenters. This paper describes the problem and discusses the options available for such locations, focusing specifically on community clouds as a short-term solution. The paper highlights the impact of recent trends in the development of cloud applications and how changing these could better help deployment in challenged networks. The paper also outlines the consequent challenges in bridging different cloud deployments, also known as cross-cloud computing

Crossref

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Lancaster E-Prints

Fairness in Algorithmic Decision Making: An Excursion Through the Lens of Causality

Author: Barabas C.
Barocas S.
Grgic-Hlaca N.
Hardt M.
Hernan M. A.
Kamiran F.
Kamishima T.
Kilbertus N.
Kohavi R.
Kusner M. J.
Li J.
Nabi R.
Rosenbaum P. R.
Rosenbaum P. R.
Rubin D. B.
Russell C.
van der Wal W. M.
Zafar M. B.
Zemel R.
Zhang J.
Zhang L.
Zhang L.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

As virtually all aspects of our lives are increasingly impacted by algorithmic decision making systems, it is incumbent upon us as a society to ensure such systems do not become instruments of unfair discrimination on the basis of gender, race, ethnicity, religion, etc. We consider the problem of determining whether the decisions made by such systems are discriminatory, through the lens of causal models. We introduce two definitions of group fairness grounded in causality: fair on average causal effect (FACE), and fair on average causal effect on the treated (FACT). We use the Rubin-Neyman potential outcomes framework for the analysis of cause-effect relationships to robustly estimate FACE and FACT. We demonstrate the effectiveness of our proposed approach on synthetic data. Our analyses of two real-world data sets, the Adult income data set from the UCI repository (with gender as the protected attribute), and the NYC Stop and Frisk data set (with race as the protected attribute), show that the evidence of discrimination obtained by FACE and FACT, or lack thereof, is often in agreement with the findings from other studies. We further show that FACT, being somewhat more nuanced compared to FACE, can yield findings of discrimination that differ from those obtained using FACE.Comment: 7 pages, 2 figures, 2 tables.To appear in Proceedings of the International Conference on World Wide Web (WWW), 201

arXiv.org e-Print Archive

Crossref

SNU Open Repository and Archive

The identification of informative genes from multiple datasets with increasing complexity

Author: AH Fielding
Allan Tucker
BC Haynes
C Zhang
D Grossman
D Heckerman
D Madigan
DM Chickering
DR Rhodes
E Segal
G Schwarz
H Ma
J Bockhorst
J Pearl
J Su
JB Tobler
JM Peña
KK Tomczak
KP Murphy
M Miron
M Stone
N Friedman
N Friedman
N Friedman
Peter AC 't Hoen
R Jelier
R Kohavi
R Mac Nally
RA Irizarry
S Iezzi
S Yahya Anvar
SS Shen-Orr
TI Lee
TVan den Bulcke
W Lam
WL Buntine
X Xu
Y Cao
Y Lai
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Background In microarray data analysis, factors such as data quality, biological variation, and the increasingly multi-layered nature of more complex biological systems complicates the modelling of regulatory networks that can represent and capture the interactions among genes. We believe that the use of multiple datasets derived from related biological systems leads to more robust models. Therefore, we developed a novel framework for modelling regulatory networks that involves training and evaluation on independent datasets. Our approach includes the following steps: (1) ordering the datasets based on their level of noise and informativeness; (2) selection of a Bayesian classifier with an appropriate level of complexity by evaluation of predictive performance on independent data sets; (3) comparing the different gene selections and the influence of increasing the model complexity; (4) functional analysis of the informative genes. Results In this paper, we identify the most appropriate model complexity using cross-validation and independent test set validation for predicting gene expression in three published datasets related to myogenesis and muscle differentiation. Furthermore, we demonstrate that models trained on simpler datasets can be used to identify interactions among genes and select the most informative. We also show that these models can explain the myogenesis-related genes (genes of interest) significantly better than others (P < 0.004) since the improvement in their rankings is much more pronounced. Finally, after further evaluating our results on synthetic datasets, we show that our approach outperforms a concordance method by Lai et al. in identifying informative genes from multiple datasets with increasing complexity whilst additionally modelling the interaction between genes. Conclusions We show that Bayesian networks derived from simpler controlled systems have better performance than those trained on datasets from more complex biological systems. Further, we present that highly predictive and consistent genes, from the pool of differentially expressed genes, across independent datasets are more likely to be fundamentally involved in the biological process under study. We conclude that networks trained on simpler controlled systems, such as in vitro experiments, can be used to model and capture interactions among genes in more complex datasets, such as in vivo experiments, where these interactions would otherwise be concealed by a multitude of other ongoing events

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Leiden University Scholary Publications

Brunel University Research Archive

P2P Lending Analysis Using the Most Relevant Graph-Based Features

Author: D Zhang
DJ Hand
I Guyon
I-C Yeh
J Han
J Pohjalainen
JM Sotoca
JN Crook
L Bai
L Yu
M Last
M Malekipirbazari
P Hájek
R Kohavi
WH Press
Y Chen
Y Guo
Y Huang
Y Saeys
Z Zhao
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Crossref

White Rose Research Online

Prospects for Genomic Selection in Cassava Breeding

Author: Alfred A. Ozimati
Bates D.
Chiedozie Egesi
Dunia Pino Del Carpio
Esuma Williams
Howeler R.
Ismail S. Kayondo
Ismail Y. Rabbi
Jean‐Luc Jannink
Kohavi R.
Liaw A.
Lydia C. Ezenwaka
Lynch M.
Marnin D. Wolfe
Meuwissen T.H.
Morota G.
Olumide Alabi
Perez‐Rodriguez P.
Peter Kulakow
Plummer M.
Robert S. Kawuki
Roberto Lozano
Uche G. Okeke
Ugochukwu N. Ikeogu
Publication venue: 'Crop Science Society of America'
Publication date: 01/11/2017
Field of study

Article purchased; Published online: 28 Sept 2017Cassava (Manihot esculenta Crantz) is a clonally propagated staple food crop in the tropics. Genomic selection (GS) has been implemented at three breeding institutions in Africa to reduce cycle times. Initial studies provided promising estimates of predictive abilities. Here, we expand on previous analyses by assessing the accuracy of seven prediction models for seven traits in three prediction scenarios: cross-validation within populations, cross-population prediction and cross-generation prediction. We also evaluated the impact of increasing the training population (TP) size by phenotyping progenies selected either at random or with a genetic algorithm. Cross-validation results were mostly consistent across programs, with nonadditive models predicting of 10% better on average. Cross-population accuracy was generally low (mean = 0.18) but prediction of cassava mosaic disease increased up to 57% in one Nigerian population when data from another related population were combined. Accuracy across generations was poorer than within-generation accuracy, as expected, but accuracy for dry matter content and mosaic disease severity should be sufficient for rapid-cycling GS. Selection of a prediction model made some difference across generations, but increasing TP size was more important. With a genetic algorithm, selection of one-third of progeny could achieve an accuracy equivalent to phenotyping all progeny. We are in the early stages of GS for this crop but the results are promising for some traits. General guidelines that are emerging are that TPs need to continue to grow but phenotyping can be done on a cleverly selected subset of individuals, reducing the overall phenotyping burden

Crossref

Directory of Open Access Journals

CGSpace (CGIAR)

Team diversity and categorization salience : capturing diversity-blind, intergroup biased, and multicultural perceptions

Author: Bass B. M.
Blau P. M.
Cameron A. C.
Costa P. T.
Daan van Knippenberg
Fiske S. T.
Harrison D. A.
Kenny D. A.
Kohavi R.
Kotsiantis S. B.
Lau D. C.
Laura Guillén
Law K. S.
Margarita Mayo
Pfeffer J.
Pierce J. R.
Quinlan J. R.
Randenbush S. W.
Rosch R.
Shainaz Firfiray
Thompson B.
Turner J. C.
Turner J. C.
Van de Ven A. H.
van Knippenberg D.
Weick K. E.
Williams K
Publication venue: 'SAGE Publications'
Publication date: 01/01/2016
Field of study

It is increasingly recognized that team diversity with respect to various social categories (e.g., gender, race) does not automatically result in the cognitive activation of these categories (i.e., categorization salience), and that factors influencing this relationship are important for the effects of diversity. Thus, it is a methodological problem that no measurement technique is available to measure categorization salience in a way that efficiently applies to multiple dimensions of diversity in multiple combinations. Based on insights from artificial intelligence research, we propose a technique to capture the salience of different social categorizations in teams that does not prime the salience of these categories. We illustrate the importance of such measurement by showing how it may be used to distinguish among diversity-blind responses (low categorization salience), multicultural responses (positive responses to categorization salience), and intergroup biased responses (negative responses to categorization salience) in a study of gender and race diversity and the gender by race faultline in 38 manufacturing teams comprising 239 members

Crossref

EUR Research Repository

Warwick Research Archives Portal Repository

Supervised machine learning algorithms can classify open-text feedback of doctor performance with human-level accuracy

Author: Birbeck GL
Blei D
Campbell JL
Chris Gibbons
Efron B
Friedman J
Hastie T
Holsti O
John Campbell
Jose Maria Valderas
Jurka T
Kohavi R
Liaw A
Peters A
Samejima F
Suzanne Richards
Publication venue: 'JMIR Publications Inc.'
Publication date: 15/03/2017
Field of study

Background: Machine learning techniques may be an effective and efficient way to classify open-text reports on doctor’s activity for the purposes of quality assurance, safety, and continuing professional development. Objective: The objective of the study was to evaluate the accuracy of machine learning algorithms trained to classify open-text reports of doctor performance and to assess the potential for classifications to identify significant differences in doctors’ professional performance in the United Kingdom. Methods: We used 1636 open-text comments (34,283 words) relating to the performance of 548 doctors collected from a survey of clinicians’ colleagues using the General Medical Council Colleague Questionnaire (GMC-CQ). We coded 77.75% (1272/1636) of the comments into 5 global themes (innovation, interpersonal skills, popularity, professionalism, and respect) using a qualitative framework. We trained 8 machine learning algorithms to classify comments and assessed their performance using several training samples. We evaluated doctor performance using the GMC-CQ and compared scores between doctors with different classifications using t tests. Results: Individual algorithm performance was high (range F score=.68 to .83). Interrater agreement between the algorithms and the human coder was highest for codes relating to “popular” (recall=.97), “innovator” (recall=.98), and “respected” (recall=.87) codes and was lower for the “interpersonal” (recall=.80) and “professional” (recall=.82) codes. A 10-fold cross-validation demonstrated similar performance in each analysis. When combined together into an ensemble of multiple algorithms, mean human-computer interrater agreement was .88. Comments that were classified as “respected,” “professional,” and “interpersonal” related to higher doctor scores on the GMC-CQ compared with comments that were not classified (P.05). Conclusions: Machine learning algorithms can classify open-text feedback of doctor performance into multiple themes derived by human raters with high performance. Colleague open-text comments that signal respect, professionalism, and being interpersonal may be key indicators of doctor’s performance

Crossref

PubMed Central

Open Research Exeter

White Rose Research Online