Search CORE

85 research outputs found

A System for Induction of Oblique Decision Trees

Author: Kasif S.
Murthy S. K.
Salzberg S.
Publication venue
Publication date: 01/01/1994
Field of study

This article describes a new system for induction of oblique decision trees. This system, OC1, combines deterministic hill-climbing with two forms of randomization to find a good oblique split (in the form of a hyperplane) at each node of a decision tree. Oblique decision tree methods are tuned especially for domains in which the attributes are numeric, although they can be adapted to symbolic or mixed symbolic/numeric attributes. We present extensive empirical studies, using both real and artificial data, that analyze OC1's ability to construct oblique trees that are smaller and more accurate than their axis-parallel counterparts. We also examine the benefits of randomization for the construction of oblique decision trees.Comment: See http://www.jair.org/ for an online appendix and other files accompanying this articl

arXiv.org e-Print Archive

CiteSeerX

Proteus: A Hierarchical Portfolio of Solvers and Transformations

Author: B.A. Huberman
C. Ansótegui
C.P. Gomes
E. Hebrard
F. Rossi
J.R. Rice
L. Xu
M. Gebser
M. Hall
S. Haim
S. Kasif
T. Tanjo
Publication venue
Publication date: 01/01/2014
Field of study

In recent years, portfolio approaches to solving SAT problems and CSPs have become increasingly common. There are also a number of different encodings for representing CSPs as SAT instances. In this paper, we leverage advances in both SAT and CSP solving to present a novel hierarchical portfolio-based approach to CSP solving, which we call Proteus, that does not rely purely on CSP solvers. Instead, it may decide that it is best to encode a CSP problem instance into SAT, selecting an appropriate encoding and a corresponding SAT solver. Our experimental evaluation used an instance of Proteus that involved four CSP solvers, three SAT encodings, and six SAT solvers, evaluated on the most challenging problem instances from the CSP solver competitions, involving global and intensional constraints. We show that significant performance improvements can be achieved by Proteus obtained by exploiting alternative view-points and solvers for combinatorial problem-solving.Comment: 11th International Conference on Integration of AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems. The final publication is available at link.springer.co

arXiv.org e-Print Archive

Crossref

Towards the identification of essential genes using targeted genome sequencing and comparative analysis

Author: DeLisi Charles
Gustafson Adam M
Kasif Simon
Parker Stephen CJ
Snitkin Evan S
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: The identification of genes essential for survival is of theoretical importance in the understanding of the minimal requirements for cellular life, and of practical importance in the identification of potential drug targets in novel pathogens. With the great time and expense required for experimental studies aimed at constructing a catalog of essential genes in a given organism, a computational approach which could identify essential genes with high accuracy would be of great value. RESULTS: We gathered numerous features which could be generated automatically from genome sequence data and assessed their relationship to essentiality, and subsequently utilized machine learning to construct an integrated classifier of essential genes in both S. cerevisiae and E. coli. When looking at single features, phyletic retention, a measure of the number of organisms an ortholog is present in, was the most predictive of essentiality. Furthermore, during construction of our phyletic retention feature we for the first time explored the evolutionary relationship among the set of organisms in which the presence of a gene is most predictive of essentiality. We found that in both E. coli and S. cerevisiae the optimal sets always contain host-associated organisms with small genomes which are closely related to the reference. Using five optimally selected organisms, we were able to improve predictive accuracy as compared to using all available sequenced organisms. We hypothesize the predictive power of these genomes is a consequence of the process of reductive evolution, by which many parasites and symbionts evolved their gene content. In addition, essentiality is measured in rich media, a condition which resembles the environments of these organisms in their hosts where many nutrients are provided. Finally, we demonstrate that integration of our most highly predictive features using a probabilistic classifier resulted in accuracies surpassing any individual feature. CONCLUSION: Using features obtainable directly from sequence data, we were able to construct a classifier which can predict essential genes with high accuracy. Furthermore, our analysis of the set of genomes in which the presence of a gene is most predictive of essentiality may suggest ways in which targeted sequencing can be used in the identification of essential genes. In summary, the methods presented here can aid in the reduction of time and money invested in essential gene identification by targeting those genes for experimentation which are predicted as being essential with a high probability

Boston University Institutional Repository (OpenBU)

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Network-Based Analysis of Affected Biological Processes in Type 2 Diabetes Models

Author: Arthur Liberzon
Isaac S Kohane
Kathleen Kerr
Manway Liu
Peter J Park
Sek Won Kong
Simon Kasif
Weil R Lai
Publication venue: Public Library of Science
Publication date: 01/01/2007
Field of study

Type 2 diabetes mellitus is a complex disorder associated with multiple genetic, epigenetic, developmental, and environmental factors. Animal models of type 2 diabetes differ based on diet, drug treatment, and gene knockouts, and yet all display the clinical hallmarks of hyperglycemia and insulin resistance in peripheral tissue. The recent advances in gene-expression microarray technologies present an unprecedented opportunity to study type 2 diabetes mellitus at a genome-wide scale and across different models. To date, a key challenge has been to identify the biological processes or signaling pathways that play significant roles in the disorder. Here, using a network-based analysis methodology, we identified two sets of genes, associated with insulin signaling and a network of nuclear receptors, which are recurrent in a statistically significant number of diabetes and insulin resistance models and transcriptionally altered across diverse tissue types. We additionally identified a network of protein–protein interactions between members from the two gene sets that may facilitate signaling between them. Taken together, the results illustrate the benefits of integrating high-throughput microarray studies, together with protein–protein interaction networks, in elucidating the underlying biological processes associated with a complex disorder

CiteSeerX

Crossref

Boston University Institutional Repository (OpenBU)

Harvard University - DASH

Directory of Open Access Journals

PubMed Central

Integration of heterogeneous expression data sets extends the role of the retinol pathway in diabetes and insulin resistance

Author: Barrett
Bell
Breitling
Dennis
Hwang
I. S. Kohane
Ioannidis
Kuo
Lowell
Moise
Mootha
Nimgaonkar
P. J. Park
Ramaswamy
Rhodes
S. Kasif
S. W. Kong
Schwartz
Sweet-Cordero
T. Tebaldi
W. R. Lai
Wild
Yang
Publication venue: Oxford University Press
Publication date: 28/09/2009
Field of study

Motivation: Type 2 diabetes is a chronic metabolic disease that involves both environmental and genetic factors. To understand the genetics of type 2 diabetes and insulin resistance, the DIabetes Genome Anatomy Project (DGAP) was launched to profile gene expression in a variety of related animal models and human subjects. We asked whether these heterogeneous models can be integrated to provide consistent and robust biological insights into the biology of insulin resistance

Crossref

Boston University Institutional Repository (OpenBU)

Harvard University - DASH

PubMed Central

Integration of heterogeneous expression data sets extends the role of the retinol pathway in diabetes and insulin resistance

Author: Barrett
Bell
Breitling
Dennis
Hwang
I. S. Kohane
Ioannidis
Kuo
Lowell
Moise
Mootha
Nimgaonkar
P. J. Park
Ramaswamy
Rhodes
S. Kasif
S. W. Kong
Schwartz
Sweet-Cordero
T. Tebaldi
W. R. Lai
Wild
Yang
Publication venue: Oxford University Press
Publication date: 01/09/2009
Field of study

DSpace@MIT

Crossref

Boston University Institutional Repository (OpenBU)

Harvard University - DASH

PubMed Central

Interpretable network propagation with application to expanding the repertoire of human proteins that interact with SARS-CoV-2

Author: Akers K.
Crovella Mark
Deutsch S.
Kasif S.
Klein-Seetharaman J.
Kshirsagar M.
Law J.N.
Murali T.M.
Rajagopalan P.
Santina C.M.D.
Tasnina N.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 29/12/2021
Field of study

BACKGROUND: Network propagation has been widely used for nearly 20 years to predict gene functions and phenotypes. Despite the popularity of this approach, little attention has been paid to the question of provenance tracing in this context, e.g., determining how much any experimental observation in the input contributes to the score of every prediction. RESULTS: We design a network propagation framework with 2 novel components and apply it to predict human proteins that directly or indirectly interact with SARS-CoV-2 proteins. First, we trace the provenance of each prediction to its experimentally validated sources, which in our case are human proteins experimentally determined to interact with viral proteins. Second, we design a technique that helps to reduce the manual adjustment of parameters by users. We find that for every top-ranking prediction, the highest contribution to its score arises from a direct neighbor in a human protein-protein interaction network. We further analyze these results to develop functional insights on SARS-CoV-2 that expand on known biology such as the connection between endoplasmic reticulum stress, HSPA5, and anti-clotting agents. CONCLUSIONS: We examine how our provenance-tracing method can be generalized to a broad class of network-based algorithms. We provide a useful resource for the SARS-CoV-2 community that implicates many previously undocumented proteins with putative functional relationships to viral infection. This resource includes potential drugs that can be opportunistically repositioned to target these proteins. We also discuss how our overall framework can be extended to other, newly emerging viruses.DBI-1759858 - National Science Foundation; Boston UniversityPublished versio

Boston University Institutional Repository (OpenBU)

PubMed Central

On the speed of constraint propagation and the time complexity of arc consistency testing

Author: A. Atserias
A. Atserias
A. Atserias
A. Samal
A.K. Mackworth
C. Berkholz
C. Berkholz
C. Bessière
C. Bessière
J.-C. Régin
P.G. Kolaitis
P.G. Kolaitis
P.G. Kolaitis
R. Mohr
R.M. McConnell
S. Kasif
T. Feder
Publication venue
Publication date: 01/01/2013
Field of study

Establishing arc consistency on two relational structures is one of the most popular heuristics for the constraint satisfaction problem. We aim at determining the time complexity of arc consistency testing. The input structures

G

and

H

can be supposed to be connected colored graphs, as the general problem reduces to this particular case. We first observe the upper bound

O(e(G)v(H)+v(G)e(H))

, which implies the bound

O(e(G)e(H))

in terms of the number of edges and the bound

O((v(G)+v(H))^3)

in terms of the number of vertices. We then show that both bounds are tight up to a constant factor as long as an arc consistency algorithm is based on constraint propagation (like any algorithm currently known). Our argument for the lower bounds is based on examples of slow constraint propagation. We measure the speed of constraint propagation observed on a pair

G,H

by the size of a proof, in a natural combinatorial proof system, that Spoiler wins the existential 2-pebble game on

G,H

. The proof size is bounded from below by the game length

D(G,H)

, and a crucial ingredient of our analysis is the existence of

G,H

with

D(G,H)=\Omega(v(G)v(H))

. We find one such example among old benchmark instances for the arc consistency problem and also suggest a new, different construction.Comment: 19 pages, 5 figure

arXiv.org e-Print Archive

Crossref

Publikationsserver der RWTH Aachen University

Towards a better solution to the shortest common supersequence problem: the deposition and reduction algorithm

Author: D Gusfield
D Sankoff
DE Foulser
EA Hubbell
G Nicosia
Hon Wai Leong
J Branke
JA Storer
K Ning
Kang Ning
P Barone
R Michels
RW Irving
S Kasif
T Jiang
TH Cormen
TK Sellis
VG Timkovsky
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: The problem of finding a Shortest Common Supersequence (SCS) of a set of sequences is an important problem with applications in many areas. It is a key problem in biological sequences analysis. The SCS problem is well-known to be NP-complete. Many heuristic algorithms have been proposed. Some heuristics work well on a few long sequences (as in sequence comparison applications); others work well on many short sequences (as in oligo-array synthesis). Unfortunately, most do not work well on large SCS instances where there are many, long sequences. RESULTS: In this paper, we present a Deposition and Reduction (DR) algorithm for solving large SCS instances of biological sequences. There are two processes in our DR algorithm: deposition process, and reduction process. The deposition process is responsible for generating a small set of common supersequences; and the reduction process shortens these common supersequences by removing some characters while preserving the common supersequence property. Our evaluation on simulated data and real DNA and protein sequences show that our algorithm consistently produces the best results compared to many well-known heuristic algorithms, and especially on large instances. CONCLUSION: Our DR algorithm provides a partial answer to the open problem of designing efficient heuristic algorithm for SCS problem on many long sequences. Our algorithm has a bounded approximation ratio. The algorithm is efficient, both in running time and space complexity and our evaluation shows that it is practical even for SCS problems on many long sequences

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

ScholarBank@NUS

Qingdao Institute of Bioenergy and Bioprocess Technology, Chinese Academy of Sciences