10 research outputs found
Tensorized LSSVMs for Multitask Regression
Multitask learning (MTL) can utilize the relatedness between multiple tasks
for performance improvement. The advent of multimodal data allows tasks to be
referenced by multiple indices. High-order tensors are capable of providing
efficient representations for such tasks, while preserving structural
task-relations. In this paper, a new MTL method is proposed by leveraging
low-rank tensor analysis and constructing tensorized Least Squares Support
Vector Machines, namely the tLSSVM-MTL, where multilinear modelling and its
nonlinear extensions can be flexibly exerted. We employ a high-order tensor for
all the weights with each mode relating to an index and factorize it with CP
decomposition, assigning a shared factor for all tasks and retaining
task-specific latent factors along each index. Then an alternating algorithm is
derived for the nonconvex optimization, where each resulting subproblem is
solved by a linear system. Experimental results demonstrate promising
performances of our tLSSVM-MTL
Learning Using Privileged Information: SVM+ and Weighted SVM
Prior knowledge can be used to improve predictive performance of learning
algorithms or reduce the amount of data required for training. The same goal is
pursued within the learning using privileged information paradigm which was
recently introduced by Vapnik et al. and is aimed at utilizing additional
information available only at training time -- a framework implemented by SVM+.
We relate the privileged information to importance weighting and show that the
prior knowledge expressible with privileged features can also be encoded by
weights associated with every training example. We show that a weighted SVM can
always replicate an SVM+ solution, while the converse is not true and we
construct a counterexample highlighting the limitations of SVM+. Finally, we
touch on the problem of choosing weights for weighted SVMs when privileged
features are not available.Comment: 18 pages, 8 figures; integrated reviewer comments, improved
typesettin
Enhanced default risk models with SVM+
Default risk models have lately raised a great interest due to the recent world economic crisis. In spite of many advanced techniques that have extensively been proposed, no comprehensive method incorporating a holistic perspective has hitherto been considered. Thus, the existing models for bankruptcy prediction lack the whole coverage of contextual knowledge which may prevent the decision makers such as investors and financial analysts to take the right decisions. Recently, SVM+ provides a formal way to incorporate additional information (not only training data) onto the learning models improving generalization. In financial settings examples of such non-financial (though relevant) information are marketing reports, competitors landscape, economic environment, customers screening, industry trends, etc. By exploiting additional information able to improve classical inductive learning we propose a prediction model where data is naturally separated into several structured groups clustered by the size and annual turnover of the firms. Experimental results in the setting of a heterogeneous data set of French companies demonstrated that the proposed default risk model showed better predictability performance than the baseline SVM and multi-task learning with SVM.info:eu-repo/semantics/publishedVersio
Convex formulation for multi-task L1-, L2-, and LS-SVMs
Quite often a machine learning problem lends itself to be split in several well-defined subproblems, or tasks. The goal of Multi-Task Learning (MTL) is to leverage the joint learning of the problem from two different perspectives: on the one hand, a single, overall model, and on the other hand task-specific models. In this way, the found solution by MTL may be better than those of either the common or the task-specific models. Starting with the work of Evgeniou et al., support vector machines (SVMs) have lent themselves naturally to this approach. This paper proposes a convex formulation of MTL for the L1-, L2- and LS-SVM models that results in dual problems quite similar to the single-task ones, but with multi-task kernels; in turn, this makes possible to train the convex MTL models using standard solvers. As an alternative approach, the direct optimal combination of the already trained common and task-specific models can also be considered. In this paper, a procedure to compute the optimal combining parameter with respect to four different error functions is derived. As shown experimentally, the proposed convex MTL approach performs generally better than the alternative optimal convex combination, and both of them are better than the straight use of either common or task-specific modelsWith partial support from Spainβs grant TIN2016-76406-P.
Work supported also by the UAMβADIC Chair for Data Science
and Machine Learning
ΠΠ ΠΠΠΠΠΠΠΠ Π‘ΠΠΠ’ΠΠ’ΠΠ§ΠΠ‘ΠΠΠ₯ ΠΠΠ ΠΠΠΠ ΠΠΠ― Π ΠΠ¨ΠΠΠΠ― ΠΠΠΠΠ§Π ΠΠΠΠ‘Π‘ΠΠ€ΠΠΠΠ¦ΠΠ ΠΠ ΠΠ ΠΠΠΠ Π ΠΠΠΠΠΠΠ‘Π’ΠΠΠ Π ΠΠΠ ΠΠΠΠΠΠΠ
Background: From a mathematical point of view, the problems of medical diagnostics are the tasks of data classification. It is important to understand how significant distortions can contribute to the result of classification errors in the collection of primary diagnostic information, in particular, the results of biochemical tests.Aims: Determination of the dependence of the prediction result on the variability of the primary diagnostic information on the example of the model classifier.Materials and methods: The case-control study enrolled patients who were divided into 2 groups: the main (diagnosed with lung cancer, n=200) and the control group (conditionally healthy, n=500). Questioning and biochemical saliva study was performed in all participants. Patients of the main group and the comparison group were hospitalized for surgical treatment, after which carried out the histological verification of the diagnosis. The biochemical composition of saliva is determined spectrophotometrically. Based on the data obtained, a model classifier for the diagnosis of lung cancer (a random forest) has been constructed. In each parameter underlying the classifier, deviations were made in the specified range (Β±1β5%, Β±5β10%, Β±10β15%), creating synthetic images. Then, the results of the classification were evaluated by the cross-validation method.Results: The basic diagnostic characteristics of the model classifier are determined (sensitivity β 72.5%, specificity β 86.0%). As the deviations of synthetic images from the baseline increase, diagnostic characteristics deteriorate with the general classification. However, the result of a confident classification, on the contrary, gives higher values (sensitivity β 81.8%, specificity β 93.1%). In case of a confident classification, similar images that fall into different classes according to the classification results are deleted, whereas in the case of a general classification, they are taken into account. The difference between methods of classification is associated with the presence of images on which the classifier gives the result of belonging to the class in the range of 0.45β0.55. Therefore, it is necessary to introduce a third class into the classifier, the so-called gray zone (0.4β0.6), since the probability of making an erroneous diagnosis in this area is significantly increased.Conclusions: The obtained results allow to conclude that the measurement error in the range (Β±1β15%) does not significantly affect the quality of the classification.ΠΠ±ΠΎΡΠ½ΠΎΠ²Π°Π½ΠΈΠ΅. Π‘ ΠΌΠ°ΡΠ΅ΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠΉ ΡΠΎΡΠΊΠΈ Π·ΡΠ΅Π½ΠΈΡ Π·Π°Π΄Π°ΡΠΈ ΠΌΠ΅Π΄ΠΈΡΠΈΠ½ΡΠΊΠΎΠΉ Π΄ΠΈΠ°Π³Π½ΠΎΡΡΠΈΠΊΠΈ ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»ΡΡΡ ΡΠΎΠ±ΠΎΠΉ Π·Π°Π΄Π°ΡΠΈ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ Π΄Π°Π½Π½ΡΡ
. ΠΡΠΈ ΡΡΠΎΠΌ Π²Π°ΠΆΠ½ΠΎ ΠΏΠΎΠ½ΠΈΠΌΠ°ΡΡ, Π½Π°ΡΠΊΠΎΠ»ΡΠΊΠΎ ΡΡΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΠ΅ ΠΈΡΠΊΠ°ΠΆΠ΅Π½ΠΈΡ ΠΌΠΎΠ³ΡΡ Π²Π½Π΅ΡΡΠΈ Π² ΡΠ΅Π·ΡΠ»ΡΡΠ°Ρ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ ΠΏΠΎΠ³ΡΠ΅ΡΠ½ΠΎΡΡΠΈ ΡΠ±ΠΎΡΠ° ΠΏΠ΅ΡΠ²ΠΈΡΠ½ΠΎΠΉ Π΄ΠΈΠ°Π³Π½ΠΎΡΡΠΈΡΠ΅ΡΠΊΠΎΠΉ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΈ, Π² ΡΠ°ΡΡΠ½ΠΎΡΡΠΈ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠΎΠ² Π±ΠΈΠΎΡ
ΠΈΠΌΠΈΡΠ΅ΡΠΊΠΈΡ
ΡΠ΅ΡΡΠΎΠ².Π¦Π΅Π»Ρ ΠΈΡΡΠ»Π΅Π΄ΠΎΠ²Π°Π½ΠΈΡ β ΡΡΡΠ°Π½ΠΎΠ²Π»Π΅Π½ΠΈΠ΅ Π·Π°Π²ΠΈΡΠΈΠΌΠΎΡΡΠΈ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠ° ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ ΠΎΡ Π²Π°ΡΠΈΠ°ΡΠΈΠ²Π½ΠΎΡΡΠΈ ΠΏΠ΅ΡΠ²ΠΈΡΠ½ΠΎΠΉ Π΄ΠΈΠ°Π³Π½ΠΎΡΡΠΈΡΠ΅ΡΠΊΠΎΠΉ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΈ Π½Π° ΠΏΡΠΈΠΌΠ΅ΡΠ΅ ΠΌΠΎΠ΄Π΅Π»ΡΠ½ΠΎΠ³ΠΎ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΎΡΠ°.ΠΠ΅ΡΠΎΠ΄Ρ. Π ΠΈΡΡΠ»Π΅Π΄ΠΎΠ²Π°Π½ΠΈΠΈ ΡΠ»ΡΡΠ°ΠΉ-ΠΊΠΎΠ½ΡΡΠΎΠ»Ρ ΠΏΡΠΈΠ½ΡΠ»ΠΈ ΡΡΠ°ΡΡΠΈΠ΅ ΠΏΠ°ΡΠΈΠ΅Π½ΡΡ, ΠΊΠΎΡΠΎΡΡΠ΅ Π±ΡΠ»ΠΈ ΡΠ°Π·Π΄Π΅Π»Π΅Π½Ρ Π½Π° 2 Π³ΡΡΠΏΠΏΡ β ΠΎΡΠ½ΠΎΠ²Π½ΡΡ (Ρ Π΄ΠΈΠ°Π³Π½ΠΎΠ·ΠΎΠΌ ΡΠ°ΠΊΠ° Π»Π΅Π³ΠΊΠΎΠ³ΠΎ, n=200) ΠΈ ΠΊΠΎΠ½ΡΡΠΎΠ»ΡΠ½ΡΡ (ΡΡΠ»ΠΎΠ²Π½ΠΎ Π·Π΄ΠΎΡΠΎΠ²ΡΠ΅, n=500). ΠΡΠ΅ΠΌ ΡΡΠ°ΡΡΠ½ΠΈΠΊΠ°ΠΌ Π±ΡΠ»ΠΎ ΠΏΡΠΎΠ²Π΅Π΄Π΅Π½ΠΎ Π±ΠΈΠΎΡ
ΠΈΠΌΠΈΡΠ΅ΡΠΊΠΎΠ΅ ΠΈΡΡΠ»Π΅Π΄ΠΎΠ²Π°Π½ΠΈΠ΅ ΡΠ»ΡΠ½Ρ, Π° ΡΠ°ΠΊΠΆΠ΅ ΠΏΠΎΡΠ»Π΅Π΄ΡΡΡΠ°Ρ Π³ΠΈΡΡΠΎΠ»ΠΎΠ³ΠΈΡΠ΅ΡΠΊΠ°Ρ Π²Π΅ΡΠΈΡΠΈΠΊΠ°ΡΠΈΡ Π΄ΠΈΠ°Π³Π½ΠΎΠ·Π°. ΠΠΈΠΎΡ
ΠΈΠΌΠΈΡΠ΅ΡΠΊΠΈΠΉ ΡΠΎΡΡΠ°Π² ΡΠ»ΡΠ½Ρ ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ ΡΠΏΠ΅ΠΊΡΡΠΎΡΠΎΡΠΎΠΌΠ΅ΡΡΠΈΡΠ΅ΡΠΊΠΈ. ΠΠ° ΠΎΡΠ½ΠΎΠ²Π΅ ΠΏΠΎΠ»ΡΡΠ΅Π½Π½ΡΡ
Π΄Π°Π½Π½ΡΡ
ΠΏΠΎΡΡΡΠΎΠ΅Π½ ΠΌΠΎΠ΄Π΅Π»ΡΠ½ΡΠΉ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΎΡ Π΄Π»Ρ Π΄ΠΈΠ°Π³Π½ΠΎΡΡΠΈΠΊΠΈ ΡΠ°ΠΊΠ° Π»Π΅Π³ΠΊΠΎΠ³ΠΎ (ΡΠ»ΡΡΠ°ΠΉΠ½ΡΠΉ Π»Π΅Ρ). Π ΠΊΠ°ΠΆΠ΄ΡΠΉ ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡ, Π»Π΅ΠΆΠ°ΡΠΈΠΉ Π² ΠΎΡΠ½ΠΎΠ²Π΅ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΎΡΠ°, Π²Π½ΠΎΡΠΈΠ»ΠΈ ΠΎΡΠΊΠ»ΠΎΠ½Π΅Π½ΠΈΡ Π² Π·Π°Π΄Π°Π½Π½ΠΎΠΌ Π΄ΠΈΠ°ΠΏΠ°Π·ΠΎΠ½Π΅ (Β±1β5%, Β±5β10%, Β±10β15%), ΡΠΎΠ·Π΄Π°Π²Π°Ρ ΡΠΈΠ½ΡΠ΅ΡΠΈΡΠ΅ΡΠΊΠΈΠ΅ ΠΎΠ±ΡΠ°Π·Ρ. ΠΠ°ΡΠ΅ΠΌ ΠΌΠ΅ΡΠΎΠ΄ΠΎΠΌ ΠΊΡΠΎΡΡ-Π²Π°Π»ΠΈΠ΄Π°ΡΠΈΠΈ ΠΏΡΠΎΠ²Π΅Π΄Π΅Π½Π° ΠΎΡΠ΅Π½ΠΊΠ° ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠΎΠ² ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ.Π Π΅Π·ΡΠ»ΡΡΠ°ΡΡ. ΠΠΏΡΠ΅Π΄Π΅Π»Π΅Π½Ρ Π±Π°Π·ΠΎΠ²ΡΠ΅ Π΄ΠΈΠ°Π³Π½ΠΎΡΡΠΈΡΠ΅ΡΠΊΠΈΠ΅ Ρ
Π°ΡΠ°ΠΊΡΠ΅ΡΠΈΡΡΠΈΠΊΠΈ ΠΌΠΎΠ΄Π΅Π»ΡΠ½ΠΎΠ³ΠΎ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΎΡΠ° (ΡΡΠ²ΡΡΠ²ΠΈΡΠ΅Π»ΡΠ½ΠΎΡΡΡ β 72,5%; ΡΠΏΠ΅ΡΠΈΡΠΈΡΠ½ΠΎΡΡΡ β 86,0%). ΠΡΠΈ ΡΠ²Π΅Π»ΠΈΡΠ΅Π½ΠΈΠΈ ΠΎΡΠΊΠ»ΠΎΠ½Π΅Π½ΠΈΠΉ ΡΠΈΠ½ΡΠ΅ΡΠΈΡΠ΅ΡΠΊΠΈΡ
ΠΎΠ±ΡΠ°Π·ΠΎΠ² ΠΎΡ Π±Π°Π·ΠΎΠ²ΠΎΠ³ΠΎ Π·Π½Π°ΡΠ΅Π½ΠΈΡ Π΄ΠΈΠ°Π³Π½ΠΎΡΡΠΈΡΠ΅ΡΠΊΠΈΠ΅ Ρ
Π°ΡΠ°ΠΊΡΠ΅ΡΠΈΡΡΠΈΠΊΠΈ ΠΏΡΠΈ ΠΎΠ±ΡΠ΅ΠΉ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ ΡΡ
ΡΠ΄ΡΠ°ΡΡΡΡ. ΠΠ΄Π½Π°ΠΊΠΎ ΡΠ΅Π·ΡΠ»ΡΡΠ°Ρ ΡΠ²Π΅ΡΠ΅Π½Π½ΠΎΠΉ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ, Π½Π°ΠΏΡΠΎΡΠΈΠ², Π΄Π°Π΅Ρ Π±ΠΎΠ»Π΅Π΅ Π²ΡΡΠΎΠΊΠΈΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΡ (ΡΡΠ²ΡΡΠ²ΠΈΡΠ΅Π»ΡΠ½ΠΎΡΡΡ β 81,8%, ΡΠΏΠ΅ΡΠΈΡΠΈΡΠ½ΠΎΡΡΡ β 93,1%). Π ΡΠ»ΡΡΠ°Π΅ ΡΠ²Π΅ΡΠ΅Π½Π½ΠΎΠΉ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ Π±Π»ΠΈΠ·ΠΊΠΈΠ΅ ΠΎΠ±ΡΠ°Π·Ρ, ΠΊΠΎΡΠΎΡΡΠ΅ ΠΏΠΎ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠ°ΠΌ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ ΠΏΠΎΠΏΠ°Π΄Π°ΡΡ Π² ΡΠ°Π·Π½ΡΠ΅ ΠΊΠ»Π°ΡΡΡ, ΡΠ΄Π°Π»ΡΡΡΡΡ, ΡΠΎΠ³Π΄Π° ΠΊΠ°ΠΊ Π² ΡΠ»ΡΡΠ°Π΅ ΠΎΠ±ΡΠ΅ΠΉ β ΡΡΠΈΡΡΠ²Π°ΡΡΡΡ. Π Π°Π·Π½ΠΈΡΠ° ΠΌΠ΅ΠΆΠ΄Ρ ΠΌΠ΅ΡΠΎΠ΄Π°ΠΌΠΈ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ ΡΠ²ΡΠ·Π°Π½Π° Ρ Π½Π°Π»ΠΈΡΠΈΠ΅ΠΌ ΠΎΠ±ΡΠ°Π·ΠΎΠ², Π½Π° ΠΊΠΎΡΠΎΡΡΡ
ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΎΡ Π΄Π°Π΅Ρ ΡΠ΅Π·ΡΠ»ΡΡΠ°Ρ ΠΏΡΠΈΠ½Π°Π΄Π»Π΅ΠΆΠ½ΠΎΡΡΠΈ ΠΊ ΠΊΠ»Π°ΡΡΡ Π² Π΄ΠΈΠ°ΠΏΠ°Π·ΠΎΠ½Π΅ 0,45β0,55. ΠΠΎΡΡΠΎΠΌΡ Π½Π΅ΠΎΠ±Ρ
ΠΎΠ΄ΠΈΠΌΠΎ Π²Π²Π΅Π΄Π΅Π½ΠΈΠ΅ ΡΡΠ΅ΡΡΠ΅Π³ΠΎ ΠΊΠ»Π°ΡΡΠ° Π² ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΎΡ, ΡΠ°ΠΊ Π½Π°Π·ΡΠ²Π°Π΅ΠΌΠΎΠΉ ΡΠ΅ΡΠΎΠΉ Π·ΠΎΠ½Ρ (0,4β0,6), Ρ.ΠΊ. Π²Π΅ΡΠΎΡΡΠ½ΠΎΡΡΡ ΠΏΠΎΡΡΠ°Π½ΠΎΠ²ΠΊΠΈ ΠΎΡΠΈΠ±ΠΎΡΠ½ΠΎΠ³ΠΎ Π΄ΠΈΠ°Π³Π½ΠΎΠ·Π° Π² Π΄Π°Π½Π½ΠΎΠΉ ΠΎΠ±Π»Π°ΡΡΠΈ ΡΡΡΠ΅ΡΡΠ²Π΅Π½Π½ΠΎ ΠΏΠΎΠ²ΡΡΠ°Π΅ΡΡΡ.ΠΠ°ΠΊΠ»ΡΡΠ΅Π½ΠΈΠ΅. ΠΠΎΠ»ΡΡΠ΅Π½Π½ΡΠ΅ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΡ ΠΏΠΎΠ·Π²ΠΎΠ»ΡΡΡ ΡΠ΄Π΅Π»Π°ΡΡ Π²ΡΠ²ΠΎΠ΄, ΡΡΠΎ ΠΈΠ·ΠΌΠ΅ΡΠΈΡΠ΅Π»ΡΠ½Π°Ρ ΠΏΠΎΠ³ΡΠ΅ΡΠ½ΠΎΡΡΡ Π² Π΄ΠΈΠ°ΠΏΠ°Π·ΠΎΠ½Π΅ (Β±1β15%) Π½Π΅ ΠΎΠΊΠ°Π·ΡΠ²Π°Π΅Ρ ΡΡΡΠ΅ΡΡΠ²Π΅Π½Π½ΠΎΠ³ΠΎ Π²Π»ΠΈΡΠ½ΠΈΡ Π½Π° ΠΊΠ°ΡΠ΅ΡΡΠ²ΠΎ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ
Learning with privileged and sensitive information: a gradient-boosting approach
We consider the problem of learning with sensitive features under the privileged information setting where the goal is to learn a classifier that uses features not available (or too sensitive to collect) at test/deployment time to learn a better model at training time. We focus on tree-based learners, specifically gradient-boosted decision trees for learning with privileged information. Our methods use privileged features as knowledge to guide the algorithm when learning from fully observed (usable) features. We derive the theory, empirically validate the effectiveness of our algorithms, and verify them on standard fairness metrics
Advanced Learning Methodologies for Biomedical Applications
University of Minnesota Ph.D. dissertation. October 2017. Major: Electrical/Computer Engineering. Advisor: Vladimir Cherkassky. 1 computer file (PDF); ix, 109 pages.There has been a dramatic increase in application of statistical and machine learning methods for predictive data-analytic modeling of biomedical data. Most existing work in this area involves application of standard supervised learning techniques. Typical methods include standard classification or regression techniques, where the goal is to estimate an indicator function (classification decision rule) or real-valued function of input variables, from finite training sample. However, real-world data often contain additional information besides labeled training samples. Incorporating this additional information into learning (model estimation) leads to nonstandard/advanced learning formalizations that represent extensions of standard supervised learning. Recent examples of such advanced methodologies include semi-supervised learning (or transduction) and learning through contradiction (or Universum learning). This thesis investigates two new advanced learning methodologies along with their biomedical applications. The first one is motivated by modeling complex survival data which can incorporate future, censored, or unknown data, in addition to (traditional) labeled training data. Here we propose original formalization for predictive modeling of survival data, under the framework of Learning Using Privileged Information (LUPI) proposed by Vapnik. Survival data represents a collection of time observations about events. Our modeling goal is to predict the state (alive/dead) of a subject at a pre-determined future time point. We explore modeling of survival data as binary classification problem that incorporates additional information (such as time of death, censored/uncensored status, etc.) under LUPI framework. Then we propose two advanced constructive Support Vector Machine (SVM)-based formulations: SVM+ and Loss-Order SVM (LO-SVM). Empirical results using simulated and real-life survival data indicate that the proposed LUPI-based methods are very effective (versus classical Cox regression) when the survival time does not follow classical probabilistic assumptions. Second advanced methodology investigates a new learning paradigm for classification called Group Learning. This approach is motivated by modeling high-dimensional data when the number of input features is much larger than the number of training samples. There are two main approaches to solving such ill-posed problems: (a) selecting a small number of informative features via feature selection; (b) using all features but imposing additional complexity constraints, e.g., ridge regression, SVM, LASSO, etc. The proposed Group Learning method takes a different approach, by splitting all features into many (t) groups, and then estimating a classifier in reduced space (of dimensionality d/t). This approach effectively uses all features, but implements training in a lower-dimensional input space. Note that the formation of groups reflects application-domain knowledge. For example, in classifying of two-dimensional images represented as a set of pixels (original high-dimensional input space), appropriate groups can be formed by grouping adjacent pixels or βlocal patchesβ because adjacent pixels are known to be highly correlated. We provide empirical validation of this new methodology for two real-life applications: (a) handwritten digit recognition, and (b) predictive classification of univariate signals, e.g., prediction of epileptic seizures from intracranial electroencephalogram (iEEG) signal. Prediction of epileptic seizures is particularly challenging, due to highly unbalanced data (just 4β5 observed seizures) and patient-specific modeling. In a joint project with Mayo Clinic, we have incorporated the Group Learning approach into an SVM-based system for seizure prediction. This system performs subject-specific modeling and achieves robust prediction performance
Toxicity prediction using multi-disciplinary data integration and novel computational approaches
Current predictive tools used for human health assessment of potential chemical hazards rely primarily on either chemical structural information (i.e., cheminformatics) or bioassay data (i.e., bioinformatics). Emerging data sources such as chemical libraries, high throughput assays and health databases offer new possibilities for evaluating chemical toxicity as an integrated system and overcome the limited predictivity of current fragmented efforts; yet, few studies have combined the new data streams. This dissertation tested the hypothesis that integrative computational toxicology approaches drawing upon diverse data sources would improve the prediction and interpretation of chemically induced diseases. First, chemical structures and toxicogenomics data were used to predict hepatotoxicity. Compared with conventional cheminformatics or toxicogenomics models, interpretation was enriched by the chemical and biological insights even though prediction accuracy did not improve. This motivated the second project that developed a novel integrative method, chemical-biological read-across (CBRA), that led to predictive and interpretable models amenable to visualization. CBRA was consistently among the most accurate models on four chemical-biological data sets. It highlighted chemical and biological features for interpretation and the visualizations aided transparency. Third, we developed an integrative workflow that interfaced cheminformatics prediction with pharmacoepidemiology validation using a case study of Stevens Johnson Syndrome (SJS), an adverse drug reaction (ADR) of major public health concern. Cheminformatics models first predicted potential SJS inducers and non-inducers, prioritizing them for subsequent pharmacoepidemiology evaluation, which then confirmed that predicted non-inducers were statistically associated with fewer SJS occurrences. By combining cheminformatics' ability to predict SJS as soon as drug structures are known, and pharmacoepidemiology's statistical rigor, we have provided a universal scheme for more effective study of SJS and other ADRs. Overall, this work demonstrated that integrative approaches could deliver more predictive and interpretable models. These models can then reliably prioritize high risk chemicals for further testing, allowing optimization of testing resources. A broader implication of this research is the growing role we envision for integrative methods that will take advantage of the various emerging data sources.Doctor of Philosoph