10 research outputs found
Recommended from our members
Developing Children's Oral Health Assessment Toolkits Using Machine Learning Algorithm.
ObjectivesEvaluating children's oral health status and treatment needs is challenging. We aim to build oral health assessment toolkits to predict Children's Oral Health Status Index (COHSI) score and referral for treatment needs (RFTN) of oral health. Parent and Child toolkits consist of short-form survey items (12 for children and 8 for parents) with and without children's demographic information (7 questions) to predict the child's oral health status and need for treatment.MethodsData were collected from 12 dental practices in Los Angeles County from 2015 to 2016. We predicted COHSI score and RFTN using random Bootstrap samples with manually introduced Gaussian noise together with machine learning algorithms, such as Extreme Gradient Boosting and Naive Bayesian algorithms (using R). The toolkits predicted the probability of treatment needs and the COHSI score with percentile (ranking). The performance of the toolkits was evaluated internally and externally by residual mean square error (RMSE), correlation, sensitivity and specificity.ResultsThe toolkits were developed based on survey responses from 545 families with children aged 2 to 17 y. The sensitivity and specificity for predicting RFTN were 93% and 49% respectively with the external data. The correlation(s) between predicted and clinically determined COHSI was 0.88 (and 0.91 for its percentile). The RMSEs of the COHSI toolkit were 4.2 for COHSI (and 1.3 for its percentile).ConclusionsSurvey responses from children and their parents/guardians are predictive for clinical outcomes. The toolkits can be used by oral health programs at baseline among school populations. The toolkits can also be used to quantify differences between pre- and post-dental care program implementation. The toolkits' predicted oral health scores can be used to stratify samples in oral health research.Knowledge transfer statementThis study creates the oral health toolkits that combine self- and proxy- reported short forms with children's demographic characteristics to predict children's oral health and treatment needs using Machine Learning algorithms. The toolkits can be used by oral health programs at baseline among school populations to quantify differences between pre and post dental care program implementation. The toolkits can also be used to stratify samples according to the treatment needs and oral health status
An Efficient Dual Approach to Distance Metric Learning
Distance metric learning is of fundamental interest in machine learning
because the distance metric employed can significantly affect the performance
of many learning methods. Quadratic Mahalanobis metric learning is a popular
approach to the problem, but typically requires solving a semidefinite
programming (SDP) problem, which is computationally expensive. Standard
interior-point SDP solvers typically have a complexity of (with
the dimension of input data), and can thus only practically solve problems
exhibiting less than a few thousand variables. Since the number of variables is
, this implies a limit upon the size of problem that can
practically be solved of around a few hundred dimensions. The complexity of the
popular quadratic Mahalanobis metric learning approach thus limits the size of
problem to which metric learning can be applied. Here we propose a
significantly more efficient approach to the metric learning problem based on
the Lagrange dual formulation of the problem. The proposed formulation is much
simpler to implement, and therefore allows much larger Mahalanobis metric
learning problems to be solved. The time complexity of the proposed method is
, which is significantly lower than that of the SDP approach.
Experiments on a variety of datasets demonstrate that the proposed method
achieves an accuracy comparable to the state-of-the-art, but is applicable to
significantly larger problems. We also show that the proposed method can be
applied to solve more general Frobenius-norm regularized SDP problems
approximately
A Clustering Algorithm Based on an Ensemble of Dissimilarities: An Application in the Bioinformatics Domain
Clustering algorithms such as k-means depend heavily on choosing an appropriate distance metric that reflect accurately the object proximities. A wide range of dissimilarities may be defined that often lead to different clustering results. Choosing the best dissimilarity is an ill-posed problem and learning a general distance from the data is a complex task, particularly for high dimensional problems. Therefore, an appealing approach is to learn an ensemble of dissimilarities. In this paper, we have developed a semi-supervised clustering algorithm that learns a linear combination of dissimilarities considering incomplete knowledge in the form of pairwise constraints. The minimization of the loss function is based on a robust and efficient quadratic optimization algorithm. Besides, a regularization term is considered that controls the complexity of the distance metric learned avoiding overfitting. The algorithm has been applied to the identification of tumor samples using the gene expression profiles, where domain experts provide often incomplete knowledge in the form of pairwise constraints. We report that the algorithm proposed outperforms a standard semi-supervised clustering technique available in the literature and clustering results based on a single dissimilarity. The improvement is particularly relevant for applications with high level of noise
Case-based similar image retrieval for weakly annotated large histopathological images of malignant lymphoma using deep metric learning
In the present study, we propose a novel case-based similar image retrieval
(SIR) method for hematoxylin and eosin (H&E)-stained histopathological images
of malignant lymphoma. When a whole slide image (WSI) is used as an input
query, it is desirable to be able to retrieve similar cases by focusing on
image patches in pathologically important regions such as tumor cells. To
address this problem, we employ attention-based multiple instance learning,
which enables us to focus on tumor-specific regions when the similarity between
cases is computed. Moreover, we employ contrastive distance metric learning to
incorporate immunohistochemical (IHC) staining patterns as useful supervised
information for defining appropriate similarity between heterogeneous malignant
lymphoma cases. In the experiment with 249 malignant lymphoma patients, we
confirmed that the proposed method exhibited higher evaluation measures than
the baseline case-based SIR methods. Furthermore, the subjective evaluation by
pathologists revealed that our similarity measure using IHC staining patterns
is appropriate for representing the similarity of H&E-stained tissue images for
malignant lymphoma
Scalable large-margin Mahalanobis distance metric learning
For many machine learning algorithms such as k-nearest neighbor ( k-NN) classifiers and k-means clustering, often their success heavily depends on the metric used to calculate distances between different data points. An effective solution for defining such a metric is to learn it from a set of labeled training samples. In this work, we propose a fast and scalable algorithm to learn a Mahalanobis distance metric. The Mahalanobis metric can be viewed as the Euclidean distance metric on the input data that have been linearly transformed. By employing the principle of margin maximization to achieve better generalization performances, this algorithm formulates the metric learning as a convex optimization problem and a positive semidefinite (p.s.d.) matrix is the unknown variable. Based on an important theorem that a p.s.d. trace-one matrix can always be represented as a convex combination of multiple rank-one matrices, our algorithm accommodates any differentiable loss function and solves the resulting optimization problem using a specialized gradient descent procedure. During the course of optimization, the proposed algorithm maintains the positive semidefiniteness of the matrix variable that is essential for a Mahalanobis metric. Compared with conventional methods like standard interior-point algorithms or the special solver used in large margin nearest neighbor , our algorithm is much more efficient and has a better performance in scalability. Experiments on benchmark data sets suggest that, compared with state-of-the-art metric learning algorithms, our algorithm can achieve a comparable classification accuracy with reduced computational complexity
Semi-supervised learning for image classification
Object class recognition is an active topic in computer vision still presenting many challenges. In most approaches, this task is addressed by supervised learning algorithms that need a large quantity of labels to perform well. This leads either to small datasets (< 10,000 images) that capture only a subset of the real-world class distribution (but with a controlled and verified labeling procedure), or to large datasets that are more representative but also add more label noise. Therefore, semi-supervised learning is a promising direction. It requires only few labels while simultaneously making use of the vast amount of images available today. We address object class recognition with semi-supervised learning. These algorithms depend on the underlying structure given by the data, the image description, and the similarity measure, and the quality of the labels. This insight leads to the main research questions of this thesis: Is the structure given by labeled and unlabeled data more important than the algorithm itself? Can we improve this neighborhood structure by a better similarity metric or with more representative unlabeled data? Is there a connection between the quality of labels and the overall performance and how can we get more representative labels? We answer all these questions, i.e., we provide an extensive evaluation, we propose several graph improvements, and we introduce a novel active learning framework to get more representative labels.Objektklassifizierung ist ein aktives Forschungsgebiet in maschineller Bildverarbeitung was bisher nur unzureichend gelöst ist. Die meisten Ansätze versuchen die Aufgabe durch überwachtes Lernen zu lösen. Aber diese Algorithmen benötigen eine hohe Anzahl von Trainingsdaten um gut zu funktionieren. Das führt häufig entweder zu sehr kleinen Datensätzen (< 10,000 Bilder) die nicht die reale Datenverteilung einer Klasse wiedergeben oder zu sehr grossen Datensätzen bei denen man die Korrektheit der Labels nicht mehr garantieren kann. Halbüberwachtes Lernen ist eine gute Alternative zu diesen Methoden, da sie nur sehr wenige Labels benötigen und man gleichzeitig Datenressourcen wie das Internet verwenden kann. In dieser Arbeit adressieren wir Objektklassifizierung mit halbüberwachten Lernverfahren. Diese Algorithmen sind sowohl von der zugrundeliegenden Struktur, die sich aus den Daten, der Bildbeschreibung und der Distanzmasse ergibt, als auch von der Qualität der Labels abhängig. Diese Erkenntnis hat folgende Forschungsfragen aufgeworfen: Ist die Struktur wichtiger als der Algorithmus selbst? Können wir diese Struktur gezielt verbessern z.B. durch eine bessere Metrik oder durch mehr Daten? Gibt es einen Zusammenhang zwischen der Qualität der Labels und der Gesamtperformanz der Algorithmen? In dieser Arbeit beantworten wir diese Fragen indem wir diese Methoden evaluieren. Ausserdem entwickeln wir neue Methoden um die Graphstruktur und die Labels zu verbessern