389 research outputs found

    Cluster validity in clustering methods

    Get PDF

    Sekventiaalisen tiedon louhinta : segmenttirakenteita etsimässä

    Get PDF
    Segmentation is a data mining technique yielding simplified representations of sequences of ordered points. A sequence is divided into some number of homogeneous blocks, and all points within a segment are described by a single value. The focus in this thesis is on piecewise-constant segments, where the most likely description for each segment and the most likely segmentation into some number of blocks can be computed efficiently. Representing sequences as segmentations is useful in, e.g., storage and indexing tasks in sequence databases, and segmentation can be used as a tool in learning about the structure of a given sequence. The discussion in this thesis begins with basic questions related to segmentation analysis, such as choosing the number of segments, and evaluating the obtained segmentations. Standard model selection techniques are shown to perform well for the sequence segmentation task. Segmentation evaluation is proposed with respect to a known segmentation structure. Applying segmentation on certain features of a sequence is shown to yield segmentations that are significantly close to the known underlying structure. Two extensions to the basic segmentation framework are introduced: unimodal segmentation and basis segmentation. The former is concerned with segmentations where the segment descriptions first increase and then decrease, and the latter with the interplay between different dimensions and segments in the sequence. These problems are formally defined and algorithms for solving them are provided and analyzed. Practical applications for segmentation techniques include time series and data stream analysis, text analysis, and biological sequence analysis. In this thesis segmentation applications are demonstrated in analyzing genomic sequences.Segmentointi on tiedon louhinnassa käytetty menetelmä, jonka avulla voidaan tuottaa yksinkertaisia kuvauksia sekvenssistä, joka koostuu järjestetystä jonosta pisteitä. Pisteet voivat olla joko yksi- tai moniulotteisia. Segmentoinnissa sekvenssi jaetaan tiettyyn määrään yhtenäisiä alueita, segmenttejä, ja kunkin alueen sisältämiä pisteitä kuvataan yhdellä arvolla. Väitöskirjassa keskitytään paloittain vakioiden segmenttirakenteiden etsintään. Tällaisille rakenteille kunkin segmentin paras kuvaus sekä koko sekvenssin paras jako segmentteihin voidaan laskea tehokkaasti. Tiedon mallintaminen segmentoinnin avulla on hyödyllistä mm. silloin kun tietoa tallennetaan ja indeksoidaan sekvenssitietokannoissa, sekä kun halutaan saada lisätietoja tietyn sekvenssin yleisrakenteesta. Väitöskirjassa käsitellään ensin segmentointiin liittyviä peruskysymyksiä, segmenttien lukumäärän valitsemista ja segmentointitulosten arviointia. Olemassa olevien mallinvalintamenetelmien näytetään soveltuvan hyvin segmenttien lukumäärän valitsemiseen. Segmentointien arviointia käsitellään suhteessa tunnettuun segmenttirakenteeseen. Voidaan näyttää, että segmentoimalla sekvenssi sen tiettyjen ominaisuuksien suhteen saadaan tulokseksi segmentointeja, joiden samankaltaisuus tunnetun rakenteen kanssa on merkitsevä. Perinteiseen segmentointikehykseen esitellään kaksi laajennosta: yksihuippuinen segmentointi ja kantasegmentointi. Yksihuippuisessa segmentoinnissa segmenttien kuvaukset saavat arvoja, jotka ensin kasvavat ja sitten vähenevät. Kantasegmentoinnissa puolestaan mallinnetaan segmenttien sekä sekvenssin eri ulottuvuuksien välisiä suhteita. Väitöskirjassa määritellään nämä kaksi uutta segmentointiongelmaa. Lisäksi sekä annetaan että analysoidaan laskennallisia menetelmiä, algoritmeja, niiden ratkaisemiseksi. Segmentointimenetelmiä sovelletaan käytännössä mm. aikasarjojen, tietovirtojen, tekstin ja biologisten sekvenssien analysoinnissa. Väitöskirjassa käsitellään esimerkinomaisesti segmentoinnin soveltamista genomisekvenssien analysoinnissa

    A General Framework for Updating Belief Distributions

    Full text link
    We propose a framework for general Bayesian inference. We argue that a valid update of a prior belief distribution to a posterior can be made for parameters which are connected to observations through a loss function rather than the traditional likelihood function, which is recovered under the special case of using self information loss. Modern application areas make it is increasingly challenging for Bayesians to attempt to model the true data generating mechanism. Moreover, when the object of interest is low dimensional, such as a mean or median, it is cumbersome to have to achieve this via a complete model for the whole data distribution. More importantly, there are settings where the parameter of interest does not directly index a family of density functions and thus the Bayesian approach to learning about such parameters is currently regarded as problematic. Our proposed framework uses loss-functions to connect information in the data to functionals of interest. The updating of beliefs then follows from a decision theoretic approach involving cumulative loss functions. Importantly, the procedure coincides with Bayesian updating when a true likelihood is known, yet provides coherent subjective inference in much more general settings. Connections to other inference frameworks are highlighted.Comment: This is the pre-peer reviewed version of the article "A General Framework for Updating Belief Distributions", which has been accepted for publication in the Journal of Statistical Society - Series B. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Self-Archivin

    Offline speaker segmentation using genetic algorithms and mutual information

    Get PDF
    We present an evolutionary approach to speaker segmentation, an activity that is especially important prior to speaker recognition and audio content analysis tasks. Our approach consists of a genetic algorithm (GA), which encodes possible segmentations of an audio record, and a measure of mutual information between the audio data and possible segmentations, which is used as fitness function for the GA. We introduce a compact encoding of the problem into the GA which reduces the length of the GA individuals and improves the GA convergence properties. Our algorithm has been tested on the segmentation of real audio data, and its performance has been compared with several existing algorithms for speaker segmentation, obtaining very good results in all test problems.This work was supported in part by the Universidad de Alcalá under Project UAH PI2005/078

    Improving the Clinical Use of Magnetic Resonance Spectroscopy for the Analysis of Brain Tumours using Machine Learning and Novel Post-Processing Methods

    Get PDF
    Magnetic Resonance Spectroscopy (MRS) provides unique and clinically relevant information for the assessment of several diseases. However, using the currently available tools, MRS processing and analysis is time-consuming and requires profound expert knowledge. For these two reasons, MRS did not gain general acceptance as a mainstream diagnostic technique yet, and the currently available clinical tools have seen little progress during the past years. MRS provides localized chemical information non-invasively, making it a valuable technique for the assessment of various diseases and conditions, namely brain, prostate and breast cancer, and metabolic diseases affecting the brain. In brain cancer, MRS is normally used for: (1.) differentiation between tumors and non-cancerous lesions, (2.) tumor typing and grading, (3.) differentiation between tumor-progression and radiation necrosis, and (4.) identification of tumor infiltration. Despite the value of MRS for these tasks, susceptibility differences associated with tissue-bone and tissue-air interfaces, as well as with the presence of post-operative paramagnetic particles, affect the quality of brain MR spectra and consequently reduce their clinical value. Therefore, the proper quality management of MRS acquisition and processing is essential to achieve unambiguous and reproducible results. In this thesis, special emphasis was placed on this topic. This thesis addresses some of the major problems that limit the use of MRS in brain tumors and focuses on the use of machine learning for the automation of the MRS processing pipeline and for assisting the interpretation of MRS data. Three main topics were investigated: (1.) automatic quality control of MRS data, (2.) identification of spectroscopic patterns characteristic of different tissue-types in brain tumors, and (3.) development of a new approach for the detection of tumor-related changes in GBM using MRSI data. The first topic tackles the problem of MR spectra being frequently affected by signal artifacts that obscure their clinical information content. Manual identification of these artifacts is subjective and is only practically feasible for single-voxel acquisitions and in case the user has an extensive experience with MRS. Therefore, the automatic distinction between data of good or bad quality is an essential step for the automation of MRS processing and routine reporting. The second topic addresses the difficulties that arise while interpreting MRS results: the interpretation requires expert knowledge, which is not available at every site. Consequently, the development of methods that enable the easy comparison of new spectra with known spectroscopic patterns is of utmost importance for clinical applications of MRS. The third and last topic focuses on the use of MRSI information for the detection of tumor-related effects in the periphery of brain tumors. Several research groups have shown that MRSI information enables the detection of tumor infiltration in regions where structural MRI appears normal. However, many of the approaches described in the literature make use of only a very limited amount of the total information contained in each MR spectrum. Thus, a better way to exploit MRSI information should enable an improvement in the detection of tumor borders, and consequently improve the treatment of brain tumor patients. The development of the methods described was made possible by a novel software tool for the combined processing of MRS and MRI: SpectrIm. This tool, which is currently distributed as part of the jMRUI software suite (www.jmrui.eu), is ubiquitous to all of the different methods presented and was one of the main outputs of the doctoral work. Overall, this thesis presents different methods that, when combined, enable the full automation of MRS processing and assist the analysis of MRS data in brain tumors. By allowing clinical users to obtain more information from MRS with less effort, this thesis contributes to the transformation of MRS into an important clinical tool that may be available whenever its information is of relevance for patient management

    Automated segmentation and characterisation of white matter hyperintensities

    Get PDF
    Neuroimaging has enabled the observation of damage to the white matter that occurs frequently in elderly population and is depicted as hyperintensities in specific magnetic resonance images. Since the pathophysiology underlying the existence of these signal abnormalities and the association with clinical risk factors and outcome is still investigated, a robust and accurate quantification and characterisation of these observations is necessary. In this thesis, I developed a data-driven split and merge model selection framework that results in the joint modelling of normal appearing and outlier observations in a hierarchical Gaussian mixture model. The resulting model can then be used to segment white matter hyperintensities (WMH) in a post-processing step. The validity of the method in terms of robustness to data quality, acquisition protocol and preprocessing and its comparison to the state of the art is evaluated in both simulated and clinical settings. To further characterise the lesions, a subject-specific coordinate frame that divides the WM region according to the relative distance between the ventricular surface and the cortical sheet and to the lobar location is introduced. This coordinate frame is used for the comparison of lesion distributions in a population of twin pairs and for the prediction and standardisation of visual rating scales. Lastly the cross-sectional method is extended into a longitudinal framework, in which a Gaussian Mixture model built on an average image is used to constrain the representation of the individual time points. The method is validated through a purpose-build longitudinal lesion simulator and applied to the investigation of the relationship between APOE genetic status and lesion load progression

    Copynumber: Efficient algorithms for single- and multi-track copy number segmentation.

    Get PDF
    BACKGROUND: Cancer progression is associated with genomic instability and an accumulation of gains and losses of DNA. The growing variety of tools for measuring genomic copy numbers, including various types of array-CGH, SNP arrays and high-throughput sequencing, calls for a coherent framework offering unified and consistent handling of single- and multi-track segmentation problems. In addition, there is a demand for highly computationally efficient segmentation algorithms, due to the emergence of very high density scans of copy number. RESULTS: A comprehensive Bioconductor package for copy number analysis is presented. The package offers a unified framework for single sample, multi-sample and multi-track segmentation and is based on statistically sound penalized least squares principles. Conditional on the number of breakpoints, the estimates are optimal in the least squares sense. A novel and computationally highly efficient algorithm is proposed that utilizes vector-based operations in R. Three case studies are presented. CONCLUSIONS: The R package copynumber is a software suite for segmentation of single- and multi-track copy number data using algorithms based on coherent least squares principles.RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are
    corecore