135 research outputs found

    EFSIS: Ensemble Feature Selection Integrating Stability

    Get PDF
    Ensemble learning that can be used to combine the predictions from multiple learners has been widely applied in pattern recognition, and has been reported to be more robust and accurate than the individual learners. This ensemble logic has recently also been more applied in feature selection. There are basically two strategies for ensemble feature selection, namely data perturbation and function perturbation. Data perturbation performs feature selection on data subsets sampled from the original dataset and then selects the features consistently ranked highly across those data subsets. This has been found to improve both the stability of the selector and the prediction accuracy for a classifier. Function perturbation frees the user from having to decide on the most appropriate selector for any given situation and works by aggregating multiple selectors. This has been found to maintain or improve classification performance. Here we propose a framework, EFSIS, combining these two strategies. Empirical results indicate that EFSIS gives both high prediction accuracy and stability.Comment: 20 pages, 3 figure

    Novel Extensions of Label Propagation for Biomarker Discovery in Genomic Data

    Get PDF
    One primary goal of analyzing genomic data is the identification of biomarkers which may be causative of, correlated with, or otherwise biologically relevant to disease phenotypes. In this work, I implement and extend a multivariate feature ranking algorithm called label propagation (LP) for biomarker discovery in genome-wide single-nucleotide polymorphism (SNP) data. This graph-based algorithm utilizes an iterative propagation method to efficiently compute the strength of association between a SNP and a phenotype. I developed three extensions to the LP algorithm, with the goal of tailoring it to genomic data. The first extension is a modification to the LP score which yields a variable-level score for each SNP, rather than a score for each SNP genotype. The second extension incorporates prior biological knowledge that is encoded as a prior value for each SNP. The third extension enables the combination of rankings produced by LP and another feature ranking algorithm. The LP algorithm, its extensions, and two control algorithms (chi squared and sparse logistic regression) were applied to 11 genomic datasets, including a synthetic dataset, a semi-synthetic dataset, and nine genome-wide association study (GWAS) datasets covering eight diseases. The quality of each feature ranking algorithm was evaluated by using a subset of top-ranked SNPs to construct a classifier, whose predictive power was evaluated in terms of the area under the Receiver Operating Characteristic curve. Top-ranked SNPs were also evaluated for prior evidence of being associated with disease using evidence from the literature. The LP algorithm was found to be effective at identifying predictive and biologically meaningful SNPs. The single-score extension performed significantly better than the original algorithm on the GWAS datasets. The prior knowledge extension did not improve on the feature ranking results, and in some cases it reduced the predictive power of top-ranked variants. The ranking combination method was effective for some pairs of algorithms, but not for others. Overall, this work’s main results are the formulation and evaluation of several algorithmic extensions of LP for use in the analysis of genomic data, as well as the identification of several disease-associated SNPs

    The value of semantics in biomedical knowledge graphs

    Get PDF
    Knowledge graphs use a graph-based data model to represent knowledge of the real world. They consist of nodes, which represent entities of interest such as diseases or proteins, and edges, which represent potentially different relations between these entities. Semantic properties can be attached to these nodes and edges, indicating the classes of entities they represent (e.g. gene, disease), the predicates that indicate the types of relationships between the nodes (e.g. stimulates, treats), and provenance that provides references to the sources of these relationships.Modelling knowledge as a graph emphasizes the interrelationships between the entities, making knowledge graphs a useful tool for performing computational analyses for domains in which complex interactions and sequences of events exist, such as biomedicine. Semantic properties provide additional information and are assumed to benefit such computational analyses but the added value of these properties has not yet been extensively investigated.This thesis therefore develops and compares computational methods that use these properties, and applies them to biomedical tasks. These are: biomarker identification, drug repurposing, drug efficacy screening, identifying disease trajectories, and identifying genes targeted by disease-associated SNPs located on the non-coding part of the genome.In general, we find that methods which use concept classes, predicates, or provenance improves achieve a superior performance over methods that do not use them. We thereby demonstrate the added value of these semantic properties for computational analyses performed on biomedical knowledge graphs.<br/

    Biomarker Discovery Using Statistical and Machine Learning Approaches on Gene Expression Data

    Get PDF
    My PhD is affiliated with the dCod 1.0 project (https://www.uib.no/en/dcod): decoding the systems toxicology of Atlantic cod (Gadus morhua), which aims to better understand how cods adapt and react to the stressors in the environment. One of the research topics is to discover the biomarkers which discriminate the fish under normal biological status and the ones that are exposed to toxicants. A biomarker, or biological marker, is an indicator of a biological state in response to an intervention, which can be for example toxic exposure (in toxicology), disease (for example cancer), or drug response (in precision medicine). Biomarker discovery is a very important research topic in toxicology, cancer research, and so on. A good set of biomarkers can give insight into the disease / toxicant response mechanisms and be useful to find if the person has the disease / the fish has been exposed to the toxicant. On the molecular level, a biomarker could be "genotype" - for instance a single nucleotide variant linked with a particular disease or susceptibility; another biomarker could be the level of expression of a gene or a set of genes. In this thesis we focus on the latter one, aiming to find out the informative genes that can help to distinguish samples from different groups from the gene expression profiling. Several transcriptomics technologies can be used to generate the necessary data, and among them, DNA microarray and RNA sequencing (RNA-Seq) have become the most useful methods for whole transcriptome gene expression profiling. Especially RNA-Seq has become an attractive alternative to microarrays since it was introduced. Prior to analysis of gene expression, the RNA-Seq data needs to go through a series of processing steps, so a workflow which can automate the process is highly required. Even though many workflows have been proposed to facilitate this process, their application is usually limited to such as model organisms, high-performance computers, computer fluent users, and so on. To fill these gaps, we developed a maximally general RNA-Seq analysis workflow: RNA-Seq Analysis Snakemake Workflow (RASflow), which is applicable to a wide range of applications and requires little programming skills. It takes the sequencing data as input, and maps them to either transcriptome or genome for quantification, and after that the gene expression profile can be achieved which afterwards goes through normalization and statistical tests to find out the differentially expressed genes. This work was presented in Paper I and Paper II. Differential expression analysis used in RASflow, together with other univariate methods are widely used in biomarker discovery for their simplicity and interpretability. But they rely on a hypothesis that variables are independent, so they can only identify variables that possess significant information by themselves. However, biological processes usually involve many variables that have complex interactions. Multivariate methods which take the interactions between variables into consideration are therefore also popular for biomarker discovery. To study whether there is a significant advantage of one over the other, we conducted a comparative study of various methods from these two categories and evaluated these methods on two aspects: stability and prediction accuracy, we found that a method’s performance is quite data-dependent. This work was presented in Paper III. Since the biomarker discovery methods perform quite differently on different datasets, then how to choose the most appropriate one for a particular dataset? One solution is to use the function perturbation strategy to combine the outputs from multiple methods. Function perturbation is capable of maintaining prediction accuracy compared with the original individual methods, but its stability is not satisfactory enough. On the other hand, data perturbation uses a similar ensemble learning logic: it firstly generates multiple datasets by resampling the original dataset and then combines the results from those datasets. Data perturbation has been proven to improve the stability of the biomarker discovery method. We therefore proposed a framework which combines function perturbation with data perturbation: Ensemble Feature Selection Integrating Stability (EFSIS) which achieves both high prediction accuracy and stability. This work was presented in Paper IV

    Random Forests Based Group Importance Scores and Their Statistical Interpretation: Application for Alzheimer's Disease

    Get PDF
    Machine learning approaches have been increasingly used in the neuroimaging field for the design of computer-aided diagnosis systems. In this paper, we focus on the ability of these methods to provide interpretable information about the brain regions that are the most informative about the disease or condition of interest. In particular, we investigate the benefit of group-based, instead of voxel-based, analyses in the context of Random Forests. Assuming a prior division of the voxels into non overlapping groups (defined by an atlas), we propose several procedures to derive group importances from individual voxel importances derived from Random Forests models. We then adapt several permutation schemes to turn group importance scores into more interpretable statistical scores that allow to determine the truly relevant groups in the importance rankings. The good behaviour of these methods is first assessed on artificial datasets. Then, they are applied on our own dataset of FDG-PET scans to identify the brain regions involved in the prognosis of Alzheimer's disease

    Automated detection of depression from brain structural magnetic resonance imaging (sMRI) scans

    Full text link
    &nbsp;Automated sMRI-based depression detection system is developed whose components include acquisition and preprocessing, feature extraction, feature selection, and classification. The core focus of the research is on the establishment of a new feature selection algorithm that quantifies the most relevant brain volumetric feature for depression detection at an individual level

    Scoping current and future genetic tools, their limitations and their applications for wild fisheries management

    Get PDF
    The overarching goal of this project was to prepare a document that summarises past, present and emerging ways in which research using genetic technology can assist the Australian fishing industry to maintain productive and sustainable harvests. The project achieved the following specific objectives: 1. Documented existing and prospective biotechnologies and genetic analysis tools that are relevant to wild fisheries management, and their availability and application at a national and international level; 2. Documented the FRDC’s past and current investment in biotechnology and genetic tools used in wild fisheries management research; 3. Documented the different biotechnology and genetic tools that are being used in wild fisheries management research in Australia, and the nature and location of key research groups; 4. Described what management question each tool has been used for (e.g. stock structure, biomass estimation, product provenance, disease monitoring); 5. Identified those tools and approaches (existing and future) most likely to deliver significant advances in fisheries management; 6. Identified the potential for collaborations which could improve the focus and impact of work in this area

    Risk assessment for progression of Diabetic Nephropathy based on patient history analysis

    Get PDF
    A nefropatia diabética (ND) é uma das complicações mais comuns em doentes com diabetes. Trata-se de uma doença crónica que afeta progressivamente os rins, podendo resultar numa insuficiência renal. A digitalização permitiu aos hospitais armazenar as informações dos doentes em registos de saúde eletrónicos (RSE). A aplicação de algoritmos de Machine Learning (ML) a estes dados pode permitir a previsão do risco na evolução destes doentes, conduzindo a uma melhor gestão da doença. O principal objetivo deste trabalho é criar um modelo preditivo que tire partido do historial do doente presente nos RSE. Foi aplicado neste trabalho o maior conjunto de dados de doentes portugueses com DN, seguidos durante 22 anos pela Associação Protetora dos Diabéticos de Portugal (APDP). Foi desenvolvida uma abordagem longitudinal na fase de pré-processamento de dados, permitindo que estes fossem servidos como entrada para dezasseis algoritmos de ML distintos. Após a avaliação e análise dos respetivos resultados, o Light Gradient Boosting Machine foi identificado como o melhor modelo, apresentando boas capacidades de previsão. Esta conclusão foi apoiada não só pela avaliação de várias métricas de classificação em dados de treino, teste e validação, mas também pela avaliação do seu desempenho por cada estádio da doença. Para além disso, os modelos foram analisados utilizando gráficos de feature ranking e através de análise estatística. Como complemento, são ainda apresentados a interpretabilidade dos resultados através do método SHAP, assim como a distribuição do modelo utilizando o Gradio e os servidores da Hugging Face. Através da integração de técnicas ML, de um método de interpretação e de uma aplicação Web que fornece acesso ao modelo, este estudo oferece uma abordagem potencialmente eficaz para antecipar a evolução da ND, permitindo que os profissionais de saúde tomem decisões informadas para a prestação de cuidados personalizados e gestão da doença

    Heuristic ensembles of filters for accurate and reliable feature selection

    Get PDF
    Feature selection has become increasingly important in data mining in recent years. However, the accuracy and stability of feature selection methods vary considerably when used individually, and yet no rule exists to indicate which one should be used for a particular dataset. Thus, an ensemble method that combines the outputs of several individual feature selection methods appears to be a promising approach to address the issue and hence is investigated in this research. This research aims to develop an effective ensemble that can improve the accuracy and stability of the feature selection. We proposed a novel heuristic ensemble of filters (HEF). It combines two types of filters: subset filters and ranking filters with a heuristic consensus algorithm in order to utilise the strength of each type. The ensemble is tested on ten benchmark datasets and its performance is evaluated by two stability measures and three classifiers. The experimental results demonstrate that HEF improves the stability and accuracy of the selected features and in most cases outperforms the other ensemble algorithms, individual filters and the full feature set. The research on the HEF algorithm is extended in several dimensions; including more filter members, three novel schemes of mean rank aggregation with partial lists, and three novel schemes for a weighted heuristic ensemble of filters. However, the experimental results demonstrate that adding weight to filters in HEF does not achieve the expected improvement in accuracy, but increases time and space complexity, and clearly decreases stability. Therefore, the core ensemble algorithm (HEF) is demonstrated to be not just simpler but also more reliable and consistent than the later more complicated and weighted ensembles. In addition, we investigated how to use data in feature selection, using ALL or PART of it. Systematic experiments with thirty five synthetic and benchmark real-world datasets were carried out
    corecore