881 research outputs found

    Knowledge Extraction from Textual Resources through Semantic Web Tools and Advanced Machine Learning Algorithms for Applications in Various Domains

    Get PDF
    Nowadays there is a tremendous amount of unstructured data, often represented by texts, which is created and stored in variety of forms in many domains such as patients' health records, social networks comments, scientific publications, and so on. This volume of data represents an invaluable source of knowledge, but unfortunately it is challenging its mining for machines. At the same time, novel tools as well as advanced methodologies have been introduced in several domains, improving the efficacy and the efficiency of data-based services. Following this trend, this thesis shows how to parse data from text with Semantic Web based tools, feed data into Machine Learning methodologies, and produce services or resources to facilitate the execution of some tasks. More precisely, the use of Semantic Web technologies powered by Machine Learning algorithms has been investigated in the Healthcare and E-Learning domains through not yet experimented methodologies. Furthermore, this thesis investigates the use of some state-of-the-art tools to move data from texts to graphs for representing the knowledge contained in scientific literature. Finally, the use of a Semantic Web ontology and novel heuristics to detect insights from biological data in form of graph are presented. The thesis contributes to the scientific literature in terms of results and resources. Most of the material presented in this thesis derives from research papers published in international journals or conference proceedings

    Estimation of Relevant Variables on High-Dimensional Biological Patterns Using Iterated Weighted Kernel Functions

    Get PDF
    BACKGROUND The analysis of complex proteomic and genomic profiles involves the identification of significant markers within a set of hundreds or even thousands of variables that represent a high-dimensional problem space. The occurrence of noise, redundancy or combinatorial interactions in the profile makes the selection of relevant variables harder. METHODOLOGY/PRINCIPAL FINDINGS Here we propose a method to select variables based on estimated relevance to hidden patterns. Our method combines a weighted-kernel discriminant with an iterative stochastic probability estimation algorithm to discover the relevance distribution over the set of variables. We verified the ability of our method to select predefined relevant variables in synthetic proteome-like data and then assessed its performance on biological high-dimensional problems. Experiments were run on serum proteomic datasets of infectious diseases. The resulting variable subsets achieved classification accuracies of 99% on Human African Trypanosomiasis, 91% on Tuberculosis, and 91% on Malaria serum proteomic profiles with fewer than 20% of variables selected. Our method scaled-up to dimensionalities of much higher orders of magnitude as shown with gene expression microarray datasets in which we obtained classification accuracies close to 90% with fewer than 1% of the total number of variables. CONCLUSIONS Our method consistently found relevant variables attaining high classification accuracies across synthetic and biological datasets. Notably, it yielded very compact subsets compared to the original number of variables, which should simplify downstream biological experimentation

    Genetic and environmental prediction of opioid cessation using machine learning, GWAS, and a mouse model

    Full text link
    The United States is currently experiencing an epidemic of opioid use, use disorder, and overdose-related deaths. While studies have identified several loci that are associated with opioid use disorder (OUD) risk, the genetic basis for the ability to discontinue opioid use has not been investigated. Furthermore, very few studies have investigated the non-genetic factors that are predictive of opioid cessation or their predictive ability. In this thesis, I studied a novel phenotype–opioid cessation, defined as the time since last use of illicit opioids (1 year ago as cease) among persons meeting lifetime DSM-5 criteria for opioid use disorder (OUD). In chapter two, I identified novel genetic variants and biological pathways that potentially regulate opioid cessation success through a genome wide study, as well as genetic overlap between opioid cessation and other substance cessation traits. In chapter three, I identified multiple non-genetic risk factors specific to each racial group that are predictive of opioid cessation from the same individuals analyzed in chapter two by applying several linear and non-linear machine learning techniques to a set of more than 3,000 variables assessed by a structured psychiatric interview. Factors identified from this atheoretical approach can be grouped into opioid use activities, other drug use, health conditions, and demographics, while the predictive accuracy as high as nearly 80% was achieved. The findings from this research generated more hypotheses for future studies to reference. In chapter four, I performed differential gene expression and network analysis on mice with different oxycodone (an opioid receptor agonist)-induced behaviors and compared the significantly associated genes and network modules with top-ranked genes identified in humans. The pathway cross-talks and gene homologs identified from both species illuminate the potential molecular mechanism of opioid behaviors. In summary, this thesis utilized statistical genetics, machine learning, and a computational biology framework to address factors that are associative with opioid cessation in humans, and cross-referenced the genetic findings in a mouse model. These findings serve as references for future studies and provide a framework for personalizing the treatment of OUD

    A Multi-label Text Classification Framework: Using Supervised and Unsupervised Feature Selection Strategy

    Get PDF
    Text classification, the task of metadata to documents, needs a person to take significant time and effort. Since online-generated contents are explosively growing, it becomes a challenge for manually annotating with large scale and unstructured data. Recently, various state-or-art text mining methods have been applied to classification process based on the keywords extraction. However, when using these keywords as features in the classification task, it is common that the number of feature dimensions is large. In addition, how to select keywords from documents as features in the classification task is a big challenge. Especially, when using traditional machine learning algorithms in big data, the computation time is very long. On the other hand, about 80% of real data is unstructured and non-labeled in the real world. The conventional supervised feature selection methods cannot be directly used in selecting entities from massive data. Usually, statistical strategies are utilized to extract features from unlabeled data for classification tasks according to their importance scores. We propose a novel method to extract key features effectively before feeding them into the classification assignment. Another challenge in the text classification is the multi-label problem, the assignment of multiple non-exclusive labels to documents. This problem makes text classification more complicated compared with a single label classification. For the above issues, we develop a framework for extracting data and reducing data dimension to solve the multi-label problem on labeled and unlabeled datasets. In order to reduce data dimension, we develop a hybrid feature selection method that extracts meaningful features according to the importance of each feature. The Word2Vec is applied to represent each document by a feature vector for the document categorization for the big dataset. The unsupervised approach is used to extract features from real online-generated data for text classification. Our unsupervised feature selection method is applied to extract depression symptoms from social media such as Twitter. In the future, these depression symptoms will be used for depression self-screening and diagnosis

    Gene selection and classification in autism gene expression data

    Get PDF
    Autism spectrum disorders (ASD) are neurodevelopmental disorders that are currently diagnosed on the basis of abnormal stereotyped behaviour as well as observable deficits in communication and social functioning. Although a variety of candidate genes have been attributed to the disorder, no single gene is applicable to more than 1–2% of the general ASD population. Despite extensive efforts, definitive genes that contribute to autism susceptibility have yet to be identified. The major problems in dealing with the gene expression dataset of autism include the presence of limited number of samples and large noises due to errors of experimental measurements and natural variation. In this study, a systematic combination of three important filters, namely t-test (TT), Wilcoxon Rank Sum (WRS) and Feature Correlation (COR) are applied along with efficient wrapper algorithm based on geometric binary particle swarm optimization-support vector machine (GBPSO-SVM), aiming at selecting and classifying the most attributed genes of autism. A new approach based on the criterion of median ratio, mean ratio and variance deviations is also applied to reduce the initial dataset prior to its involvement. Results showed that the most discriminative genes that were identified in the first and last selection steps concluded the presence of a repetitive gene (CAPS2), which was assigned as the most ASD risk gene. The fused result of genes subset that were selected by the GBPSO-SVM algorithm increased the classification accuracy to about 92.10%, which is higher than those reported in literature for the same autism dataset. Noticeably, the application of ensemble using random forest (RF) showed better performance compared to that of previous studies. However, the ensemble approach based on the employment of SVM as an integrator of the fused genes from the output branches of GBPSO-SVM outperformed the RF integrator. The overall improvement was ascribed to the selection strategies that were taken to reduce the dataset and the utilization of efficient wrapper based GBPSO-SVM algorithm

    Building an automated platform for the classification of peptides/proteins using machine learning

    Get PDF
    Dissertação de mestrado em BioinformaticsOne of the challenging problems in bioinformatics is to computationally characterize sequences, structures and functions of proteins. Sequence-derived structural and physico-chemical properties of proteins have been used in the development of machine learning models in protein related problems. However, tools and platforms to calculate features and perform Machine learning (ML) with proteins are scarce and have their limitations in terms of effectiveness, user-friendliness and capacity. Here, a generic modular automated platform for the classification of proteins based on their physicochemical properties using different ML algorithms is proposed. The tool developed, as a Python package, facilitates the major tasks of ML and includes modules to read and alter sequences, calculate protein features, preprocess datasets, execute feature reduction and selection, perform clustering, train and optimize ML models and make predictions. As it is modular, the user retains the power to alter the code to fit specific needs. This platform was tested to predict membrane active anticancer and antimicrobial peptides and further used to explore viral fusion peptides. Membrane-interacting peptides play a crucial role in several biological processes. Fusion peptides are a subclass found in enveloped viruses, that are particularly relevant for membrane fusion. Determining what are the properties that characterize fusion peptides and distinguishing them from other proteins is a very relevant scientific question with important technological implications. Using three different datasets composed by well annotated sequences, different feature extraction techniques and feature selection methods (resulting in a total of over 20 datasets), seven ML models were trained and tested, using cross validation for error estimation and grid search for model selection. The different models, feature sets and feature selection techniques were compared. The best models obtained for distinct metric were then used to predict the location of a known fusion peptide in a protein sequence from the Dengue virus. Feature importances were also analysed. The models obtained will be useful in future research, also providing a biological insight of the distinctive physicochemical characteristics of fusion peptides. This work presents a freely available tool to perform ML-based protein classification and the first global analysis and prediction of viral fusion peptides using ML, reinforcing the usability and importance of ML in protein classification problems.Um dos problemas mais desafiantes em bioinformática é a caracterização de sequências, estruturas e funções de proteínas. Propriedades físico-químicas e estruturais derivadas da sequêcia proteica têm sido utilizadas no desenvolvimento de modelos de aprendizagem máquina (AM). No entanto, ferramentas para calcular estes atributos são escassas e têm limitações em termos de eficiência, facilidade de uso e capacidade de adaptação a diferentes problemas. Aqui, é descrita uma plataforma modular genérica e automatizada para a classificação de proteínas com base nas suas propriedades físico-químicas, que faz uso de diferentes algoritmos de AM. A ferramenta desenvolvida facilita as principais tarefas de AM e inclui módulos para ler e alterar sequências, calcular atributos de proteínas, realizar pré-processamento de dados, fazer redução e seleção de features, executar clustering, criar modelos de AM e fazer previsões. Como é construído de forma modular, o utilizador mantém o poder de alterar o código para atender às suas necessidades específicas. Esta plataforma foi testada com péptidos anticancerígenos e antimicrobianos e foi ainda utilizada para explorar péptidos de fusão virais. Os péptidos de fusão são uma classe de péptidos que interagem com a membrana, encontrados em vírus encapsulados e que são particularmente relevantes para a fusão da membrana do vírus com a membrana do hospedeiro. Determinar quais são as propriedades que os caracterizam é uma questão científica muito relevante, com importantes implicações tecnológicas. Usando três conjuntos de dados diferentes compostos por sequências bem anotadas, quatro técnicas diferentes de extração de features e cinco métodos diferentes de seleção de features (num total de 24 conjuntos de dados testados), sete modelos de AM, com validação cruzada de io vezes e uma abordagem de pesquisa em grelha, foram treinados e testados. Os melhores modelos obtidos, com avaliações MCC entre 0,7 e o,8 e precisão entre 0,85 e 0,9, foram utilizados para prever a localização de um péptido de fusão conhecido numa sequência da proteína de fusão do vírus do Dengue. Os modelos obtidos para prever a localização do péptido de fusão são úteis em pesquisas futuras, fornecendo também uma visão biológica das características físico-químicas distintivas dos mesmos. Este trabalho apresenta uma ferramenta disponível gratuitamente para realizar a classificação de proteínas com AM e a primeira análise global de péptidos de fusão virais usando métodos baseados em AM, reforçando a usabilidade e a importância da AM em problemas de classificação de proteínas

    Machine Learning na previsão de Cancro Colorretal em função de alterações metabólicas

    Get PDF
    No mundo atual, a quantidade de informação disponível nos mais variados setores é cada vez maior. É o caso da área da saúde, onde a recolha e tratamento de dados biomédicos procuram melhorar a tomada de decisão no tratamento a aplicar a um doente, recorrendo a ferramentas baseadas em Machine Learning. Machine Learning é uma área da Inteligência Artificial em que através da aplicação de algoritmos a um conjunto de dados é possível prever resultados ou até descobrir relações entre estes que seriam impercetíveis à primeira vista. Com este projeto pretende-se realizar um estudo em que o objetivo é investigar diversos algoritmos e técnicas de Machine Learning, de modo a identificar se o perfil de acilcarnitinas pode constituir um novo marcador bioquímico para a predição e prognóstico do Cancro Colorretal. No decurso do trabalho, foram testados diferentes algoritmos e técnicas de pré-processamento de dados. Foram realizadas três experiências distintas com o objetivo de validar as previsões dos modelos construídos para diferentes cenários, nomeadamente: prever se o paciente tem Cancro Colorretal, prever qual a doença que o paciente tem (Cancro Colorretal e outras doenças metabólicas) e prever se este tem ou não alguma doença. Numa primeira análise, os modelos desenvolvidos apresentam bons resultados na triagem de Cancro Colorretal. Os melhores resultados foram obtidos pelos algoritmos Random Forest e Gradient Boosting, em conjunto com técnicas de balanceamento dos dados e Feature Selection, nomeadamente Random Oversampling, Synthetic Oversampling e Recursive Feature SelectionIn today´s world, the amount of information available in various sectors is increasing. That is the case in the healthcare area, where the collection and treatment of biochemical data seek to improve the decision-making in the treatment to be applied to a patient, using Machine Learning-based tools. Machine learning is an area of Artificial Intelligence in which applying algorithms to a dataset makes it possible to predict results or even discover relationships that would be unnoticeable at first glance. This project’s main objective is to study several algorithms and techniques of Machine Learning to identify if the acylcarnitine profile may constitute a new biochemical marker for the prediction and prognosis of rectal cancer. In the course of the work, different algorithms and data preprocessing techniques were tested. Three different experiments were carried out to validate the predictions of the models built for different scenarios, namely: predicting whether the patient has Colorectal Cancer, predicting which disease the patient has (Colorectal Cancer and other metabolic diseases) and predicting whether he has any disease. As a first analysis, the developed models showed good results in Colorectal Cancer screening. The best results were obtained by the Random Forest and Gradient Boosting algorithms, together with data balancing and feature selection techniques, namely Random Oversampling, Synthetic Oversampling and Recursive Feature Selectio
    corecore