4,156 research outputs found

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    New data analytics and visualization methods in personal data mining, cancer data analysis and sports data visualization

    Get PDF
    In this dissertation, we discuss a reading profiling system, a biological data visualization system and a sports visualization system. Self-tracking is getting increasingly popular in the field of personal informatics. Reading profiling can be used as a personal data collection method. We present UUAT, an unintrusive user attention tracking system. In UUAT, we used user interaction data to develop technologies that help to pinpoint a users reading region (RR). Based on computed RR and user interaction data, UUAT can identify a readers reading struggle or interest. A biomarker is a measurable substance that may be used as an indicator of a particular disease. We developed CancerVis for visual and interactive analysis of cancer data and demonstrate how to apply this platform in cancer biomarker research. CancerVis provides interactive multiple views from different perspectives of a dataset. The views are synchronized so that users can easily link them to a same data entry. Furthermore, CancerVis supports data mining practice in cancer biomarker, such as visualization of optimal cutpoints and cutthrough exploration. Tennis match summarization helps after-live sports consumers assimilate an interested match. We developed TennisVis, a comprehensive match summarization and visualization platform. TennisVis offers chart- graph for a client to quickly get match facts. Meanwhile, TennisVis offers various queries of tennis points to satisfy diversified client preferences (such as volley shot, many-shot rally) of tennis fans. Furthermore, TennisVis offers video clips for every single tennis point and a recommendation rating is computed for each tennis play. A case study shows that TennisVis identifies more than 75% tennis points in full time match

    Highly accurate detection of ovarian cancer using CA125 but limited improvement with serum matrix-assisted laser desorption/ionization time-of-flight mass spectrometry profiling

    Get PDF
    Objectives: Our objective was to test the performance of CA125 in classifying serum samples from a cohort of malignant and benign ovarian cancers and age-matched healthy controls and to assess whether combining information from matrix-assisted laser desorption/ionization (MALDI) time-of-flight profiling could improve diagnostic performance. Materials and Methods: Serum samples from women with ovarian neoplasms and healthy volunteers were subjected to CA125 assay and MALDI time-of-flight mass spectrometry (MS) profiling. Models were built from training data sets using discriminatory MALDI MS peaks in combination with CA125 values and tested their ability to classify blinded test samples. These were compared with models using CA125 threshold levels from 193 patients with ovarian cancer, 290 with benign neoplasm, and 2236 postmenopausal healthy controls. Results: Using a CA125 cutoff of 30 U/mL, an overall sensitivity of 94.8% (96.6% specificity) was obtained when comparing malignancies versus healthy postmenopausal controls, whereas a cutoff of 65 U/mL provided a sensitivity of 83.9% (99.6% specificity). High classification accuracies were obtained for early-stage cancers (93.5% sensitivity). Reasons for high accuracies include recruitment bias, restriction to postmenopausal women, and inclusion of only primary invasive epithelial ovarian cancer cases. The combination of MS profiling information with CA125 did not significantly improve the specificity/accuracy compared with classifications on the basis of CA125 alone. Conclusions: We report unexpectedly good performance of serum CA125 using threshold classification in discriminating healthy controls and women with benign masses from those with invasive ovarian cancer. This highlights the dependence of diagnostic tests on the characteristics of the study population and the crucial need for authors to provide sufficient relevant details to allow comparison. Our study also shows that MS profiling information adds little to diagnostic accuracy. This finding is in contrast with other reports and shows the limitations of serum MS profiling for biomarker discovery and as a diagnostic too

    miRetrieve-an R package and web application for miRNA text mining

    Get PDF
    microRNAs (miRNAs) regulate gene expression and thereby influence biological processes in health and disease. As a consequence, miRNAs are intensely studied and literature on miRNAs has been constantly growing. While this growing body of literature reflects the interest in miRNAs, it generates a challenge to maintain an overview, and the comparison of miRNAs that may function across diverse disease fields is complex due to this large number of relevant publications. To address these challenges, we designed miRetrieve, an R package and web application that provides an overview on miRNAs. By text mining, miRetrieve can characterize and compare miRNAs within specific disease fields and across disease areas. This overview provides focus and facilitates the generation of new hypotheses. Here, we explain how miRetrieve works and how it is used. Furthermore, we demonstrate its applicability in an exemplary case study and discuss its advantages and disadvantages

    The combination of the disciplines of Techmining and semantic TRIZ for better and faster analyzing technology evolution

    Full text link
    Tesis por compendioThe purpose of the present thesis is to explore and to demonstrate how the combination of two methodological approaches, text mining plus the systemic vision of TRIZ empowered by semantics, can bring a larger and more comprehensive analysis of the evolution of a technology. Both approaches had been not combined before the first of the four papers constituents of the present thesis based in a compendium of publications. However, this combination applied to the evolution of technologies is increasingly being published in the scientific literature. Such combination shows a second benefit in the form of an improvement in accessing and connecting knowledge from disparate scientific literatures in a systematic manner. The common element in all these papers is the use of the technology mining approach, 'techmining', the application of text mining techniques based on technology management knowledge, combined with the use of semantic TRIZ, the advantage of syntactic applied to the systemic vision of TRIZ. These papers show that a better analysis of evolving technologies, e.g. by profiling technologies from a systemic point of view or, a better access to knowledge, e.g. by semantically connecting concepts with meaning, can be achieved. The research on applying the combination of these approaches to scientific and technological information analysis explores the advantages and new possibilities for technology trends assessment as well as the semantic connection of concepts which represents a change in the way information research can be done. The different applications of the aforementioned combination are explored by means of the here presented articles. The structure followed in this research is the collection of three papers published in international academic journals indexed in the most prestigious databases and one chapter in a proceedings book of an international congress. The attached articles show the research undertaken to demonstrate the aforementioned benefits of the proposed combination. Despite it can be found many methods and approaches about the assessment of the evolution of technologies, distributed across the literature, there is still a need to better understand which technologies may emerge, which may evolve faster and at what pace can they reach the market. The combination of the techmining approach and the semantic TRIZ approaches allows understanding the trends enriched with a systemic vision of the links, functions, and influences of constituent and enabling elements of a technology. Such systemic link of elements with its components and ecosystem also allows for a multi-dimensional view of a technology and further reduces the uncertainty to preview the progress of a technology. The papers presented in this dissertation are based on the combination of the TRIZ methodology, the techmining approach and the semantic TRIZ approach, applied to different technologies in different domains, to proof the advantages and implications of the combination. The articles try the different interactions of the combined approaches, applied to the assessment of different technologies, such as lithium batteries for the electric car, a medical case linked to a disease known as Meniére's Disease, the prognosis of prostate cancer, and the usage of probiotics as substitutes of antibiotics in the animal health. The wide range of technologies was selected to show the clear benefits of either combining the two approaches or applying predominantly one of them in the case of the Meniére's disease article. That difference in the nature of technologies also helped to better understand the systemic point of view of the technology, exploring new applications based on the general system theory from Bertalanffy as well as other related approaches about technologies.El propósito de la presente tesis es la exploración y la demostración de la combinación de dos enfoques metodológicos, la minería de textos y la visión sistémica de TRIZ reforzada con la semántica, pueden aportar un mayor y mas exhaustivo análisis de la evolución de una tecnología. Ambos enfoques no habían sido combinados antes del primero de los cuatro artículos que representan esta tesis por compendio de publicaciones, aunque dicha combinación ha sido crecientemente publicada en la literatura científica, para multiples propósitos desde entonces. Un segundo aporte proporcionado por esta combinación es la mejora de la capacidad de acceso al conocimiento y cómo ello supone un avance para el descubrimiento a través de literaturas no relacionadas "disparate literature discovery" de una forma metódica y científica. El elemento común en los artículos aquí presentados es el aprovechamiento de techmining, esto es, la minería de textos con base en la gestión tecnológica, por ejemplo mediante el perfilado de tecnologías, junto al enfoque de la metodología TRIZ potenciada por el análisis sintáctico y semántico, esto es, mediante la conexión semántica de conceptos, para un análisis más completo de la evolución tecnológica, proporcionando al mismo tiempo un acceso más racional al conocimiento. La investigación sobre la aplicación de la citada combinación al análisis de información científica y tecnológica explora las ventajas y nuevas posibilidades en la evaluación del avance de la tecnología, así como la conexión semántica de conceptos que representa nuevas posibilidades en la forma en que la investigación textual puede hacerse. La estructura de la investigación aquí presentada se muestra a través de los artículos publicados en revistas internacionales de alto impacto y el capítulo de los 'proceedings' de un congreso internacional. Dichos artículos muestran la investigación llevada a cabo para demostrar los beneficios mencionados de la combinación propuesta. A pesar de la gran actividad de investigación y de la existencia de varios enfoques para la prospectiva y la previsión tecnológica presentes en la literatura científica, existe aún la necesidad de entender qué tecnologías pueden emerger, pueden evolucionar más rápido y a qué velocidad pueden llegar al mercado. La combinación de los enfoques de minería tecnológica o techmining y TRIZ semántico permite entender las tendencias de una tecnología dada, enriquecida con una visión de su sistémica, y teniendo en cuenta las conexiones de sus elementos y las influencias de sus elementos constituyentes. Tal conexión entre los components y su entorno permite una vision multidimensional de la tecnología reduciendo más aún la incertidumbre en la previsión de la evolución de una tecnología. Los artículos presentados en esta tesis son aplicaciones y exploraciones de la combinación de mencionada, a diferentes tecnologías de diversos ámbitos muy dispares entre sí, con el fin de demostrar sus ventajas e implicaciones. Los artículos tratan las diferentes interacciones entre ambos enfoques de trabajo, aplicados a tecnologías como baterías de litio para los vehículos eléctricos, un caso médico ligado a una dolencia como el síndrome de Méniere, a la prognosis del cáncer de próstata y al uso de probióticos en la alimentación animal como sustitución de los antibióticos. Este amplio rango de tecnologías han sido seleccionados para mostrar las ventajas, de forma más objetiva, de la combinación de ambos enfoques o con predominancia de alguno en particular, como es el caso del artículo explorando el síndrome de Méniere. Estas exploraciones permiten también entender mejor el punto de vista sistémico de una tecnología, descubriendo nuevas aplicaciones basadas en la teoría general de sistemas de Bertalanffy así como en otros enfoques relacionados.El propòsit de la present tesi és l'exploració i la demostració de la combinació de dos enfocaments metodològics, la minería de textes i la visió sistémica de TRIZ, reforçada amb la sintáctica i la semántica, mostrant que poden oferir un abast més gran i més holístic en l'enteniment de l'evolució d'una tecnología. Tots dos enfocaments no habían estat combinats abans del primer article dels quatre que composen aquesta tesi, però creixentment combinat dins la literatura científica per a múltiples propostes des de la primera publicació. Una segona aportació proporcionada per aquesta combinació és la millora de la capacitat d'accés al coneixement, i de com això suposa un avanç en l'àrea de recerca a traves de literatures no relacionades "disparate literature discovery" d'una forma metòdica i científica. L'element comú en els articles presentats en aquesta tesi és l'aprofitament de la mineria de textos amb base en la gestió tecnològica, 'techmining', per exemple mitjançant el perfilat de tecnologies, al costat de l'enfocament de la metodologia TRIZ potenciada per l'anàlisi sintàctica i semàntica, mitjançant la conexión semántica de conceptes, per assolir un anàlisi més complet de l'evolució tecnològica, així com per a garantir un accés més racional al coneixement. La investigació de l'aplicació de la combinació dels dos enfocaments a l'anàlisi d'informació científica i tecnològica realizat, exploren els avantatges i noves possibilitats en l'avaluació de l'avanç de tecnologies, així com la conexión de conceptes uqe representa noves possibilitats en la forma en què la investigació textual pot fer-se. L'estructura de la investigació ací presentada es mostra a través dels articles publicats i el capítol dels 'proceedings' d'un congrés internacional. Aquests articles mostren la investigació duta a terme per demostrar els beneficis esmentats. Tot i la gran activitat de recerca i enfocaments per a la prospectiva i la previsió tecnològica existents a la literatura científica, existeix encara la necessitat d'entendre quines tecnologies poden emergir, poden evolucionar més ràpid i a quina velocitat poden arribar al mercat. La combinació dels enfocaments de mineria tecnològica o 'techmining' i TRIZ semàntic permet entendre les tendències d'una tecnologia donada, amb una visió del seu sistema, les connexions dels seus elements i les influències dels elements constituents. Els articles presentats en aquesta tesi són aplicacions i exploracions de la combinació de la metodologia TRIZ, la seva potenciació mitjançant la semàntica i el techmining a diferents tecnologies de diversos àmbits, alguns molt dispars entre si, per tal de demostrar les seves avantatges i implicacions. Els articles tracten les diferents interaccions entre els dos enfocaments de treball, aplicats a tecnologies com bateries de liti per als vehicles elèctrics, un cas mèdic lligat a una malaltia com la síndrome de Ménière, a la prognosi del càncer de pròstata i en alimentació, a l'ús de probiòtics en l'alimentació animal com a substitució dels antibiòtics. Aquest ampli rang de tecnologies han estat seleccionats per mostrar els avantatges de forma més objectiva, de la combinació de tots dos enfocaments o amb predominança d'algun en particular, com és el cas de l'article explorant la síndrome de Ménière. Aquestes exploracions permeten també entendre millor el punt de vista sistèmic d'una tecnologia, descobrint noves aplicacions amb base en la teoria general de sistemes de Bertalanffy així com altres treballs relacionats.Vicente Gomila, JM. (2017). The combination of the disciplines of Techmining and semantic TRIZ for better and faster analyzing technology evolution [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/89088TESISCompendi

    MeInfoText 2.0: gene methylation and cancer relation extraction from biomedical literature

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>DNA methylation is regarded as a potential biomarker in the diagnosis and treatment of cancer. The relations between aberrant gene methylation and cancer development have been identified by a number of recent scientific studies. In a previous work, we used co-occurrences to mine those associations and compiled the MeInfoText 1.0 database. To reduce the amount of manual curation and improve the accuracy of relation extraction, we have now developed MeInfoText 2.0, which uses a machine learning-based approach to extract gene methylation-cancer relations.</p> <p>Description</p> <p>Two maximum entropy models are trained to predict if aberrant gene methylation is related to any type of cancer mentioned in the literature. After evaluation based on 10-fold cross-validation, the average precision/recall rates of the two models are 94.7/90.1 and 91.8/90% respectively. MeInfoText 2.0 provides the gene methylation profiles of different types of human cancer. The extracted relations with maximum probability, evidence sentences, and specific gene information are also retrievable. The database is available at <url>http://bws.iis.sinica.edu.tw:8081/MeInfoText2/</url>.</p> <p>Conclusion</p> <p>The previous version, MeInfoText, was developed by using association rules, whereas MeInfoText 2.0 is based on a new framework that combines machine learning, dictionary lookup and pattern matching for epigenetics information extraction. The results of experiments show that MeInfoText 2.0 outperforms existing tools in many respects. To the best of our knowledge, this is the first study that uses a hybrid approach to extract gene methylation-cancer relations. It is also the first attempt to develop a gene methylation and cancer relation corpus.</p

    A text mining based approach for biomarker discovery

    Get PDF
    Dissertação de mestrado em BioinformáticaBiomarkers have long been heralded as potential motivators for the emergence of new treatment and diagnostic procedures for disease conditions. However, for many years, the biomarker discovery process could only be achieved through experimental means, serving as a deterrent for their increase in popularity as the usually large number of candidates resulted in a costly and time-consuming discovery process. The increase in computational capabilities has led to a change in the paradigm of biomarker discovery, migrating from the clinical laboratory to in silico environments. Furthermore, text mining, the act of automatically extracting information from text through computational means, has seen a rise in popularity in the biomedical fields. The number of studies and clinical trials in these fields has greatly increased in the past years, making the task of manually examining and annotating these, at the very least, incredibly cumbersome. Adding to this, even though the development of efficient and thorough natural language processing is still an on-going process, the potential for the discovery of common reported and hidden behaviours in the scientific literature is too high to be ignored. Several tools, technologies, pipelines and frameworks already exist capable of, at least, giving a glimpse on how the analysis of the available pile of scientific literature can pave the way for the development of novel medical techniques that might help in the prevention, diagnostic and treatment of diseases. As such, a novel approach is presented in this work for achieving biomarker discov ery, one that integrates both gene-disease associations extracted from current biomedical literature and RNA-Seq gene expression data in an L1-regularization mixed-integer linear programming model for identifying potential biomarkers, potentially providing an optimal and robust genetic signature for disease diagnostic and helping identify novel biomarker candidates. This analysis was carried out on five publicly available RNA-Seq datasets ob tained from the Genomic Data Commons Data Portal, related to breast, colon, lung and prostate cancer, and head and neck squamous cell carcinoma. Hyperparameter optimiza tion was also performed for this approach, and the performance of the optimal set of pa rameters was compared against other machine learning methods.Os biomarcadores há muito que são considerados como os motivadores principais para o desenvolvimento de novos procedimentos de diagnóstico e tratamento de doenças. No entanto, ate há relativamente pouco tempo, o processo de descoberta de biomarcadores estava dependente de métodos experimentais, sendo este um elemento dissuasor da sua aplicação e estudo em massa dado que o número elevado de candidatos implicava um processo de averiguação extremamente dispendioso e demorado. O grande aumento do poder computacional nas últimas décadas veio contrariar esta tendência, levando a migração do processo de descoberta de biomarcadores do laboratório para o ambiente in silico. Para além disso, a aplicação de processos de mineração de textos, que consistem na extração de informação de documentos através de meios computacionais, tem visto um aumento da sua popularidade na comunidade biomédica devido ao aumento exponencial do número de estudos e ensaios clínicos nesta área, tornando todo o processo de analise e anotação manual destes bastante laborioso. A adicionar a isto, apesar do desenvolvimento de métodos eficientes capazes de processar linguagem natural na sua plenitude seja um processo que ainda esteja a decorrer, o potencial para a descoberta de comportamentos reportados e escondidos na literatura e demasiado elevado para ser ignorado. Já existem diversas ferramentas e tecnologias capazes de, pelo menos, dar uma indicação de como a análise da literatura científica disponível pode abrir o caminho para o desenvolvimento de novas técnicas e procedimentos médicos que poder ao auxiliar na prevenção, diagnóstico e tratamento de doenças. Como tal, e apresentado neste trabalho um novo método para realizar a descoberta de biomarcadores, que considera simultaneamente associações entre genes e doenças, já extraídas da literatura biomédica e dados de expressão de genes RNA-Seq num modelo de otimização linear com regularização L1 com variáveis contínuas e inteiras (MILP) para identificar possíveis biomarcadores, sendo capaz potencialmente de providenciar assinaturas genéticas ótimas e robustas para o diagnostico de doenças e ajudar a identificar novos candidatos a biomarcador. Esta análise foi levada a cabo em cinco conjuntos de dados RNA-Seq obtidos através do Portal de Dados do Genomic Data Commons (GDC) relacionados com os cancros da mama, colon, pulmão, próstata, e carcinoma escamoso da cabeça e pescoço. Realizou-se também uma otimização dos hiperparâmetros deste método, e o desempenho do conjunto ideal de parâmetros foi comparado com o de outros métodos de aprendizagem máquina

    The potential of text mining in data integration and network biology for plant research : a case study on Arabidopsis

    Get PDF
    Despite the availability of various data repositories for plant research, a wealth of information currently remains hidden within the biomolecular literature. Text mining provides the necessary means to retrieve these data through automated processing of texts. However, only recently has advanced text mining methodology been implemented with sufficient computational power to process texts at a large scale. In this study, we assess the potential of large-scale text mining for plant biology research in general and for network biology in particular using a state-of-the-art text mining system applied to all PubMed abstracts and PubMed Central full texts. We present extensive evaluation of the textual data for Arabidopsis thaliana, assessing the overall accuracy of this new resource for usage in plant network analyses. Furthermore, we combine text mining information with both protein-protein and regulatory interactions from experimental databases. Clusters of tightly connected genes are delineated from the resulting network, illustrating how such an integrative approach is essential to grasp the current knowledge available for Arabidopsis and to uncover gene information through guilt by association. All large-scale data sets, as well as the manually curated textual data, are made publicly available, hereby stimulating the application of text mining data in future plant biology studies
    corecore