1,245 research outputs found
Subgroup Discovery: Real-World Applications
Subgroup discovery is a data mining technique which extracts interesting rules with respect
to a target variable. An important characteristic of this task is the combination of predictive
and descriptive induction. In this paper, an overview about subgroup discovery is performed.
In addition, di erent real-world applications solved through evolutionary algorithms where the
suitability and potential of this type of algorithms for the development of subgroup discovery
algorithms are presented
Systematising and scaling literature curation for genetically determined developmental disorders
The widespread availability of genomic sequencing has transformed the diagnosis of genetically-determined developmental disorders (GDD). However, this type of test often generates a number of genetic variants, which have to be reviewed and related back to the clinical features (phenotype) of the individual being tested. This frequently entails a time-consuming review of the peer-reviewed literature to look for case reports describing variants in the gene(s) of interest. This is particularly true for newly described and/or very rare disorders not covered in phenotype databases. Therefore, there is a need for scalable, automated literature curation to increase the efficiency of this process. This should lead to improvements in the speed in which diagnosis is made, and an increase in the number of individuals who are diagnosed through genomic testing.
Phenotypic data in case reports/case series is not usually recorded in a standardised, computationally-tractable format. Plain text descriptions of similar clinical features may be recorded in several different ways. For example, a technical term such as ‘hypertelorism’, may be recorded as its synonym ‘widely spaced eyes’. In addition, case reports are found across a wide range of journals, with different structures and file formats for each publication.
The Human Phenotype Ontology (HPO) was developed to store phenotypic data in a computationally-accessible format. Several initiatives have been developed to link diseases to phenotype data, in the form of HPO terms. However, these rely on manual expert curation and therefore are not inherently scalable, and cannot be updated automatically.
Methods of extracting phenotype data from text at scale developed to date have relied on abstracts or open access papers. At the time of writing, Europe PubMed Central (EPMC, https://europepmc.org/) contained approximately 39.5 million articles, of which only 3.8 million were open access. Therefore, there is likely a significant volume of phenotypic data which has not been used previously at scale, due to difficulties accessing non-open access manuscripts.
In this thesis, I present a method for literature curation which can utilise all relevant published full text through a newly developed package which can download almost all manuscripts licenced by a university or other institution. This is scalable to the full spectrum of GDD. Using manuscripts identified through manual literature review, I use a full text download pipeline and NLP (natural language processing) based methods to generate disease models. These are comprised of HPO terms weighted according to their frequency in the literature. I demonstrate iterative refinement of these models, and use a custom annotated corpus of 50 papers to show the text mining process has high precision and recall. I demonstrate that these models clinically reflect true disease expressivity, as defined by manual comparison with expert literature reviews, for three well-characterised GDD.
I compare these disease models to those in the most commonly used genetic disease phenotype databases. I show that the automated disease models have increased depth of phenotyping, i.e. there are more terms than those which are manually-generated. I show that, in comparison to ‘real life’ prospectively gathered phenotypic data, automated disease models outperform existing phenotype databases in predicting diagnosis, as defined by increased area under the curve (by 0.05 and 0.08 using different similarity measures) on ROC curve plots.
I present a method for automated PubMed search at scale, to use as input for disease model generation. I annotated a corpus of 6500 abstracts. Using this corpus I show a high precision (up to 0.80) and recall (up to 1.00) for machine learning classifiers used to identify manuscripts relevant to GDD. These use hand-picked domain-specific features, for example utilising specific MeSH terms. This method can be used to scale automated literature curation to the full spectrum of GDD. I also present an analysis of the phenotypic terms used in one year of GDD-relevant papers in a prominent journal. This shows that use of supplemental data and parsing clinical report sections from manuscripts is likely to result in more patient-specific phenotype extraction in future.
In summary, I present a method for automated curation of full text from the peer-reviewed literature in the context of GDD. I demonstrate that this method is robust, reflects clinical disease expressivity, outperforms existing manual literature curation, and is scalable. Applying this process to clinical testing in future should improve the efficiency and accuracy of diagnosis
Epilepsy
Epilepsy is the most common neurological disorder globally, affecting approximately 50 million people of all ages. It is one of the oldest diseases described in literature from remote ancient civilizations 2000-3000 years ago. Despite its long history and wide spread, epilepsy is still surrounded by myth and prejudice, which can only be overcome with great difficulty. The term epilepsy is derived from the Greek verb epilambanein, which by itself means to be seized and to be overwhelmed by surprise or attack. Therefore, epilepsy is a condition of getting over, seized, or attacked. The twelve very interesting chapters of this book cover various aspects of epileptology from the history and milestones of epilepsy as a disease entity, to the most recent advances in understanding and diagnosing epilepsy
Recommended from our members
Constrained neuro fuzzy inference methodology for explainable personalised modelling with applications on gene expression data
Interpretable machine learning models for gene expression datasets are important for understanding the decision-making process of a classifier and gaining insights on the underlying molecular processes of genetic conditions. Interpretable models can potentially support early diagnosis before full disease manifestation. This is particularly important yet, challenging for mental health. We hypothesise this is due to extreme heterogeneity issues which may be overcome and explained by personalised modelling techniques. Thus far, most machine learning methods applied to gene expression datasets, including deep neural networks, lack personalised interpretability. This paper proposes a new methodology named personalised constrained neuro fuzzy inference (PCNFI) for learning personalised rules from high dimensional datasets which are structurally and semantically interpretable. Case studies on two mental health related datasets (schizophrenia and bipolar disorders) have shown that the relatively short and simple personalised fuzzy rules provided enhanced interpretability as well as better classification performance compared to other commonly used machine learning methods. Performance test on a cancer dataset also showed that PCNFI matches previous benchmarks. Insights from our approach also indicated the importance of two genes (ATRX and TSPAN2) as possible biomarkers for early differentiation of ultra-high risk, bipolar and healthy individuals. These genes are linked to cognitive ability and impulsive behaviour. Our findings suggest a significant starting point for further research into the biological role of cognitive and impulsivity-related differences. With potential applications across bio-medical research, the proposed PCNFI method is promising for diagnosis, prognosis, and the design of personalised treatment plans for better outcomes in the future
Machine learning techniques to discover genes with potential prognosis role in Alzheimer’s disease using different biological sources
Alzheimer’s disease is a complex progressive neurodegenerative brain disorder, being its prevalence ex pected to rise over the next decades. Unconventional strategies for elucidating the genetic mechanisms
are necessary due to its polygenic nature. In this work, the input information sources are five: a public
DNA microarray that measures expression levels of control and patient samples, repositories of known
genes associated to Alzheimer’s disease, additional data, Gene Ontology and finally, a literature review or
expert knowledge to validate the results. As methodology to identify genes highly related to this disease,
we present the integration of three machine learning techniques: particularly, we have used decision
trees, quantitative association rules and hierarchical cluster to analyze Alzheimer’s disease gene expres sion profiles to identify genes highly linked to this neurodegenerative disease, through changes in their
expression levels between control and patient samples. We propose an ensemble of decision trees and
quantitative association rules to find the most suitable configurations of the multi-objective evolutionary
algorithm GarNet, in order to overcome the complex parametrization intrinsic to this type of algorithms.
To fulfill this goal, GarNet has been executed using multiple configuration settings and the well-known
C4.5 has been used to find the minimum accuracy to be satisfied. Then, GarNet is rerun to identify de pendencies between genes and their expression levels, so we are able to distinguish between healthy
individuals and Alzheimer’s patients using the configurations that overcome the minimum threshold of
accuracy defined by C4.5 algorithm. Finally, a hierarchical cluster analysis has been used to validate the
obtained gene-Alzheimer’s Disease associations provided by GarNet. The results have shown that the ob tained rules were able to successfully characterize the underlying information, grouping relevant genes
for Alzheimer Disease. The genes reported by our approach provided two well defined groups that per fectly divided the samples between healthy and Alzheimer’s Disease patients. To prove the relevance of
the obtained results, a statistical test and gene expression fold-change were used. Furthermore, this rel evance has been summarized in a volcano plot, showing two clearly separated and significant groups of
genes that are up or down-regulated in Alzheimer’s Disease patients. A biological knowledge integration
phase was performed based on the information fusion of systematic literature review, enrichment Gene
Ontology terms for the described genes found in the hippocampus of patients. Finally, a validation phase
with additional data and a permutation test is carried out, being the results consistent with previous
studies.Ministerio de Ciencia y Tecnología TIN2011-28956-C02-02Ministerio de Ciencia y Tecnología TIN2014-55894-C2-1-RJunta de Andalucía P11-TIC-752
Performance Evaluation of Smart Decision Support Systems on Healthcare
Medical activity requires responsibility not only from clinical knowledge and skill but
also on the management of an enormous amount of information related to patient care. It is
through proper treatment of information that experts can consistently build a healthy wellness
policy. The primary objective for the development of decision support systems (DSSs) is
to provide information to specialists when and where they are needed. These systems provide
information, models, and data manipulation tools to help experts make better decisions in a
variety of situations.
Most of the challenges that smart DSSs face come from the great difficulty of dealing
with large volumes of information, which is continuously generated by the most diverse types
of devices and equipment, requiring high computational resources. This situation makes this
type of system susceptible to not recovering information quickly for the decision making. As a
result of this adversity, the information quality and the provision of an infrastructure capable
of promoting the integration and articulation among different health information systems (HIS)
become promising research topics in the field of electronic health (e-health) and that, for this
same reason, are addressed in this research. The work described in this thesis is motivated
by the need to propose novel approaches to deal with problems inherent to the acquisition,
cleaning, integration, and aggregation of data obtained from different sources in e-health environments,
as well as their analysis.
To ensure the success of data integration and analysis in e-health environments, it
is essential that machine-learning (ML) algorithms ensure system reliability. However, in this
type of environment, it is not possible to guarantee a reliable scenario. This scenario makes
intelligent SAD susceptible to predictive failures, which severely compromise overall system
performance. On the other hand, systems can have their performance compromised due to the
overload of information they can support.
To solve some of these problems, this thesis presents several proposals and studies
on the impact of ML algorithms in the monitoring and management of hypertensive disorders
related to pregnancy of risk. The primary goals of the proposals presented in this thesis are
to improve the overall performance of health information systems. In particular, ML-based
methods are exploited to improve the prediction accuracy and optimize the use of monitoring
device resources. It was demonstrated that the use of this type of strategy and methodology
contributes to a significant increase in the performance of smart DSSs, not only concerning precision
but also in the computational cost reduction used in the classification process.
The observed results seek to contribute to the advance of state of the art in methods
and strategies based on AI that aim to surpass some challenges that emerge from the integration
and performance of the smart DSSs. With the use of algorithms based on AI, it is possible to
quickly and automatically analyze a larger volume of complex data and focus on more accurate
results, providing high-value predictions for a better decision making in real time and without
human intervention.A atividade médica requer responsabilidade não apenas com base no conhecimento
e na habilidade clínica, mas também na gestão de uma enorme quantidade de informações
relacionadas ao atendimento ao paciente. É através do tratamento adequado das informações
que os especialistas podem consistentemente construir uma política saudável de bem-estar. O
principal objetivo para o desenvolvimento de sistemas de apoio à decisão (SAD) é fornecer informações
aos especialistas onde e quando são necessárias. Esses sistemas fornecem informações,
modelos e ferramentas de manipulação de dados para ajudar os especialistas a tomar melhores
decisões em diversas situações.
A maioria dos desafios que os SAD inteligentes enfrentam advêm da grande dificuldade
de lidar com grandes volumes de dados, que é gerada constantemente pelos mais diversos
tipos de dispositivos e equipamentos, exigindo elevados recursos computacionais. Essa situação
torna este tipo de sistemas suscetível a não recuperar a informação rapidamente para a
tomada de decisão. Como resultado dessa adversidade, a qualidade da informação e a provisão
de uma infraestrutura capaz de promover a integração e a articulação entre diferentes sistemas
de informação em saúde (SIS) tornam-se promissores tópicos de pesquisa no campo da saúde
eletrônica (e-saúde) e que, por essa mesma razão, são abordadas nesta investigação. O trabalho
descrito nesta tese é motivado pela necessidade de propor novas abordagens para lidar
com os problemas inerentes à aquisição, limpeza, integração e agregação de dados obtidos de
diferentes fontes em ambientes de e-saúde, bem como sua análise.
Para garantir o sucesso da integração e análise de dados em ambientes e-saúde é
importante que os algoritmos baseados em aprendizagem de máquina (AM) garantam a confiabilidade
do sistema. No entanto, neste tipo de ambiente, não é possível garantir um cenário
totalmente confiável. Esse cenário torna os SAD inteligentes suscetíveis à presença de falhas
de predição que comprometem seriamente o desempenho geral do sistema. Por outro lado, os
sistemas podem ter seu desempenho comprometido devido à sobrecarga de informações que
podem suportar.
Para tentar resolver alguns destes problemas, esta tese apresenta várias propostas e
estudos sobre o impacto de algoritmos de AM na monitoria e gestão de transtornos hipertensivos
relacionados com a gravidez (gestação) de risco. O objetivo das propostas apresentadas nesta
tese é melhorar o desempenho global de sistemas de informação em saúde. Em particular, os
métodos baseados em AM são explorados para melhorar a precisão da predição e otimizar o
uso dos recursos dos dispositivos de monitorização. Ficou demonstrado que o uso deste tipo
de estratégia e metodologia contribui para um aumento significativo do desempenho dos SAD
inteligentes, não só em termos de precisão, mas também na diminuição do custo computacional
utilizado no processo de classificação.
Os resultados observados buscam contribuir para o avanço do estado da arte em métodos
e estratégias baseadas em inteligência artificial que visam ultrapassar alguns desafios que
advêm da integração e desempenho dos SAD inteligentes. Como o uso de algoritmos baseados
em inteligência artificial é possível analisar de forma rápida e automática um volume maior de
dados complexos e focar em resultados mais precisos, fornecendo previsões de alto valor para uma melhor tomada de decisão em tempo real e sem intervenção humana
Constructions of self-identity and experience of diagnosis in adults with intellectual disabilities
Background: Research exploring self-identity has focused on the meaning of having an intellectual disability with the risk of overshadowing other aspects that affect how people view themselves.Method: This systematic literature review explores the multifaceted constructions of self-identity in adults with intellectual disabilities. 30 qualitative studies are synthesised thematically, incorporating formal quality assessments.Results: The experience of power through control, dependence and influential narratives and negotiating the self from others, considering autonomy and seeking normality were related to individuals’ constructions of their identities. The desire to live a meaningful life considering future hopes, the ability to support others and the experience of connectedness contributed to positive self-identities.Conclusions: Self-identity in adults with intellectual disabilities appears multi-faceted, with a multitude of influences on the construction and expression of identity beyond that of an intellectual disability. The review highlighted a lack of high quality research and indicates the need for further rigorous studies across the literature base
- …