232 research outputs found

    Clustering Approaches for Evaluation and Analysis on Formal Gene Expression Cancer Datasets

    Get PDF
    Enormous generation of biological data and the need of analysis of that data led to the generation of the field Bioinformatics. Data mining is the stream which is used to derive, analyze the data by exploring the hidden patterns of the biological data. Though, data mining can be used in analyzing biological data such as genomic data, proteomic data here Gene Expression (GE) Data is considered for evaluation. GE is generated from Microarrays such as DNA and oligo micro arrays. The generated data is analyzed through the clustering techniques of data mining. This study deals with an implement the basic clustering approach K-Means and various clustering approaches like Hierarchal, Som, Click and basic fuzzy based clustering approach. Eventually, the comparative study of those approaches which lead to the effective approach of cluster analysis of GE.The experimental results shows that proposed algorithm achieve a higher clustering accuracy and takes less clustering time when compared with existing algorithms

    Diagnosis of the disease using an ant colony gene selection method based on information gain ratio using fuzzy rough sets

    Get PDF
    With the advancement of metagenome data mining science has become focused on microarrays. Microarrays are datasets with a large number of genes that are usually irrelevant to the output class; hence, the process of gene selection or feature selection is essential. So, it follows that you can remove redundant genes and increase the speed and accuracy of classification. After applying the gene selection, the dataset is reduced and detection of differentially abundant genes facilitated with more accuracy. This will, in turn, increases the power of genes which are correctly detected statistically differentially abundant in two or more phenotypes. The method presented in this study is a two-stage method for functional analysis of metagenomes.  The first stage uses a combination of the filter and wrapper gene selection method, which includes the ant colony algorithm and utilizes fuzzy rough sets to calculate the information gain ratio as an evaluation measure in the ant colony algorithm. The set of features from the first stage is used as input in the second stage, and then the negative binomial distribution is used to detect genes which are statistically differentially abundant in two or more phenotypes. Applying the proposed method on a microarray dataset it becomes clear that the proposed method increases the accuracy of the classifier and selects a subset of genes that have a minimum length and maximum accuracy

    Ant Colony Optimization

    Get PDF
    Ant Colony Optimization (ACO) is the best example of how studies aimed at understanding and modeling the behavior of ants and other social insects can provide inspiration for the development of computational algorithms for the solution of difficult mathematical problems. Introduced by Marco Dorigo in his PhD thesis (1992) and initially applied to the travelling salesman problem, the ACO field has experienced a tremendous growth, standing today as an important nature-inspired stochastic metaheuristic for hard optimization problems. This book presents state-of-the-art ACO methods and is divided into two parts: (I) Techniques, which includes parallel implementations, and (II) Applications, where recent contributions of ACO to diverse fields, such as traffic congestion and control, structural optimization, manufacturing, and genomics are presented

    Click Fraud Detection in Online and In-app Advertisements: A Learning Based Approach

    Get PDF
    Click Fraud is the fraudulent act of clicking on pay-per-click advertisements to increase a site’s revenue, to drain revenue from the advertiser, or to inflate the popularity of content on social media platforms. In-app advertisements on mobile platforms are among the most common targets for click fraud, which makes companies hesitant to advertise their products. Fraudulent clicks are supposed to be caught by ad providers as part of their service to advertisers, which is commonly done using machine learning methods. However: (1) there is a lack of research in current literature addressing and evaluating the different techniques of click fraud detection and prevention, (2) threat models composed of active learning systems (smart attackers) can mislead the training process of the fraud detection model by polluting the training data, (3) current deep learning models have significant computational overhead, (4) training data is often in an imbalanced state, and balancing it still results in noisy data that can train the classifier incorrectly, and (5) datasets with high dimensionality cause increased computational overhead and decreased classifier correctness -- while existing feature selection techniques address this issue, they have their own performance limitations. By extending the state-of-the-art techniques in the field of machine learning, this dissertation provides the following solutions: (i) To address (1) and (2), we propose a hybrid deep-learning-based model which consists of an artificial neural network, auto-encoder and semi-supervised generative adversarial network. (ii) As a solution for (3), we present Cascaded Forest and Extreme Gradient Boosting with less hyperparameter tuning. (iii) To overcome (4), we propose a row-wise data reduction method, KSMOTE, which filters out noisy data samples both in the raw data and the synthetically generated samples. (iv) For (5), we propose different column-reduction methods such as multi-time-scale Time Series analysis for fraud forecasting, using binary labeled imbalanced datasets and hybrid filter-wrapper feature selection approaches

    Performance Evaluation of Smart Decision Support Systems on Healthcare

    Get PDF
    Medical activity requires responsibility not only from clinical knowledge and skill but also on the management of an enormous amount of information related to patient care. It is through proper treatment of information that experts can consistently build a healthy wellness policy. The primary objective for the development of decision support systems (DSSs) is to provide information to specialists when and where they are needed. These systems provide information, models, and data manipulation tools to help experts make better decisions in a variety of situations. Most of the challenges that smart DSSs face come from the great difficulty of dealing with large volumes of information, which is continuously generated by the most diverse types of devices and equipment, requiring high computational resources. This situation makes this type of system susceptible to not recovering information quickly for the decision making. As a result of this adversity, the information quality and the provision of an infrastructure capable of promoting the integration and articulation among different health information systems (HIS) become promising research topics in the field of electronic health (e-health) and that, for this same reason, are addressed in this research. The work described in this thesis is motivated by the need to propose novel approaches to deal with problems inherent to the acquisition, cleaning, integration, and aggregation of data obtained from different sources in e-health environments, as well as their analysis. To ensure the success of data integration and analysis in e-health environments, it is essential that machine-learning (ML) algorithms ensure system reliability. However, in this type of environment, it is not possible to guarantee a reliable scenario. This scenario makes intelligent SAD susceptible to predictive failures, which severely compromise overall system performance. On the other hand, systems can have their performance compromised due to the overload of information they can support. To solve some of these problems, this thesis presents several proposals and studies on the impact of ML algorithms in the monitoring and management of hypertensive disorders related to pregnancy of risk. The primary goals of the proposals presented in this thesis are to improve the overall performance of health information systems. In particular, ML-based methods are exploited to improve the prediction accuracy and optimize the use of monitoring device resources. It was demonstrated that the use of this type of strategy and methodology contributes to a significant increase in the performance of smart DSSs, not only concerning precision but also in the computational cost reduction used in the classification process. The observed results seek to contribute to the advance of state of the art in methods and strategies based on AI that aim to surpass some challenges that emerge from the integration and performance of the smart DSSs. With the use of algorithms based on AI, it is possible to quickly and automatically analyze a larger volume of complex data and focus on more accurate results, providing high-value predictions for a better decision making in real time and without human intervention.A atividade médica requer responsabilidade não apenas com base no conhecimento e na habilidade clínica, mas também na gestão de uma enorme quantidade de informações relacionadas ao atendimento ao paciente. É através do tratamento adequado das informações que os especialistas podem consistentemente construir uma política saudável de bem-estar. O principal objetivo para o desenvolvimento de sistemas de apoio à decisão (SAD) é fornecer informações aos especialistas onde e quando são necessárias. Esses sistemas fornecem informações, modelos e ferramentas de manipulação de dados para ajudar os especialistas a tomar melhores decisões em diversas situações. A maioria dos desafios que os SAD inteligentes enfrentam advêm da grande dificuldade de lidar com grandes volumes de dados, que é gerada constantemente pelos mais diversos tipos de dispositivos e equipamentos, exigindo elevados recursos computacionais. Essa situação torna este tipo de sistemas suscetível a não recuperar a informação rapidamente para a tomada de decisão. Como resultado dessa adversidade, a qualidade da informação e a provisão de uma infraestrutura capaz de promover a integração e a articulação entre diferentes sistemas de informação em saúde (SIS) tornam-se promissores tópicos de pesquisa no campo da saúde eletrônica (e-saúde) e que, por essa mesma razão, são abordadas nesta investigação. O trabalho descrito nesta tese é motivado pela necessidade de propor novas abordagens para lidar com os problemas inerentes à aquisição, limpeza, integração e agregação de dados obtidos de diferentes fontes em ambientes de e-saúde, bem como sua análise. Para garantir o sucesso da integração e análise de dados em ambientes e-saúde é importante que os algoritmos baseados em aprendizagem de máquina (AM) garantam a confiabilidade do sistema. No entanto, neste tipo de ambiente, não é possível garantir um cenário totalmente confiável. Esse cenário torna os SAD inteligentes suscetíveis à presença de falhas de predição que comprometem seriamente o desempenho geral do sistema. Por outro lado, os sistemas podem ter seu desempenho comprometido devido à sobrecarga de informações que podem suportar. Para tentar resolver alguns destes problemas, esta tese apresenta várias propostas e estudos sobre o impacto de algoritmos de AM na monitoria e gestão de transtornos hipertensivos relacionados com a gravidez (gestação) de risco. O objetivo das propostas apresentadas nesta tese é melhorar o desempenho global de sistemas de informação em saúde. Em particular, os métodos baseados em AM são explorados para melhorar a precisão da predição e otimizar o uso dos recursos dos dispositivos de monitorização. Ficou demonstrado que o uso deste tipo de estratégia e metodologia contribui para um aumento significativo do desempenho dos SAD inteligentes, não só em termos de precisão, mas também na diminuição do custo computacional utilizado no processo de classificação. Os resultados observados buscam contribuir para o avanço do estado da arte em métodos e estratégias baseadas em inteligência artificial que visam ultrapassar alguns desafios que advêm da integração e desempenho dos SAD inteligentes. Como o uso de algoritmos baseados em inteligência artificial é possível analisar de forma rápida e automática um volume maior de dados complexos e focar em resultados mais precisos, fornecendo previsões de alto valor para uma melhor tomada de decisão em tempo real e sem intervenção humana

    Evolving machine learning and deep learning models using evolutionary algorithms

    Get PDF
    Despite the great success in data mining, machine learning and deep learning models are yet subject to material obstacles when tackling real-life challenges, such as feature selection, initialization sensitivity, as well as hyperparameter optimization. The prevalence of these obstacles has severely constrained conventional machine learning and deep learning methods from fulfilling their potentials. In this research, three evolving machine learning and one evolving deep learning models are proposed to eliminate above bottlenecks, i.e. improving model initialization, enhancing feature representation, as well as optimizing model configuration, respectively, through hybridization between the advanced evolutionary algorithms and the conventional ML and DL methods. Specifically, two Firefly Algorithm based evolutionary clustering models are proposed to optimize cluster centroids in K-means and overcome initialization sensitivity as well as local stagnation. Secondly, a Particle Swarm Optimization based evolving feature selection model is developed for automatic identification of the most effective feature subset and reduction of feature dimensionality for tackling classification problems. Lastly, a Grey Wolf Optimizer based evolving Convolutional Neural Network-Long Short-Term Memory method is devised for automatic generation of the optimal topological and learning configurations for Convolutional Neural Network-Long Short-Term Memory networks to undertake multivariate time series prediction problems. Moreover, a variety of tailored search strategies are proposed to eliminate the intrinsic limitations embedded in the search mechanisms of the three employed evolutionary algorithms, i.e. the dictation of the global best signal in Particle Swarm Optimization, the constraint of the diagonal movement in Firefly Algorithm, as well as the acute contraction of search territory in Grey Wolf Optimizer, respectively. The remedy strategies include the diversification of guiding signals, the adaptive nonlinear search parameters, the hybrid position updating mechanisms, as well as the enhancement of population leaders. As such, the enhanced Particle Swarm Optimization, Firefly Algorithm, and Grey Wolf Optimizer variants are more likely to attain global optimality on complex search landscapes embedded in data mining problems, owing to the elevated search diversity as well as the achievement of advanced trade-offs between exploration and exploitation

    Modelling the head and neck region for microwave imaging of cervical lymph nodes

    Get PDF
    Tese de mestrado integrado, Engenharia Biomédica e Biofísica (Radiações em Diagnóstico e Terapia), Universidade de Lisboa, Faculdade de Ciências, 2020O termo “cancro da cabeça e pescoço” refere-se a um qualquer tipo de cancro com início nas células epiteliais das cavidades oral e nasal, seios perinasais, glândulas salivares, faringe e laringe. Estes tumores malignos apresentaram, em 2018, uma incidência mundial de cerca de 887.659 novos casos e taxa de mortalidade superior a 51%. Aproximadamente 80% dos novos casos diagnosticados nesse ano revelaram a proliferação de células cancerígenas dos tumores para outras regiões do corpo através dos vasos sanguíneos e linfáticos das redondezas. De forma a determinar o estado de desenvolvimento do cancro e as terapias a serem seguidas, é fundamental a avaliação dos primeiros gânglios linfáticos que recebem a drenagem do tumor primário – os gânglios sentinela – e que, por isso, apresentam maior probabilidade de se tornarem os primeiros alvos das células tumorais. Gânglios sentinela saudáveis implicam uma menor probabilidade de surgirem metástases, isto é, novos focos tumorais decorrentes da disseminação do cancro para outros órgãos. O procedimento standard que permite o diagnóstico dos gânglios linfáticos cervicais, gânglios que se encontram na região da cabeça e pescoço, e o estadiamento do cancro consiste na remoção cirúrgica destes gânglios e subsequente histopatologia. Para além de ser um procedimento invasivo, a excisão cirúrgica dos gânglios linfáticos representa perigos tanto para a saúde mental e física dos pacientes, como para a sua qualidade de vida. Dores, aparência física deformada (devido a cicatrizes), perda da fala ou da capacidade de deglutição são algumas das repercussões que poderão advir da remoção de gânglios linfáticos da região da cabeça e pescoço. Adicionalmente, o risco de infeção e linfedema – acumulação de linfa nos tecidos intersticiais – aumenta significativamente com a remoção de uma grande quantidade de gânglios linfáticos saudáveis. Também os encargos para os sistemas de saúde são elevados devido à necessidade de monitorização destes pacientes e subsequentes terapias e cuidados associados à morbilidade, como é o caso da drenagem linfática manual e da fisioterapia. O desenvolvimento de novas tecnologias de imagem da cabeça e pescoço requer o uso de modelos realistas que simulem o comportamento e propriedades dos tecidos biológicos. A imagem médica por micro-ondas é uma técnica promissora e não invasiva que utiliza radiação não ionizante, isto é, sinais com frequências na gama das micro-ondas cujo comportamento depende do contraste dielétrico entre os diferentes tecidos atravessados, pelo que é possível identificar regiões ou estruturas de interesse e, consequentemente, complementar o diagnóstico. No entanto, devido às suas características, este tipo de modalidade apenas poderá ser utilizado para a avaliação de regiões anatómicas pouco profundas. Estudos indicam que os gânglios linfáticos com células tumorais possuem propriedades dielétricas distintas dos gânglios linfáticos saudáveis. Por esta razão e juntamente pelo facto da sua localização pouco profunda, consideramos que os gânglios linfáticos da região da cabeça e pescoço constituem um excelente candidato para a utilização de imagem médica por radar na frequência das micro-ondas como ferramenta de diagnóstico. Até à data, não foram efetuados estudos de desenvolvimento de modelos da região da cabeça e pescoço focados em representar realisticamente os gânglios linfáticos cervicais. Por este motivo, este projeto consistiu no desenvolvimento de dois geradores de fantomas tridimensionais da região da cabeça e pescoço – um gerador de fantomas numéricos simples (gerador I) e um gerador de fantomas numéricos mais complexos e anatomicamente realistas, que foi derivado de imagens de ressonância magnética e que inclui as propriedades dielétricas realistas dos tecidos biológicos (gerador II). Ambos os geradores permitem obter fantomas com diferentes níveis de complexidade e assim acompanhar diferentes fases no processo de desenvolvimento de equipamentos médicos de imagiologia por micro-ondas. Todos os fantomas gerados, e principalmente os fantomas anatomicamente realistas, poderão ser mais tarde impressos a três dimensões. O processo de construção do gerador I compreendeu a modelação da região da cabeça e pescoço em concordância com a anatomia humana e distribuição dos principais tecidos, e a criação de uma interface para a personalização dos modelos (por exemplo, a inclusão ou remoção de alguns tecidos é dependente do propósito para o qual cada modelo é gerado). O estudo minucioso desta região levou à inclusão de tecidos ósseos, musculares e adiposos, pele e gânglios linfáticos nos modelos. Apesar destes fantomas serem bastante simples, são essenciais para o início do processo de desenvolvimento de dispositivos de imagem médica por micro-ondas dedicados ao diagnóstico dos gânglios linfáticos cervicais. O processo de construção do gerador II foi fracionado em 3 grandes etapas devido ao seu elevado grau de complexidade. A primeira etapa consistiu na criação de uma pipeline que permitiu o processamento das imagens de ressonância magnética. Esta pipeline incluiu: a normalização dos dados, a subtração do background com recurso a máscaras binárias manualmente construídas, o tratamento das imagens através do uso de filtros lineares (como por exemplo, filtros passa-baixo ideal, Gaussiano e Butterworth) e não-lineares (por exemplo, o filtro mediana), e o uso de algoritmos não supervisionados de machine learning para a segmentação dos vários tecidos biológicos presentes na região cervical, tais como o K-means, Agglomerative Hierarchical Clustering, DBSCAN e BIRCH. Visto que cada algoritmo não supervisionado de machine learning anteriormente referido requer diferentes hiperparâmetros, é necessário proceder a um estudo pormenorizado que permita a compreensão do modo de funcionamento de cada algoritmo individualmente e a sua interação / performance com o tipo de dados tratados neste projeto (isto é, dados de exames de ressonâncias magnéticas) com vista a escolher empiricamente o leque de valores de cada hiperparâmetro que deve ser considerado, e ainda as combinações que devem ser testadas. Após esta fase, segue-se a avaliação da combinação de hiperparâmetros que resulta na melhor segmentação das estruturas anatómicas. Para esta avaliação são consideradas duas metodologias que foram combinadas: a utilização de métricas que permitam avaliar a qualidade do clustering (como por exemplo, o Silhoeutte Coefficient, o índice de Davies-Bouldin e o índice de Calinski-Harabasz) e ainda a inspeção visual. A segunda etapa foi dedicada à introdução manual de algumas estruturas, como a pele e os gânglios linfáticos, que não foram segmentadas pelos algoritmos de machine learning devido à sua fina espessura e pequena dimensão, respetivamente. Finalmente, a última etapa consistiu na atribuição das propriedades dielétricas, para uma frequência pré-definida, aos tecidos biológicos através do Modelo de Cole-Cole de quatro pólos. Tal como no gerador I, foi criada uma interface que permitiu ao utilizador decidir que características pretende incluir no fantoma, tais como: os tecidos a incluir (tecido adiposo, tecido muscular, pele e / ou gânglios linfáticos), relativamente aos gânglios linfáticos o utilizador poderá ainda determinar o seu número, dimensões, localização em níveis e estado clínico (saudável ou metastizado) e finalmente, o valor de frequência para o qual pretende obter as propriedades dielétricas (permitividade relativa e condutividade) de cada tecido biológico. Este projeto resultou no desenvolvimento de um gerador de modelos realistas da região da cabeça e pescoço com foco nos gânglios linfáticos cervicais, que permite a inserção de tecidos biológicos, tais como o tecidos muscular e adiposo, pele e gânglios linfáticos e aos quais atribui as propriedades dielétricas para uma determinada frequência na gama de micro-ondas. Estes modelos computacionais resultantes do gerador II, e que poderão ser mais tarde impressos em 3D, podem vir a ter grande impacto no processo de desenvolvimento de dispositivos médicos de imagem por micro-ondas que visam diagnosticar gânglios linfáticos cervicais, e consequentemente, contribuir para um processo não invasivo de estadiamento do cancro da cabeça e pescoço.Head and neck cancer is a broad term referring to any epithelial malignancies arising in the paranasal sinuses, nasal and oral cavities, salivary glands, pharynx, and larynx. In 2018, approximately 80% of the newly diagnosed head and neck cancer cases resulted in tumour cells spreading to neighbouring lymph and blood vessels. In order to determine cancer staging and decide which follow-up exams and therapy to follow, physicians excise and assess the Lymph Nodes (LNs) closest to the primary site of the head and neck tumour – the sentinel nodes – which are the ones with highest probability of being targeted by cancer cells. The standard procedure to diagnose the Cervical Lymph Nodes (CLNs), i.e. lymph nodes within the head and neck region, and determine the cancer staging frequently involves their surgical removal and subsequent histopathology. Besides being invasive, the removal of the lymph nodes also has negative impact on patients’ quality of life, it can be health threatening, and it is costly to healthcare systems due to the patients’ needs for follow-up treatments/cares. Anatomically realistic phantoms are required to develop novel technologies tailored to image head and neck regions. Medical MicroWave Imaging (MWI) is a promising non-invasive approach which uses non-ionizing radiation to screen shallow body regions, therefore cervical lymph nodes are excellent candidates to this imaging modality. In this project, a three-dimensional (3D) numerical phantom generator (generator I) and a Magnetic Resonance Imaging (MRI)-derived anthropomorphic phantom generator (generator II) of the head and neck region were developed to create phantoms with different levels of complexity and realism, which can be later 3D printed to test medical MWI devices. The process of designing the numerical phantom generator included the modelling of the head and neck regions according to their anatomy and the distribution of their main tissues, and the creation of an interface which allowed the users to personalise the model (e.g. include or remove certain tissues, depending on the purpose of each generated model). To build the anthropomorphic phantom generator, the modelling process included the creation of a pipeline of data processing steps to be applied to MRIs of the head and neck, followed by the development of algorithms to introduce additional tissues to the models, such as skin and lymph nodes, and finally, the assignment of the dielectric properties to the biological tissues. Similarly, this generator allowed users to decide the features they wish to include in the phantoms. This project resulted in the creation of a generator of 3D anatomically realistic head and neck phantoms which allows the inclusion of biological tissues such as skin, muscle tissue, adipose tissue, and LNs, and assigns state-of-the-art dielectric properties to the tissues. These phantoms may have a great impact in the development process of MWI devices aimed at screening and diagnosing CLNs, and consequently, contribute to a non-invasive staging of the head and neck cancer

    Capsule Network-based Radiomics: From Diagnosis to Treatment

    Get PDF
    Recent advancements in signal processing and machine learning coupled with developments of electronic medical record keeping in hospitals have resulted in a surge of significant interest in ``radiomics". Radiomics is an emerging and relatively new research field, which refers to semi-quantitative and/or quantitative features extracted from medical images with the goal of developing predictive and/or prognostic models. Radiomics is expected to become a critical component for integration of image-derived information for personalized treatment in the near future. The conventional radiomics workflow is, typically, based on extracting pre-designed features (also referred to as hand-crafted or engineered features) from a segmented region of interest. Clinical application of hand-crafted radiomics is, however, limited by the fact that features are pre-defined and extracted without taking the desired outcome into account. The aforementioned drawback has motivated trends towards development of deep learning-based radiomics (also referred to as discovery radiomics). Discovery radiomics has the advantage of learning the desired features on its own in an end-to-end fashion. Discovery radiomics has several applications in disease prediction/ diagnosis. Through this Ph.D. thesis, we develop deep learning-based architectures to address the following critical challenges identified within the radiomics domain. First, we cover the tumor type classification problem, which is of high importance for treatment selection. We address this problem, by designing a Capsule network-based architecture that has several advantages over existing solutions such as eliminating the need for access to a huge amount of training data, and its capability to learn input transformations on its own. We apply different modifications to the Capsule network architecture to make it more suitable for radiomics. At one hand, we equip the proposed architecture with access to the tumor boundary box, and on the other hand, a multi-scale Capsule network architecture is designed. Furthermore, capitalizing on the advantages of ensemble learning paradigms, we design a boosting and also a mixture of experts capsule network. A Bayesian capsule network is also developed to capture the uncertainty of the tumor classification. Beside knowing the tumor type (through classification), predicting the patient's response to treatment plays an important role in treatment design. Predicting patient's response, including survival and tumor recurrence, is another goal of this thesis, which we address by designing a deep learning-based model that takes not only the medical images, but also different clinical factors (such as age and gender) as inputs. Finally, COVID-19 diagnosis, another challenging and crucial problem within the radiomics domain, is dealt with using both X-ray and Computed Tomography (CT) images (in particular low-dose ones), where two in-house datasets are collected for the latter and different capsule network-based models are developed for COVID-19 diagnosis

    Neural Representations of Concepts and Texts for Biomedical Information Retrieval

    Get PDF
    Information retrieval (IR) methods are an indispensable tool in the current landscape of exponentially increasing textual data, especially on the Web. A typical IR task involves fetching and ranking a set of documents (from a large corpus) in terms of relevance to a user\u27s query, which is often expressed as a short phrase. IR methods are the backbone of modern search engines where additional system-level aspects including fault tolerance, scale, user interfaces, and session maintenance are also addressed. In addition to fetching documents, modern search systems may also identify snippets within the documents that are potentially most relevant to the input query. Furthermore, current systems may also maintain preprocessed structured knowledge derived from textual data as so called knowledge graphs, so certain types of queries that are posed as questions can be parsed as such; a response can be an output of one or more named entities instead of a ranked list of documents (e.g., what diseases are associated with EGFR mutations? ). This refined setup is often termed as question answering (QA) in the IR and natural language processing (NLP) communities. In biomedicine and healthcare, specialized corpora are often at play including research articles by scientists, clinical notes generated by healthcare professionals, consumer forums for specific conditions (e.g., cancer survivors network), and clinical trial protocols (e.g., www.clinicaltrials.gov). Biomedical IR is specialized given the types of queries and the variations in the texts are different from that of general Web documents. For example, scientific articles are more formal with longer sentences but clinical notes tend to have less grammatical conformity and are rife with abbreviations. There is also a mismatch between the vocabulary of consumers and the lingo of domain experts and professionals. Queries are also different and can range from simple phrases (e.g., COVID-19 symptoms ) to more complex implicitly fielded queries (e.g., chemotherapy regimens for stage IV lung cancer patients with ALK mutations ). Hence, developing methods for different configurations (corpus, query type, user type) needs more deliberate attention in biomedical IR. Representations of documents and queries are at the core of IR methods and retrieval methodology involves coming up with these representations and matching queries with documents based on them. Traditional IR systems follow the approach of keyword based indexing of documents (the so called inverted index) and matching query phrases against the document index. It is not difficult to see that this keyword based matching ignores the semantics of texts (synonymy at the lexeme level and entailment at phrase/clause/sentence levels) and this has lead to dimensionality reduction methods such as latent semantic indexing that generally have scale-related concerns; such methods also do not address similarity at the sentence level. Since the resurgence of neural network methods in NLP, the IR field has also moved to incorporate advances in neural networks into current IR methods. This dissertation presents four specific methodological efforts toward improving biomedical IR. Neural methods always begin with dense embeddings for words and concepts to overcome the limitations of one-hot encoding in traditional NLP/IR. In the first effort, we present a new neural pre-training approach to jointly learn word and concept embeddings for downstream use in applications. In the second study, we present a joint neural model for two essential subtasks of information extraction (IE): named entity recognition (NER) and entity normalization (EN). Our method detects biomedical concept phrases in texts and links them to the corresponding semantic types and entity codes. These first two studies provide essential tools to model textual representations as compositions of both surface forms (lexical units) and high level concepts with potential downstream use in QA. In the third effort, we present a document reranking model that can help surface documents that are likely to contain answers (e.g, factoids, lists) to a question in a QA task. The model is essentially a sentence matching neural network that learns the relevance of a candidate answer sentence to the given question parametrized with a bilinear map. In the fourth effort, we present another document reranking approach that is tailored for precision medicine use-cases. It combines neural query-document matching and faceted text summarization. The main distinction of this effort from previous efforts is to pivot from a query manipulation setup to transforming candidate documents into pseudo-queries via neural text summarization. Overall, our contributions constitute nontrivial advances in biomedical IR using neural representations of concepts and texts

    Pattern Recognition

    Get PDF
    A wealth of advanced pattern recognition algorithms are emerging from the interdiscipline between technologies of effective visual features and the human-brain cognition process. Effective visual features are made possible through the rapid developments in appropriate sensor equipments, novel filter designs, and viable information processing architectures. While the understanding of human-brain cognition process broadens the way in which the computer can perform pattern recognition tasks. The present book is intended to collect representative researches around the globe focusing on low-level vision, filter design, features and image descriptors, data mining and analysis, and biologically inspired algorithms. The 27 chapters coved in this book disclose recent advances and new ideas in promoting the techniques, technology and applications of pattern recognition
    corecore