14 research outputs found

    BioCaster: detecting public health rumors with a Web-based text mining system

    Get PDF
    Summary: BioCaster is an ontology-based text mining system for detecting and tracking the distribution of infectious disease outbreaks from linguistic signals on the Web. The system continuously analyzes documents reported from over 1700 RSS feeds, classifies them for topical relevance and plots them onto a Google map using geocoded information. The background knowledge for bridging the gap between Layman's terms and formal-coding systems is contained in the freely available BioCaster ontology which includes information in eight languages focused on the epidemiological role of pathogens as well as geographical locations with their latitudes/longitudes. The system consists of four main stages: topic classification, named entity recognition (NER), disease/location detection and event recognition. Higher order event analysis is used to detect more precisely specified warning signals that can then be notified to registered users via email alerts. Evaluation of the system for topic recognition and entity identification is conducted on a gold standard corpus of annotated news articles

    What's unusual in online disease outbreak news?

    Get PDF
    Background: Accurate and timely detection of public health events of international concern is necessary to help support risk assessment and response and save lives. Novel event-based methods that use the World Wide Web as a signal source offer potential to extend health surveillance into areas where traditional indicator networks are lacking. In this paper we address the issue of systematically evaluating online health news to support automatic alerting using daily disease-country counts text mined from real world data using BioCaster. For 18 data sets produced by BioCaster, we compare 5 aberration detection algorithms (EARS C2, C3, W2, F-statistic and EWMA) for performance against expert moderated ProMED-mail postings. Results: We report sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), mean alerts/100 days and F1, at 95% confidence interval (CI) for 287 ProMED-mail postings on 18 outbreaks across 14 countries over a 366 day period. Results indicate that W2 had the best F1 with a slight benefit for day of week effect over C2. In drill down analysis we indicate issues arising from the granular choice of country-level modeling, sudden drops in reporting due to day of week effects and reporting bias. Automatic alerting has been implemented in BioCaster available from http://born.nii.ac.jp. Conclusions: Online health news alerts have the potential to enhance manual analytical methods by increasing throughput, timeliness and detection rates. Systematic evaluation of health news aberrations is necessary to push forward our understanding of the complex relationship between news report volumes and case numbers and to select the best performing features and algorithms

    Mining association language patterns using a distributional semantic model for negative life event classification

    Get PDF
    AbstractPurposeNegative life events, such as the death of a family member, an argument with a spouse or the loss of a job, play an important role in triggering depressive episodes. Therefore, it is worthwhile to develop psychiatric services that can automatically identify such events. This study describes the use of association language patterns, i.e., meaningful combinations of words (e.g., <loss, job>), as features to classify sentences with negative life events into predefined categories (e.g., Family, Love, Work).MethodsThis study proposes a framework that combines a supervised data mining algorithm and an unsupervised distributional semantic model to discover association language patterns. The data mining algorithm, called association rule mining, was used to generate a set of seed patterns by incrementally associating frequently co-occurring words from a small corpus of sentences labeled with negative life events. The distributional semantic model was then used to discover more patterns similar to the seed patterns from a large, unlabeled web corpus.ResultsThe experimental results showed that association language patterns were significant features for negative life event classification. Additionally, the unsupervised distributional semantic model was not only able to improve the level of performance but also to reduce the reliance of the classification process on the availability of a large, labeled corpus

    Named Entity Recognition for Bacterial Type IV Secretion Systems

    Get PDF
    Research on specialized biological systems is often hampered by a lack of consistent terminology, especially across species. In bacterial Type IV secretion systems genes within one set of orthologs may have over a dozen different names. Classifying research publications based on biological processes, cellular components, molecular functions, and microorganism species should improve the precision and recall of literature searches allowing researchers to keep up with the exponentially growing literature, through resources such as the Pathosystems Resource Integration Center (PATRIC, patricbrc.org). We developed named entity recognition (NER) tools for four entities related to Type IV secretion systems: 1) bacteria names, 2) biological processes, 3) molecular functions, and 4) cellular components. These four entities are important to pathogenesis and virulence research but have received less attention than other entities, e.g., genes and proteins. Based on an annotated corpus, large domain terminological resources, and machine learning techniques, we developed recognizers for these entities. High accuracy rates (>80%) are achieved for bacteria, biological processes, and molecular function. Contrastive experiments highlighted the effectiveness of alternate recognition strategies; results of term extraction on contrasting document sets demonstrated the utility of these classes for identifying T4SS-related documents

    Automatic annotation of narrative radiology reports

    Get PDF
    Narrative texts in electronic health records can be efficiently utilized for building decision support systems in the clinic, only if they are correctly interpreted automatically in accordance with a specified standard. This paper tackles the problem of developing an automated method of labeling free-form radiology reports, as a precursor for building query-capable report databases in hospitals. The analyzed dataset consists of 1295 radiology reports concerning the condition of a knee, retrospectively gathered at the Clinical Hospital Centre Rijeka, Croatia. Reports were manually labeled with one or more labels from a set of 10 most commonly occurring clinical conditions. After primary preprocessing of the texts, two sets of text classification methods were compared: (1) traditional classification models—Naive Bayes (NB), Logistic Regression (LR), Support Vector Machine (SVM), and Random Forests (RF)—coupled with Bag-of-Words (BoW) features (i.e., symbolic text representation) and (2) Convolutional Neural Network (CNN) coupled with dense word vectors (i.e., word embeddings as a semantic text representation) as input features. We resorted to nested 10-fold cross-validation to evaluate the performance of competing methods using accuracy, precision, recall, and F 1 score. The CNN with semantic word representations as input yielded the overall best performance, having a micro-averaged F 1 score of 86 . 7 % . The CNN classifier yielded particularly encouraging results for the most represented conditions: degenerative disease ( 95 . 9 % ), arthrosis ( 93 . 3 % ), and injury ( 89 . 2 % ). As a data-hungry deep learning model, the CNN, however, performed notably worse than the competing models on underrepresented classes with fewer training instances such as multicausal disease or metabolic disease. LR, RF, and SVM performed comparably well, with the obtained micro-averaged F 1 scores of 84 . 6 % , 82 . 2 % , and 82 . 1 % , respectively

    Improved relative discriminative criterion using rare and informative terms and ringed seal search-support vector machine techniques for text classification

    Get PDF
    Classification has become an important task for automatically classifying the documents to their respective categories. For text classification, feature selection techniques are normally used to identify important features and to remove irrelevant, and noisy features for minimizing the dimensionality of feature space. These techniques are expected particularly to improve efficiency, accuracy, and comprehensibility of the classification models in text labeling problems. Most of the feature selection techniques utilize document and term frequencies to rank a term. Existing feature selection techniques (e.g. RDC, NRDC) consider frequently occurring terms and ignore rarely occurring terms count in a class. However, this study proposes the Improved Relative Discriminative Criterion (IRDC) technique which considers rarely occurring terms count. It is argued that rarely occurring terms count are also meaningful and important as frequently occurring terms in a class. The proposed IRDC is compared to the most recent feature selection techniques RDC and NRDC. The results reveal significant improvement by the proposed IRDC technique for feature selection in terms of precision 27%, recall 30%, macro-average 35% and micro- average 30%. Additionally, this study also proposes a hybrid algorithm named: Ringed Seal Search-Support Vector Machine (RSS-SVM) to improve the generalization and learning capability of the SVM. The proposed RSS-SVM optimizes kernel and penalty parameter with the help of RSS algorithm. The proposed RSS-SVM is compared to the most recent techniques GA-SVM and CS-SVM. The results show significant improvement by the proposed RSS-SVM for classification in terms of accuracy 18.8%, recall 15.68%, precision 15.62% and specificity 13.69%. In conclusion, the proposed IRDC has shown better performance as compare to existing techniques because its capability in considering rare and informative terms. Additionally, the proposed RSS- SVM has shown better performance as compare to existing techniques because it has capability to improve balance between exploration and exploitation

    Text mining aplicado à gestão de fundos públicos

    Get PDF
    Este trabalho tem como objetivo analisar documentos textuais submetidos por empresas portuguesas no momento de candidatura a programas de incentivos empresariais públicos. Com esta análise pretende-se extrair e selecionar variáveis relevantes, presentes nos textos, que possuam poder preditivo em relação a futuras ações das empresas candidatas aceites, no decorrer dos projetos. O objetivo concreto é a predição da anulação de projetos com fundos atribuídos, durante a sua duração prevista. Para realizar esta análise foi necessário criar uma cadeia de classificação de texto na qual são aplicadas variadas técnicas de processamento da língua natural, extração e seleção de variáveis, seleção e utilização de classificadores, e métricas de avaliação dos resultados. Foram utilizadas técnicas de referência de extração de variáveis como a extração de valores TF e TF-IDF e foram igualmente levadas a cabo experiências de extração de variáveis baseadas em geração de tópicos, análise de similaridade textual, análise de diversidade lexical, exploração de vocabulário específico, entre outros tipos de análise do conteúdo textual. A exploração de variáveis criadas a partir destas experiências mostra-nos características escondidas nos dados, como por exemplo, o facto de se verificar uma maior incidência de projetos com elevados níveis de similaridade em certos distritos do país. O principal objetivo foi alcançar o melhor desempenho possível nas métricas obtidas através da matriz de confusão (taxa de acerto; precisão; cobertura; F1-Score) na predição da anulação de projetos. Os melhores resultados da predição de anulação foram obtidos por um conjunto de variáveis provenientes de diversos métodos de extração e utilizando o algoritmo Classificador Naïve Bayes: 79% de taxa de acerto; 77% de precisão; 71% de cobertura; 74% de F1-Score. Neste trabalho é assim demonstrado o proveito da mistura de variáveis provenientes de diferentes métodos de extração de variáveis.This work aims to analyze the textual documents presented by Portuguese companies when applying for business incentive programs. This work intends to extract and select relevant features, present in the texts, which have predictive power in relation to future actions of the companies whose projects were accepted, during the projects. The concrete goal is the prediction of the cancellation of the projects with allocated funds, during their expected duration. It was necessary to create a text classification pipeline which applies natural language processing, various features extraction and selection techniques, classification algorithms and evaluation metrics. Many feature extraction techniques were used, such as classical techniques as TF and TF-IDF values generation, as also other experiments as topic generation, similarity analysis, lexical analysis, identification of specific vocabulary used, among other analysis of textual content that were also carried out. The feature analysis can show us hidden characteristics in the data, such as the fact that there is a preponderance of projects with high levels of similarity in certain districts of the country. The main objective, regarding the perdition of cancellation of the projects, was achieving the best possible performance, for that there were used the confusion matrix metrics (accuracy; precision; revocation; F1-Score). The best prediction results were obtained by a set of features from different extraction methods together with the use of the Naïve Bayes Classifier algorithm: 79% accuracy; 77% precision; 71% recall; 74% F1-Score. Therefore, it is shown the advantages of mixing features from different extraction methods on this text classification application

    Evaluation des systèmes d'intelligence épidémiologique appliqués à la détection précoce des maladies infectieuses au niveau mondial.

    Get PDF
    Our work demonstrated the performance of the epidemic intelligence systems used for the early detection of infectious diseases in the world, the specific added value of each system, the greater intrinsic sensitivity of moderated systems and the variability of the type information source’s used. The creation of a combined virtual system incorporating the best result of the seven systems showed gains in terms of sensitivity and timeliness that would result from the integration of these individual systems into a supra-system. They have shown the limits of these tools and in particular: the low positive predictive value of the raw signals detected, the variability of the detection capacities for the same disease, but also the significant influence played by the type of pathology, the language and the region of occurrence on the detection of infectious events. They established the wide variety of epidemic intelligence strategies used by public health institutions to meet their specific needs and the impact of these strategies on the nature, the geographic origin and the number of events reported. As well, they illustrated that under conditions close to the routine, epidemic intelligence permitted the detection of infectious events on average one to two weeks before their official notification, hence allowing to alert health authorities and therefore the anticipating the implementation of eventual control measures. Our work opens new fields of investigation which applications could be important for both users systems.Nos travaux ont démontré les performances des systèmes d’intelligence épidémiologique en matière de détection précoce des évènements infectieux au niveau mondial, la valeur ajoutée spécifique de chaque système, la plus grande sensibilité intrinsèque des systèmes modérés et la variabilité du type de source d’information utilisé. La création d’un système virtuel combiné intégrant le meilleur résultat des sept systèmes a démontré les gains en termes de sensibilité et de réactivité, qui résulterait de l’intégration de ces systèmes individuels dans un supra-système. Ils ont illustrés les limites de ces outils et en particulier la faible valeur prédictive positive des signaux bruts détectés, la variabilité les capacités de détection pour une même pathologie, mais également l’influence significative jouée par le type de pathologie, la langue et la région de survenue sur les capacités de détection des évènements infectieux. Ils ont établis la grande diversité des stratégies d’intelligence épidémiologique mises en œuvre par les institutions de santé publique pour répondre à leurs besoins spécifiques et l’impact de ces stratégies sur la nature, l’origine géographique et le nombre des évènements rapportés. Ils ont également montré que dans des conditions proches de la routine, l’intelligence épidémiologique permettait la détection d’évènements infectieux en moyenne une à deux semaines avant leur notification officielle, permettant ainsi d’alerter les autorités sanitaires et d’anticiper la mise en œuvre d’éventuelles mesures de contrôle. Nos travaux ouvrent de nouveaux champs d’investigations dont les applications pourraient être importantes pour les utilisateurs comme pour les systèmes
    corecore