12 research outputs found
Normalising Medical Concepts in Social Media Texts by Learning Semantic Representation
Automatically recognising medical con- cepts mentioned in social media messages (e.g. tweets) enables several applications for enhancing health quality of people in a community, e.g. real-time monitoring of infectious diseases in population. How- ever, the discrepancy between the type of language used in social media and med- ical ontologies poses a major challenge. Existing studies deal with this challenge by employing techniques, such as lexi- cal term matching and statistical machine translation. In this work, we handle the medical concept normalisation at the se- mantic level. We investigate the use of neural networks to learn the transition be- tween layman’s language used in social media messages and formal medical lan- guage used in the descriptions of medi- cal concepts in a standard ontology. We evaluate our approaches using three differ- ent datasets, where social media texts are extracted from Twitter messages and blog posts. Our experimental results show that our proposed approaches significantly and consistently outperform existing effective baselines, which achieved state-of-the-art performance on several medical concept normalisation tasks, by up to 44%
A framework for enhancing the query and medical record representations for patient search
Electronic medical records (EMRs) are digital documents stored by medical institutions that detail the observed symptoms, the conducted diagnostic tests, the identified diagnoses and the prescribed treatments. These EMRs are being increasingly used worldwide to improve healthcare services. For example, when a doctor compiles the possible treatments for a patient showing some particular symptoms, it is advantageous to consult the information about patients who were previously treated for those same symptoms. However, finding patients with particular medical conditions is challenging, due to the implicit knowledge inherent within the patients' medical records and queries - such knowledge may be known by medical practitioners, but may be hidden from an information retrieval (IR) system. For instance, the mention of a treatment such as a drug may indicate to a practitioner that a particular diagnosis has been made for the patient, but this diagnosis may not be explicitly mentioned in the patient's medical records. Moreover, the use of negated language (e.g.\ `without', `no') to describe a medical condition of a patient (e.g.\ the patient has no fever) may cause a search system to erroneously retrieve that patient for a query when searching for patients with that medical condition (e.g.\ find patients with fever).
This thesis focuses on enhancing the search of EMRs, with the aim of identifying patients with medical histories relevant to the medical conditions stated in a text query. During retrieval, a healthcare practitioner indicates a number of inclusion criteria describing the medical conditions of the patients of interest. To attain effective retrieval performance, we hypothesise that, in a patient search system, both the information needs and patients' histories should be represented based upon \emph{the medical decision process}. In particular, this thesis argues that since the medical decision process typically encompasses four aspects (symptom, diagnostic test, diagnosis and treatment), a patient search system should take into account these aspects and apply inferences to recover the possible implicit knowledge. We postulate that considering these aspects and their derived implicit knowledge at three different levels of the retrieval process (namely, sentence, medical record and inter-record levels) enhances the retrieval performance. Indeed, we propose a novel framework that can gain insights from EMRs and queries, by modelling and reasoning upon information during retrieval in terms of the four aforementioned aspects at the three levels of the retrieval process, and can use these insights to enhance patient search.
Firstly, at the sentence level, we extract the medical conditions in the medical records and queries. In particular, we propose to represent only the medical conditions related to the four medical aspects in order to improve the accuracy of our search system. In addition, we identify the context (negative/positive) of terms, which leads to an accurate representation of the medical conditions both in the EMRs and queries. In particular, we aim to prevent patients whose EMRs state the medical conditions in the contexts different from the query from being ranked highly. For example, preventing patients whose EMRs state ``no history of dementia'' from being retrieved for a query searching for patients with dementia.
Secondly, at the medical record level, using external knowledge-based resources (e.g.\ ontologies and health-related websites), we leverage the relationships between medical terms to infer the wider medical history of the patient in terms of the four medical aspects. In particular, we estimate the relevance of a patient to the query by exploiting association rules that we extract from the semantic relationships between medical terms using the four aspects of the medical process. For example, patients with a medical history involving a \emph{CABG surgery} (treatment) can be inferred as relevant to a query searching for a patient suffering from \emph{heart disease} (diagnosis), since a CABG surgery is a treatment of heart disease.
Thirdly, at the inter-record level, we enhance the retrieval of patients in two different manners. First, we exploit knowledge about how the four medical aspects are handled by different hospital departments to gain a better understanding about the appropriateness of EMRs created by different departments for a given query. We propose to aggregate EMRs at the department level (i.e.\ inter-record level) to extract implicit knowledge (i.e.\ the expertise of each department) and model this department's expertise, while ranking patients. For instance, patients having EMRs from the cardiology department are likely to be relevant to a query searching for patients who suffered from a heart attack. Second, as a medical query typically contains several medical conditions that the relevant patients should satisfy, we propose to explicitly model the relevance towards multiple query medical conditions in the EMRs related to a particular patient during retrieval. In particular, we rank highly those patients that match all the stated medical conditions in the query by adapting coverage-based diversification approaches originally proposed for the web search domain.
Finally, we examine the combination of our aforementioned approaches that exploit the implicit knowledge at the three levels of the retrieval process to further improve the retrieval performance by adapting techniques from the fields of data fusion and machine learning. In particular, data fusion techniques, such as CombSUM and CombMNZ, are used to combine the relevance scores computed by the different approaches of the proposed framework. On the other hand, we deploy state-of-the-art learning to rank approaches (e.g.\ LambdaMART and AdaRank) to learn from a set of training data an effective combination of the relevance scores computed by the approaches of the framework. In addition, we introduce a novel selective ranking approach that uses a classifier to effectively apply one of the approaches of the framework on a per-query basis.
This thesis draws insights from a thorough evaluation and analysis of the proposed framework using a standard test collection provided by the TREC Medical Records track. The experimental results show the effectiveness of the framework. In particular, the results demonstrate the importance of dealing with the implicit knowledge in patient search by focusing on the medical decision criteria aspects at the three levels of the retrieval process
Tweeting Behaviour during Train Disruptions within a City
In a smart city environment, citizens use social media for communicating and reporting events. Existing
work has shown that social media tools, such as Twitter and Facebook, can be used as social sensors to monitor
events in real-time as they happen (e.g. riots, natural disasters and sport events). In this paper, we study the
reactions of citizens in social media towards train disruptions within a city. Our study using 30 days of tweets in a large city shows that citizens react differently to train disruptions by, for instance, displaying unique behaviours in tweeting depending on the time of the disruption. Specifically, for working days, tweets related to train disruptions are typically generated during rush hour periods. In contrast, during weekends, urban citizens tended to tweet about train disruptions during late evenings. Using these insights, we develop a supervised approach to predict whether a train disruption tweet will be retweeted and propagated on the social network, by using features, such as time, user, and the content of tweets. Our experimental results show that we can effectively predict when a train disruption tweet is retweeted by using such features
Topic-centric Classification of Twitter User's Political Orientation
In the recent Scottish Independence Referendum (hereafter, IndyRef), Twitter offered a broad platform for people to express their opinions, with millions of IndyRef tweets posted over the campaign period. In this paper, we aim to classify people's voting intentions by the content of their tweets---their short messages communicated on Twitter. By observing tweets related to the IndyRef, we find that people not only discussed the vote, but raised topics related to an independent Scotland including oil reserves, currency, nuclear weapons, and national debt. We show that the views communicated on these topics can inform us of the individuals' voting intentions ("Yes"--in favour of Independence vs. "No"--Opposed). In particular, we argue that an accurate classifier can be designed by leveraging the differences in the features' usage across different topics related to voting intentions. We demonstrate improvements upon a Naive Bayesian classifier using the topics enrichment method. Our new classifier identifies the closest topic for each unseen tweet, based on those topics identified in the training data. Our experiments show that our Topics-Based Naive Bayesian classifier improves accuracy by 7.8% over the classical Naive Bayesian baseline
Learning to Combine Representations for Medical Records Search
ABSTRACT The complexity of medical terminology raises challenges when searching medical records. For example, 'cancer', 'tumour', and 'neoplasms', which are synonyms, may prevent a traditional search system from retrieving relevant records that contain only synonyms of the query terms. Prior works use bag-of-concepts approaches, to deal with this by representing medical terms sharing the same meanings using concepts from medical resources (e.g. MeSH). The relevance scores are then combined with a traditional bag-of-words representation, when inferring the relevance of medical records. Even though the existing approaches are effective, the predicted retrieval effectiveness of either the bag-of-words or bag-ofconcepts representation, which may be used to effectively model the score combination and hence improve retrieval performance, is not taken into account. In this paper, we propose a novel learning framework that models the importance of the bag-of-words and the bag-of-concepts representations, combining their scores on a per-query basis. Our proposed framework leverages retrieval performance predictors, such as the clarity score and AvIDF, calculated on both representations as learning features. We evaluate our proposed framework using the TREC Medical Records track's test collections. As our proposed framework can significantly outperform an existing approach that linearly merges the relevance scores, we conclude that retrieval performance predictors can be effectively leveraged when combining the relevance scores
A Query and Patient Understanding Framework for Medical Records Search
ABSTRACT Electronic medical records (EMRs) are being increasingly used worldwide to facilitate improved healthcare services Our work focuses on searching EMRs to identify patients with medical histories relevant to the medical condition(s) stated in a query. The resulting system can be beneficial to healthcare providers, administrators, and researchers who may wish to analyse the effectiveness of a particular medical procedure to combat a specific disease To attain effective retrieval performance, we hypothesise that, in such a medical IR system, both the information needs and patients should be modelled based on how the medical process is developed. Specifically, our thesis states that since the medical decision process typically encompasses four aspects (symptom, diagnostic test, diagnosis, and treatment), a medical search system should take into account these aspects and apply inferences to recover possible implicit knowledge. We postulate that considering these aspects and their derived implicit knowledge at different levels of the retrieval process (namely, sentence, record, and inter-record level) enhances the retrieval performance. Indeed, we propose to build a query and patient understanding framework that can gain insights from EMRs and queries, by modelling and reasoning during retrieval in terms of Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for thirdparty components of this work must be honored. For all other uses, contact the owner/author(s). SIGIR'13, July 28-August 1, 2013, Dublin, Ireland. ACM 978-1-4503-2034-4/13/07. the four aforementioned aspects (symptom, diagnostic test, diagnosis, and treatment) at three different levels of the retrieval process. Firstly, at the sentence level, a medical negation detection tool is used to identify the context (negative/positive) of terms, which leads to an accurate representation of the medical conditions both in the EMRs and the queries. Handling negated language is challenging in medical records search, since it is commonly used by medical practitioners to indicate that a patient does not possess a particular medical condition Thirdly, at the inter-record level, we exploit knowledge about how the four medical aspects are handled by different hospital departments to gain further understanding about the appropriateness of EMRs from different departments for a given query. Specifically, we propose to aggregate EMRs at the department level (i.e. inter-record level) to extract implicit medical knowledge (i.e. expertise of each department) and model this department's expertise, while ranking EMRs. For instance, patients having EMRs from the cardiology department are likely to be relevant to a query such as "find patients suffering from heart attack". We evaluate our work using standard test collections provided by the TREC Medical Records trac
Tweeting Behaviour during Train Disruptions within a City
In a smart city environment, citizens use social media for communicating and reporting events. Existing work has shown that social media tools, such as Twitter and Facebook, can be used as social sensors to monitor events in real-time as they happen (e.g.\ riots, natural disasters and sport events). In this paper, we study the reactions of citizens in social media towards train disruptions within a city. Our study using 30 days of tweets in a large city shows that citizens react differently to train disruptions by, for instance, displaying unique behaviours in tweeting depending on the time of the disruption. Specifically, for working days, tweets related to train disruptions are typically generated during rush hour periods. In contrast, during weekends, urban citizens tended to tweet about train disruptions during late evenings. Using these insights, we develop a supervised approach to predict whether a train disruption tweet will be retweeted and propagated on the social network, by using features, such as time, user, and the content of tweets. Our experimental results show that we can effectively predict when a train disruption tweet is retweeted by using such features
Disambiguating Biomedical Acronyms using EMIM
Expanding a query with acronyms or their corresponding ‘long-forms ’ has not been shown to provide consistent improvements in the biomedical IR literature. The major open issue with expanding acronyms in a query is their inherent ambiguity, as an acronym can refer to multiple long-forms. At the same time, a long-form identified in a query can be expanded with its acronym(s); however, some of these may be also ambiguous and lead to poor retrieval performance. In this work, we propose the use of the EMIM (Expected Mutual Information Measure) between a long-form and its abbreviated acronym to measure ambiguity. We experiment with expanding both acronyms and long-forms identified in the queries from the adhoc task of the TREC 2004 Genomics track. Our preliminary analysis shows the potential of both acronym and long-form expansions for biomedical IR
WSDM 2017 Workshop on Mining Online Health Reports:WSDM workshop summary
The workshop on Mining Online Health Reports (MOHRS) draws upon the rapidly developing field of Computational Health, focusing on textual content that has been generated through the various facets of Web activity. Online user-generated information mining, especially from social media platforms and search engines, has been in the forefront of many research efforts, especially in the fields of Information Retrieval and Natural Language Processing. The incorporation of such data and techniques in a number of health-oriented applications has provided strong evidence about the potential benefits, which include better population coverage, timeliness and the operational ability in places with less established health infrastructure. The workshop aims to create a platform where relevant state-of-the-art research is presented, but at the same time discussions among researchers with cross-disciplinary backgrounds can take place. It will focus on the characterisation of data sources, the essential methods for mining this textual information, as well as potential real-world applications and the arising ethical issues. MOHRS '17 will feature 3 keynote talks and 4 accepted paper presentations, together with a panel discussion session