54 research outputs found

    A method for discovering and inferring appropriate eligibility criteria in clinical trial protocols without labeled data

    Get PDF
    BACKGROUND: We consider the user task of designing clinical trial protocols and propose a method that discovers and outputs the most appropriate eligibility criteria from a potentially huge set of candidates. Each document d in our collection D is a clinical trial protocol which itself contains a set of eligibility criteria. Given a small set of sample documents [Formula: see text] , a user has initially identified as relevant e.g., via a user query interface, our scoring method automatically suggests eligibility criteria from D, D ⊃ D', by ranking them according to how appropriate they are to the clinical trial protocol currently being designed. The appropriateness is measured by the degree to which they are consistent with the user-supplied sample documents D'. METHOD: We propose a novel three-step method called LDALR which views documents as a mixture of latent topics. First, we infer the latent topics in the sample documents using Latent Dirichlet Allocation (LDA). Next, we use logistic regression models to compute the probability that a given candidate criterion belongs to a particular topic. Lastly, we score each criterion by computing its expected value, the probability-weighted sum of the topic proportions inferred from the set of sample documents. Intuitively, the greater the probability that a candidate criterion belongs to the topics that are dominant in the samples, the higher its expected value or score. RESULTS: Our experiments have shown that LDALR is 8 and 9 times better (resp., for inclusion and exclusion criteria) than randomly choosing from a set of candidates obtained from relevant documents. In user simulation experiments using LDALR, we were able to automatically construct eligibility criteria that are on the average 75% and 70% (resp., for inclusion and exclusion criteria) similar to the correct eligibility criteria. CONCLUSIONS: We have proposed LDALR, a practical method for discovering and inferring appropriate eligibility criteria in clinical trial protocols without labeled data. Results from our experiments suggest that LDALR models can be used to effectively find appropriate eligibility criteria from a large repository of clinical trial protocols

    Boosting Drug Named Entity Recognition using an Aggregate Classifier

    Get PDF
    AbstractObjectiveDrug named entity recognition (NER) is a critical step for complex biomedical NLP tasks such as the extraction of pharmacogenomic, pharmacodynamic and pharmacokinetic parameters. Large quantities of high quality training data are almost always a prerequisite for employing supervised machine-learning techniques to achieve high classification performance. However, the human labour needed to produce and maintain such resources is a significant limitation. In this study, we improve the performance of drug NER without relying exclusively on manual annotations.MethodsWe perform drug NER using either a small gold-standard corpus (120 abstracts) or no corpus at all. In our approach, we develop a voting system to combine a number of heterogeneous models, based on dictionary knowledge, gold-standard corpora and silver annotations, to enhance performance. To improve recall, we employed genetic programming to evolve 11 regular-expression patterns that capture common drug suffixes and used them as an extra means for recognition.MaterialsOur approach uses a dictionary of drug names, i.e. DrugBank, a small manually annotated corpus, i.e. the pharmacokinetic corpus, and a part of the UKPMC database, as raw biomedical text. Gold-standard and silver annotated data are used to train maximum entropy and multinomial logistic regression classifiers.ResultsAggregating drug NER methods, based on gold-standard annotations, dictionary knowledge and patterns, improved the performance on models trained on gold-standard annotations, only, achieving a maximum F-score of 95%. In addition, combining models trained on silver annotations, dictionary knowledge and patterns are shown to achieve comparable performance to models trained exclusively on gold-standard data. The main reason appears to be the morphological similarities shared among drug names.ConclusionWe conclude that gold-standard data are not a hard requirement for drug NER. Combining heterogeneous models build on dictionary knowledge can achieve similar or comparable classification performance with that of the best performing model trained on gold-standard annotations

    A behaviour biometrics dataset for user identification and authentication

    Get PDF
    As e-Commerce continues to shift our shopping preference from the physical to online marketplace, we leave behind digital traces of our personally identifiable details. For example, the merchant keeps record of your name and address; the payment processor stores your transaction details including account or card information, and every website you visit stores other information such as your device address and type. Cybercriminals constantly steal and use some of this information to commit identity fraud, ultimately leading to devastating consequences to the victims; but also, to the card issuers and payment processors with whom the financial liability most often lies. To this end, we recognise that data is generally compromised in this digital age, and personal data such as card number, password, personal identification number and account details can be easily stolen and used by someone else. However, there is a plethora of data relating to a person's behaviour biometrics that are almost impossible to steal, such as the way they type on a keyboard, move the cursor, or whether they normally do so via a mouse, touchpad or trackball. This data, commonly called keystroke, mouse and touchscreen dynamics, can be used to create a unique profile for the legitimate card owner, that can be utilised as an additional layer of user authentication during online card payments. Machine learning is a powerful technique for analysing such data to gain knowledge; and has been widely used successfully in many sectors for profiling e.g., genome classification in molecular biology and genetics where predictions are made for one or more forms of biochemical activity along the genome. Similar techniques are applicable in the financial sector to detect anomaly in user keyboard and mouse behaviour when entering card details online, such that they can be used to distinguish between a legitimate and an illegitimate card owner. In this article, a behaviour biometrics (i.e., keystroke and mouse dynamics) dataset, collected from 88 individuals, is presented. The dataset holds a total of 1760 instances categorised into two classes (i.e., legitimate and illegitimate card owners’ behaviour). The data was collected to facilitate an academic start-up project (called CyberSignature1) which received funding from Innovate UK, under the Cyber Security Academic Startup Accelerator Programme. The dataset could be helpful to researchers who apply machine learning to develop applications using keystroke and mouse dynamics e.g., in cybersecurity to prevent identity theft. The dataset, entitled ‘Behaviour Biometrics Dataset’, is freely available on the Mendeley Data repository

    Automatic language ability assessment method based on natural language processing

    Get PDF
    Background and Objectives:The Wechsler Abbreviated Scales of Intelligence second edition (WASI-II) is a standardised assessment tool that is widely used to assess cognitive ability in clinical, research, and educational settings. In one of the components of this assessment, referred to as the Vocabulary task, the assessed individuals are presented with words (called stimulus items), and asked to explain what each word mean. Their responses are hand-scored based on a list of pre-rated sample responses [0-Point (poor), 1-Point (moderate), or 2-Point (excellent)] that is provided in the accompanying manual of WASI-II. This scoring method is time-consuming, and scoring of responses that do not fully match the pre-rated ones may vary between individual scorers. In this study, we aim to use natural language processing techniques to automate the scoring procedure and make it more time-efficient and reliable (objective).Methods:Utilising five different word embeddings (Word2vec, Global Vectors, Bidirectional Encoder Representations from Transformers, Generative Pre-trained Transformer 2, and Embeddings from Language Model), we transformed stimulus items and pre-rated responses from the WASI-II Vocabulary task into machine-readable vectors. We measured distance with cosine similarity, evaluating each model against a rational-expectations hypothesis that vector representations for stimuli should align closely with 2-Point responses and diverge from 0-Point responses. Assessment involved frequency of consistent representation and the Pearson correlation coefficient, examining overall consistency with the manual’s ranking across all items and sample responses.Results:The Word2vec model showed the highest consistency with the WASI-II manual (frequency = 20 out of 27; Pearson Correlation coefficient = 0.61) while Bidirectional Encoder Representations from Transformers was the worst performing model (frequency = 5; Pearson Correlation coefficient = 0.05). The consistency of these two models with the WASI-II manual differed significantly, Z = 2.282, p = 0.022.Conclusions:Our results showed that the scoring of the WASI-II Vocabulary task can be automated with moderate accuracy relying upon off-the-shelf embedding models. These results are promising, and could be improved further by considering alternative vector dimensions, similarity metrics, and data preprocessing techniques to those used in this study
    • …
    corecore