32 research outputs found
Local Binary Patterns as a Feature Descriptor in Alignment-free Visualisation of Metagenomic Data
Shotgun sequencing has facilitated the analysis of complex microbial communities. However, clustering and visualising these communities without prior taxonomic information is a major challenge. Feature descriptor methods can be utilised to extract these taxonomic relations from the data. Here, we present a novel approach consisting of local binary patterns (LBP) coupled with randomised singular value decomposition (RSVD) and Barnes-Hut t-stochastic neighbor embedding (BH-tSNE) to highlight the underlying taxonomic structure of the metagenomic data. The effectiveness of our approach is demonstrated using several simulated and a real metagenomic datasets
Interpreting Differentiable Latent States for Healthcare Time-series Data
Machine learning enables extracting clinical insights from large temporal
datasets. The applications of such machine learning models include identifying
disease patterns and predicting patient outcomes. However, limited
interpretability poses challenges for deploying advanced machine learning in
digital healthcare. Understanding the meaning of latent states is crucial for
interpreting machine learning models, assuming they capture underlying
patterns. In this paper, we present a concise algorithm that allows for i)
interpreting latent states using highly related input features; ii)
interpreting predictions using subsets of input features via latent states; and
iii) interpreting changes in latent states over time. The proposed algorithm is
feasible for any model that is differentiable. We demonstrate that this
approach enables the identification of a daytime behavioral pattern for
predicting nocturnal behavior in a real-world healthcare dataset
On the Effectiveness of Compact Biomedical Transformers
Language models pre-trained on biomedical corpora, such as BioBERT, have
recently shown promising results on downstream biomedical tasks. Many existing
pre-trained models, on the other hand, are resource-intensive and
computationally heavy owing to factors such as embedding size, hidden
dimension, and number of layers. The natural language processing (NLP)
community has developed numerous strategies to compress these models utilising
techniques such as pruning, quantisation, and knowledge distillation, resulting
in models that are considerably faster, smaller, and subsequently easier to use
in practice. By the same token, in this paper we introduce six lightweight
models, namely, BioDistilBERT, BioTinyBERT, BioMobileBERT, DistilBioBERT,
TinyBioBERT, and CompactBioBERT which are obtained either by knowledge
distillation from a biomedical teacher or continual learning on the Pubmed
dataset via the Masked Language Modelling (MLM) objective. We evaluate all of
our models on three biomedical tasks and compare them with BioBERT-v1.1 to
create efficient lightweight models that perform on par with their larger
counterparts. All the models will be publicly available on our Huggingface
profile at https://huggingface.co/nlpie and the codes used to run the
experiments will be available at
https://github.com/nlpie-research/Compact-Biomedical-Transformers
Marginalised stack denoising autoencoders for metagenomic data binning
Shotgun sequencing has facilitated the analysis of complex microbial communities. Recently we have shown how local binary patterns (LBP) from image processing can be used to analyse the sequenced samples. LBP codes represent the data in a sparse high dimensional space. To improve the performance of our pipeline, marginalised stacked autoencoders are used here to learn frequent LBP codes and map the high dimensional space to a lower dimension dense space. We demonstrate its performance using both low and high complexity simulated metagenomic data and compare the performance of our method with several existing techniques including principal component analysis (PCA) in the dimension reduction step and fc-mer frequency in feature extraction step
Lightweight transformers for clinical natural language processing
Specialised pre-trained language models are becoming more frequent in Natural language Processing (NLP) since they can potentially outperform models trained on generic texts. BioBERT (Sanh et al., Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv: 1910.01108, 2019) and BioClinicalBERT (Alsentzer et al., Publicly available clinical bert embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pp. 72–78, 2019) are two examples of such models that have shown promise in medical NLP tasks. Many of these models are overparametrised and resource-intensive, but thanks to techniques like knowledge distillation, it is possible to create smaller versions that perform almost as well as their larger counterparts. In this work, we specifically focus on development of compact language models for processing clinical texts (i.e. progress notes, discharge summaries, etc). We developed a number of efficient lightweight clinical transformers using knowledge distillation and continual learning, with the number of parameters ranging from million to million. These models performed comparably to larger models such as BioBERT and ClinicalBioBERT and significantly outperformed other compact models trained on general or biomedical data. Our extensive evaluation was done across several standard datasets and covered a wide range of clinical text-mining tasks, including natural language inference, relation extraction, named entity recognition and sequence classification. To our knowledge, this is the first comprehensive study specifically focused on creating efficient and compact transformers for clinical NLP tasks. The models and code used in this study can be found on our Huggingface profile at https://huggingface.co/nlpie and Github page at https://github.com/nlpie-research/Lightweight-Clinical-Transformers, respectively, promoting reproducibility of our results
The Utility of Data Transformation for Alignment, De Novo Assembly and Classification of Short Read Virus Sequences.
Advances in DNA sequencing technology are facilitating genomic analyses of unprecedented scope and scale, widening the gap between our abilities to generate and fully exploit biological sequence data. Comparable analytical challenges are encountered in other data-intensive fields involving sequential data, such as signal processing, in which dimensionality reduction (i.e., compression) methods are routinely used to lessen the computational burden of analyses. In this work, we explored the application of dimensionality reduction methods to numerically represent high-throughput sequence data for three important biological applications of virus sequence data: reference-based mapping, short sequence classification and de novo assembly. Leveraging highly compressed sequence transformations to accelerate sequence comparison, our approach yielded comparable accuracy to existing approaches, further demonstrating its suitability for sequences originating from diverse virus populations. We assessed the application of our methodology using both synthetic and real viral pathogen sequences. Our results show that the use of highly compressed sequence approximations can provide accurate results, with analytical performance retained and even enhanced through appropriate dimensionality reduction of sequence data
A crowd of BashTheBug volunteers reproducibly and accurately measure the minimum inhibitory concentrations of 13 antitubercular drugs from photographs of 96-well broth microdilution plates
Tuberculosis is a respiratory disease that is treatable with antibiotics. An increasing prevalence of resistance means that to ensure a good treatment outcome it is desirable to test the susceptibility of each infection to different antibiotics. Conventionally, this is done by culturing a clinical sample and then exposing aliquots to a panel of antibiotics, each being present at a pre-determined concentration, thereby determining if the sample isresistant or susceptible to each sample. The minimum inhibitory concentration (MIC) of a drug is the lowestconcentration that inhibits growth and is a more useful quantity but requires each sample to be tested at a range ofconcentrations for each drug. Using 96-well broth micro dilution plates with each well containing a lyophilised pre-determined amount of an antibiotic is a convenient and cost-effective way to measure the MICs of several drugs at once for a clinical sample. Although accurate, this is still an expensive and slow process that requires highly-skilled and experienced laboratory scientists. Here we show that, through the BashTheBug project hosted on the Zooniverse citizen science platform, a crowd of volunteers can reproducibly and accurately determine the MICs for 13 drugs and that simply taking the median or mode of 11–17 independent classifications is sufficient. There is therefore a potential role for crowds to support (but not supplant) the role of experts in antibiotic susceptibility testing
Quantitative measurement of antibiotic resistance in Mycobacterium tuberculosis reveals genetic determinants of resistance and susceptibility in a target gene approach
The World Health Organization has a goal of universal drug susceptibility testing for
patients with tuberculosis; however, molecular diagnostics to date have focused largely
on first-line drugs and predicting binary susceptibilities. We used a multivariable linear
mixed model alongside whole genome sequencing and a quantitative microtiter plate
assay to relate genomic mutations to minimum inhibitory concentration in 15,211
Mycobacterium tuberculosis patient isolates from 23 countries across five continents.
This identified 492 unique MIC-elevating variants across thirteen drugs, as well as 91
mutations resulting in hypersensitivity. Our results advance genetics-based diagnostics
for tuberculosis and serve as a curated training/testing dataset for development of drug
resistance prediction algorithms
Multiview classification and dimensionality reduction of scalp and intracranial EEG data through tensor factorisation
Electroencephalography (EEG) signals arise as a mixture of various neural processes that occur in different spatial, frequency and temporal locations. In classification paradigms, algorithms are developed that can distinguish between these processes. In this work, we apply tensor factorisation to a set of EEG data from a group of epileptic patients and factorise the data into three modes; space, time and frequency with each mode containing a number of components or signatures. We train separate classifiers on various feature sets corresponding to complementary combinations of those modes and components and test the classification accuracy of each set. The relative influence on the classification accuracy of the respective spatial, temporal or frequency signatures can then be analysed and useful interpretations can be made. Additionaly, we show that through tensor factorisation we can perform dimensionality reduction by evaluating the classification performance with regards to the number mode components and by rejecting components with insignificant contribution to the classification accuracy