42 research outputs found

    Interpretable Medical Diagnostics with Structured Data Extraction by Large Language Models

    Full text link
    Tabular data is often hidden in text, particularly in medical diagnostic reports. Traditional machine learning (ML) models designed to work with tabular data, cannot effectively process information in such form. On the other hand, large language models (LLMs) which excel at textual tasks, are probably not the best tool for modeling tabular data. Therefore, we propose a novel, simple, and effective methodology for extracting structured tabular data from textual medical reports, called TEMED-LLM. Drawing upon the reasoning capabilities of LLMs, TEMED-LLM goes beyond traditional extraction techniques, accurately inferring tabular features, even when their names are not explicitly mentioned in the text. This is achieved by combining domain-specific reasoning guidelines with a proposed data validation and reasoning correction feedback loop. By applying interpretable ML models such as decision trees and logistic regression over the extracted and validated data, we obtain end-to-end interpretable predictions. We demonstrate that our approach significantly outperforms state-of-the-art text classification models in medical diagnostics. Given its predictive performance, simplicity, and interpretability, TEMED-LLM underscores the potential of leveraging LLMs to improve the performance and trustworthiness of ML models in medical applications

    Sequence-based typing of genetic targets encoded outside of the O-antigen gene cluster is indicative of Shiga toxin-producing Escherichia coli serogroup lineages

    Get PDF
    Serogroup classifications based upon the O-somatic antigen of Shiga toxin-producing Escherichia coli (STEC) provide significant epidemiological information on clinical isolates. Each O-antigen determinant is encoded by a unique cluster of genes present between the gnd and galF chromosomal genes. Alternatively, serogroup-specific polymorphisms might be encoded in loci that are encoded outside of the O-antigen gene cluster. Segments of the core bacterial loci mdh, gnd, gcl, ppk, metA, ftsZ, relA and metG for 30 O26 STEC strains have previously been sequenced, and comparative analyses to O157 distinguished these two serogroups. To screen these loci for serogroup-specific traits within a broader range of clinically significant serogroups, DNA sequences were obtained for 19 strains of 10 additional STEC serogroups. Unique alleles were observed at the gnd locus for each examined STEC serogroup, and this correlation persisted when comparative analyses were extended to 144 gnd sequences from 26 O-serogroups (comprising 42 O : H-serotypes). These included O157, O121, O103, O26, O5 : non-motile (NM), O145 : NM, O113 : H21, O111 : NM and O117 : H7 STEC; and furthermore, non-toxin encoding O157, O26, O55, O6 and O117 strains encoded distinct gnd alleles compared to STEC strains of the same serogroup. DNA sequencing of a 643 bp region of gnd was, therefore, sufficient to minimally determine the O-antigen of STEC through molecular means, and the location of gnd next to the O-antigen gene cluster offered additional support for the co-inheritance of these determinants. The gnd DNA sequence-based serogrouping method could improve the typing capabilities for STEC in clinical laboratories, and was used successfully to characterize O121 : H19, O26 : H11 and O177 : NM clinical isolates prior to serological confirmation during outbreak investigations

    Tetrameric repeat units associated with virulence factor phase variation in Haemophilus also occur in Neisseria spp. and Moraxella catarrhalis

    No full text
    The tetrameric repeat units 5'-CAAT-3' and 5'-GCAA-3' are associated with phase variable expression of lipopolysaccharide biosynthetic genes in Haemophilus influenzae. Four other tetrameric repeat units have also been reported from H. influenzae strain Rd, 5'-CAAC-3', 5'-GACA-3', 5'-AGCT-3', and 5'-TTTA-3', which are also associated with putative virulence factors. Using oligonucleotide probes corresponding to five tandem copies of each of these tetramers, we have screened three strains of Neisseria meningitidis and one each of Neisseria gonorrhoeae, Neisseria lactamica, Haemophilus parainfluenzae, Bordetella pertussis, Bordetella parapertussis, Bordetella bronchiceptica and Moraxella catarrhalis for the presence of these motifs. We have demonstrated the presence of multiple copies of the 5'-GCAA-3' motif in all the Neisseria strains tested, and also the repeated motif 5'-CAAC-3' in M. catarrhalis. We have further demonstrated by Southern blot analysis that the 5'-CAAC-3' repeats detected in M. catarrhalis are probably associated with the same genes as in H. influenzae, but that the 5'-GCAA-3' motifs in N. meningitidis are not. The use of characterised tetrameric DNA sequences as hybridisation probes may prove useful in the identification of novel phase variable virulence determinants in organisms other than H. influenzae Type: JOURNAL ARTICLE Language: EngNRC publication: Ye

    DNA repeats identify novel virulence genes in Haemophilus influenzae: Proc Natl Acad Sci U.S.A.

    No full text
    The whole genome sequence (1.83 Mbp) of Haemophilus influenzae strain Rd was searched to identify tandem oligonucleotide repeat sequences. Loss or gain of one or more nucleotide repeats through a recombination-independent slippage mechanism is known to mediate phase variation of surface molecules of pathogenic bacteria, including H. influenzae. This facilitates evasion of host defenses and adaptation to the varying microenvironments of the host. We reasoned that iterative nucleotides could identify novel genes relevant to microbe-host interactions. Our search of the Rd genome sequence identified 9 novel loci with multiple (range 6-36, mean 22) tandem tetranucleotide repeats. All were found to be located within putative open reading frames and included homologues of hemoglobin-binding proteins of Neisseria, a glycosyltransferase (IgtC gene product) of Neisseria, and an adhesin of Yersinia. These tetranucleotide repeat sequences were also shown to be present in two other epidemiologically different H. influenzae type b strains, although the number and distribution of repeats was different. Further characterization of the IgtC gene showed that it was involved in phenotypic switching of a lipopolysaccharide epitope and that this variable expression was associated with changes in the number of tetranucleotide repeats. Mutation of IgtC resulted in attenuated virulence of H. influenzae in an infant rat model of invasive infection. These data indicate the rapidity, economy, and completeness with which whole genome sequences can be used to investigate the biology of pathogenic bacteria Type: JOURNAL ARTICLE Language: EngNRC publication: Ye

    A novel and effective natural product-based immunodetection tool for TNT-like compounds

    No full text
    This study aimed to develop a fast and reliable protocol for Trinitrophenol-Tris(hydroxymethyl)aminomethane (TNP-Tris) detection applying a beta-lactamase-fusion protein of choice, the natural product-based immunoreagent tool of competitive sensitivity developed herein for the first time. Since the fusion protein 11B3-scFv-beta-lactamase is constructed from a scFv-antibody (11B3) linked to an enzyme (beta-lactamase), the step related to the use of secondary antibody in the enzyme-linked immunosorbent assay (ELISA) is completely omitted. Indeed, this fusion protein itself serves both as binding mean of the antigen model and detecting agent, due to the presence of the naturally occurring enzyme. In such a way, it actually affords the one-step TNP-Tris detection reaching a promising LOD value of 45 +/- 2 fmol or 157 +/- 6 pg/mL. Taken all together, the current protocol does represent much cheaper and significantly less-time consuming alternative compared both to the recombinant antibodies and recombinant phages, previously designed means in our labs for the same purpose. [GRAPHICS]
    corecore