107 research outputs found
On Generative Models and Joint Architectures for Document-level Relation Extraction
Biomedical text is being generated at a high rate in scientific literature publications and electronic health records. Within these documents lies a wealth of potentially useful information in biomedicine. Relation extraction (RE), the process of automating the identification of structured relationships between entities within text, represents a highly sought-after goal in biomedical informatics, offering the potential to unlock deeper insights and connections from this vast corpus of data. In this dissertation, we tackle this problem with a variety of approaches.
We review the recent history of the field of document-level RE. Several themes emerge. First, graph neural networks dominate the methods for constructing entity and relation representations. Second, clever uses of attention allow for the these constructions to focus on particularly relevant tokens and object (such as mentions and entities) representations. Third, aggregation of signal across mentions in entity-level RE is a key focus of research. Fourth, the injection of additional signal by adding tokens to the text prior to encoding via language model (LM) or through additional learning tasks boosts performance. Last, we explore an assortment of strategies for the challenging task of end-to-end entity-level RE.
Of particular note are sequence-to-sequence (seq2seq) methods that have become particularly popular in the past few years. With the success of general-domain generative LMs, biomedical NLP researchers have trained a variety of these models on biomedical text under the assumption that they would be superior for biomedical tasks. As training such models is computationally expensive, we investigate whether they outperform generic models. We test this assumption rigorously by comparing performance of all major biomedical generative language models to the performances of their generic counterparts across multiple biomedical RE datasets, in the traditional finetuning setting as well as in the few-shot setting. Surprisingly, we found that biomedical models tended to underperform compared to their generic counterparts. However, we found that small-scale biomedical instruction finetuning improved performance to a similar degree as larger-scale generic instruction finetuning.
Zero-shot natural language processing (NLP) offers savings on the expenses associated with annotating datasets and the specialized knowledge required for applying NLP methods. Large, generative LMs trained to align with human objectives have demonstrated impressive zero-shot capabilities over a broad range of tasks. However, the effectiveness of these models in biomedical RE remains uncertain. To bridge this gap in understanding, we investigate how GPT-4 performs across several RE datasets. We experiment with the recent JSON generation features to generate structured output, which we use alternately by defining an explicit schema describing the relation structure, and inferring the structure from the prompt itself. Our work is the first to study zero-shot biomedical RE across a variety of datasets. Overall, performance was lower than that of fully-finetuned methods. Recall suffered in examples with more than a few relations. Entity mention boundaries were a major source of error, which future work could fruitfully address.
In our previous work with generative LMs, we noted that RE performance decreased with the number of gold relations in an example. This observation aligns with the general pattern that recurrent neural network and transformer-based model performance tends to decrease with sequence length. Generative LMs also do not identify textual mentions or group them into entities, which are valuable information extraction tasks unto themselves. Therefore, in this age of generative methods, we revisit non-seq2seq methodology for biomedical RE. We adopt a sequential framework of named entity recognition (NER), clustering mentions into entities, followed by relation classification (RC). As errors early in the pipeline necessarily cause downstream errors, and NER performance is near its ceiling, we focus on improving clustering. We match state-of-the-art (SOTA) performance in NER, and substantially improve mention clustering performance by incorporating dependency parsing and gating string dissimilarity embeddings.
Overall, we advance the field of biomedical RE in a few ways. In our experiments of finetuned LMs, we show that biomedicine-specific models are unnecessary, freeing researchers to make use of SOTA generic LMs. The relatively high few-shot performance in these experiments also suggests that biomedical RE can be reasonably accessible, as it is not so difficult to construct small datasets. Our investigation into zero-shot RE shows that SOTA LMs can compete with fully finetuned smaller LMs. Together these studies also demonstrate weaknesses of generative RE. Last, we show that non-generative RE methods still outperform generative methods in the fully-finetuned setting
Synthetic standards for mass spectrometry-based carbohydrate sequencing and the automated solution-phase syntheses of beta-glucans
This dissertation describes 1) the synthesis of disaccharide standards for mass spectrometry-based carbohydrate sequencing and 2) the automated iterative solution-phase synthesis of beta-glucans. Two libraries of mass-identical disaccharides were synthesized. Varying in linkage position and the identity of the non-reducing end monosaccharide residue, these compounds served as synthetic standards for systematic study by tandem mass spectrometry. These disaccharides were analyzed by mass spectrometry in the laboratories of Professors Edward Yeung and Young-Jin Lee. The varying degrees of fragmentation observed in the MS-MS spectra of several of these disaccharides were used to produce classification functions that were capable of correctly classifying the linkage position and identity of the non-reducing end monosaccharide residue. These results provide insight that will ultimately contribute to the development of faster carbohydrate sequencing methods.
The second portion of this dissertation describes the automated solution-phase syntheses of branched and linear beta-glucans. A fluorocarbon-based tag on the growing sugar chain allows for facile purification of intermediates by automated fluorous solid-phase extraction (FSPE), and also provides a means of noncovalent attachment to a fluorinated glass slide for the direct formation of carbohydrate microarrays. Our synthetic approach allows for traditional solution-phase kinetics, reaction monitoring, and chromatographic purification, techniques that are not possible with solid-phase oligosaccharide synthesis. Several new glucosyl trichloroacetimidate building blocks were synthesized and subsequently utilized for the automated synthesis of branched and linear beta-glucan fragments. Finally, conditions were developed to fully deprotect our synthetic glucans, rendering them suitable for NMR binding studies and biological assays. These studies established automation protocols that can be used for the synthesis of larger, more complex beta-glucan structures
The healing power of words: psychotherapy in the USSR, 1956-1985
This thesis examines the growth of psychotherapy as a discipline in the Soviet Union between 1956 and 1985, looking at the types of treatment that existed in this period, the tasks that psychotherapy was to perform according to physicians who promoted it, and their efforts to establish it as a distinct medical speciality and popularise it within the Soviet healthcare system. It looks at how different challenges encountered by the promoters of psychotherapy influenced its practice and the discourse around it, and how it was shaped by a broader political, social and cultural context of the USSR. It demonstrates that psychotherapy after Stalin was not stagnant but developed into a diverse field fuelled by enthusiasm of its practitioners who, while sticking to methods that by mid-twentieth century lost popularity in the West, gave them new theoretical underpinnings, constantly worked to modify and improve them, and supplemented them by new ideas and approaches. The result was a unique form of psychotherapy characterised by a physiological language, a specific view of the human mind and body and an unusually broad understanding of its tasks. This thesis analyses the legitimising strategies employed by psychotherapists to present their discipline as both scientifically substantiated and useful to the Soviet society, showing that it was envisaged not only as a strictly therapeutic method but also as a potentially universal auxiliary treatment and as a means of prophylaxis. It examines various aspects of Soviet psychotherapy such as its goals, links to physiology, emphasis on human self-perfection, embrace of placebo as a legitimate form of therapy and the blurring of the boundary between therapy, prophylaxis and conversation implicit in its theory, seeking to understand what psychotherapy was for its Soviet practitioners and how it came to be conceptualised in this particular way
A strategy to identify event specific hospitalizations in large health claims databases
Background: Health insurance claims data offer a unique opportunity to study disease distribution on a large scale. Challenges arise in the process of accurately analyzing these raw data. One important challenge to overcome is the accurate classification of study outcomes. For example, using claims data, there is no clear way of classifying hospitalizations due to a specific event. This is because of the inherent disjointedness and lack of context that typically come with raw claims data.
Methods: In this paper, we propose a framework for classifying hospitalizations due to a specific event. We then tested this framework in a private health insurance claims database (Symphony) with approximately 4 million US adults who tested positive with COVID-19 between March and December 2020. Our claims specific COVID-19 related hospitalizations proportion is then compared to nationally reported rates from the Centers for Disease Control by age.
Results: Across all ages (18 +) the total percentage of Symphony patients who met our definition of hospitalized due to COVID-19 was 7.3% which was similar to the CDC’s estimate of 7.5%. By age group, defined by the CDC, our estimates vs. the CDC’s estimates were 18–49: 2.7% vs. 3%, 50–64: 8.2% vs. 9.2%, and 65 + : 14.6% vs. 28.1%.
Conclusions: The proposed methodology is a rigorous way to define event specific hospitalizations in claims data. This methodology can be extended to many different types of events and used on a variety of different types of claims databases
Non-Poisson processes: regression to equilibrium versus equilibrium correlation functions
We study the response to perturbation of non-Poisson dichotomous fluctuations
that generate super-diffusion. We adopt the Liouville perspective and with it a
quantum-like approach based on splitting the density distribution into a
symmetric and an anti-symmetric component. To accomodate the equilibrium
condition behind the stationary correlation function, we study the time
evolution of the anti-symmetric component, while keeping the symmetric
component at equilibrium. For any realistic form of the perturbed distribution
density we expect a breakdown of the Onsager principle, namely, of the property
that the subsequent regression of the perturbation to equilibrium is identical
to the corresponding equilibrium correlation function. We find the directions
to follow for the calculation of higher-order correlation functions, an
unsettled problem, which has been addressed in the past by means of
approximations yielding quite different physical effects.Comment: 30 page
Group representation of bicrystal invariant translations with an application to the topology of secondary grain boundary dislocations
All DSC-Lattice translations of one lattice with respect to the second lattice of a bicrystal are described as a group. It is shown that the matrix representation of this group can be used to solve topological problems connected with secondary grain boundary dislocations (SGBDs) such as finding the step in the boundary associated with the SGBD. We have formulated this problem by establishing the step vector S associated with the Burgers vector b of the SGBD in a cubic bicrystal. The problem is then solved in 2D, and the way to generalize to 3D is indicated
Synthetic standards for mass spectrometry-based carbohydrate sequencing and the automated solution-phase syntheses of beta-glucans
This dissertation describes 1) the synthesis of disaccharide standards for mass spectrometry-based carbohydrate sequencing and 2) the automated iterative solution-phase synthesis of beta-glucans. Two libraries of mass-identical disaccharides were synthesized. Varying in linkage position and the identity of the non-reducing end monosaccharide residue, these compounds served as synthetic standards for systematic study by tandem mass spectrometry. These disaccharides were analyzed by mass spectrometry in the laboratories of Professors Edward Yeung and Young-Jin Lee. The varying degrees of fragmentation observed in the MS-MS spectra of several of these disaccharides were used to produce classification functions that were capable of correctly classifying the linkage position and identity of the non-reducing end monosaccharide residue. These results provide insight that will ultimately contribute to the development of faster carbohydrate sequencing methods.
The second portion of this dissertation describes the automated solution-phase syntheses of branched and linear beta-glucans. A fluorocarbon-based tag on the growing sugar chain allows for facile purification of intermediates by automated fluorous solid-phase extraction (FSPE), and also provides a means of noncovalent attachment to a fluorinated glass slide for the direct formation of carbohydrate microarrays. Our synthetic approach allows for traditional solution-phase kinetics, reaction monitoring, and chromatographic purification, techniques that are not possible with solid-phase oligosaccharide synthesis. Several new glucosyl trichloroacetimidate building blocks were synthesized and subsequently utilized for the automated synthesis of branched and linear beta-glucan fragments. Finally, conditions were developed to fully deprotect our synthetic glucans, rendering them suitable for NMR binding studies and biological assays. These studies established automation protocols that can be used for the synthesis of larger, more complex beta-glucan structures.</p
- …