94 research outputs found
Code-Mixed Probes Show How Pre-Trained Models Generalise On Code-Switched Text
Code-switching is a prevalent linguistic phenomenon in which multilingual individuals seamlessly alternate between languages. Despite its widespread use online and recent research trends in this area, research in code-switching presents unique challenges, primarily stemming from the scarcity of labelled data and available resources. In this study we investigate how pre-trained Language Models handle code-switched text in three dimensions: a) the ability of PLMs to detect code-switched text, b) variations in the structural information that PLMs utilise to capture code-switched text, and c) the consistency of semantic information representation in code-switched text. To conduct a systematic and controlled evaluation of the language models in question, we create a novel dataset of well-formed naturalistic code-switched text along with parallel translations into the source languages. Our findings reveal that pre-trained language models are effective in generalising to code-switched text, shedding light on the abilities of these models to generalise representations to CS corpora. We release all our code and data including the novel corpus at https://github.com/francesita/code-mixed-probes
Code-Mixed Probes Show How Pre-Trained Models Generalise On Code-Switched Text
Code-switching is a prevalent linguistic phenomenon in which multilingual
individuals seamlessly alternate between languages. Despite its widespread use
online and recent research trends in this area, research in code-switching
presents unique challenges, primarily stemming from the scarcity of labelled
data and available resources. In this study we investigate how pre-trained
Language Models handle code-switched text in three dimensions: a) the ability
of PLMs to detect code-switched text, b) variations in the structural
information that PLMs utilise to capture code-switched text, and c) the
consistency of semantic information representation in code-switched text. To
conduct a systematic and controlled evaluation of the language models in
question, we create a novel dataset of well-formed naturalistic code-switched
text along with parallel translations into the source languages. Our findings
reveal that pre-trained language models are effective in generalising to
code-switched text, shedding light on the abilities of these models to
generalise representations to CS corpora. We release all our code and data
including the novel corpus at https://github.com/francesita/code-mixed-probes.Comment: Accepted for publication at Joint International Conference on
Computational Linguistics, Language Resources and Evaluation (LREC-COLING
2024). Data and code available at
https://github.com/francesita/code-mixed-probe
SemEval-2022 Task 2 : multilingual idiomaticity detection and sentence embedding
This paper presents the shared task on Multilingual Idiomaticity Detection and Sentence Embedding, which consists of two subtasks: (a) a binary classification task aimed at identifying whether a sentence contains an idiomatic expression, and (b) a task based on semantic text similarity which requires the model to adequately represent potentially idiomatic expressions in context. Each subtask includes different settings regarding the amount of training data. Besides the task description, this paper introduces the datasets in English, Portuguese, and Galician and their annotation procedure, the evaluation metrics, and a summary of the participant systems and their results. The task had close to 100 registered participants organised into twenty five teams making over 650 and 150 submissions in the practice and evaluation phases respectively
SIRT6 protein deacetylase interacts with MYH DNA glycosylase, APE1 endonuclease, and Rad9-Rad1-Hus1 checkpoint clamp
Background: SIRT6, a member of the NAD+-dependent histone/protein deacetylase family, regulates genomic stability, metabolism, and lifespan. MYH glycosylase and APE1 are two base excision repair (BER) enzymes involved in mutation avoidance from oxidative DNA damage. Rad9-Rad1-Hus1 (9-1-1) checkpoint clamp promotes cell cycle checkpoint signaling and DNA repair. BER is coordinated with the checkpoint machinery and requires chromatin remodeling for efficient repair. SIRT6 is involved in DNA double-strand break repair and has been implicated in BER. Here we investigate the direct physical and functional interactions between SIRT6 and BER enzymes. Results: We show that SIRT6 interacts with and stimulates MYH glycosylase and APE1. In addition, SIRT6 interacts with the 9-1-1 checkpoint clamp. These interactions are enhanced following oxidative stress. The interdomain connector of MYH is important for interactions with SIRT6, APE1, and 9-1-1. Mutagenesis studies indicate that SIRT6, APE1, and Hus1 bind overlapping but different sequence motifs on MYH. However, there is no competition of APE1, Hus1, or SIRT6 binding to MYH. Rather, one MYH partner enhances the association of the other two partners to MYH. Moreover, APE1 and Hus1 act together to stabilize the MYH/SIRT6 complex. Within human cells, MYH and SIRT6 are efficiently recruited to confined oxidative DNA damage sites within transcriptionally active chromatin, but not within repressive chromatin. In addition, Myh foci induced by oxidative stress and Sirt6 depletion are frequently localized on mouse telomeres. Conclusions: Although SIRT6, APE1, and 9-1-1 bind to the interdomain connector of MYH, they do not compete for MYH association. Our findings indicate that SIRT6 forms a complex with MYH, APE1, and 9-1-1 to maintain genomic and telomeric integrity in mammalian cells
PVS: a web server for protein sequence variability analysis tuned to facilitate conserved epitope discovery
We have developed PVS (Protein Variability Server), a web-based tool that uses several variability metrics to compute the absolute site variability in multiple protein-sequence alignments (MSAs). The variability is then assigned to a user-selected reference sequence consisting of either the first sequence in the alignment or a consensus sequence. Subsequently, PVS performs tasks that are relevant for structure-function studies, such as plotting and visualizing the variability in a relevant 3D-structure. Neatly, PVS also implements some other tasks that are thought to facilitate the design of epitope discovery-driven vaccines against pathogens where sequence variability largely contributes to immune evasion. Thus, PVS can return the conserved fragments in the MSA—as defined by a user-provided variability threshold—and locate them in a relevant 3D-structure. Furthermore, PVS can return a variability-masked sequence, which can be directly submitted to the RANKPEP server for the prediction of conserved T-cell epitopes. PVS is freely available at: http://imed.med.ucm.es/PVS/
The Türki̇ye earthquake sequence of February 2023: A longitudinal study report by EEFIT
On 6 February 2023 at 4:17 am local time, a large area in southeastern Türkiye and northern
Syria was hit by an Mw 7.8 earthquake, which was followed by an Mw 7.5 earthquake at 1:24
pm local time, causing the loss of more than 50,000 lives, some 100,000 injuries and
significant damage to buildings and infrastructure, estimated to be in the range of 84.1 billion
USD for Türkiye alone. The largest earthquake in Türkiye since the deadly 1939 Erzincan
earthquake with however much larger losses, the sequence immediately attracted the
attention of the global post-disaster reconnaissance/engineering communities. This included
the Earthquake Engineering Field Investigation Team (EEFIT), who, within one week of the
event, gathered a team with 30 people from academia and industry in the UK (19), Türkiye
(5), New Zealand (1), Hungary (1), Bulgaria (1), Greece (1) and USA (1) with two support
members from the UK and the Netherlands, to study the events and their impacts, and also to
develop suggestions to reduce the existing vulnerabilities in the future. The team was
organised in the form of 6 working groups as shown below, which were (1) strong ground
motions and seismotectonics, (2) geotechnics, (3) structures, (4) infrastructure, (5) remote
sensing and (6) relief response and recovery
How accurate and statistically robust are catalytic site predictions based on closeness centrality?
<p>Abstract</p> <p>Background</p> <p>We examine the accuracy of enzyme catalytic residue predictions from a network representation of protein structure. In this model, amino acid α-carbons specify vertices within a graph and edges connect vertices that are proximal in structure. Closeness centrality, which has shown promise in previous investigations, is used to identify important positions within the network. Closeness centrality, a global measure of network centrality, is calculated as the reciprocal of the average distance between vertex <it>i </it>and all other vertices.</p> <p>Results</p> <p>We benchmark the approach against 283 structurally unique proteins within the Catalytic Site Atlas. Our results, which are inline with previous investigations of smaller datasets, indicate closeness centrality predictions are statistically significant. However, unlike previous approaches, we specifically focus on residues with the very best scores. Over the top five closeness centrality scores, we observe an average true to false positive rate ratio of 6.8 to 1. As demonstrated previously, adding a solvent accessibility filter significantly improves predictive power; the average ratio is increased to 15.3 to 1. We also demonstrate (for the first time) that filtering the predictions by residue identity improves the results even more than accessibility filtering. Here, we simply eliminate residues with physiochemical properties unlikely to be compatible with catalytic requirements from consideration. Residue identity filtering improves the average true to false positive rate ratio to 26.3 to 1. Combining the two filters together has little affect on the results. Calculated p-values for the three prediction schemes range from 2.7E-9 to less than 8.8E-134. Finally, the sensitivity of the predictions to structure choice and slight perturbations is examined.</p> <p>Conclusion</p> <p>Our results resolutely confirm that closeness centrality is a viable prediction scheme whose predictions are statistically significant. Simple filtering schemes substantially improve the method's predicted power. Moreover, no clear effect on performance is observed when comparing ligated and unligated structures. Similarly, the CC prediction results are robust to slight structural perturbations from molecular dynamics simulation.</p
Joint Evolutionary Trees: A Large-Scale Method To Predict Protein Interfaces Based on Sequence Sampling
The Joint Evolutionary Trees (JET) method detects protein interfaces, the core
residues involved in the folding process, and residues susceptible to
site-directed mutagenesis and relevant to molecular recognition. The approach,
based on the Evolutionary Trace (ET) method, introduces a novel way to treat
evolutionary information. Families of homologous sequences are analyzed through
a Gibbs-like sampling of distance trees to reduce effects of erroneous multiple
alignment and impacts of weakly homologous sequences on distance tree
construction. The sampling method makes sequence analysis more sensitive to
functional and structural importance of individual residues by avoiding effects
of the overrepresentation of highly homologous sequences and improves
computational efficiency. A carefully designed clustering method is parametrized
on the target structure to detect and extend patches on protein surfaces into
predicted interaction sites. Clustering takes into account residues'
physical-chemical properties as well as conservation. Large-scale application of
JET requires the system to be adjustable for different datasets and to guarantee
predictions even if the signal is low. Flexibility was achieved by a careful
treatment of the number of retrieved sequences, the amino acid distance between
sequences, and the selective thresholds for cluster identification. An iterative
version of JET (iJET) that guarantees finding the most likely interface residues
is proposed as the appropriate tool for large-scale predictions. Tests are
carried out on the Huang database of 62 heterodimer, homodimer, and transient
complexes and on 265 interfaces belonging to signal transduction proteins,
enzymes, inhibitors, antibodies, antigens, and others. A specific set of
proteins chosen for their special functional and structural properties
illustrate JET behavior on a large variety of interactions covering proteins,
ligands, DNA, and RNA. JET is compared at a large scale to ET and to Consurf,
Rate4Site, siteFiNDER|3D, and SCORECONS on specific structures. A significant
improvement in performance and computational efficiency is shown
Recruitment of rare 3-grams at functional sites: Is this a mechanism for increasing enzyme specificity?
<p>Abstract</p> <p>Background</p> <p>A wealth of unannotated and functionally unknown protein sequences has accumulated in recent years with rapid progresses in sequence genomics, giving rise to ever increasing demands for developing methods to efficiently assess functional sites. Sequence and structure conservations have traditionally been the major criteria adopted in various algorithms to identify functional sites. Here, we focus on the distributions of the 20<sup>3 </sup>different types of <it>3</it>-grams (or triplets of sequentially contiguous amino acid) in the entire space of sequences accumulated to date in the UniProt database, and focus in particular on the rare <it>3</it>-grams distinguished by their high entropy-based information content.</p> <p>Results</p> <p>Comparison of the UniProt distributions with those observed near/at the active sites on a non-redundant dataset of 59 enzyme/ligand complexes shows that the active sites preferentially recruit <it>3</it>-grams distinguished by their low frequency in the UniProt. Three cases, Src kinase, hemoglobin, and tyrosyl-tRNA synthetase, are discussed in details to illustrate the biological significance of the results.</p> <p>Conclusion</p> <p>The results suggest that recruitment of rare <it>3</it>-grams may be an efficient mechanism for increasing specificity at functional sites. Rareness/scarcity emerges as a feature that may assist in identifying key sites for proteins function, providing information complementary to that derived from sequence alignments. In addition it provides us (for the first time) with a means of identifying potentially functional sites from sequence information alone, when sequence conservation properties are not available.</p
- …