36 research outputs found

    Identifying the oncogenic potential of gene fusions exploiting miRNAs

    Get PDF
    It is estimated that oncogenic gene fusions cause about 20% of human cancer morbidity. Identifying potentially oncogenic gene fusions may improve affected patients’ diagnosis and treatment. Previous approaches to this issue included exploiting specific gene-related information, such as gene function and regulation. Here we propose a model that profits from the previous findings and includes the microRNAs in the oncogenic assessment. We present ChimerDriver, a tool to classify gene fusions as oncogenic or not oncogenic. ChimerDriver is based on a specifically designed neural network and trained on genetic and post-transcriptional information to obtain a reliable classification. The designed neural network integrates information related to transcription factors, gene ontologies, microRNAs and other detailed information related to the functions of the genes involved in the fusion and the gene fusion structure. As a result, the performances on the test set reached 0.83 f1-score and 96% recall. The comparison with state-of-the-art tools returned comparable or higher results. Moreover, ChimerDriver performed well in a real-world case where 21 out of 24 validated gene fusion samples were detected by the gene fusion detection tool Starfusion. ChimerDriver integrates transcriptional and post-transcriptional information in an ad-hoc designed neural network to effectively discriminate oncogenic gene fusions from passenger ones. ChimerDriver source code is freely available at https://github.com/martalovino/ChimerDriver

    A survey on data integration for multi-omics sample clustering

    Get PDF
    Due to the current high availability of omics, data-driven biology has greatly expanded, and several papers have reviewed state-of-the-art technologies. Nowadays, two main types of investigation are available for a multi-omics dataset: extraction of relevant features for a meaningful biological interpretation and clustering of the samples. In the latter case, a few reviews refer to some outdated or no longer available methods, whereas others lack the description of relevant clustering metrics to compare the main approaches. This work provides a general overview of the major techniques in this area, divided into four groups: graph, dimensionality reduction, statistical and neural-based. Besides, eight tools have been tested both on a synthetic and a real biological dataset. An extensive performance comparison has been provided using four clustering evaluation scores: Peak Signal-to-Noise Ratio (PSNR), Davies-Bouldin(DB) index, Silhouette value and the harmonic mean of cluster purity and efficiency. The best results were obtained by using the dimensionality reduction, either explicitly or implicitly, as in the neural architecture

    SARS-CoV-2 variants classification and characterization

    No full text
    As of late 2019, the SARS-CoV-2 virus has spread globally, giving several variants over time. These variants, unfortunately, differ from the original sequence identified in Wuhan, thus risking compromising the efficacy of the vaccines developed. Some software has been released to recognize currently known and newly spread variants. However, some of these tools are not entirely automatic. Some others, instead, do not return a detailed characterization of all the mutations in the samples. Indeed, such characterization can be helpful for biologists to understand the variability between samples. This paper presents a Machine Learning (ML) approach to identifying existing and new variants completely automatically. In addition, a detailed table showing all the alterations and mutations found in the samples is provided in output to the user. SARS-CoV-2 sequences are obtained from the GISAID database, and a list of features is custom designed (e.g., number of mutations in each gene of the virus) to train the algorithm. The recognition of existing variants is performed through a Random Forest classifier while identifying newly spread variants is accomplished by the DBSCAN algorithm. Both Random Forest and DBSCAN techniques demonstrated high precision on a new variant that arose during the drafting of this paper (used only in the testing phase of the algorithm). Therefore, researchers will significantly benefit from the proposed algorithm and the detailed output with the main alterations of the samples. Data availability: the tool is freely available at https://github.com/sofiaborgato/-SARS-CoV-2-variants-classification-and-characterization

    Phenolic composition of red grapes grown in Southern Italy

    No full text
    The phenolic composition of red grapes native to Southern Italy (Aglianico, Carignano, Frappato, Gaglioppo, Negro Amaro, Nero d'Avola, Primitivo, Tintilia, and Uva di Troia) and an "international" grape (Cabernet Sauvignon) introduced into the Apulia region were investigated. Results showed that these cultivars could be divided into two groups on the basis of both their anthocyanin content and the presence of ortho-hydroxylated groups. Further differences regarded the ratio between flavans reacting with vanillin and proanthocyanidins. The anthocyanin profile of the skin of Negro Amaro, Primitivo and Uva di Troia grapes was found to be a specific characteristic of the grape variety which was affected only slightly by the place of growing. The different phenolic composition of the cultivars determines a different aptitude to wine production. The Cabernet Sauvignon grapes, due to their high concentration in polyphenolic substances, could be added to the native grape varieties in order to produce wines with a more complex aroma

    Interannual-to-multidecadal hydroclimate variability and its sectoral impacts in northeastern Argentina

    No full text
    This study examines the joint variability of precipitation, river streamflow and temperature over northeastern Argentina; advances the understanding of their links with global SST forcing; and discusses their impacts on water resources, agriculture and human settlements. The leading patterns of variability, and their nonlinear trends and cycles are identified by means of a principal component analysis (PCA) complemented with a singular spectrum analysis (SSA). Interannual hydroclimatic variability centers on two broad frequency bands: one of 2.5–6.5 years corresponding to El Niño Southern Oscillation (ENSO) periodicities and the second of about 9 years. The higher frequencies of the precipitation variability (2.5–4 years) favored extreme events after 2000, even during moderate extreme phases of the ENSO. Minimum temperature is correlated with ENSO with a main frequency close to 3 years. Maximum temperature time series correlate well with SST variability over the South Atlantic, Indian and Pacific oceans with a 9-year frequency. Interdecadal variability is characterized by low-frequency trends and multidecadal oscillations that have induced a transition from dryer and cooler climate to wetter and warmer decades starting in the mid-twentieth century. The Paraná River streamflow is influenced by North and South Atlantic SSTs with bidecadal periodicities. The hydroclimate variability at all timescales had significant sectoral impacts. Frequent wet events between 1970 and 2005 favored floods that affected agricultural and livestock productivity and forced population displacements. On the other hand, agricultural droughts resulted in soil moisture deficits that affected crops at critical growth stages. Hydrological droughts affected surface water resources, causing water and food scarcity and stressing the capacity for hydropower generation. Lastly, increases in minimum temperature reduced wheat and barley yields

    Enhancing PFI Prediction with GDS-MIL: A Graph-Based Dual Stream MIL Approach

    No full text
    Whole-Slide Images (WSI) are emerging as a promising resource for studying biological tissues, demonstrating a great potential in aiding cancer diagnosis and improving patient treatment. However, the manual pixel-level annotation of WSIs is extremely time-consuming and practically unfeasible in real-world scenarios. Multi-Instance Learning (MIL) have gained attention as a weakly supervised approach able to address lack of annotation tasks. MIL models aggregate patches (e.g., cropping of a WSI) into bag-level representations (e.g., WSI label), but neglect spatial information of the WSIs, crucial for histological analysis. In the High-Grade Serous Ovarian Cancer (HGSOC) context, spatial information is essential to predict a prognosis indicator (the Platinum-Free Interval, PFI) from WSIs. Such a prediction would bring highly valuable insights both for patient treatment and prognosis of chemotherapy resistance. Indeed, NeoAdjuvant ChemoTherapy (NACT) induces changes in tumor tissue morphology and composition, making the prediction of PFI from WSIs extremely challenging. In this paper, we propose GDS-MIL, a method that integrates a state-of-the-art MIL model with a Graph ATtention layer (GAT in short) to inject a local context into each instance before MIL aggregation. Our approach achieves a significant improvement in accuracy on the “Ome18” PFI dataset. In summary, this paper presents a novel solution for enhancing PFI prediction in HGSOC, with the potential of significantly improving treatment decisions and patient outcomes
    corecore