20 research outputs found
Learning Markov networks with context-specific independences
Learning the Markov network structure from data is a problem that has
received considerable attention in machine learning, and in many other
application fields. This work focuses on a particular approach for this purpose
called independence-based learning. Such approach guarantees the learning of
the correct structure efficiently, whenever data is sufficient for representing
the underlying distribution. However, an important issue of such approach is
that the learned structures are encoded in an undirected graph. The problem
with graphs is that they cannot encode some types of independence relations,
such as the context-specific independences. They are a particular case of
conditional independences that is true only for a certain assignment of its
conditioning set, in contrast to conditional independences that must hold for
all its assignments. In this work we present CSPC, an independence-based
algorithm for learning structures that encode context-specific independences,
and encoding them in a log-linear model, instead of a graph. The central idea
of CSPC is combining the theoretical guarantees provided by the
independence-based approach with the benefits of representing complex
structures by using features in a log-linear model. We present experiments in a
synthetic case, showing that CSPC is more accurate than the state-of-the-art IB
algorithms when the underlying distribution contains CSIs.Comment: 8 pages, 6 figure
Aprendizaje de independencias especÃficas del contexto en Markov random fields
Los modelos no dirigidos o Markov random fields son ampliamente utilizados para problemas que aprenden una distribución desconocida desde un conjunto de datos. Esto es porque permiten representar una distribución eficientemente al hacer explÃcitas las independencias condicionales que pueden existir entre sus variables. Además de estas independencias es posible representar otras, las Independencias EspecÃficas del Contexto (CSIs) que a diferencia de las anteriores sólo son válidas bajo ciertos valores que pueden tomar subconjuntos de sus variables. Debido a esto son complicadas de representar y aprenderlas desde datos. En este trabajo presentamos un enfoque para representar CSIs en modelos no dirigidos y un algoritmo que las aprende desde datos utilizando tests estadÃsticos. Mostramos resultados donde los modelos aprendidos por nuestro algoritmo resultan ser mejores o comparables a modelos aprendidos por otros sin utilizar CSIs.Presentado en el XII Workshop Agentes y Sistemas Inteligentes (WASI)Red de Universidades con Carreras en Informática (RedUNCI
Male sterility and somatic hybridization in plant breeding
Plant male sterility refers to the failure in the production of fertile pollen. It occurs spon-taneously in natural populations and may be caused by genes encoded in the nuclear (genicmale sterility; GMS) or mitochondrial (cytoplasmic male sterility; CMS) genomes. Thisfeature has great agronomic value for the production of hybrid seeds, since it prevents self-pollination without the need of emasculation which is time-consuming and cost-intensive.CMS has been widely used in crops, such as corn, rice, wheat, citrus, and several speciesof the family Solanaceae. Mitochondrial genes determining CMS have been uncovered ina wide range of plant species. The modes of action of CMS have been classified in terms ofthe effect they produce in the cell, which ultimately leads to a failure in the production offertile pollen. Male fertility can be restored by nuclear-encoded genes, termed restorer-of-fertility (Rf) factors. CMS from wild plants has been transferred to species of agronomicinterest through somatic hybridization. Somatic hybrids have also been produced togenerate CMS de novo upon recombination of the mitochondrial genomes of two parentalplants or by separating the CMS cytoplasm from the nuclear Rf alleles. As a result, somatichybridization can be used as a highly efficient and useful strategy to incorporate CMS inbreeding programs.Fil: Garcia, Laura Evangelina. Consejo Nacional de Investigaciones CientÃficas y Técnicas. Centro CientÃfico Tecnológico Conicet - Mendoza. Instituto de BiologÃa AgrÃcola de Mendoza. Universidad Nacional de Cuyo. Facultad de Ciencias Agrarias. Instituto de BiologÃa AgrÃcola de Mendoza; Argentina. Universidad Nacional de Cuyo; ArgentinaFil: Edera, Alejandro. Consejo Nacional de Investigaciones CientÃficas y Técnicas. Centro CientÃfico Tecnológico Conicet - Mendoza. Instituto de BiologÃa AgrÃcola de Mendoza. Universidad Nacional de Cuyo. Facultad de Ciencias Agrarias. Instituto de BiologÃa AgrÃcola de Mendoza; ArgentinaFil: Marfil, Carlos Federico. Consejo Nacional de Investigaciones CientÃficas y Técnicas. Centro CientÃfico Tecnológico Conicet - Mendoza. Instituto de BiologÃa AgrÃcola de Mendoza. Universidad Nacional de Cuyo. Facultad de Ciencias Agrarias. Instituto de BiologÃa AgrÃcola de Mendoza; ArgentinaFil: Sánchez Puerta, MarÃa Virginia. Consejo Nacional de Investigaciones CientÃficas y Técnicas. Centro CientÃfico Tecnológico Conicet - Mendoza. Instituto de BiologÃa AgrÃcola de Mendoza. Universidad Nacional de Cuyo. Facultad de Ciencias Agrarias. Instituto de BiologÃa AgrÃcola de Mendoza; Argentin
The IBMAP approach for Markov networks structure learning
In this work we consider the problem of learning the structure of Markov
networks from data. We present an approach for tackling this problem called
IBMAP, together with an efficient instantiation of the approach: the IBMAP-HC
algorithm, designed for avoiding important limitations of existing
independence-based algorithms. These algorithms proceed by performing
statistical independence tests on data, trusting completely the outcome of each
test. In practice tests may be incorrect, resulting in potential cascading
errors and the consequent reduction in the quality of the structures learned.
IBMAP contemplates this uncertainty in the outcome of the tests through a
probabilistic maximum-a-posteriori approach. The approach is instantiated in
the IBMAP-HC algorithm, a structure selection strategy that performs a
polynomial heuristic local search in the space of possible structures. We
present an extensive empirical evaluation on synthetic and real data, showing
that our algorithm outperforms significantly the current independence-based
algorithms, in terms of data efficiency and quality of learned structures, with
equivalent computational complexities. We also show the performance of IBMAP-HC
in a real-world application of knowledge discovery: EDAs, which are
evolutionary algorithms that use structure learning on each generation for
modeling the distribution of populations. The experiments show that when
IBMAP-HC is used to learn the structure, EDAs improve the convergence to the
optimum
Transfer learning in proteins: evaluating novel protein learned representations for bioinformatics tasks
A representation method is an algorithm that calculates numerical feature vectors for samples in a dataset. Such vectors, also known as embeddings, define a relatively low-dimensional space able to efficiently encode high-dimensional data. Very recently, many types of learned data representations based on machine learning have appeared and are being applied to several tasks in bioinformatics. In particular, protein representation learning methods integrate different types of protein information (sequence, domains, etc.), in supervised or unsupervised learning approaches, and provide embeddings of protein sequences that can be used for downstream tasks. One task that is of special interest is the automatic function prediction of the huge number of novel proteins that are being discovered nowadays and are still totally uncharacterized. However, despite its importance, up to date there is not a fair benchmark study of the predictive performance of existing proposals on the same large set of proteins and for very concrete and common bioinformatics tasks. Therefore, this lack of benchmark studies prevent the community from using adequate predictive methods for accelerating the functional characterization of proteins. In this study, we performed a detailed comparison of protein sequence representation learning methods, explaining each approach and comparing them with an experimental benchmark on several bioinformatics tasks: (i) determining protein sequence similarity in the embedding space; (ii) inferring protein domains and (iii) predicting ontology-based protein functions. We examine the advantages and disadvantages of each representation approach over the benchmark results. We hope the results and the discussion of this study can help the community to select the most adequate machine learning-based technique for protein representation according to the bioinformatics task at hand.Fil: Fenoy, Luis Emilio. Consejo Nacional de Investigaciones CientÃficas y Técnicas. Centro CientÃfico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de IngenierÃa y Ciencias HÃdricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; ArgentinaFil: Edera, Alejandro. Consejo Nacional de Investigaciones CientÃficas y Técnicas. Centro CientÃfico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de IngenierÃa y Ciencias HÃdricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; ArgentinaFil: Stegmayer, Georgina. Consejo Nacional de Investigaciones CientÃficas y Técnicas. Centro CientÃfico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de IngenierÃa y Ciencias HÃdricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; Argentin
Anc2vec: Embedding gene ontology terms by preserving ancestors relationships
The gene ontology (GO) provides a hierarchical structure with a controlled vocabulary composed of terms describing functions and localization of gene products. Recent works propose vector representations, also known as embeddings, of GO terms that capture meaningful information about them. Significant performance improvements have been observed when these representations are used on diverse downstream tasks, such as the measurement of semantic similarity between GO terms and functional similarity between proteins. Despite the success shown by these approaches, existing embeddings of GO terms still fail to capture crucial structural features of the GO. Here, we present anc2vec, a novel protocol based on neural networks for constructing vector representations of GO terms by preserving three important ontological features: its ontological uniqueness, ancestors hierarchy and sub-ontology membership. The advantages of using anc2vec are demonstrated by systematic experiments on diverse tasks: visualization, sub-ontology prediction, inference of structurally related terms, retrieval of terms from aggregated embeddings, and prediction of protein-protein interactions. In these tasks, experimental results show that the performance of anc2vec representations is better than those of recent approaches. This demonstrates that higher performances on diverse tasks can be achieved by embeddings when the structure of the GO is better represented. Full source code and data are available at https://github.com/sinc-lab/anc2vec.Fil: Edera, Alejandro. Consejo Nacional de Investigaciones CientÃficas y Técnicas. Centro CientÃfico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de IngenierÃa y Ciencias HÃdricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; ArgentinaFil: Milone, Diego Humberto. Consejo Nacional de Investigaciones CientÃficas y Técnicas. Centro CientÃfico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de IngenierÃa y Ciencias HÃdricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; ArgentinaFil: Stegmayer, Georgina. Consejo Nacional de Investigaciones CientÃficas y Técnicas. Centro CientÃfico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de IngenierÃa y Ciencias HÃdricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; Argentin
Simultaneous Profiling of Chromatin Accessibility and DNA Methylation in Complete Plant Genomes Using Long-Read Sequencing
ABSTRACT Epigenetic regulations, including chromatin accessibility, nucleosome positioning, and DNA methylation intricately shape genome function. However, current chromatin profiling techniques relying on short-read sequencing technologies face limitations in adequately characterising repetitive genomic regions and detecting multiple chromatin features simultaneously. Here, we present Simultaneous Accessibility and DNA Methylation Sequencing (SAM-seq), a robust method leveraging bacterial adenine methyltransferases (m6A-MTases) to label accessible regions in purified plant nuclei. Coupled with Oxford Nanopore Technology sequencing, SAM-seq enables high-resolution profiling of m6A-tagged chromatin accessibility together with cytosine methylation along chromatin fibres in plants. Analysis of naked genomic DNA revealed significant sequence preference biases of m6A-MTases, controllable through a normalisation step. By applying SAM-seq to Arabidopsis and maize nuclei we obtained fine-grained accessibility and DNA methylation landscapes at genome-wide and local scales. We characterised crosstalk between chromatin accessibility and DNA methylation, notably within nucleosomes of genes, TEs, and centromeric repeats. SAM-seq also facilitated the identification of DNA footprints over cis-regulatory regions. Furthermore, using the single-molecule information provided by SAM-seq we unveiled extensive cellular heterogeneity at chromatin domains harbouring antagonistic chromatin marks, suggesting that bivalency reflects cell-specific regulations of gene activity. In summary, we introduce a robust method for acquiring high-resolution accessibility and DNA methylation landscapes across entire plant genomes. Our results underscore the importance of considering the intrinsic substrate preferences of m6A-MTases for reliable chromatin profiling. SAM-seq opens new opportunities to simultaneously study multiple epigenetic features at unprecedented scale, enabling the investigation of non-model species with limited genomic and epigenomic information
Learning Markov Network Structures Constrained by Context-Specific Independences
This work focuses on learning the structure of Markov networks from data. Markov networks are parametric models for compactly representing complex probability distributions. These models are composed by: a structure and numerical weights, where the structure describes independences that hold in the distribution. Depending on which is the goal of structure learning, learning algorithms can be divided into: density estimation algorithms, where structure is learned for answering inference queries; and knowledge discovery algorithms, where structure is learned for describing independences qualitatively. The latter algorithms present an important limitation for describing independences because they use a single graph; a coarse grain structure representation which cannot represent flexible independences. For instance, context-specific independences cannot be described by a single graph. To overcome this limitation, this work proposes a new alternative representation named canonical model as well as the CSPC algorithm; a novel knowledge discovery algorithm for learning canonical models by using context-specific independences as constraints. On an extensive empirical evaluation, CSPC learns more accurate structures than state-of-the-art density estimation and knowledge discovery algorithms. Moreover, for answering inference queries, our approach obtains competitive results against density estimation algorithms, significantly outperforming knowledge discovery algorithms.Fil: Edera, Alejandro. Consejo Nacional de Investigaciones CientÃficas y Técnicas. Centro CientÃfico Tecnológico Conicet - Mendoza; Argentina. Universidad Tecnológica Nacional. Facultad Regional Mendoza. Departamento de Sistemas de Información; ArgentinaFil: Schluter, Federico Enrique Adolfo. Universidad Tecnológica Nacional. Facultad Regional Mendoza. Departamento de Sistemas de Información; Argentina. Consejo Nacional de Investigaciones CientÃficas y Técnicas. Centro CientÃfico Tecnológico Conicet - Mendoza; ArgentinaFil: Bromberg, Facundo. Consejo Nacional de Investigaciones CientÃficas y Técnicas. Centro CientÃfico Tecnológico Conicet - Mendoza; Argentina. Universidad Tecnológica Nacional. Facultad Regional Mendoza. Departamento de Sistemas de Información; Argentin
Simultaneous profiling of chromatin accessibility and DNA methylation in complete plant genomes using long-read sequencing
International audienceAbstract Epigenetic regulations, including chromatin accessibility, nucleosome positioning and DNA methylation intricately shape genome function. However, current chromatin profiling techniques relying on short-read sequencing technologies fail to characterise highly repetitive genomic regions and cannot detect multiple chromatin features simultaneously. Here, we performed Simultaneous Accessibility and DNA Methylation Sequencing (SAM-seq) of purified plant nuclei. Thanks to the use of long-read nanopore sequencing, SAM-seq enables high-resolution profiling of m6A-tagged chromatin accessibility together with endogenous cytosine methylation in plants. Analysis of naked genomic DNA revealed significant sequence preference biases of m6A-MTases, controllable through a normalisation step. By applying SAM-seq to Arabidopsis and maize nuclei we obtained fine-grained accessibility and DNA methylation landscapes genome-wide. We uncovered crosstalk between chromatin accessibility and DNA methylation within nucleosomes of genes, TEs, and centromeric repeats. SAM-seq also detects DNA footprints over cis-regulatory regions. Furthermore, using the single-molecule information provided by SAM-seq we identified extensive cellular heterogeneity at chromatin domains with antagonistic chromatin marks, suggesting that bivalency reflects cell-specific regulations. SAM-seq is a powerful approach to simultaneously study multiple epigenetic features over unique and repetitive sequences, opening new opportunities for the investigation of epigenetic mechanisms