35 research outputs found

    Improving the translation environment for professional translators

    Get PDF
    When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side. This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project

    A Survey of Machine Learning for Big Code and Naturalness

    Get PDF
    Research at the intersection of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit code's abundance of patterns. In this article, we survey this work. We contrast programming languages against natural languages and discuss how these similarities and differences drive the design of probabilistic models. We present a taxonomy based on the underlying design principles of each model and use it to navigate the literature. Then, we review how researchers have adapted these models to application areas and discuss cross-cutting and application-specific challenges and opportunities.Comment: Website accompanying this survey paper can be found at https://ml4code.github.i

    Query-Time Data Integration

    Get PDF
    Today, data is collected in ever increasing scale and variety, opening up enormous potential for new insights and data-centric products. However, in many cases the volume and heterogeneity of new data sources precludes up-front integration using traditional ETL processes and data warehouses. In some cases, it is even unclear if and in what context the collected data will be utilized. Therefore, there is a need for agile methods that defer the effort of integration until the usage context is established. This thesis introduces Query-Time Data Integration as an alternative concept to traditional up-front integration. It aims at enabling users to issue ad-hoc queries on their own data as if all potential other data sources were already integrated, without declaring specific sources and mappings to use. Automated data search and integration methods are then coupled directly with query processing on the available data. The ambiguity and uncertainty introduced through fully automated retrieval and mapping methods is compensated by answering those queries with ranked lists of alternative results. Each result is then based on different data sources or query interpretations, allowing users to pick the result most suitable to their information need. To this end, this thesis makes three main contributions. Firstly, we introduce a novel method for Top-k Entity Augmentation, which is able to construct a top-k list of consistent integration results from a large corpus of heterogeneous data sources. It improves on the state-of-the-art by producing a set of individually consistent, but mutually diverse, set of alternative solutions, while minimizing the number of data sources used. Secondly, based on this novel augmentation method, we introduce the DrillBeyond system, which is able to process Open World SQL queries, i.e., queries referencing arbitrary attributes not defined in the queried database. The original database is then augmented at query time with Web data sources providing those attributes. Its hybrid augmentation/relational query processing enables the use of ad-hoc data search and integration in data analysis queries, and improves both performance and quality when compared to using separate systems for the two tasks. Finally, we studied the management of large-scale dataset corpora such as data lakes or Open Data platforms, which are used as data sources for our augmentation methods. We introduce Publish-time Data Integration as a new technique for data curation systems managing such corpora, which aims at improving the individual reusability of datasets without requiring up-front global integration. This is achieved by automatically generating metadata and format recommendations, allowing publishers to enhance their datasets with minimal effort. Collectively, these three contributions are the foundation of a Query-time Data Integration architecture, that enables ad-hoc data search and integration queries over large heterogeneous dataset collections

    Exploratory visualizations and statistical analysis of large, heterogeneous epigenetic datasets

    Get PDF
    Epigenetic marks, such as DNA methylation and histone modifications, are important regulatory mechanisms that allow a single genomic sequence to give rise to a complex multicellular organism. When studying mechanisms of epigenetic regulation, the analyses depend on the experimental technologies and the available data. Recent advancements in sequencing technologies allow for the efficient extraction of genome-wide maps of epigenetic marks. A number of large-scale mapping projects, such as ENCODE and IHEC, intensively produce data for different tissues and cell cultures. The increasing quantity of data highlights a major bottleneck in bioinformatic research, namely the lack of bioinformatic tools for analyzing these data. To date, there are bioinformatics tools for detailed (mostly visual) inspection of single genomic loci, allowing biologists to focus research on regions of interest. Also, efficient tools for manipulation and analysis of the data have been published, but often they require computer science abilities. Furthermore, the available tools provide solutions to only already well formulated biological questions. What is missing, in our opinion, are tools (or pipelines of tools) to explore the data interactively, in a process that would facilitate a trained biologist to recognize interesting aspects and pursue them further until concrete hypotheses are formulated. A possible solution stems from the best practices in the fields of information retrieval and exploratory search. In this thesis, I propose EpiExplorer, a paradigm for integration of state-of-the-art information retrieval methods and indexing structures, applied to offer instant interactive exploration of large epigenetic datasets. The algorithms we use are developed for semi-structured text data, but we apply them on bioinformatic data through clever textual mapping of biological properties. We demonstrate the power of EpiExplorer in a series of studies that address interesting biological problems. We also present in this manuscript EpiGRAPH, a bioinformatic software that we developed with colleagues. EpiGRAPH helps identify and model significant biological associations among epigenetic and genetic properties for sets of regions. Using EpiExplorer and EpiGRAPH, independently or in a pipeline, provides the bioinformatic community with access to large databases of annotations, allows for exploratory visualizations or statistical analysis and facilitates reproduction and sharing of results.Epigenetische Signaturen wie die Methylierung der DNS oder posttranslationale Modifikationen der Histonproteine stellen wichtige regulatorische Mechanismen dar. Diese ermöglichen es, dass ein komplexer, multizellulärer Organismus aus einer einzelnen genomische Sequenz hervorgeht. Adequate Analysemethoden hängen von den verwendeten experimentellen Technologien und den verfügbaren Daten ab. Jüngste Fortschritte in der DNS-Sequenzierungstechnologie ermöglichen die effiziente Erstellung genomweiter Karten epigenetischer Informationen. Diese Epigenomkarten werden von einigen Projekten und Initiativen wie ENCODE und IHEC im grossen Massstab für diverse Gewebe- und Zelltypen erstellt. Hierbei stellt der Mangel an effizienten bioinformatischen Softwarewerkzeugen einen wesentlichen Engpass in der Analyse dieser stetig wachsenden Datenflut dar. Experimentelle Biologen können heute einzelne genomische Loci mithilfe benutzerfreundlicher (meist visueller) bioinformatischer Software im Detail inspizieren. Des Weiteren existieren effiziente Werkzeuge für die Manipulation und Analyse dieser Datensätze, die jedoch ein gewisses Mass informatischer Expertise erfordern und sich zumeist auf die Lösung bereits wohldefinierter biologischer Fragestellungen fokussieren. Unserer Ansicht nach fehlen Werkzeuge und Softwarepipelines mithilfe derer ein Benutzer, der über ein fundiertes Wissen der biologischen Grundlagen, jedoch nicht unbedingt über informatische Kenntnisse verfügt, die verfügbaren Datensätze interaktiv durchstöbern und darauf aufbauend weiterführende Hypothesen entwickeln kann. Eine möglichen Ansatz hierfür bieten Methoden aus den Bereichen Information Retrieval und der explorativen Suche. Diese Arbeit beschreibt EpiExplorer, eine Software, die auf dem Paradigma der Integration von modernen Information Retrieval und Indexstrukturen basiert und darauf ausgelegt ist eine Vielzahl von (epi-)genomweiten Datensätzen in Echtzeit zu explorieren. Die verwendeten Algorithmen wurden ursprünglich für die Suche in semistrukturierten, textuellen Datensätzen entwickelt. EpiExplorer ermöglicht ihre Verwendung durch eine systematische Umwandlung biologischer Eigenschaften in Textdukumente. Ausserdem demonstriert diese Arbeit EpiExplorers Leistungsfähigkeit und Nützlichkeit durch relevante Anwendungsbeispiele biologisch interessanter Fragestellungen. Komplementär zu EpiExplorer wurde in Kollaboration mit Kollegen EpiGRAPH entwickelt, mithilfe dessen signifikante biologische Assoziationen zwischen genetischen und epigenetischen Eigenschaften regionsbasiert identifiziert und modelliert werden können. EpiExplorer und EpiGRAPH stellen - unabhängig voneinander oder im Verbund miteinander - nützliche Ressourcen dar. In einer bioinformatischen Softwarepipeline ermöglichen sie den Datenbank-basierten Zugriff auf eine Vielzahl (epi-)genomischer Datensätze, deren explorative Visualisierung oder statistische Analyse sowie die Reproduzierbarkeit und den Austausch von Analyseergebnissen

    Maximizing Insight from Modern Economic Analysis

    Full text link
    The last decade has seen a growing trend of economists exploring how to extract different economic insight from "big data" sources such as the Web. As economists move towards this model of analysis, their traditional workflow starts to become infeasible. The amount of noisy data from which to draw insights presents data management challenges for economists and limits their ability to discover meaningful information. This leads to economists needing to invest a great deal of energy in training to be data scientists (a catch-all role that has grown to describe the usage of statistics, data mining, and data management in the big data age), with little time being spent on applying their domain knowledge to the problem at hand. We envision an ideal workflow that generates accurate and reliable results, where results are generated in near-interactive time, and systems handle the "heavy lifting" required for working with big data. This dissertation presents several systems and methodologies that bring economists closer to this ideal workflow, helping them address many of the challenges faced in transitioning to working with big data sources like the Web. To help users generate accurate and reliable results, we present approaches to identifying relevant predictors in nowcasting applications, as well as methods for identifying potentially invalid nowcasting models and their inputs. We show how a streamlined workflow, combined with pruning and shared computation, can help handle the heavy lifting of big data analysis, allowing users to generate results in near-interactive time. We also present a novel user model and architecture for helping users avoid undesirable bias when doing data preparation: users interactively define constraints for transformation code and the data that the code produces, and an explain-and-repair system satisfies these constraints as best it can, also providing an explanation for any problems along the way. These systems combined represent a unified effort to streamline the transition for economists to this new big data workflow.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/144007/1/dol_1.pd

    From Nose to Lung: Using Systems Biology to Fight Pathogens in the Human Respiratory Tract

    Get PDF
    In December 2019, the world was hit by the global SARS-CoV-2 pandemic. Two years later, the number of infections and deaths is still increasing, affecting everyday’s life. Although several vaccines have been successfully approved for SARS-CoV-2, therapeutic strategies are still minimal. As viruses rely on the host’s metabolism for replication, analyzing the viral reprogramming of the host cells might reveal potential antiviral targets. Such metabolic alterations can be evaluated and analyzed using genome-scale metabolic models (GEMs). These models represent large metabolic networks that connect metabolites with biochemical reactions facilitated by proteins and encoded by genes. With the help of genomic information, the so-called genotype, we can create metabolic models that can predict the phenotypic behavior of an organism. However, these GEMs can be used to analyze virus-host interactions and predict potential antiviral targets and understand the genotype-phenotype relationship of pathogens and commensals such as Staphylococcus aureus. High-quality models with a high predictive value help us to better understand an organism, determine metabolic capabilities in health and disease, identify potential targets for treatment interventions, and analyze the interplay between different cells and organisms. Such models can answer relevant and urgent questions of our time quickly and efficiently and become an indispensable constituent in future research. In my thesis, I demonstrate (I) how the quality and predictive value of an existing genome-scale metabolic model can be assessed, (II) how high-quality genome-scale metabolic models can be curated, and (III) how high-quality genome-scale metabolic models can be used for model-driven discoveries. All three points are addressed in the context of pathogens and commensals in the human respiratory tract. To assess the quality and predictive value of GEMs, we collected all currently available models of the pathogen Staphylococcus aureus, which colonizes the human nose. We evaluated the models concerning their validity, compliance with the FAIR data principle, quality, simulatability, and predictive value. Using high-quality models with a high predictive value enables model-driven hypotheses and discoveries. However, if no such model is available, one needs to curate a high-quality model. For this purpose, we developed a pipeline that focuses on the model curation of nasal pathogens and commensals. This pipeline is adaptable to incorporate other tools and bacteria, pathogens, or cells while maintaining certain community standards. We demonstrated the applicability of this pipeline by curating the first model of the nasal commensal Dolosigranulum pigrum. We showed how to use high-quality GEMs for model-driven discoveries by identifying novel antiviral targets. To do so, we virtually infected human alveolar macrophages in the lung with SARS-CoV-2

    Polyglot software development

    Get PDF
    The languages we choose to design solutions influence the way we think about the problem, the words we use in discussing it with colleagues, the processes we adopt in developing the software which should solve that problem. Therefore we should strive to use the best language possible for depicting each facet of the system. To do that we have to solve two challenges: i) first of all to understand merits and issues brought by the languages we could adopt and their long reaching effects on the organizations, ii) combine them wisely, trying to reduce the overhead due to their assembling. In the first part of this dissertation we study the adoption of modeling and domain specific languages. On the basis of an industrial survey we individuate a list of benefits attainable through these languages, how frequently they can be reached and which techniques permit to improve the chances to obtain a particular benefit. In the same way we study also the common problems which either prevent or hinder the adoption of these languages. We then analyze the processes through which these languages are employed, studying the relative frequency of the usage of the different techniques and the factors influencing it. Finally we present two case-studies performed in a small and in a very large company, with the intent of presenting the peculiarities of the adoption in different contexts. As consequence of adopting specialized languages, many of them have to be employed to represent the complete solution. Therefore in the second part of the thesis we focus on the integration of these languages. Being this topic really new we performed preliminary studies to first understand the phenomenon, studying the different ways through which languages interact and their effects on defectivity. Later we present some prototypal solutions for i) the automatic spotting of cross-language relations, ii) the design of language integration tool support in language workbenches through the exploitation of common meta-metamodeling. This thesis wants to offer a contribution towards the productive adoption of multiple, specific languages in the same software development project, hence polyglot software development. From this approach we should be able to reduce the complexity due to misrepresentation of solutions, offer a better facilities to think about problems and, finally to be able to solve more difficult problems with our limited brain resources. Our results consists in a better understanding of MDD and DSLs adoption in companies. From that we can derive guidelines for practitioners, lesson learned for deploying in companies, depending on the size of the company, and implications for other actors involved in the process: company management and universities. Regarding cross-language relations our contribution is an initial definition of the problem, supported by some empirical evidence to sustain its importance. The solutions we propose are not yet mature but we believe that from them future work can stem

    Exploratory visualizations and statistical analysis of large, heterogeneous epigenetic datasets

    Get PDF
    Epigenetic marks, such as DNA methylation and histone modifications, are important regulatory mechanisms that allow a single genomic sequence to give rise to a complex multicellular organism. When studying mechanisms of epigenetic regulation, the analyses depend on the experimental technologies and the available data. Recent advancements in sequencing technologies allow for the efficient extraction of genome-wide maps of epigenetic marks. A number of large-scale mapping projects, such as ENCODE and IHEC, intensively produce data for different tissues and cell cultures. The increasing quantity of data highlights a major bottleneck in bioinformatic research, namely the lack of bioinformatic tools for analyzing these data. To date, there are bioinformatics tools for detailed (mostly visual) inspection of single genomic loci, allowing biologists to focus research on regions of interest. Also, efficient tools for manipulation and analysis of the data have been published, but often they require computer science abilities. Furthermore, the available tools provide solutions to only already well formulated biological questions. What is missing, in our opinion, are tools (or pipelines of tools) to explore the data interactively, in a process that would facilitate a trained biologist to recognize interesting aspects and pursue them further until concrete hypotheses are formulated. A possible solution stems from the best practices in the fields of information retrieval and exploratory search. In this thesis, I propose EpiExplorer, a paradigm for integration of state-of-the-art information retrieval methods and indexing structures, applied to offer instant interactive exploration of large epigenetic datasets. The algorithms we use are developed for semi-structured text data, but we apply them on bioinformatic data through clever textual mapping of biological properties. We demonstrate the power of EpiExplorer in a series of studies that address interesting biological problems. We also present in this manuscript EpiGRAPH, a bioinformatic software that we developed with colleagues. EpiGRAPH helps identify and model significant biological associations among epigenetic and genetic properties for sets of regions. Using EpiExplorer and EpiGRAPH, independently or in a pipeline, provides the bioinformatic community with access to large databases of annotations, allows for exploratory visualizations or statistical analysis and facilitates reproduction and sharing of results.Epigenetische Signaturen wie die Methylierung der DNS oder posttranslationale Modifikationen der Histonproteine stellen wichtige regulatorische Mechanismen dar. Diese ermöglichen es, dass ein komplexer, multizellulärer Organismus aus einer einzelnen genomische Sequenz hervorgeht. Adequate Analysemethoden hängen von den verwendeten experimentellen Technologien und den verfügbaren Daten ab. Jüngste Fortschritte in der DNS-Sequenzierungstechnologie ermöglichen die effiziente Erstellung genomweiter Karten epigenetischer Informationen. Diese Epigenomkarten werden von einigen Projekten und Initiativen wie ENCODE und IHEC im grossen Massstab für diverse Gewebe- und Zelltypen erstellt. Hierbei stellt der Mangel an effizienten bioinformatischen Softwarewerkzeugen einen wesentlichen Engpass in der Analyse dieser stetig wachsenden Datenflut dar. Experimentelle Biologen können heute einzelne genomische Loci mithilfe benutzerfreundlicher (meist visueller) bioinformatischer Software im Detail inspizieren. Des Weiteren existieren effiziente Werkzeuge für die Manipulation und Analyse dieser Datensätze, die jedoch ein gewisses Mass informatischer Expertise erfordern und sich zumeist auf die Lösung bereits wohldefinierter biologischer Fragestellungen fokussieren. Unserer Ansicht nach fehlen Werkzeuge und Softwarepipelines mithilfe derer ein Benutzer, der über ein fundiertes Wissen der biologischen Grundlagen, jedoch nicht unbedingt über informatische Kenntnisse verfügt, die verfügbaren Datensätze interaktiv durchstöbern und darauf aufbauend weiterführende Hypothesen entwickeln kann. Eine möglichen Ansatz hierfür bieten Methoden aus den Bereichen Information Retrieval und der explorativen Suche. Diese Arbeit beschreibt EpiExplorer, eine Software, die auf dem Paradigma der Integration von modernen Information Retrieval und Indexstrukturen basiert und darauf ausgelegt ist eine Vielzahl von (epi-)genomweiten Datensätzen in Echtzeit zu explorieren. Die verwendeten Algorithmen wurden ursprünglich für die Suche in semistrukturierten, textuellen Datensätzen entwickelt. EpiExplorer ermöglicht ihre Verwendung durch eine systematische Umwandlung biologischer Eigenschaften in Textdukumente. Ausserdem demonstriert diese Arbeit EpiExplorers Leistungsfähigkeit und Nützlichkeit durch relevante Anwendungsbeispiele biologisch interessanter Fragestellungen. Komplementär zu EpiExplorer wurde in Kollaboration mit Kollegen EpiGRAPH entwickelt, mithilfe dessen signifikante biologische Assoziationen zwischen genetischen und epigenetischen Eigenschaften regionsbasiert identifiziert und modelliert werden können. EpiExplorer und EpiGRAPH stellen - unabhängig voneinander oder im Verbund miteinander - nützliche Ressourcen dar. In einer bioinformatischen Softwarepipeline ermöglichen sie den Datenbank-basierten Zugriff auf eine Vielzahl (epi-)genomischer Datensätze, deren explorative Visualisierung oder statistische Analyse sowie die Reproduzierbarkeit und den Austausch von Analyseergebnissen
    corecore