846 research outputs found

    Context-Aware Notebook Search in a Jupyter-Based Virtual Research Environment

    Get PDF
    Computational notebook environments such as the Jupyter play an increasingly important role in data-centric research for prototyping computational experiments, documenting code implementations, and sharing scientific results. Effectively discovering and reusing notebooks available on the web can reduce repetitive work and facilitate scientific innovations. However, general-purpose web search engines (e.g., Google Search) do not explicitly index the contents of notebooks, and notebook repositories (e.g., Kaggle and GitHub) require users to create domain-specific queries based on the metadata in the notebook catalogs, which fail to capture the working contexts in the notebook environment. This poster presents a Context-aware Notebook Search Framework (CANSF) to enable a researcher to seamlessly discover external notebooks based on semantic contexts of the literate programming activities in the Jupyter environment.Non

    Doctor of Philosophy

    Get PDF
    dissertationServing as a record of what happened during a scientific process, often computational, provenance has become an important piece of computing. The importance of archiving not only data and results but also the lineage of these entities has led to a variety of systems that capture provenance as well as models and schemas for this information. Despite significant work focused on obtaining and modeling provenance, there has been little work on managing and using this information. Using the provenance from past work, it is possible to mine common computational structure or determine differences between executions. Such information can be used to suggest possible completions for partial workflows, summarize a set of approaches, or extend past work in new directions. These applications require infrastructure to support efficient queries and accessible reuse. In order to support knowledge discovery and reuse from provenance information, the management of those data is important. One component of provenance is the specification of the computations; workflows provide structured abstractions of code and are commonly used for complex tasks. Using change-based provenance, it is possible to store large numbers of similar workflows compactly. This storage also allows efficient computation of differences between specifications. However, querying for specific structure across a large collection of workflows is difficult because comparing graphs depends on computing subgraph isomorphism which is NP-Complete. Graph indexing methods identify features that help distinguish graphs of a collection to filter results for a subgraph containment query and reduce the number of subgraph isomorphism computations. For provenance, this work extends these methods to work for more exploratory queries and collections with significant overlap. However, comparing workflow or provenance graphs may not require exact equality; a match between two graphs may allow paired nodes to be similar yet not equivalent. This work presents techniques to better correlate graphs to help summarize collections. Using this infrastructure, provenance can be reused so that users can learn from their own and others' history. Just as textual search has been augmented with suggested completions based on past or common queries, provenance can be used to suggest how computations can be completed or which steps might connect to a given subworkflow. In addition, provenance can help further science by accelerating publication and reuse. By incorporating provenance into publications, authors can more easily integrate their results, and readers can more easily verify and repeat results. However, reusing past computations requires maintaining stronger associations with any input data and underlying code as well as providing paths for migrating old work to new hardware or algorithms. This work presents a framework for maintaining data and code as well as supporting upgrades for workflow computations

    Querying Large Collections of Semistructured Data

    Get PDF
    An increasing amount of data is published as semistructured documents formatted with presentational markup. Examples include data objects such as mathematical expressions encoded with MathML or web pages encoded with XHTML. Our intention is to improve the state of the art in retrieving, manipulating, or mining such data. We focus first on mathematics retrieval, which is appealing in various domains, such as education, digital libraries, engineering, patent documents, and medical sciences. Capturing the similarity of mathematical expressions also greatly enhances document classification in such domains. Unlike text retrieval, where keywords carry enough semantics to distinguish text documents and rank them, math symbols do not contain much semantic information on their own. Unfortunately, considering the structure of mathematical expressions to calculate relevance scores of documents results in ranking algorithms that are computationally more expensive than the typical ranking algorithms employed for text documents. As a result, current math retrieval systems either limit themselves to exact matches, or they ignore the structure completely; they sacrifice either recall or precision for efficiency. We propose instead an efficient end-to-end math retrieval system based on a structural similarity ranking algorithm. We describe novel optimization techniques to reduce the index size and the query processing time. Thus, with the proposed optimizations, mathematical contents can be fully exploited to rank documents in response to mathematical queries. We demonstrate the effectiveness and the efficiency of our solution experimentally, using a special-purpose testbed that we developed for evaluating math retrieval systems. We finally extend our retrieval system to accommodate rich queries that consist of combinations of math expressions and textual keywords. As a second focal point, we address the problem of recognizing structural repetitions in typical web documents. Most web pages use presentational markup standards, in which the tags control the formatting of documents rather than semantically describing their contents. Hence, their structures typically contain more irregularities than descriptive (data-oriented) markup languages. Even though applications would greatly benefit from a grammar inference algorithm that captures structure to make it explicit, the existing algorithms for XML schema inference, which target data-oriented markup, are ineffective in inferring grammars for web documents with presentational markup. There is currently no general-purpose grammar inference framework that can handle irregularities commonly found in web documents and that can operate with only a few examples. Although inferring grammars for individual web pages has been partially addressed by data extraction tools, the existing solutions rely on simplifying assumptions that limit their application. Hence, we describe a principled approach to the problem by defining a class of grammars that can be inferred from very small sample sets and can capture the structure of most web documents. The effectiveness of this approach, together with a comparison against various classes of grammars including DTDs and XSDs, is demonstrated through extensive experiments on web documents. We finally use the proposed grammar inference framework to extend our math retrieval system and to optimize it further

    Development of a scalable database for recognition of printed mathemematical expressions

    Full text link
    [ES] Buscar información en documentos científicos impresos es un reto problemático que recientemente ha recibido atención especial por parte de la comunidad de investigación de Reconocimiento de Formas. Las Expresiones Matemáticas son elementos complejos que aparecen en documentos cientificos, y desarrollar técnicas para localizarlas y reconocerlas requiere preparar data sets que pueden ser utilizados como punto de referencia. La mayoría de las técnicas actuales para lidiar con Expresiones Matemáticas están basadas en técnicas de Reconocimiento de Formas y Aprendizaje Automático y por tanto, estos data sets tienen que ser preparados con información sobre el ground-truth para entrenamiento y test automático. Sin embargo, preparar data sets grandes es muy costoso y requiere mucho tiempo. Este proyecto introduce un data set de documentos científicos que ha sido preparado con el fin de reconocer y buscar Expresiones Matemáticas. Este data set ha sido generado automáticamente a partir de la versión LATEX de los documentos y consecuentemente puede ser aumentado fácilmente. El ground-truth incluye la posición a nivel de página, la versión LATEX de las Expresiones Matemáticas integradas y aisladas del texto y la secuencia de símbolos representados como unicode code points que se han utilizado para definir estas expresiones. En base a este data set, se han extraído estadísticas como por ejemplo el número total y el tipo de las expresiones, el número medio de expresiones por documento y las frecuencias de distribución de todo el conjunto de expresiones. En este documento también se introduce un experimento de clasificación de símbolos matemáticos que puede ser utilizado como punto de partida.[EN] Searching information in printed scientific documents is a challenging problem that has recently received special attention from the Pattern Recognition research community. Mathematical Expressions are complex elements that appear in scientific documents, and developing techniques for locating and recognizing them requires preparation of data sets that can be used as benchmarks. Most of the current techniques for dealing with Mathematical Expressions are based in Machine Intelligent techniques and therefore these data sets have to be prepared with ground-truth information for automatic training and testing. However preparing large data sets with ground-truth is a very expensive and timeconsuming task. This project introduces a data set of scientific documents that has been prepared for Mathematical Expression recognition and searching. This data set has been automatically generated from the LATEX version of the documents and consequently can be enlarged easily. The ground-truth includes the position at page level, the LATEX version for Mathematical Expressions both embedded in the text and displayed and the sequence of mathematical symbols represented as unicode code points used to define these expressions. Based on this data set, statistics such as the total number and type of expressions, the average number of expressions per document and their frequency distribution were extracted. A baseline classification experiment with mathematical symbols from this data set is also reported in this paper.Anitei, D. (2020). Development of a scalable database for recognition of printed mathemematical expressions. Universitat Politècnica de València. http://hdl.handle.net/10251/150390TFG

    Needle: a fast and space-efficient prefilter for estimating the quantification of very large collections of expression experiments

    Get PDF
    Motivation The ever-growing size of sequencing data is a major bottleneck in bioinformatics as the advances of hardware development cannot keep up with the data growth. Therefore, an enormous amount of data is collected but rarely ever reused, because it is nearly impossible to find meaningful experiments in the stream of raw data. Results As a solution, we propose Needle, a fast and space-efficient index which can be built for thousands of experiments in <2 h and can estimate the quantification of a transcript in these experiments in seconds, thereby outperforming its competitors. The basic idea of the Needle index is to create multiple interleaved Bloom filters that each store a set of representative k-mers depending on their multiplicity in the raw data. This is then used to quantify the query

    Bioinformatics process management: information flow via a computational journal

    Get PDF
    This paper presents the Bioinformatics Computational Journal (BCJ), a framework for conducting and managing computational experiments in bioinformatics and computational biology. These experiments often involve series of computations, data searches, filters, and annotations which can benefit from a structured environment. Systems to manage computational experiments exist, ranging from libraries with standard data models to elaborate schemes to chain together input and output between applications. Yet, although such frameworks are available, their use is not widespread–ad hoc scripts are often required to bind applications together. The BCJ explores another solution to this problem through a computer based environment suitable for on-site use, which builds on the traditional laboratory notebook paradigm. It provides an intuitive, extensible paradigm designed for expressive composition of applications. Extensive features facilitate sharing data, computational methods, and entire experiments. By focusing on the bioinformatics and computational biology domain, the scope of the computational framework was narrowed, permitting us to implement a capable set of features for this domain. This report discusses the features determined critical by our system and other projects, along with design issues. We illustrate the use of our implementation of the BCJ on two domain-specific examples

    Automatic non-linear video editing for home video collections

    Get PDF
    The video editing process consists of deciding what elements to retain, delete, or combine from various video sources so that they come together in an organized, logical, and visually pleasing manner. Before the digital era, non-linear editing involved the arduous process of physically cutting and splicing video tapes, and was restricted to the movie industry and a few video enthusiasts. Today, when digital cameras and camcorders have made large personal video collections commonplace, non-linear video editing has gained renewed importance and relevance. Almost all available video editing systems today are dependent on considerable user interaction to produce coherent edited videos. In this work, we describe an automatic non-linear video editing system for generating coherent movies from a collection of unedited personal videos. Our thesis is that computing image-level visual similarity in an appropriate manner forms a good basis for automatic non-linear video editing. To our knowledge, this is a novel approach to solving this problem. The generation of output video from the system is guided by one or more input keyframes from the user, which guide the content of the output video. The output video is generated in a manner such that it is non-repetitive and follows the dynamics of the input videos. When no input keyframes are provided, our system generates "video textures" with the content of the output chosen at random. Our system demonstrates promising results on large video collections and is a first step towards increased automation in non-linear video editin

    Genotypic Variation in Sweetpotato Ipomoea Batatas (L.) Lam. Clones.

    Get PDF
    Arbitrarily-primed PCR-based assays established the presence of sweetpotato intra-clonal genetic variability. These DNA polymorphism assays provided benchmark information regarding cultivar genetic uniformity in sweetpotato foundation seed programs. Arbitrarily-primed markers were also used to compare the genetic uniformity among sweetpotato clones derived conventionally, i.e., through adventitious sprouts, and nodally-based propagation systems. Initially, 38 primers generated 110 scorable DNA fragments using two virus-indexed plants from each clone source. Twenty-one bands (18.6%) were scored as putative polymorphic markers based on the presence or absence of amplified products. A subset of 14 marker loci generated by four selected primers was used to further assay 10 sample plants per clone group. Polymorphism ranged from 7.1% to 35.7% in five of eight clone groups. Field studies show variation in nearly all yield grades measured. In three tests during the 1991 and 1992 seasons, yield differences ranged from 27% to 46% within the economically important U.S. No. 1 root grade. The results suggest the usefulness of arbitrarily-primed markers in detecting intra-clonal genomic variability in the crop. To determine the role of propagation method in sweetpotato genotypic uniformity, a single sprout each of \u27Jewel,\u27 \u27Sumor,\u27 and L87-95 served as source of clonal plants simultaneously propagated through conventional adventitious procedures and an in vitro-based nodal technique. Fifteen arbitrary primers generated 64 scorable amplified fragments, 29 of which were putatively polymorphic across n = 60 samples (10 each of nodal and adventitiously derived plants/genotype). Within adventitiously derived materials, putative polymorphisms ranged from 4.7% to 31.3% depending upon genotypic class. In contrast, putative polymorphisms ranged from 0.0% to 3.1% among nodally-derived samples. The marker loci differentiated the genotypes and putative marker phenotype variants as revealed through multidimensional scaling analysis. An \u27analysis of molecular variance\u27 shows that genotypic effects accounted for 88.7% of the total marker variability, while propagation effects (within genotypic groups) accounted for 11.3%. The results suggest variability associated with propagation, wherein clonal plants derived from pre-existing meristematic regions are more genetically uniform than plants propagated from adventitious origins

    On Weighted k-mer Dictionaries

    Get PDF
    We consider the problem of representing a set of k-mers and their abundance counts, or weights, in compressed space so that assessing membership and retrieving the weight of a k-mer is efficient. The representation is called a weighted dictionary of k-mers and finds application in numerous tasks in Bioinformatics that usually count k-mers as a pre-processing step. In fact, k-mer counting tools produce very large outputs that may result in a severe bottleneck for subsequent processing. In this work we extend the recently introduced SSHash dictionary (Pibiri, Bioinformatics 2022) to also store compactly the weights of the k-mers. From a technical perspective, we exploit the order of the k-mers represented in SSHash to encode runs of weights, hence allowing (several times) better compression than the empirical entropy of the weights. We also study the problem of reducing the number of runs in the weights to improve compression even further and illustrate a lower bound for this problem. We propose an efficient, greedy, algorithm to reduce the number of runs and show empirically that it performs well, i.e., very similarly to the lower bound. Lastly, we corroborate our findings with experiments on real-world datasets and comparison with competitive alternatives. Up to date, SSHash is the only k-mer dictionary that is exact, weighted, associative, fast, and small

    Systematic Analysis of the Factors Contributing to the Variation and Change of the Microbiome

    Get PDF
    abstract: Understanding changes and trends in biomedical knowledge is crucial for individuals, groups, and institutions as biomedicine improves people’s lives, supports national economies, and facilitates innovation. However, as knowledge changes what evidence illustrates knowledge changes? In the case of microbiome, a multi-dimensional concept from biomedicine, there are significant increases in publications, citations, funding, collaborations, and other explanatory variables or contextual factors. What is observed in the microbiome, or any historical evolution of a scientific field or scientific knowledge, is that these changes are related to changes in knowledge, but what is not understood is how to measure and track changes in knowledge. This investigation highlights how contextual factors from the language and social context of the microbiome are related to changes in the usage, meaning, and scientific knowledge on the microbiome. Two interconnected studies integrating qualitative and quantitative evidence examine the variation and change of the microbiome evidence are presented. First, the concepts microbiome, metagenome, and metabolome are compared to determine the boundaries of the microbiome concept in relation to other concepts where the conceptual boundaries have been cited as overlapping. A collection of publications for each concept or corpus is presented, with a focus on how to create, collect, curate, and analyze large data collections. This study concludes with suggestions on how to analyze biomedical concepts using a hybrid approach that combines results from the larger language context and individual words. Second, the results of a systematic review that describes the variation and change of microbiome research, funding, and knowledge are examined. A corpus of approximately 28,000 articles on the microbiome are characterized, and a spectrum of microbiome interpretations are suggested based on differences related to context. The collective results suggest the microbiome is a separate concept from the metagenome and metabolome, and the variation and change to the microbiome concept was influenced by contextual factors. These results provide insight into how concepts with extensive resources behave within biomedicine and suggest the microbiome is possibly representative of conceptual change or a preview of new dynamics within science that are expected in the future.Dissertation/ThesisDoctoral Dissertation Biology 201
    • …
    corecore