517 research outputs found
String Searching with Ranking Constraints and Uncertainty
Strings play an important role in many areas of computer science. Searching pattern in a string or string collection is one of the most classic problems. Different variations of this problem such as document retrieval, ranked document retrieval, dictionary matching has been well studied. Enormous growth of internet, large genomic projects, sensor networks, digital libraries necessitates not just efficient algorithms and data structures for the general string indexing, but indexes for texts with fuzzy information and support for queries with different constraints. This dissertation addresses some of these problems and proposes indexing solutions. One such variation is document retrieval query for included and excluded/forbidden patterns, where the objective is to retrieve all the relevant documents that contains the included patterns and does not contain the excluded patterns. We continue the previous work done on this problem and propose more efficient solution. We conjecture that any significant improvement over these results is highly unlikely. We also consider the scenario when the query consists of more than two patterns. The forbidden pattern problem suffers from the drawback that linear space (in words) solutions are unlikely to yield a solution better than O(root(n/occ)) per document reporting time, where n is the total length of the documents and occ is the number of output documents. Continuing this path, we introduce a new variation, namely document retrieval with forbidden extension query, where the forbidden pattern is an extension of the included pattern.We also address the more general top-k version of the problem, which retrieves the top k documents, where the ranking is based on PageRank relevance metric. This problem finds motivation from search applications. It also holds theoretical interest as we show that the hardness of forbidden pattern problem is alleviated in this problem. We achieve linear space and optimal query time for this variation. We also propose succinct indexes for both these problems. Position restricted pattern matching considers the scenario where only part of the text is searched. We propose succinct index for this problem with efficient query time. An important application for this problem stems from searching in genomic sequences, where only part of the gene sequence is searched for interesting patterns. The problem of computing discriminating(resp. generic) words is to report all minimal(resp. maximal) extensions of a query pattern which are contained in at most(resp. at least) a given number of documents. These problems are motivated from applications in computational biology, text mining and automated text classification. We propose succinct indexes for these problems. Strings with uncertainty and fuzzy information play an important role in increasingly many applications. We propose a general framework for indexing uncertain strings such that a deterministic query string can be searched efficiently. String matching becomes a probabilistic event when a string contains uncertainty, i.e. each position of the string can have different probable characters with associated probability of occurrence for each character. Such uncertain strings are prevalent in various applications such as biological sequence data, event monitoring and automatic ECG annotations. We consider two basic problems of string searching, namely substring searching and string listing. We formulate these well known problems for uncertain strings paradigm and propose exact and approximate solution for them. We also discuss a constrained variation of orthogonal range searching. Given a set of points, the task of orthogonal range searching is to build a data structure such that all the points inside a orthogonal query region can be reported. We introduce a new variation, namely shared constraint range searching which naturally arises in constrained pattern matching applications. Shared constraint range searching is a special four sided range reporting query problem where two constraints has sharing among them, effectively reducing the number of independent constraints. For this problem, we propose a linear space index that can match the best known bound for three dimensional dominance reporting problem. We extend our data structure in the external memory model
Advanced rank/select data structures: succinctness, bounds and applications.
The thesis explores new theoretical results and applications of rank and select data structures. Given a string, select(c, i) gives the position of the ith occurrence of character c in the string, while rank(c, p) counts the number of instances of character c on the left of position p.
Succinct rank/select data structures are space-efficient versions of standard ones, designed to keep data compressed and at the same time answer to queries rapidly. They are at the basis of more involved compressed and succinct data structures which in turn are motivated by the nowadays need to analyze and operate on massive data sets quickly, where space efficiency is crucial. The thesis builds up on the state of the art left by years of study and produces results on multiple fronts.
Analyzing binary succinct data structures and their link with predecessor data structures, we integrate data structures for the latter problem in the former. The result is a data structure which outperforms the one of Patrascu 08 in a range of cases which were not studied before, namely when the lower bound for predecessor do not apply and constant-time rank is not feasible.
Further, we propose the first lower bound for succinct data structures on generic strings, achieving a linear trade-off between time for rank/select execution and additional space (w.r.t. to the plain data) needed by the data structure. The proposal addresses systematic data structures, namely those that only access the underlying string through ADT calls and do not encode it directly.
Also, we propose a matching upper bound that proves the tightness of our lower bound.
Finally, we apply rank/select data structures to the substring counting problem, where we seek to preprocess a text and generate a summary data structure which is stored in lieu of the text and answers to substring counting queries with additive error. The results include a theory-proven optimal data structure with generic additive error and a data structure that errs only on infrequent patterns with significative practical space gains
Evaluating the quality of society and public services
A person’s quality of life is not only shaped by individual choices and behaviour: the surrounding environment and the public services on offer have a big influence on how people perceive the society they live in and on their evaluation of their own quality of life. Institutions influence the quality of society through collective actions that individuals cannot undertake themselves: for example maintaining schools, hospitals and roads. Public policies are also responsible for ensuring that water and air are not polluted, and for reducing tensions between different social groups. If public policies are effective and these services are provided to a high standard, the quality of society will improve, with a positive impact on the overall quality of life of citizens. This is why European policymakers and citizens share a common concern regarding the quality of society and public services: the actions of policymakers should contribute to improving the quality of citizens’ lives. To evaluate whether this is in fact happening, one needs to look beyond objective measures of material wealth such as gross domestic product (GDP) and find out how citizens assess the conditions in their society. The second European Quality of Life Survey (EQLS), carried out by the European Foundation for the Improvement of Living and Working Conditions (Eurofound) in 2007, asks European citizens to evaluate multiple aspects of quality of society. The result is a comprehensive picture of the diverse social realities in the 27 EU Member States, in Norway, Croatia, the former Yugoslav Republic of Macedonia and Turkey
Representation and Exploitation of Event Sequences
Programa Oficial de Doutoramento en Computación . 5009V01[Abstract]
The Ten Commandments, the thirty best smartphones in the market and
the five most wanted people by the FBI. Our life is ruled by sequences:
thought sequences, number sequences, event sequences. . . a history book
is nothing more than a compilation of events and our favorite film is
just a sequence of scenes. All of them have something in common, it
is possible to acquire relevant information from them. Frequently, by
accumulating some data from the elements of each sequence we may
access hidden information (e.g. the passengers transported by a bus
on a journey is the sum of the passengers who got on in the sequence
of stops made); other times, reordering the elements by any of their
characteristics facilitates the access to the elements of interest (e.g. the
publication of books in 2019 can be ordered chronologically, by author,
by literary genre or even by a combination of characteristics); but it
will always be sought to store them in the smallest space possible.
Thus, this thesis proposes technological solutions for the storage
and subsequent processing of events, focusing specifically on three
fundamental aspects that can be found in any application that needs
to manage them: compressed and dynamic storage, aggregation
or accumulation of elements of the sequence and element sequence
reordering by their different characteristics or dimensions.
The first contribution of this work is a compact structure for the
dynamic compression of event sequences. This structure allows any
sequence to be compressed in a single pass, that is, it is capable of
compressing in real time as elements arrive. This contribution is
a milestone in the world of compression since, to date, this is the
first proposal for a variable-to-variable dynamic compressor for general purpose.
Regarding aggregation, a data warehouse-like proposal is presented
capable of storing information on any characteristic of the events in a
sequence in an aggregated, compact and accessible way. Following the
philosophy of current data warehouses, we avoid repeating cumulative
operations and speed up aggregate queries by preprocessing the
information and keeping it in this separate structure.
Finally, this thesis addresses the problem of indexing event sequences
considering their different characteristics and possible reorderings. A new
approach for simultaneously keeping the elements of a sequence ordered
by different characteristics is presented through compact structures.
Thus, it is possible to consult the information and perform operations
on the elements of the sequence using any possible rearrangement in a
simple and efficient way.[Resumen]
Los diez mandamientos, los treinta mejores móviles del mercado y las
cinco personas más buscadas por el FBI. Nuestra vida está gobernada
por secuencias: secuencias de pensamientos, secuencias de números,
secuencias de eventos. . . un libro de historia no es más que una sucesión
de eventos y nuestra película favorita no es sino una secuencia de
escenas. Todas ellas tienen algo en común, de todas podemos extraer
información relevante. A veces, al acumular algún dato de los elementos
de cada secuencia accedemos a información oculta (p. ej. los viajeros
transportados por un autobús en un trayecto es la suma de los pasajeros
que se subieron en la secuencia de paradas realizadas); otras veces, la
reordenación de los elementos por alguna de sus características facilita
el acceso a los elementos de interés (p. ej. la publicación de obras
literarias en 2019 puede ordenarse cronológicamente, por autor, por
género literario o incluso por una combinación de características); pero
siempre se buscará almacenarlas en el espacio más reducido posible sin
renunciar a su contenido.
Por ello, esta tesis propone soluciones tecnológicas para el almacenamiento
y posterior procesamiento de secuencias, centrándose
concretamente en tres aspectos fundamentales que se pueden encontrar
en cualquier aplicación que precise gestionarlas: el almacenamiento
comprimido y dinámico, la agregación o acumulación de algún dato
sobre los elementos de la secuencia y la reordenación de los elementos
de la secuencia por sus diferentes características o dimensiones.
La primera contribución de este trabajo es una estructura compacta
para la compresión dinámica de secuencias. Esta estructura permite
comprimir cualquier secuencia en una sola pasada, es decir, es capaz de comprimir en tiempo real a medida que llegan los elementos de la
secuencia. Esta aportación es un hito en el mundo de la compresión ya
que, hasta la fecha, es la primera propuesta de un compresor dinámico
“variable to variable” de carácter general.
En cuanto a la agregación, se presenta una propuesta de almacén
de datos capaz de guardar la información acumulada sobre alguna
característica de los eventos de la secuencia de modo compacto y
fácilmente accesible. Siguiendo la filosofía de los actuales almacenes de
datos, el objetivo es evitar repetir operaciones de acumulación y agilizar
las consultas agregadas mediante el preprocesado de la información
manteniéndola en esta estructura.
Por último, esta tesis aborda el problema de la indexación de
secuencias de eventos considerando sus diferentes características y
posibles reordenaciones. Se presenta una nueva forma de mantener
simultáneamente ordenados los elementos de una secuencia por diferentes
características a través de estructuras compactas. Así se permite
consultar la información y realizar operaciones sobre los elementos
de la secuencia usando cualquier posible ordenación de una manera
sencilla y eficiente
Transepithelial accelerated corneal collagen crosslinking in patients with progressive keratoconus: long term follow up results
Objetivo: avaliar de forma sistematizada a eficácia a longo prazo do crosslinking de colagénio corneano transepitelial acelerado (TE-ACXL) no tratamento de olhos com queratocone progressivo, reportando os seus outcomes visuais e morfológicos ao longo de um follow-up de 4 anos.
Métodos: olhos de pacientes submetidos a TE-ACXL (6mW/cm2 durante 15 minutos) para queratocone progressivo foram incluídos neste estudo de coorte retrospetivo. Melhor acuidade visual corrigida (BCVA), valores queratométricos, paquimetria corneana mínima (PachyMin) e índices topográficos foram analisados pré-operatoriamente e a cada 6 meses após o TE-ACXL, até um máximo de 48 meses. A progressão da doença foi definida como um aumento ≥ 1.00 D do astigmatismo corneano, um aumento ≥ 1.00 D da queratometria máxima (Kmax), uma redução ≥ 2% da PachyMin ou um aumento ≥ 0.42 unidades do D-index.
Resultados: o estudo envolveu 39 olhos de 30 pacientes. Não foram observadas diferenças estatisticamente significativas na BCVA, astigmatismo corneano, Kmax, índice de variação da superfície corneana (ISV), índice de descentralização por elevação da córnea (IHD) e índice queratocone (KI) entre a avaliação de base e as avaliações subsequentes (p>0.05). Houve um aumento significativo aos 12, 24 e 36 meses de seguimento na queratometria média (Km) (0.66 ± 1.07 D, p=0.001; 0.94 ± 1,42 D, p=0.001; 1.48 ± 1.19 D, p=0.002) e D-index (0.50 ± 1.05 unidades, p=0.011; 0.53 ± 1.19 unidades, p=0.024; 1.29 ± 1.11 unidades, p=0.003). Houve uma redução estatisticamente significativa na PachyMin aos 36 meses (-10.45 ± 15.20 µm, p=0.046) e no índice de assimetria vertical da córnea (IVA) aos 24 meses (-0.07 ± 0.16 unidades, p=0.024). 28 (71.8%) olhos mantiveram progressão por pelo menos um critério. 2 (5.1%) olhos cumpriram os 4 critérios de progressão definidos. Não foram registadas complicações durante a cirurgia ou seguimento em nenhum dos doentes.
Conclusão: o TE-ACXL parece ser um tratamento seguro e eficaz no tratamento do queratocone progressivo. Recomenda-se a definição de novos critérios de progressão específicos e significativos e estudos prospetivos adicionais com coortes mais alargados.Purpose: to systematically evaluate the long-term efficacy of transepithelial accelerated corneal collagen crosslinking (TE-ACXL) in the treatment of eyes with progressive keratoconus by reporting its visual and morphological outcomes throughout a 4-year follow-up.
Methods: eyes of patients who underwent TE-ACXL (6mW/cm2 for 15 minutes) for progressive keratoconus were included in this retrospective cohort study. Best-corrected visual acuity (BCVA), keratometry measurements, thinnest corneal thickness (PachyMin), and topographic indexes were analyzed preoperatively and every 6 months after TE-ACXL, up to a maximum of 48 months. Disease progression was defined as an increase ≥ 1.00 D in corneal astigmatism, an increase ≥ 1.00 D in maximum keratometry (Kmax), a decrease ≥ 2% in PachyMin, or an increase ≥ 0.42 units in D-index.
Results: the study enrolled 39 eyes from 30 patients. No significant differences were observed in BCVA, corneal astigmatism, Kmax, index of surface variance (ISV), index of height decentration (IHD), and keratoconus index (KI) between baseline and subsequent follow-up evaluations (p>0.05). There was a significant increase at 12-, 24- and 36-months follow-up in mean keratometry (Km) (0.66 ± 1.07 D, p=0.001; 0.94 ± 1,42 D, p=0.001; 1.48 ± 1.19 D, p=0.002) and D-index (0.50 ± 1.05 units, p=0.011; 0.53 ± 1.19 units, p=0.024; 1.29 ± 1.11 units, p=0.003). There were significant decreases in PachyMin at 36 months (-10.45 ± 15.20 µm, p=0.046) and in index of vertical asymmetry (IVA) at 24 months (-0.07 ± 0.16 units, p=0.024). 28 (71.8%) eyes maintained progression by at least one criterion. 2 (5.1%) eyes fulfilled all 4 progression criteria. Surgery and follow-up were uneventful in all subjects.
Conclusion: TE-ACXL seems to be a safe and effective treatment for progressive keratoconus. Definition of new specific and significant progression criteria and further prospective studies with larger cohorts are recommended
Novel computational techniques for mapping and classifying Next-Generation Sequencing data
Since their emergence around 2006, Next-Generation Sequencing technologies have been revolutionizing biological and medical research. Quickly obtaining an extensive amount of short or long reads of DNA sequence from almost any biological sample enables detecting genomic variants, revealing the composition of species in a metagenome, deciphering cancer biology, decoding the evolution of living or extinct species, or understanding human migration patterns and human history in general. The pace at which the throughput of sequencing technologies is increasing surpasses the growth of storage and computer capacities, which creates new computational challenges in NGS data processing.
In this thesis, we present novel computational techniques for read mapping and taxonomic classification. With more than a hundred of published mappers, read mapping might be considered fully solved. However, the vast majority of mappers follow the same paradigm and only little attention has been paid to non-standard mapping approaches. Here, we propound the so-called dynamic mapping that we show to significantly improve the resulting alignments compared to traditional mapping approaches. Dynamic mapping is based on exploiting the information from previously computed alignments, helping to improve the mapping of subsequent reads. We provide the first comprehensive overview of this method and demonstrate its qualities using Dynamic Mapping Simulator, a pipeline that compares various dynamic mapping scenarios to static mapping and iterative referencing.
An important component of a dynamic mapper is an online consensus caller, i.e., a program collecting alignment statistics and guiding updates of the reference in the online fashion. We provide Ococo, the first online consensus caller that implements a smart statistics for individual genomic positions using compact bit counters. Beyond its application to dynamic mapping, Ococo can be employed as an online SNP caller in various analysis pipelines, enabling SNP calling from a stream without saving the alignments on disk.
Metagenomic classification of NGS reads is another major topic studied in the thesis. Having a database with thousands of reference genomes placed on a taxonomic tree, the task is to rapidly assign a huge amount of NGS reads to tree nodes, and possibly estimate the relative abundance of involved species. In this thesis, we propose improved computational techniques for this task. In a series of experiments, we show that spaced seeds consistently improve the classification accuracy. We provide Seed-Kraken, a spaced seed extension of Kraken, the most popular classifier at present. Furthermore, we suggest ProPhyle, a new indexing strategy based on a BWT-index, obtaining a much smaller and more informative index compared to Kraken. We provide a modified version of BWA that improves the BWT-index for a quick k-mer look-up
Scalable Profiling and Visualization for Characterizing Microbiomes
Metagenomics is the study of the combined genetic material found in microbiome samples, and it serves as an instrument for studying microbial communities, their biodiversities, and the relationships to their host environments. Creating, interpreting, and understanding microbial community profiles produced from microbiome samples is a challenging task as it requires large computational resources along with innovative techniques to process and analyze datasets that can contain terabytes of information.
The community profiles are critical because they provide information about what microorganisms are present in the sample, and in what proportions. This is particularly important as many human diseases and environmental disasters are linked to changes in microbiome compositions.
In this work we propose novel approaches for the creation and interpretation of microbial community profiles. This includes: (a) a cloud-based, distributed computational system that generates detailed community profiles by processing large DNA sequencing datasets against large reference genome collections, (b) the creation of Microbiome Maps: interpretable, high-resolution visualizations of community profiles, and (c) a machine learning framework for characterizing microbiomes from the Microbiome Maps that delivers deep insights into microbial communities.
The proposed approaches have been implemented in three software solutions: Flint, a large scale profiling framework for commercial cloud systems that can process millions of DNA sequencing fragments and produces microbial community profiles at a very low cost; Jasper, a novel method for creating Microbiome Maps, which visualizes the abundance profiles based on the Hilbert curve; and Amber, a machine learning framework for characterizing microbiomes using the Microbiome Maps generated by Jasper with high accuracy.
Results show that Flint scales well for reference genome collections that are an order of magnitude larger than those used by competing tools, while using less than a minute to profile a million reads on the cloud with 65 commodity processors. Microbiome maps produced by Jasper are compact, scalable representations of extremely complex microbial community profiles with numerous demonstrable advantages, including the ability to display latent relationships that are hard to elicit. Finally, experiments show that by using images as input instead of unstructured tabular input, the carefully engineered software, Amber, can outperform other sophisticated machine learning tools available for classification of microbiomes
Financial Problems as Predictors of Divorce: A Social Exchange Perspective
By using a conceptual framework derived from social exchange theory, this study examined the relationship between financial problems and divorce. Nationally representative data from the Marital Instability Over the Life Course panel study was used to determine if financial problems reported at one interview could predict those who would divorce by the subsequent interview. A self-replicating design allowed data analyses for three separate time periods: 1980-1983 , 1983- 1988, and 1988-1992.
The sample used in this study consisted of l ,620 married men and women under the age of 55. Additionally, the participants were in their first marriages.
Divorce was the only dependent variable. The independent variables included eight financial problems: (a) husband\u27s job interferes with family life, (b) husband \u27s job satisfaction, (c) wife\u27s job satisfaction, (d) wife\u27s work preference, (e) satisfaction with spouse as breadwinner, (f) satisfaction with financial situation, (g) spending money foolishly/unwisely, and (h) financial situation getting better or worse. Additionally, total number of financial problems, age at marriage, gender, income, and presence of children under age 6 were used as independent variables in the analyses. Bivariate correlation and discriminant analysis procedures were used to analyze the data.
The results indicated statistically significant relationships between financial problems and divorce for all independent variables except wife\u27s job satisfaction, gender, and income. However, none of the independent variables (singularly or in combination) explained more than 5% of the variance in divorce;·financial problems were inadequate predictors of divorce.
Although the results of this investigation did not provide substantive support for the popular belief that money problems are a major cause of divorce, this research filled a gap in the divorce literature, posited a clearer definition of financial problems, and provided a more complete conceptual model of the relationships between marital problems and divorce. Finally, the unanswered questions raised by this study indicate the need for continued investigation of the impact that financial issues have on marital relationships
Recommended from our members
Interprofessional working: perceptions of healthcare professionals in Nepalese hospitals
Interprofessional working (IPW) is an essential part of the health service delivery system. Effective delivery of health services relies on the contribution of healthcare professionals (HCPs) from all groups. The aim of the study is to examine how HCPs collaborate and to assess their perceptions of IPW on healthcare delivery. This study follows a qualitative research approach. It was conducted in three hospitals in Nepal using semi-structured interview schedule. Purposive sampling method was used to select the hospitals and the participants. All together thirty-eight HCPs participated in the research. This study suggests that IPW is an integral part of HCPs’ life and they viewed it as a booster to support them to deliver the optimal and desired health outcomes. HCPs perceived that organisational support and involvement of service users are important for the successful delivery IPW. Verbal means of communication are mostly used during IPW. Nursing and allied health professionals (AHPs) are more critical to the medical professionals because they feel domination and professional isolation from the medical professionals. This study recognises factors that support IPW and also identifies various barriers to IPW in Nepalese hospitals
- …