10 research outputs found

    DotHash: Estimating Set Similarity Metrics for Link Prediction and Document Deduplication

    Full text link
    Metrics for set similarity are a core aspect of several data mining tasks. To remove duplicate results in a Web search, for example, a common approach looks at the Jaccard index between all pairs of pages. In social network analysis, a much-celebrated metric is the Adamic-Adar index, widely used to compare node neighborhood sets in the important problem of predicting links. However, with the increasing amount of data to be processed, calculating the exact similarity between all pairs can be intractable. The challenge of working at this scale has motivated research into efficient estimators for set similarity metrics. The two most popular estimators, MinHash and SimHash, are indeed used in applications such as document deduplication and recommender systems where large volumes of data need to be processed. Given the importance of these tasks, the demand for advancing estimators is evident. We propose DotHash, an unbiased estimator for the intersection size of two sets. DotHash can be used to estimate the Jaccard index and, to the best of our knowledge, is the first method that can also estimate the Adamic-Adar index and a family of related metrics. We formally define this family of metrics, provide theoretical bounds on the probability of estimate errors, and analyze its empirical performance. Our experimental results indicate that DotHash is more accurate than the other estimators in link prediction and detecting duplicate documents with the same complexity and similar comparison time

    Matching Profiles from Social Network Sites

    Get PDF
    In recent years social networking sites have become very popular. Many people are member of one or more of these profile sites and tend to put a lot of informa- tion about themselves online. This often publicly available data can be useful for many purposes. Retrieving all available data from one person and merging it into one profile even more. Detection of which profiles belong to the same person becomes very important. This task is called Entity Resolution (ER).\ud In this research we develop a model to solve the ER problem for profiles from social networking sites. First we present a simple model. Then we try to improve this model by making use of the social networks a member can have on these sites. We believe that involving the networks can improve the results significantly.\ud General idea is that we have two sites with profiles. With the model we try to find out which profiles of the first profile site correspond to which profiles of the second profile site, whereby we assume a person to have at most one profile at each profile site.\ud In the simple model, we compare all profiles of the first profile site against all profiles of the second site. This comparison will result in a score for each pair: the pairwise similarity score. The higher this score, the higher the probability that these profiles belong to the same person. The pairs that satisfy the so-called pairwise threshold are the candidate matches. From these candidate matches, the matches are chosen.\ud In the network model, we start the same way. When the list of candidate matches is determined, the network phase is started. For each candidate match the network similarity score is calculated. This is done by determining the overlap in the networks of both profiles in the candidate match. The more overlap between the networks, the higher the network similarity score, the higher the probability that the profiles in the candidate match belong to the same person. This time, the candidate matches should satisfy a network threshold in order to stay a candidate match. Then from the remaining candidate matches, the matches are chosen.\ud In order to test whether the network model would indeed improve the simple model, we have set up experiments. Since no suitable data sets were available, we retrieved our own data set. Unfortunately, it appeared to have some limitations. Also, we have built a prototype that implemented the model. The prototype has several parameters for which we could vary the values in the experiments to find a good configuration.\ud The network model ensures that there are more conditions that need to be met to be a match. The experimental results confirm this. That means that the precision of the results increases. On the other side, due to these strict conditions, corresponding profiles are missed, which is undesired. However, in case there are ambiguous profiles in the set, the network model can distinguish the correct profile, which is highly desired. This situation will occur frequently in real life, hence we think the network model can really contribute to solving the ER problem

    The Fourth International VLDB Workshop on Management of Uncertain Data

    Get PDF

    Decoding the Past

    Get PDF
    The human genome is continuously evolving, hence the sequenced genome is a snapshot in time of this evolving entity. Over time, the genome accumulates mutations that can be associated with different phenotypes - like physical traits, diseases, etc. Underlying mutation accumulation is an evolution channel (the term channel is motivated by the notion of communication channel introduced by Shannon [1] in 1948 and started the area of Information Theory), which is controlled by hereditary, environmental, and stochastic factors. The premise of this thesis is to understand the human genome using information theory framework. In particular, it focuses on: (i) the analysis and characterization of the evolution channel using measures of capacity, expressiveness, evolution distance, and uniqueness of ancestry and uses these insights for (ii) the design of error correcting codes for DNA storage, (iii) inversion symmetry in the genome and (iv) cancer classification. The mutational events characterizing this evolution channel can be divided into two categories, namely point mutations and duplications. While evolution through point mutations is unconstrained, giving rise to combinatorially many possibilities of what could have happened in the past, evolution through duplications adds constraints limiting the number of those possibilities. Further, more than 50% of the genome has been observed to consist of repeated sequences. We focus on the much constrained form of duplications known as tandem duplications in order to understand the limits of evolution by duplication. Our sequence evolution model consists of a starting sequence called seed and a set of tandem duplication rules. We find limits on the diversity of sequences that can be generated by tandem duplications using measures of capacity and expressiveness. Additionally, we calculate bounds on the duplication distance which is used to measure the timing of generation by these duplications. We also ask questions about the uniqueness of seed for a given sequence and completely characterize the duplication length sets where the seed is unique or non-unique. These insights also led us to design error correcting codes for any number of tandem duplication errors that are useful for DNA-storage based applications. For uniform duplication length and duplication length bounded by 2, our designed codes achieve channel capacity. We also define and measure uncertainty in decoding when the duplication channel is misinformed. Moreover, we add substitutions to our tandem duplication model and calculate sequence generation diversity for a given budget of substitutions. We also use our duplication model to explain the inversion symmetry observed in the genome of many species. The inversion symmetry is popularly known as the 2nd Chargaff Rule, according to which in a single strand DNA, the frequency of a k-mer is almost the same as the frequency of its reverse complement. The insights gained by these problems led us to investigate the tandem repeat regions in the genome. Tandem repeat regions in the genome can be traced back in time algorithmically to make inference about the effect of the hereditary, environmental and stochastic factors on the mutation rate of the genome. By inferring the evolutionary history of the tandem repeat regions, we show how this knowledge can be used to make predictions about the risk of incurring a mutation based disease, specifically cancer. More precisely, we introduce the concept of mutation profiles that are computed without any comparative analysis, but instead by analyzing the short tandem repeat regions in a single healthy genome and capturing information about the individual's evolution channel. Using gradient boosting on data from more than 5,000 TCGA (The Cancer Genome Atlas) cancer patients, we demonstrate that these mutation profiles can accurately distinguish between patients with various types of cancer. For example, the pairwise validation accuracy of the classifier between PAAD (pancreas) patients and GBM (brain) patients is 93%. Our results show that healthy unaffected cells still contain a cancer-specific signal, which opens the possibility of cancer prediction from a healthy genome.</p

    Report Linking: Information Extraction for Building Topical Knowledge Bases

    Get PDF
    Human language artifacts represent a plentiful source of rich, unstructured information created by reporters, scientists, and analysts. In this thesis we provide approaches for adding structure: extracting and linking entities, events, and relationships from a collection of documents about a common topic. We pursue this linking at two levels of abstraction. At the document level we propose models for aligning the entities and events described in coherent and related discourses: these models are useful for deduplicating repeated claims, finding implicit arguments to events, and measuring semantic overlap between documents. Then at a higher level of abstraction, we construct knowledge graphs containing salient entities and relations linked to supporting documents: these graphs can be augmented with facts and summaries to give users a structured understanding of the information in a large collection

    Transient Relaxation of DNA Methylation at the Onset of Meiosis

    Get PDF
    Meiotic prophase I (MPI) is a unique phase of the cell cycle, specific to germ cells and defining of sexual reproduction. MPI is a period of extensive and specialized homologous chromosome interactions and genetic exchange. Proper progression of MPI requires elaborate epigenetic control, deficiencies in which often lead to infertility. Changes in DNA methylation during MPI can endanger genome integrity by activating transposable elements (TEs) that when mobilized induce DNA breaks and mutations. Therefore, MPI was thought to be under strict surveillance by DNA methylation, whose levels were assumed to be high and stable throughout MPI. Interestingly, expression of LINE retrotransposons, specifically LINE-1 (L1)-encoded protein ORF1p has been observed in MPI germ cells of wild-type male mice. Since tight epigenetic regulation is associated with transposon silencing, we hypothesized that L1 expression in MPI may indicate relaxation of epigenetic silencing in meiotic germ cells. Thus, we investigated the dynamics of CpG DNA methylation during MPI. We enriched and isolated individual MPI stages by Fluorescence Activated Cell Sorting (FACS) and profiled individual MPI germ cells using whole-genome bisulfite sequencing, and RNA-sequencing. Using this approach we uncovered transient and stage-specific changes in DNA methylation dynamics. In contrast to the prevailing view, we show that male germ cells undergo genome-wide transient relaxation of DNA methylation (TRDM) during early MPI. Specifically, we find that a transition from pre-meiotic spermatogonia to meiotic onset in preleptotene spermatocytes is accompanied by genome-wide hypomethylation. Gradual, but uneven remethylation of the genome creates hypomethylated domains throughout meiotic prophase, with pre-meiotic levels of DNA methylation achieved only by late MPI. Our data are most consistent with a DNA replication- coupled mechanism of DNA demethylation in pre-meiotic S-phase. Intriguingly, a TRDM- independent set of hypomethylated domains emerges in mid to late MPI and is enriched in transcriptionally upregulated spermatogenic genes. Using Mael -/- mice defective in piRNA pathway, we show that early MPI offers an opportunity for TE expression and reactivation. We demonstrate that if germ cells enter MPI with insufficient levels of DNA methylation at L1 elements, then during TRDM, meiotic onset can be hijacked to reactivate potentially active L1s. Cumulatively, we demonstrate that early MPI is epigenetically relaxed, exhibits dynamic DNA methylation pattern in MPI and that transient genome-wide DNA hypomethylation at meiotic onset might have implications in gamete quality control

    Management of Degenerative Cervical Myelopathy and Spinal Cord Injury

    Get PDF
    The present Special Issue is dedicated to presenting current research topics in DCM and SCI in an attempt to bridge gaps in knowledge for both of the two main forms of SCI. The issue consists of fourteen studies, of which the majority were on DCM, the more common pathology, while three studies focused on tSCI. This issue includes two narrative reviews, three systematic reviews and nine original research papers. Areas of research covered include image studies, predictive modeling, prognostic factors, and multiple systemic or narrative reviews on various aspects of these conditions. These articles include the contributions of a diverse group of researchers with various approaches to studying SCI coming from multiple countries, including Canada, Czech Republic, Germany, Poland, Switzerland, United Kingdom, and the United States

    Building on Progress - Expanding the Research Infrastructure for the Social, Economic, and Behavioral Sciences. Vol. 1

    Get PDF
    The publication provides a comprehensive compendium of the current state of Germany's research infrastructure in the social, economic, and behavioural sciences. In addition, the book presents detailed discussions of the current needs of empirical researchers in these fields and opportunities for future development. The book contains 68 advisory reports by more than 100 internationally recognized authors from a wide range of fields and recommendations by the German Data Forum (RatSWD) on how to improve the research infrastructure so as to create conditions ideal for making Germany's social, economic, and behavioral sciences more innovative and internationally competitive. The German Data Forum (RatSWD) has discussed the broad spectrum of issues covered by these advisory reports extensively, and has developed general recommendations on how to expand the research infrastructure to meet the needs of scholars in the social and economic sciences
    corecore