13 research outputs found

    QUERY-SPECIFIC SUBTOPIC CLUSTERING IN RESPONSE TO BROAD QUERIES

    Get PDF
    Information Retrieval (IR) refers to obtaining valuable and relevant information from various sources in response to a specific information need. For the textual domain, the most common form of information sources is a collection of textual documents or text corpus. Depending on the scope of the information need, also referred to as the query, the relevant information can span a wide range of topical themes. Hence, the relevant information may often be scattered through multiple documents in the corpus, and each satisfies the information need to varying degrees. Traditional IR systems present the relevant set of documents in the form of a ranking where the rank of a particular document corresponds to its degree of relevance to the query. If the query is sufficiently specific, the set of relevant documents will be more or less about similar topics. However, they will be much more topically diverse when the query is vague or about a generalized topic, e.g., ``Computer science. In such cases, multiple documents may be of equal importance as each represents a specific facade of the broad topic of the query. Consider, for example, documents related to information retrieval and machine learning for the query ``Computer Science. In this case, the decision to rank documents from these two subtopics would be ambiguous. Instead, presenting the retrieved results as a cluster of documents where each cluster represents one subtopic would be more appropriate. Subtopic clustering of search results has been explored in the domain of Web-search, where users receive relevant clusters of search results in response to their query. This thesis explores query-specific subtopic clustering that incorporates queries into the clustering framework. We develop a query-specific similarity metric that governs a hierarchical clustering algorithm. The similarity metric is trained to predict whether a pair of relevant documents should also share the same subtopic cluster in the context of the query. Our empirical study shows that direct involvement of the query in the clustering model significantly improves the clustering performance over a state-of-the-art neural approach on two publicly available datasets. Further qualitative studies provide insights into the strengths and limitations of our proposed approach. In addition to query-specific similarity metrics, this thesis also explores a new supervised clustering paradigm that directly optimizes for a clustering metric. Being discrete functions, existing approaches for supervised clustering find it difficult to use a clustering metric for optimization. We propose a scalable training strategy for document embedding models that directly optimizes for the RAND index, a clustering quality metric. Our method outperforms a strong neural approach and other unsupervised baselines on two publicly available datasets. This suggests that optimizing directly for the clustering outcome indeed yields better document representations suitable for clustering. This thesis also studies the generalizability of our findings by incorporating the query-specific clustering approach and our clustering metric-based optimization technique into a single end-to-end supervised clustering model. Also, we extend our methods to different clustering algorithms to show that our approaches are not dependent on any specific clustering algorithm. Having such a generalized query-specific clustering model will help to revolutionize the way digital information is organized, archived, and presented to the user in a context-aware manner

    Theory and Applications for Advanced Text Mining

    Get PDF
    Due to the growth of computer technologies and web technologies, we can easily collect and store large amounts of text data. We can believe that the data include useful knowledge. Text mining techniques have been studied aggressively in order to extract the knowledge from the data since late 1990s. Even if many important techniques have been developed, the text mining research field continues to expand for the needs arising from various application fields. This book is composed of 9 chapters introducing advanced text mining techniques. They are various techniques from relation extraction to under or less resourced language. I believe that this book will give new knowledge in the text mining field and help many readers open their new research fields

    Comparative Summarization of Document Collections

    Get PDF
    Comparing documents is an important task that help us in understanding the differences between documents. Example of document comparisons include comparing laws on same related subject matter in different jurisdictions, comparing the specifications of similar product from different manufacturers. One can see that the need for comparison does not stop at individual documents, and extends to large collections of documents. For example comparing the writing styles of an author early vs late in their life, identifying linguistic and lexical patterns of different political ideologies, or discover commonalities of political arguments in disparate events. Comparing large document collections calls for automated algorithms to do so. Every day a huge volume of documents are produced in social and news media. There has been a lot of research in summarizing individual document such as a news article, or document collections such as a collection of news articles on a related topic or event. However, comparatively summarizing different document collections, or comparative summarization is under-explored problem in terms of methodology, datasets, evaluations and applicability in different domains. To address this, in this thesis, we make three types of contributions to comparative summarization, methodology, datasets and evaluation, and empirical measurements on a range of settings where comparative summarization can be applied. We propose a new formulation the problem of comparative summarization as competing binary classifiers. This formulation help us to develop new unsupervised and supervised methods for comparative summarization. Our methods are based on Maximum Mean Discrepancy (MMD), a metric that measures the distance between two sets of datapoints (or documents). The unsupervised methods incorporate information coverage, information diversity and discriminativeness of the prototypes based on global-model of sentence-sentence similarity, and be optimized with greedy and gradient methods. We show the efficacy of the approach in summarizing a long running news topic over time. Our supervised method improves the unsupervised methods, and can learn the importance of prototypes based on surface features (e.g., position, length, presence of cue words) and combine different text feature representations. Our supervised method meets or exceeds the state-of-the-art performance in benchmark datasets. We design new scalable automatic and crowd-sourced extrinsic evaluations of comparative summaries when human written ground truth summaries are not available. To evaluate our methods, we develop two new datasets on controversial news topics -- CONTROVNEWS2017 and NEWS2019+BIAS datasets which we use in different experiments. We use CONTROVNEWS2017, which consists of news articles on controversial topics to evaluate our unsupervised methods in summarizing over time. We use NEWS2019+BIAS, which again consists of news articles on controversial news topics, along with media bias labels to empirically study the applicability of methods. Finally, we measure the distinguishability and summarizability of document collections to quantify the applicability of our methods in different domains. We measure these metrics in a newly curated NEWS2019+BIAS dataset in comparing articles over time, and across ideological leanings of media outlets. First, we observe that the summarizability is proportional to the distinguishability, and identify the groups of articles that are less or more distinguishable.Second, better distinguishability and summarizability is amenable to the choice of document representations according to the comparisons we make, either over time, or across ideological leanings of media outlets. We also apply the comparative summarization method to the task of comparing stances in the social media domain

    Study on open science: The general state of the play in Open Science principles and practices at European life sciences institutes

    Get PDF
    Nowadays, open science is a hot topic on all levels and also is one of the priorities of the European Research Area. Components that are commonly associated with open science are open access, open data, open methodology, open source, open peer review, open science policies and citizen science. Open science may a great potential to connect and influence the practices of researchers, funding institutions and the public. In this paper, we evaluate the level of openness based on public surveys at four European life sciences institute

    Praxishandbuch Forschungsdatenmanagement

    Get PDF

    Incorporating Benefit-Risk Consideration and Feature Selection into Optimal Dynamic Treatment Regimens

    Get PDF
    Optimal dynamic treatment regimen (DTR) is one of the most important strategies in precision medicine, which sequentially assigns the best treatment to patients based on their evolving health status to maximize the cumulative outcome. For many chronic diseases, treatments are often multifaceted where aggressive treatments with a higher beneficial reward are usually accompanied by an elevated risk of adverse outcomes, and ideal DTRs should both yield a higher beneficial gain while avoiding unnecessary risk. In addition, it is often that among many possible tailoring variables, only a small subset is essential for treatment, and identifying these variables is particularly important for developing sparse DTRs, which are useful in practice. To address these challenges, in the first project we propose a new machine learning-based method to learn the optimal DTRs that maximize patients' cumulative reward but at each stage, the acute short-term risk induced by the treatments is controlled lower than a pre-specified threshold. We show that this multistage-constrained problem can be decomposed into a series of single-stage single-constrained problems, which can be efficiently solved using a backward algorithm. We provide theoretical guarantees for the method and demonstrate the performance via simulation studies and an application to a clinical trial for T2D patients (DURABLE study). In the second project, we develop a general approach to estimate the optimal DTRs that maximize patients' cumulative reward but lead to a cumulative risk no higher than a pre-specified threshold. This procedure converts the problem into solving unconstrained DTRs problems, which can be accommodated to existing DTRs methods. Furthermore, we propose an estimation procedure (MRL) to solve the decision rules across all stages simultaneously. The method is justified via theoretical guarantees, simulation studies, and an application to the DURABLE study. In the third project, we develop a new machine learning-based method by extending and adding an L1-penalty to the MRL framework to implement variable selection while learning optimal DTRs across all stages contingently. A DC algorithm is developed to solve the L1-MRL problem efficiently and the performance is demonstrated via simulation studies and application to an observational electronic health record (EHR) data of T2D patients.Doctor of Philosoph
    corecore