18 research outputs found

    Search Among Sensitive Content

    Get PDF
    Current search engines are designed to find what we want. But many collections can not be made available for search engines because they contain sensitive content that needs to be protected. Before release, such content needs to be examined through a sensitivity review process, which can be difficult and time-consuming. To address this challenge, search technology should be capable of providing access to relevant content while protecting sensitive content. In this dissertation, we present an approach that leverages evaluation-driven information retrieval (IR) techniques. These techniques optimize an objective function that balances the value of finding relevant content with the imperative to protect sensitive content. This requires evaluation measures that balance between relevance and sensitivity. Baselines are introduced for addressing the problem, and a proposed approach that is based on building a listwise learning to rank model is described. The model is trained with a modified loss function to optimize for the evaluation measure. Initial experiments re-purpose a LETOR benchmark dataset, OHSUMED, by using Medical Subject Heading (MeSH) labels to represent the sensitivity. A second test collection is based on the Avocado Research Email Collection. Search topics were developed as a basis for assessing relevance, and two personas describing the sensitivities of representative (but fictional) content creators were created as a basis for assessing sensitivity. These personas were based on interviews with potential donors of historically significant email collections and with archivists who currently manage access to such collections. Two annotators then created relevance and sensitivity judgments for 65 topics for one or both personas. Experiment results show the efficacy of the learning to rank approach. The dissertation also includes four extensions to increase the quality of retrieved results with respect to relevance and sensitivity. First, the use of alternative optimization measures is explored. Second, transformer-based rankers are compared with rankers based on hand-crafted features. Third, a cluster-based replacement strategy that can further improve the score of our evaluation measures is introduced. Fourth, a policy that truncates the ranked list according to the query's expected difficulty is investigated. Results show improvements in each case

    Information Retrieval with Entity Linking

    Get PDF
    Despite the advantages of their low-resource settings, traditional sparse retrievers depend on exact matching approaches between high-dimensional bag-of-words (BoW) representations of both the queries and the collection. As a result, retrieval performance is restricted by semantic discrepancies and vocabulary gaps. On the other hand, transformer-based dense retrievers introduce significant improvements in information retrieval tasks by exploiting low-dimensional contextualized representations of the corpus. While dense retrievers are known for their relative effectiveness, they suffer from lower efficiency and lack of generalization issues, when compared to sparse retrievers. For a lightweight retrieval task, high computational resources and time consumption are major barriers encouraging the renunciation of dense models despite potential gains. In this work, I propose boosting the performance of sparse retrievers by expanding both the queries and the documents with linked entities in two formats for the entity names: 1) explicit and 2) hashed. A zero-shot end-to-end dense entity linking system is employed for entity recognition and disambiguation to augment the corpus. By leveraging the advanced entity linking methods, I believe that the effectiveness gap between sparse and dense retrievers can be narrowed. Experiments are conducted on the MS MARCO passage dataset using the original qrel set, the re-ranked qrels favoured by MonoT5 and the latter set further re-ranked by DuoT5. Since I am concerned with the early stage retrieval in cascaded ranking architectures of large information retrieval systems, the results are evaluated using recall@1000. The suggested approach is also capable of retrieving documents for query subsets judged to be particularly difficult in prior work. In addition, it is demonstrated that the non-expanded and the expanded runs with both explicit and hashed entities retrieve complementary results. Consequently, run combination methods such as run fusion and classifier selection are experimented to maximize the benefits of entity linking. Due to the success of entity methods for sparse retrieval, the proposed approach is also tested on dense retrievers. The corresponding results are reported in MRR@10

    COMMUNITY DETECTION IN GRAPHS

    Get PDF
    Thesis (Ph.D.) - Indiana University, Luddy School of Informatics, Computing, and Engineering/University Graduate School, 2020Community detection has always been one of the fundamental research topics in graph mining. As a type of unsupervised or semi-supervised approach, community detection aims to explore node high-order closeness by leveraging graph topological structure. By grouping similar nodes or edges into the same community while separating dissimilar ones apart into different communities, graph structure can be revealed in a coarser resolution. It can be beneficial for numerous applications such as user shopping recommendation and advertisement in e-commerce, protein-protein interaction prediction in the bioinformatics, and literature recommendation or scholar collaboration in citation analysis. However, identifying communities is an ill-defined problem. Due to the No Free Lunch theorem [1], there is neither gold standard to represent perfect community partition nor universal methods that are able to detect satisfied communities for all tasks under various types of graphs. To have a global view of this research topic, I summarize state-of-art community detection methods by categorizing them based on graph types, research tasks and methodology frameworks. As academic exploration on community detection grows rapidly in recent years, I hereby particularly focus on the state-of-art works published in the latest decade, which may leave out some classic models published decades ago. Meanwhile, three subtle community detection tasks are proposed and assessed in this dissertation as well. First, apart from general models which consider only graph structures, personalized community detection considers user need as auxiliary information to guide community detection. In the end, there will be fine-grained communities for nodes better matching user needs while coarser-resolution communities for the rest of less relevant nodes. Second, graphs always suffer from the sparse connectivity issue. Leveraging conventional models directly on such graphs may hugely distort the quality of generate communities. To tackle such a problem, cross-graph techniques are involved to propagate external graph information as a support for target graph community detection. Third, graph community structure supports a natural language processing (NLP) task to depict node intrinsic characteristics by generating node summarizations via a text generative model. The contribution of this dissertation is threefold. First, a decent amount of researches are reviewed and summarized under a well-defined taxonomy. Existing works about methods, evaluation and applications are all addressed in the literature review. Second, three novel community detection tasks are demonstrated and associated models are proposed and evaluated by comparing with state-of-art baselines under various datasets. Third, the limitations of current works are pointed out and future research tracks with potentials are discussed as well

    Advances in knowledge discovery and data mining Part II

    Get PDF
    19th Pacific-Asia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, May 19-22, 2015, Proceedings, Part II</p
    corecore