94 research outputs found

    Searching to Translate and Translating to Search: When Information Retrieval Meets Machine Translation

    Get PDF
    With the adoption of web services in daily life, people have access to tremendous amounts of information, beyond any human's reading and comprehension capabilities. As a result, search technologies have become a fundamental tool for accessing information. Furthermore, the web contains information in multiple languages, introducing another barrier between people and information. Therefore, search technologies need to handle content written in multiple languages, which requires techniques to account for the linguistic differences. Information Retrieval (IR) is the study of search techniques, in which the task is to find material relevant to a given information need. Cross-Language Information Retrieval (CLIR) is a special case of IR when the search takes place in a multi-lingual collection. Of course, it is not helpful to retrieve content in languages the user cannot understand. Machine Translation (MT) studies the translation of text from one language into another efficiently (within a reasonable amount of time) and effectively (fluent and retaining the original meaning), which helps people understand what is being written, regardless of the source language. Putting these together, we observe that search and translation technologies are part of an important user application, calling for a better integration of search (IR) and translation (MT), since these two technologies need to work together to produce high-quality output. In this dissertation, the main goal is to build better connections between IR and MT, for which we present solutions to two problems: Searching to translate explores approximate search techniques for extracting bilingual data from multilingual Wikipedia collections to train better translation models. Translating to search explores the integration of a modern statistical MT system into the cross-language search processes. In both cases, our best-performing approach yielded improvements over strong baselines for a variety of language pairs. Finally, we propose a general architecture, in which various components of IR and MT systems can be connected together into a feedback loop, with potential improvements to both search and translation tasks. We hope that the ideas presented in this dissertation will spur more interest in the integration of search and translation technologies

    Fast Similarity Graph Construction via Data Sketching Techniques

    Get PDF
    Graphs are mathematical structures used to model objects and their pairwise relationships. Due to their simple but expressive abstract representation, they are commonly used to model various types of relations and processes in technological, social or biological systems and have found numerous applications. A special type of graph is the similarity graph in which nodes represent entities and there is an edge connecting two nodes if the two entities are similar based on some similarity measure. In a typical scenario, raw data of entities are provided in the form of a relational dataset, matrix or a tensor and a similarity graph is built to facilitate graph-based analysis like node importance, node classification, link prediction, community detection, outlier detection, and more. The ability to construct similarity graphs fast is important and with a potential for high impact, thus several approximation techniques have been proposed. In this work, we propose data sketching based methods for fast approximate similarity graph construction. Data sketching techniques are applied on the raw data and are designed to achieve desired error guarantees. They can drastically reduce the size of raw data on which we operate, allowing for faster construction and analysis of similarity graphs, but with approximate results. This is a desirable tradeoff for many applications in diverse domains. Through a thorough experimental evaluation, we demonstrate that our sketching methods outperform sensible baselines and competitor methods proposed for the problem. First, they are much faster than exact methods while maintaining high accuracy in constructing the similarity graph. Furthermore, our methods demonstrate significantly higher accuracy than competitive methods on generic graph analysis tasks. We demonstrate the effectiveness of our methods on different real-world graph applications

    Finding and exploring memes in social media

    Full text link
    Critical literacy challenges us to question how what we read has been shaped by external context, especially when infor-mation comes from less established sources. While cross-checking multiple sources provides a foundation for critical literacy, trying to keep pace the constant deluge of new on-line information is a daunting proposition, especially for ca-sual readers. To help address this challenge, we propose a new form of technological assistance which automatically discovers and displays underlyingmemes: ideas embodied by similar phrases which are found in multiple sources. Once detected, these underlying memes are revealed to users via generated hypertext, allowing memes to be explored in con-text. Given the massive volume of online information today, we propose a highly-scalable system architecture based on MapReduce, extending work by Kolak and Schilit [11]. To validate our approach, we report on using our system to pro-cess and browse a 1.5 TB collection of crawled social media. Our contributions include a novel technological approach to support critical literacy and a highly-scalable system archi-tecture for meme discovery optimized for Hadoop [25]. Our source code and Meme Browser are both available online

    EFFICIENT PAIR-WISE SIMILARITY COMPUTATION USING APACHE SPARK

    Get PDF
    Entity matching is the process of identifying different manifestations of the same real world entity. These entities can be referred to as objects(string) or data instances. These entities are in turn split over several databases or clusters based on the signatures of the entities. When entity matching algorithms are performed on these databases or clusters, there is a high possibility that a particular entity pair is compared more than once. The number of comparison for any two entities depend on the number of common signatures or keys they possess. This effects the performance of any entity matching algorithm. This paper is the implementation of the algorithm written by Erhard Rahm et al. for performing redundancy free pair-wise similarity computation using MapReduce. As an improvisation to the existing implementation, this project aims to implement the algorithm in Apache Spark in standalone mode for sample of data and in cluster mode for large volume of data

    Using random search and brute force algorithm in factoring the RSA modulus

    Get PDF
    Abstract. The security of the RSA cryptosystem is directly proportional to the size of its modulus, n. The modulus n is a multiplication of two very large prime numbers, notated as p and q. Since modulus n is public, a cryptanalyst can use factorization algorithms such as Euler’s and Pollard’s algorithms to derive the private keys, p and q. Brute force is an algorithm that searches a solution to a problem by generating all the possible candidate solutions and testing those candidates one by one in order to get the most relevant solution. Random search is a numerical optimization algorithm that starts its search by generating one candidate solution randomly and iteratively compares it with other random candidate solution in order to get the most suitable solution. This work aims to compare the performance of brute force algorithm and random search in factoring the RSA modulus into its two prime factors by experimental means in Python programming language. The primality test is done by Fermat algorithm and the sieve of Eratosthenes

    Efficient partial-duplicate detection based on sequence matching

    Full text link
    • …
    corecore