7 research outputs found

    Streaming and Sketch Algorithms for Large Data NLP

    Get PDF
    The availability of large and rich quantities of text data is due to the emergence of the World Wide Web, social media, and mobile devices. Such vast data sets have led to leaps in the performance of many statistically-based problems. Given a large magnitude of text data available, it is computationally prohibitive to train many complex Natural Language Processing (NLP) models on large data. This motivates the hypothesis that simple models trained on big data can outperform more complex models with small data. My dissertation provides a solution to effectively and efficiently exploit large data on many NLP applications. Datasets are growing at an exponential rate, much faster than increase in memory. To provide a memory-efficient solution for handling large datasets, this dissertation show limitations of existing streaming and sketch algorithms when applied to canonical NLP problems and proposes several new variants to overcome those shortcomings. Streaming and sketch algorithms process the large data sets in one pass and represent a large data set with a compact summary, much smaller than the full size of the input. These algorithms can easily be implemented in a distributed setting and provide a solution that is both memory- and time-efficient. However, the memory and time savings come at the expense of approximate solutions. In this dissertation, I demonstrate that approximate solutions achieved on large data are comparable to exact solutions on large data and outperform exact solutions on smaller data. I focus on many NLP problems that boil down to tracking many statistics, like storing approximate counts, computing approximate association scores like pointwise mutual information (PMI), finding frequent items (like n-grams), building streaming language models, and measuring distributional similarity. First, I introduce the concept of approximate streaming large-scale language models in NLP. Second, I present a novel variant of the Count-Min sketch that maintains approximate counts of all items. Third, I conduct a systematic study and compare many sketch algorithms that approximate count of items with focus on large-scale NLP tasks. Last, I develop fast large-scale approximate graph (FLAG), a system that quickly constructs a large-scale approximate nearest-neighbor graph from a large corpus

    Incremental Skip-gram Model with Negative Sampling

    Full text link
    This paper explores an incremental training strategy for the skip-gram model with negative sampling (SGNS) from both empirical and theoretical perspectives. Existing methods of neural word embeddings, including SGNS, are multi-pass algorithms and thus cannot perform incremental model update. To address this problem, we present a simple incremental extension of SGNS and provide a thorough theoretical analysis to demonstrate its validity. Empirical experiments demonstrated the correctness of the theoretical analysis as well as the practical usefulness of the incremental algorithm

    On Frequency Estimation and Detection of Heavy Hitters in Data Streams

    Get PDF
    A stream can be thought of as a very large set of data, sometimes even infinite, which arrives sequentially and must be processed without the possibility of being stored. In fact, the memory available to the algorithm is limited and it is not possible to store the whole stream of data which is instead scanned upon arrival and summarized through a succinct data structure in order to maintain only the information of interest. Two of the main tasks related to data stream processing are frequency estimation and heavy hitter detection. The frequency estimation problem requires estimating the frequency of each item, that is the number of times or the weight with which each appears in the stream, while heavy hitter detection means the detection of all those items with a frequency higher than a fixed threshold. In this work we design and analyze ACMSS, an algorithm for frequency estimation and heavy hitter detection, and compare it against the state of the art ASKETCH algorithm. We show that, given the same budgeted amount of memory, for the task of frequency estimation our algorithm outperforms ASKETCH with regard to accuracy. Furthermore, we show that, under the assumptions stated by its authors, ASKETCH may not be able to report all of the heavy hitters whilst ACMSS will provide with high probability the full list of heavy hitters

    Система рекомендацій фільмів по моделі «PERMA»

    No full text
    Магістерська робота обсягом 95 сторінок має 20 рисунків, 2 таблиці, 2 додатки та 10 джерел. Завданням роботи є розробка рекомендаційної системи для фільмів на основі моделі PERMA. Обєктом дослідження є дані про фільм, субтитри з фільму, а також набір даних моделі PERMA. Предметом дослідження є застосування алгоритму статистичного навчання до даних, які характеризують фільм. Мета роботи полягає в розробці новітнього методу рекомендацій фільмів на основі вмісту фільмів та емоційної окраски мови персонажів. Методами дослідження є обробка великих даних, NLP, рекурентні нейронні мережі, статистичні моделі, Python, Numpy, Pandas, Tableau Дана робота – це результат новітнього підходу до створення рекомендацій фільмів, який узагальнює сучасні знання про психологію, машинне навчання, методів залучення та утримання клієнтів. Створена модель вбудована в розширення сервісу MEGOGO та може покращити інтерактивність клієнта та сервіса. Даний метод може бути використаний не тільки для рекомендацій фільмів, книг чи музики. Застосовувати модель можна до будь яких предметних областей, де обєктом є послідовності, та, таким чином, оцінювати емоційну окраску цих послідовностей.Master's work in 95 pages has 20 images, 2 tables, 2 attachments and 10 sources. The task of the work is to develop a recommendation system for films based on the PERMA model. The object of the investigation is the metadata about movies, subtitles of movies, and the set of data for the PERMA model. The subject of the investigation is the applying of the algorithm of statistical training to the data characterizing the film. The purpose of the work is to develop the film recommendation method based on the content of films and the sentiment of the characters' language. Investigation methods are processing large data, NLP, recurrent neural networks, statistical models, Python, Numpy, Pandas, Tableau. This work is a result of the latest approach to the creation of film recommendations, which summarizes latest knowledge about psychology, machine learning, methods of attracting and retaining clients. The created model is built into the extension of the MEGOGO service and can improve customer-service interactivity. This method can be used not only for the recommendations of movies, books or music. The model can be applied to any areas where the subject is a sequence, and thus, to assess the sentiment of these sequences
    corecore