22 research outputs found

    Метод аналізу відгуків клієнтів з природномовних текстів

    No full text
    Стаття присвячена методу аналізу текстів природною мовою, що містять відгуки клієнтів. Метод від-різняється від існуючих комбінацією різних типів векторизатора та уведенням ієрархії компонентів. Послі-довність застосування різних векторизаторів дає змогу будувати ієрархію ознак та маркерів. Використання методу опорних векторів та острівної кластеризації з подальшим навчання моделі для прогнозування почут-тів є одним із кращих методів аналізу настроїв, як для небінарних, так і для бінарних аспектів. На основі від-критого набору даних з допомогою Python та Tablau побудовано програмний продукт для аналізу вподобань клієнтів і візуалізації результатів аналізів.The article is devoted to the method of analysis of texts in the natural language, containing reviews of clients. The method differs from the existing combination of different types of vectorizer and the introduction of the component hierarchy. The sequencing of the use of different vectorizers allows us to build a hierarchy of features and markers. Using the reference vectors and island clustering techniques, with the subsequent training of a model for prediction of feelings, is one of the best methods for analyzing mood, both for non-binary and binary aspects. Based on open data set with Python and Tablau, a software product was developed to analyze customer preferences and visualize the results of analyzes

    Cross-language high similarity search using a conceptual thesaurus

    Full text link
    This work addresses the issue of cross-language high similarity and near-duplicates search, where, for the given document, a highly similar one is to be identified from a large cross-language collection of documents. We propose a concept-based similarity model for the problem which is very light in computation and memory. We evaluate the model on three corpora of different nature and two language pairs English-German and English-Spanish using the Eurovoc conceptual thesaurus. Our model is compared with two state-of-the-art models and we find, though the proposed model is very generic, it produces competitive results and is significantly stable and consistent across the corpora.This work was done in the framework of the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems and it has been partially funded by the European Commission as part of the WIQ-EI IRSES project (grant no. 269180) within the FP 7 Marie Curie People Framework, and by the Text-Enterprise 2.0 research project (TIN2009-13391-C04-03). The research work of the second author is supported by the CONACyT 192021/302009 grantGupta, P.; Barrón Cedeño, LA.; Rosso, P. (2012). Cross-language high similarity search using a conceptual thesaurus. En Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics. Springer Verlag (Germany). 7488:67-75. https://doi.org/10.1007/978-3-642-33247-0_8S6775748

    Generating Clusters of Duplicate Documents: An Approach Based on Frequent Closed Itemsets

    Full text link
    Множество документов в Интернете имеют дубликаты, в связи с чем необходимы средства эффективного вычисления кластеров документов-дубликатов [1-5, 8-10, 13-14]. В работе исследуется применение алгоритмов Data Mining для поиска кластеров дубликатов с использованием синтаксических и лексических методов составления образов документов. На основе экспериментальной работы делаются некоторые выводы о способе выбора параметров методов.A vast amount of documents in the Web have duplicates, which necessitates creation of efficient methods for computing clusters of duplicates [1-5, 8-10, 13-14]. In this paper some algorithms of Data Mining are used for constructing clusters of duplicate documents (duplicates), documents being represented by both syntactic and lexical methods. Series of experiments suggest some conclusions about choosing parameters of the methods

    THE METHOD FOR DETECTING PLAGIARISM IN A COLLECTION OF DOCUMENTS

    Get PDF
    The development of the intelligent system for searching for plagiarism by combining two algorithms of searching fuzzy duplicate is considered in this article. This combining contributed to the high computational efficiency. Another advantage of the algorithm is its high efficiency when small-sized documents are compared. The practical use of the algorithm makes it possible to improve the quality of the detection of plagiarism. Also, this algorithm can be used in different systems text search

    The problem of fuzzy duplicate detection of large texts

    Get PDF
    Основная статьяIn the paper, we considered the problem of fuzzy duplicate detection. There are given the basic approaches to detection of text duplicates – distance between strings, fuzzy search algorithms without indexing data, fuzzy search algorithms with indexing data. The review of existing methods for the fuzzy duplicate detection is given. The algorithm of fuzzy duplicate detection is present. The algorithm of fuzzy duplicate texts detection was implemented in the system AVTOR.NET. The use of filtering text, stemming and character replacement, allow the algorithm to found duplicates even in minor modified texts

    Efficient partial-duplicate detection based on sequence matching

    Full text link
    corecore