61 research outputs found

    Cross - Language Information Retrieval Using Two Methods: LSI via SDD and LSI via SVD

    Get PDF
    This chapter presents a method for the recovery of bilingual information based on semidiscrete matrix decomposition (SDD); that is, the problem of retrieving information in two languages, Spanish and English, is studied when the queries are made only in Spanish. In it, four case studies that exhibit the performance of the use of the latent semantic index (LSI) via SDD method for cross-language information retrieval (CLIR) are displayed. Concurrently, these results are compared with those obtained by applying LSI via singular value decomposition (SVD). All experiments were performed from a bilingual database, built from the gospels of the Bible, which combines documents in Spanish and English. For this, a fusion strategy was used that increases the size of the database by 10%. It was found that in terms of errors, the methods are comparable, since equal results were obtained in 58.3% of the queries made. In addition, the methods presented a success rate of at least 65% in the task of retrieving relevant information in the two languages considered

    A Survey of Multilingual Text Retrieval

    Get PDF
    This report reviews the present state of the art in selection of texts in one language based on queries in another, a problem we refer to as ``multilingual'' text retrieval. Present applications of multilingual text retrieval systems are limited by the cost and complexity of developing and using the multilingual thesauri on which they are based and by the level of user training that is required to achieve satisfactory search effectiveness. A general model for multilingual text retrieval is used to review the development of the field and to describe modern production and experimental systems. The report concludes with some observations on the present state of the art and an extensive bibliography of the technical literature on multilingual text retrieval. The research reported herein was supported, in part, by Army Research Office contract DAAL03-91-C-0034 through Battelle Corporation, NSF NYI IRI-9357731, Alfred P. Sloan Research Fellow Award BR3336, and a General Research Board Semester Award. (Also cross-referenced as UMIACS-TR-96-19

    Метод межъязыкового аспектно-ориентированного анализа высказываний с использованием машинного обучение категоризационной модели.

    Get PDF
    Product reviews are the foremost source of information for customers and manufacturers to help them make appropriate purchasing and production decisions. Today, the Internet has become the largest source of consumer thought. Sentiment analysis and opinion mining is the field of study that analyzes people’s opinions, sentiments, evaluations, attitudes, and emotions from written language. In this paper, we present a study of aspect-based opinion mining using a lexicon-based approach and their adaptation to the processing of responses written in Ukrainian and English. This information helps to build systems to understand customer’s feedback and plan business strategies accordingly. This also helps in predicting the chances of product failure. In this paper, it is explained how machine learning can be used for opinion mining. The research methods used in the work are based on data mining methods, Web mining, machine learning, and information retrieval. The stages of the method of cross-language aspect-oriented analysis of statements are presented. The cross-language categorization of characteristics of goods is considered. The algorithm describes the model learning in cross-language virtual contextual documents.Відгуки про продукцію є головним джерелом інформації для клієнтів і виробників, щоб допомогти їм прийняти відповідні рішення щодо закупівель і виробництва. Сьогодні Інтернет став найбільшим джерелом споживчої думки. Аналіз настроїв і видобування думок є сферою дослідження, яка аналізує думки людей, почуття, оцінки, ставлення та емоції з природно-мовного тексту. У даній роботі представлено дослідження аспектно-орієнтованого видобування думок з використанням лексіконного підходу та його адаптація до обробки відповідей, написаних українською та англійською мовами. Ця інформація допомагає створювати системи для розуміння зворотного зв'язку клієнта та планування відповідних бізнес-стратегій. Це також допомагає прогнозувати шляхи запобігання невдач при просуванні на ринку продуктів. У цій роботі розглянуто використання машинного навчання для видобутку думок клієнтів. Методи дослідження, що використовуються в роботі, базуються на методах інтелектуального аналізу даних, веб-добуванні, машинному навчанні та пошуку інформації. Представлено етапи методу міжмовного аспектно-орієнтованого аналізу тверджень. Розглянуто перехресну категоризацію характеристик товарів. Алгоритм описує модель навчання на міжмовному віртуальному контекстному документі.Отзывы о продукции является главным источником информации для клиентов и производителей, чтобы помочь им принять соответствующие решения в части закупок и производства. Сегодня Интернет стал крупнейшим источником потребительского мнения. Анализ настроений и выявления мыслей является сферой исследования, которая анализирует мнения людей, чувства, оценки, отношения и эмоции с естественно-языкового текста. В данной работе представлено исследование аспектно-ориентированного выявления мыслей с использованием лексиконного подхода и его адаптация к обработки ответов, написанных на украинском и английском языках. Эта информация помогает создавать системы для понимания обратной связи клиента и планирования соответствующих бизнес-стратегий. Это также помогает прогнозировать пути предотвращения неудач при продвижении на рынке продуктов. В этой работе рассмотрено использование машинного обучения для выявления мнений клиентов. Методы исследования, используемые в работе, базируются на методах интеллектуального анализа данных, веб-добывании, машинном обучении и поиска информации. Представлены этапы метода межъязыкового аспектно-ориентированного анализа утверждений. Рассмотрена перекрестная категоризацию характеристик товаров. Алгоритм описывает модель обучения на межъязыковой виртуальном контекстном документе

    Cross-Lingual Semantic Similarity Measure for Comparable Articles

    Get PDF
    International audienceWe aim in this research to find and compare crosslingual articles concerning a specific topic. So, we need measure for that. This measure can be based on bilingual dictionaries or based on numerical methods such as Latent Semantic Indexing (LSI). In this paper, we use the LSI in two ways to retrieve Arabic-English comparable articles. The first one is monolingual: the English article is translated into Arabic and then mapped into the Arabic LSI space; the second one is crosslingual: Arabic and English documents are mapped into Arabic-English LSI space. Then, we compare LSI approaches to the dictionary-based approach on several English-Arabic parallel and comparable corpora. Results indicate that the performance of cross-lingual LSI approach is competitive to monolingual approach, or even better for some corpora. Moreover, both LSI approaches outperform the dictionary approach

    A history and theory of textual event detection and recognition

    Get PDF

    Large-scale, Language-agnostic Discourse Classification of Tweets During COVID-19

    Get PDF
    Quantifying the characteristics of public attention is an essential prerequisite for appropriate crisis management during severe events such as pandemics. For this purpose, we propose language-agnostic tweet representations to perform large-scale Twitter discourse classification with machine learning. Our analysis on more than 26 million COVID-19 tweets shows that large-scale surveillance of public discourse is feasible with computationally lightweight classifiers by out-of-the-box utilization of these representations.Comment: 14 pages, 4 figure

    Analyzing user reviews of messaging Apps for competitive analysis

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceThe rise of various messaging apps has resulted in intensively fierce competition, and the era of Web 2.0 enables business managers to gain competitive intelligence from user-generated content (UGC). Text-mining UGC for competitive intelligence has been drawing great interest of researchers. However, relevant studies mostly focus on industries such as hospitality and products, and few studies applied such techniques to effectively perform competitive analysis for messaging apps. Here, we conducted a competitive analysis based on topic modeling and sentiment analysis by text-mining 27,479 user reviews of four iOS messaging apps, namely Messenger, WhatsApp, Signal and Telegram. The results show that the performance of topic modeling and sentiment analysis is encouraging, and that a combination of the extracted app aspect-based topics and the adjusted sentiment scores can effectively reveal meaningful competitive insights into user concerns, competitive strengths and weaknesses as well as changes of user sentiments over time. We anticipate that this study will not only advance the existing literature on competitive analysis using text mining techniques for messaging apps but also help existing players and new entrants in the market to sharpen their competitive edge by better understanding their user needs and the industry trends

    Document Clustering using Self-Organizing Maps

    Get PDF
    Cluster analysis of textual documents is a common technique for better ltering, navigation, under-standing and comprehension of the large document collection. Document clustering is an autonomous methodthat separate out large heterogeneous document collection into smaller more homogeneous sub-collections calledclusters. Self-organizing maps (SOM) is a type of arti cial neural network (ANN) that can be used to performautonomous self-organization of high dimension feature space into low-dimensional projections called maps. Itis considered a good method to perform clustering as both requires unsupervised processing. In this paper, weproposed a SOM using multi-layer, multi-feature to cluster documents. The paper implements a SOM usingfour layers containing lexical terms, phrases and sequences in bottom layers respectively and combining all atthe top layers. The documents are processed to extract these features to feed the SOM. The internal weightsand interconnections between these layers features(neurons) automatically settle through iterations with a smalllearning rate to discover the actual clusters. We have performed extensive set of experiments on standard textmining datasets like: NEWS20, Reuters and WebKB with evaluation measures F-Measure and Purity. Theevaluation gives encouraging results and outperforms some of the existing approaches. We conclude that SOMwith multi-features (lexical terms, phrases and sequences) and multi-layers can be very e ective in producinghigh quality clusters on large document collections
    corecore