440 research outputs found

    Natural language processing

    Get PDF
    Beginning with the basic issues of NLP, this chapter aims to chart the major research activities in this area since the last ARIST Chapter in 1996 (Haas, 1996), including: (i) natural language text processing systems - text summarization, information extraction, information retrieval, etc., including domain-specific applications; (ii) natural language interfaces; (iii) NLP in the context of www and digital libraries ; and (iv) evaluation of NLP systems

    Spoken content retrieval: A survey of techniques and technologies

    Get PDF
    Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR

    Sentiment Analysis in Karonese Tweet using Machine Learning

    Get PDF
    Recently, many social media users expressed their conditions, ideas, emotions using local languages ​​on social media, for example via tweets or status. Due to the large number of texts, sentiment analysis is used to identify opinions, ideas, or thoughts from social media. Sentiment analysis research has also been widely applied to local languages. Karonese is one of the largest local languages ​​in North Sumatera, Indonesia. Karo society actively use the language in expression on twitter. This study proposes two things: Karonese tweet dataset for classification and analysis of sentiment on Karonese. Several machine learning algorithms are implemented in this research, that is Logistic regression, Naive bayes, K-nearest neighbor, and Support Vector Machine (SVM). Karonese tweets is obtained from timeline twitter based on several keywords and hashtags. Transcribers from ethnic figures helped annotating the Karo tweets into three classes: positive, negative, and neutral. To get the best model, several scenarios were run based on various compositions of training data and test data. The SVM algorithm has highest accuracy, precision, recall, and F-1 scores than others. As the research is a preliminary research of sentiment analysis on Karonese language, there are many feature works to improvement

    Inquest of Current Situation in Afghanistan Under Taliban Rule Using Sentiment Analysis and Volume Analysis

    Get PDF
    Microblogging websites and social media platforms serve as a potential source for mining public opinions and sentiments on a variety of subjects including the prevailing situations in war-afflicted countries. In particular, Twitter has a large number of geotagged tweets that make the analysis of sentiments across time and space possible. This study performs volume analysis and sentiment analysis using LDA (Latent Dirichlet Allocation) and text mining over two datasets collected for different periods. To increase the adequacy and efficacy of the sentiment analysis, a hybrid feature engineering approach is proposed that elevates the performance of machine learning models. Geotagged tweets are used for volume analysis indicating that the highest number of tweets is originated from India, the US, the UK, Pakistan, and Afghanistan. Analysis of positive and negative tweets reveals that negative tweets are mostly originated from India and the US. On the contrary, positive tweets belong to Pakistan and Afghanistan. LDA is used for topic modeling on two datasets containing tweets about the current situation after the Taliban take control of Afghanistan. Topics extracted through LDA suggest that majority of the Afghanistan people seem satisfied with the Taliban�s takeover while the topics from negative tweets reveal that issues discussed in negative tweets are related to the US concerns in Afghanistan. Sentiment analysis over two different datasets indicates that the trend of the sentiments has been shifted positively over three weeks

    ANALYZING CUSTOMER REVIEWS IN TURKISH USING MACHINE LEARNING AND DATA SCIENCE METHODOLOGIES

    Get PDF
    Digital life, especially after the introduction of Web 2.0, has significantly altered human relations, providing all people the “right of public speech”. Ideas, emotions, and opinions on many topics are generously shared in virtual environments. A new age global and digital Mouth of World is shaping the society where knowledge is the most influential power. Being fed by social media data highly dynamic in either amount or shape, automatic handling is indispensable. Natural Language Processing, in cooperation with Machine Language techniques, has an important say in analyzing written textual data. Traditional techniques exploited in the literature are empowered when hybrid ones are applied, in accordance also with the characteristic properties of the language used and the domain-specific data. Although all the subsequent steps of the text classification chain are important, adequate feature selecting has a notable huge impact on accurate classification prediction. In this study, a simple classification of the sentiment polarity of comments in document level of subjective texts in Turkish is done. Different domains include reviews of customers towards company products, movies, and healthcare services, deciding on the positivity or negativity of the comments. Another domain includes doctors’ notes on patients’ symptoms aiming to predict and thus recommend some of the most often used medical tests according to general doctors’ procedures. The features used included a part of or all distinct words roots together with their binary or frequency information. Linear or vector analysis of the feature sets was done employing Machine Learning algorithms provided by the Weka tool. Hybrid features set was proposed and found more efficient combining binary vectors and frequency meta-features from nodes and leaves of J48 tree classifier for all or a set of correlation based selected features, improving both prediction accuracy and classification performance

    Document analysis by means of data mining techniques

    Get PDF
    The huge amount of textual data produced everyday by scientists, journalists and Web users, allows investigating many different aspects of information stored in the published documents. Data mining and information retrieval techniques are exploited to manage and extract information from huge amount of unstructured textual data. Text mining also known as text data mining is the processing of extracting high quality information (focusing relevance, novelty and interestingness) from text by identifying patterns etc. Text mining typically involves the process of structuring input text by means of parsing and other linguistic features or sometimes by removing extra data and then finding patterns from structured data. Patterns are then evaluated at last and interpretation of output is performed to accomplish the desired task. Recently, text mining has got attention in several fields such as in security (involves analysis of Internet news), for commercial (for search and indexing purposes) and in academic departments (such as answering query). Beyond searching the documents consisting the words given in a user query, text mining may provide direct answer to user by semantic web for content based (content meaning and its context). It can also act as intelligence analyst and can also be used in some email spam filters for filtering out unwanted material. Text mining usually includes tasks such as clustering, categorization, sentiment analysis, entity recognition, entity relation modeling and document summarization. In particular, summarization approaches are suitable for identifying relevant sentences that describe the main concepts presented in a document dataset. Furthermore, the knowledge existed in the most informative sentences can be employed to improve the understanding of user and/or community interests. Different approaches have been proposed to extract summaries from unstructured text documents. Some of them are based on the statistical analysis of linguistic features by means of supervised machine learning or data mining methods, such as Hidden Markov models, neural networks and Naive Bayes methods. An appealing research field is the extraction of summaries tailored to the major user interests. In this context, the problem of extracting useful information according to domain knowledge related to the user interests is a challenging task. The main topics have been to study and design of novel data representations and data mining algorithms useful for managing and extracting knowledge from unstructured documents. This thesis describes an effort to investigate the application of data mining approaches, firmly established in the subject of transactional data (e.g., frequent itemset mining), to textual documents. Frequent itemset mining is a widely exploratory technique to discover hidden correlations that frequently occur in the source data. Although its application to transactional data is well-established, the usage of frequent itemsets in textual document summarization has never been investigated so far. A work is carried on exploiting frequent itemsets for the purpose of multi-document summarization so a novel multi-document summarizer, namely ItemSum (Itemset-based Summarizer) is presented, that is based on an itemset-based model, i.e., a framework comprise of frequent itemsets, taken out from the document collection. Highly representative and not redundant sentences are selected for generating summary by considering both sentence coverage, with respect to a sentence relevance score, based on tf-idf statistics, and a concise and highly informative itemset-based model. To evaluate the ItemSum performance a suite of experiments on a collection of news articles has been performed. Obtained results show that ItemSum significantly outperforms mostly used previous summarizers in terms of precision, recall, and F-measure. We also validated our approach against a large number of approaches on the DUC’04 document collection. Performance comparisons, in terms of precision, recall, and F-measure, have been performed by means of the ROUGE toolkit. In most cases, ItemSum significantly outperforms the considered competitors. Furthermore, the impact of both the main algorithm parameters and the adopted model coverage strategy on the summarization performance are investigated as well. In some cases, the soundness and readability of the generated summaries are unsatisfactory, because the summaries do not cover in an effective way all the semantically relevant data facets. A step beyond towards the generation of more accurate summaries has been made by semantics-based summarizers. Such approaches combine the use of general-purpose summarization strategies with ad-hoc linguistic analysis. The key idea is to also consider the semantics behind the document content to overcome the limitations of general-purpose strategies in differentiating between sentences based on their actual meaning and context. Most of the previously proposed approaches perform the semantics-based analysis as a preprocessing step that precedes the main summarization process. Therefore, the generated summaries could not entirely reflect the actual meaning and context of the key document sentences. In contrast, we aim at tightly integrating the ontology-based document analysis into the summarization process in order to take the semantic meaning of the document content into account during the sentence evaluation and selection processes. With this in mind, we propose a new multi-document summarizer, namely Yago-based Summarizer, that integrates an established ontology-based entity recognition and disambiguation step. Named Entity Recognition from Yago ontology is being used for the task of text summarization. The Named Entity Recognition (NER) task is concerned with marking occurrences of a specific object being mentioned. These mentions are then classified into a set of predefined categories. Standard categories include “person”, “location”, “geo-political organization”, “facility”, “organization”, and “time”. The use of NER in text summarization improved the summarization process by increasing the rank of informative sentences. To demonstrate the effectiveness of the proposed approach, we compared its performance on the DUC’04 benchmark document collections with that of a large number of state-of-the-art summarizers. Furthermore, we also performed a qualitative evaluation of the soundness and readability of the generated summaries and a comparison with the results that were produced by the most effective summarizers. A parallel effort has been devoted to integrating semantics-based models and the knowledge acquired from social networks into a document summarization model named as SociONewSum. The effort addresses the sentence-based generic multi-document summarization problem, which can be formulated as follows: given a collection of news articles ranging over the same topic, the goal is to extract a concise yet informative summary, which consists of most salient document sentences. An established ontological model has been used to improve summarization performance by integrating a textual entity recognition and disambiguation step. Furthermore, the analysis of the user-generated content coming from Twitter has been exploited to discover current social trends and improve the appealing of the generated summaries. An experimental evaluation of the SociONewSum performance was conducted on real English-written news article collections and Twitter posts. The achieved results demonstrate the effectiveness of the proposed summarizer, in terms of different ROUGE scores, compared to state-of-the-art open source summarizers as well as to a baseline version of the SociONewSum summarizer that does not perform any UGC analysis. Furthermore, the readability of the generated summaries has also been analyzed

    Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

    Get PDF
    Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen

    ARNLI: ARABIC NATURAL LANGUAGE INFERENCE ENTAILMENT AND CONTRADICTION DETECTION

    Get PDF
    Natural Language Inference (NLI) is a hot topic research in natural language processing, contradiction detection between sentences is a special case of NLI. This is considered a difficult NLP task which has a big influence when added as a component in many NLP applications, such as Question Answering Systems, text Summarization. Arabic Language is one of the most challenging low-resources languages in detecting contradictions due to its rich lexical, semantics ambiguity. We have created a dataset of more than 12k sentences and named ArNLI, that will be publicly available. Moreover, we have applied a new model inspired by Stanford contradiction detection proposed solutions on English language. We proposed an approach to detect contradictions between pairs of sentences in Arabic language using contradiction vector combined with language model vector as an input to machine learning model. We analyzed results of different traditional machine learning classifiers and compared their results on our created dataset (ArNLI) and on an automatic translation of both PHEME, SICK English datasets. Best results achieved using Random Forest classifier with an accuracy of 99%, 60%, 75% on PHEME, SICK and ArNLI respectively

    Reproducibility Issues for BERT-based Evaluation Metrics

    Full text link
    Reproducibility is of utmost concern in machine learning and natural language processing (NLP). In the field of natural language generation (especially machine translation), the seminal paper of Post (2018) has pointed out problems of reproducibility of the dominant metric, BLEU, at the time of publication. Nowadays, BERT-based evaluation metrics considerably outperform BLEU. In this paper, we ask whether results and claims from four recent BERT-based metrics can be reproduced. We find that reproduction of claims and results often fails because of (i) heavy undocumented preprocessing involved in the metrics, (ii) missing code and (iii) reporting weaker results for the baseline metrics. (iv) In one case, the problem stems from correlating not to human scores but to a wrong column in the csv file, inflating scores by 5 points. Motivated by the impact of preprocessing, we then conduct a second study where we examine its effects more closely (for one of the metrics). We find that preprocessing can have large effects, especially for highly inflectional languages. In this case, the effect of preprocessing may be larger than the effect of the aggregation mechanism (e.g., greedy alignment vs. Word Mover Distance).Comment: EMNLP 2022 Camera-Ready (captions fixed
    corecore