12 research outputs found

    Delineating Knowledge Domains in Scientific Domains in Scientific Literature using Machine Learning (ML)

    Get PDF
    The recent years have witnessed an upsurge in the number of published documents. Organizations are showing an increased interest in text classification for effective use of the information. Manual procedures for text classification can be fruitful for a handful of documents, but the same lack in credibility when the number of documents increases besides being laborious and time-consuming. Text mining techniques facilitate assigning text strings to categories rendering the process of classification fast, accurate, and hence reliable. This paper classifies chemistry documents using machine learning and statistical methods. The procedure of text classification has been described in chronological order like data preparation followed by processing, transformation, and application of classification techniques culminating in the validation of the results

    Text Categorization and Machine Learning Methods: Current State Of The Art

    Get PDF
    In this informative age, we find many documents are available in digital forms which need classification of the text. For solving this major problem present researchers focused on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of pre classified documents, the characteristics of the categories. The main benefit of the present approach is consisting in the manual definition of a classifier by domain experts where effectiveness, less use of expert work and straightforward portability to different domains are possible. The paper examines the main approaches to text categorization comparing the machine learning paradigm and present state of the art. Various issues pertaining to three different text similarity problems, namely, semantic, conceptual and contextual are also discussed

    Nonprofit Partisanship

    Get PDF
    We establish a novel measurement of nonprofit organization ideology using semantic text analysis and validate it with a large-scale online experiment. On average, health- related nonprofits as well as education-related organizations, including US universities, are the most left-leaning group. Religion-related nonprofits, on the other hand, are the most conservative. We then examine whether ”rage donations” for selected lib- eral nonprofits right after the Trump elections documented by the media hold true more generally across different sectors over different presidential elections. We find no evidence that expected shifts in ideology of a government systematically influence donations differently depending on nonprofit ideology

    What have fruits to do with technology? The case of Orange, Blackberry and Apple.

    Get PDF
    Twitter is a micro-blogging service on the Web, where people can enter short messages, which then become visible to other users of the service. While the topics of these messages varies, there are a lot of messages where the users express their opinions about com- panies or products. Since the twitter service is very popular, the messages form a rich source of information for companies. They can learn with the help of data mining and sentiment analysis tech- niques, how their customers like their products or what is the gen- eral perception of the company. There is however a great obstacle for analyzing the data directly: as the company names are often ambiguous, one needs first to identify, which messages are related to the company. In this paper we address this question. We present various techniques to classify tweet messages, whether they are related to a given company or not, for example, whether a mes- sage containing the keyword “apple” is about the company Apple Inc.. We present simple techniques, which make use of company profiles, which we created semi-automatically from external Web sources. Our advanced techniques take ambiguity estimations into account and also automatically extend the company profiles from the twitter stream itself. We demonstrate the effectiveness of our methods through an extensive set of experiments

    Exploring Crosslingual Word Embeddings for Semantic Classification in Text and Dialogue

    Get PDF
    Current approaches to learning crosslingual word emebeddings provide a decent performance when based on a big amount of parallel data. Considering the fact, that most of the languages are under-resourced and lack structured lexical materials, it makes it difficult to implement them into such methods, and, respectively, into any human language technologies. In this thesis we explore whether crosslingual mapping between two sets of monolingual word embeddings obtained separately is strong enough to present competitive results on semantic classification tasks. Our experiment involves learning crosslingual transfer between German and French word vectors based on the combination of adversarial approach and the Procrustes algorithm. We evaluate embeddings on topic classification, sentiment analysis and humour detection tasks. We use a German subset of a multilingual data set for training, and a French subset for testing our models. Results across German and French languages prove that word vectors mapped into a shared vector space are able to obtain and transfer semantic information from one language to another successfully. We also show that crosslingual mapping does not weaken the monolingual connections between words in one language

    Entity-based Classification of Twitter Messages

    Get PDF
    Twitter is a popular micro-blogging service on theWeb, where people can enter short messages, which then become visible to some other users of the service. While the topics of these messages varies, there are a lot of messages where the users express their opinions about some companies or their products. These messages are a rich source of information for companies for sentiment analysis or opinion mining. There is however a great obstacle for analyzing the messages directly: as the company names are often ambiguous (e.g. apple, the fruit vs. Apple Inc.), one needs first to identify, which messages are related to the company. In this paper we address this question. We present various techniques for classifying tweet messages containing a given keyword, whether they are related to a particular company with that name or not. We first present simple techniques, which make use of company profiles, which we created semi-automatically from external Web sources. Our advanced techniques take ambiguity estimations into account and also automatically extend the company profiles from the twitter stream itself. We demonstrate the effectiveness of our methods through an extensive set of experiments. Moreover, we extensively analyze the sources of errors in the classification. The analysis not only brings further improvement, but also enables to use the human input more efficiently

    Social review-based recommender systems from theory to practice

    Get PDF
    Premi al millor PFC en l'Àrea de Sistemes de la informació d'Enginyeria de Telecomunicació o d'Enginyeria Electrònica de l'ETSETB-UPC (curs 2013-2014). Atorgat per Cátedra Red.esSocial Recommender Systems were born with the goal to mitigate the current information overload caused by the birth of Social Networks among other causes. They have enabled Internet actors (e.g. users, web browsers, sensors, actuators, etc.) to make more informed decisions based on the information that is been shown to them, up to the point that some actors even blindly trust the recommendation generated by these systems. Within this scenario, this thesis proposes a novel Hybrid Social Recommender System purely based on the text reviews typed by users. The proposed engine treats the review content and sentiment separately and finally, combines both into a single recommendation. Very little scientific research has been published on mining text reviews with the aim of performing item recommendation. Moreover, among all Hybrid Recommendation Systems in the literature, none use the above-mentioned review features into a collaborative and content-based recommender. With the purpose in mind of assessing the platform effectiveness, we present a methodology that goes from the process of extracting the data directly from a Social Network, cleaning and pre-processing the text data, building the predictive model with different state-of-the art machine learning techniques, up to the point of evaluating the system in terms of several key metrics. The data extraction process gains our attention due to the challenges imposed by most social platforms in obtaining all the geo-positioned data generated in a bounded region. To overcome the platform limitations, we introduce the use of the Quadtree algorithm with the goal of crawling all the geo-positioned reviews. The algorithm is enhanced with a module that copes with the time dynamics and captures the time-stamped data as well. Moreover, we study the effectiveness of the Quadtree partition method to crawl any type of spatial data, which tends to be softly distributed in the area. This thesis draws several conclusions from the available data about the use of several state-of-the art text mining techniques and the effectiveness of the proposed recommender setup. Nonetheless, future work needs to design and propose novel evaluation methodologies that uncouple the system evaluation from the data.Award-winnin
    corecore