    Categorização e classificação de notícias de big data em tecnologias segundo o Quadrante Mágico de Gartner

    O desenvolvimento das tecnologias nos últimos anos levou a um aumento contínuo de dados e sua acumulação a uma velocidade incalculável. Todos estes fatores acima mencionados levaram à banalização de um novo conceito: Big Data. Neste estudo foram extraídas 11 505 notícias sobre Big Data do Google News e foram aplicadas técnicas de Text Mining de forma a obter conhecimento relevante e uma categorização noticiosa, através de Latent Dirichlet Allocation. São abordadas as Tecnologias Big Data relativamente aos Quadrantes de Gartner de forma a perceber o tipo de Tecnologias em que as empresas de um Quadrante específico investem. Desta forma, este estudo tem uma contribuição interessante para a literatura, pois fornece resultados concretos sobre o comportamento do mercado, provenientes de dados factuais. Este estudo comprova a força das empresas integrantes do Quadrante de Gartner leaders, revelando que estas são cada vez mais líderes de mercado, apresentando uma solução muito completa e diversificada de Tecnologias Big Data. É também demonstrado que as empresas que integram o Quadrante de Gartner challengers não demonstram entendimento sobre a direção em que o mercado se desloca e que uma empresa que pertença ao Quadrante de Gartner visionaries, caso aposte fortemente na Tecnologia Big Data stream analytics terá a sua posição alterada no Quadrante de Gartner, aproximando-se cada vez mais do Quadrante leaders e, ao mesmo tempo, do Quadrante niche players.The development of technologies in recent years has led to a continuous increase in data, and its accumulation at an incalculable speed. All these factors mentioned above have led to the trivialization of a new concept: Big Data. In this study 11505 Google News Big Data news were extracted and Text Mining techniques were applied to obtain relevant knowledge and a news categorization through the Latent Dirichlet Allocation algorithm. Big Data Technologies are approached relatively to the Gartner Quadrants in order to perceive the type of Technologies wherein companies of a specific Quadrant invest. Thus, this study has an interesting contribution to the literature, since it provides concrete results on the market behavior, coming from factual data. This study proves the strength of the Gartner leaders quadrant, revealing that they are increasingly market leaders, presenting a very complete and diverse Big Data Technology solution. It is also demonstrated that the companies in the challengers Gartner Quadrant do not demonstrate understanding of the direction the market is moving and that a company belonging to the visionaries Gartner Quadrant betting strong on Big Data stream analytics technology will have is position modified in the Gartner Quadrant, increasingly approaching the Leaders Quadrant and, at the same time, the niche players Quadrant

    Mining Online Text Data for Sentiment and News Impact Analysis

    As continuous growth of Internet, an ever increasing amount of information becomesavailable on the World Wide Web (WWW). Information on the WWW has never been soexploded that search engines using traditional keyword-based searching strategies hardlymeet people’s needs to retrieve knowledge from online massive text data. The motivationof this thesis comes from the great demands on discovering implicit knowledge and richsemantics from online documents.This thesis focuses on analyzing online business news, a representative of objective information,and online customer reviews, a representative of subjective information. Foronline business news, a topic driven impact analysis model is proposed that quantifies theimpact of topic of a news article. With the proposed topic driven impact analysis model,an explorative visual analysis system called ImpactWheel is developed to help users betternavigate and understand topic-specific companies’ impact relationships through miningrich information source of online business news.For online customer reviews, both document overall sentiment classification and attributedbasedsentiment analysis are performed. In the regard of document overall sentiment classification,taking advantages of high frequency of Co-occurring Term (CoT) patterns incustomer reviews, a frequency-based algorithm is proposed to generate complex featureswhich benefits sentiment classifiers. In order to search for effective features and ignoreuseless ones produced by the frequency-based complex feature generation algorithm, anEffective Feature Search (EFS) framework is proposed, which makes a novel connectionbetween feature candidate generation and a Stochastic Local Search process. In theregard of attributed-based sentiment analysis, the concept of Sentiment Ontology Tree isproposed, which organizes a product’s domain specific knowledge as well as sentiments ina tree-like ontology structure. With the concept of SOT, a Hierarchial Learning via SentimentOntology Tree (HL-SOT) approach is proposed to solve the sentiment analysis tasksin a hierarchical classification process. To enhance the classification performance andcomputational efficiency of the HL-SOT approach which encodes texts using a globallyunified index term space, a Localized Feature Selection (LFS) framework is developedwhich generates the customized index term space for each node of SOT. Since that theHL-SOT approach was estimated by a RLS estimator which is not competent enough tofind max class separation and that the statistical linear classifier has been evidently provenits fallibility on classifying sentiment, a more pragmatic Hybrid Hierarchical ClassificationProcess (HHCP) is proposed. The HHCP approach employs a linear classifier thatis capable of maximizing the class separation while minimizing the within-class variancefor attribute detection and turns to a rule-based solution for sentiment orientation