    Topic-dependent sentiment analysis of financial blogs

    While most work in sentiment analysis in the financial domain has focused on the use of content from traditional finance news, in this work we concentrate on more subjective sources of information, blogs. We aim to automatically determine the sentiment of financial bloggers towards companies and their stocks. To do this we develop a corpus of financial blogs, annotated with polarity of sentiment with respect to a number of companies. We conduct an analysis of the annotated corpus, from which we show there is a significant level of topic shift within this collection, and also illustrate the difficulty that human annotators have when annotating certain sentiment categories. To deal with the problem of topic shift within blog articles, we propose text extraction techniques to create topic-specific sub-documents, which we use to train a sentiment classifier. We show that such approaches provide a substantial improvement over full documentclassification and that word-based approaches perform better than sentence-based or paragraph-based approaches

    Prometheus: a generic e-commerce crawler for the study of business markets and other e-commerce problems

    Dissertação de mestrado em Computer ScienceThe continuous social and economic development has led over time to an increase in consumption, as well as greater demand from the consumer for better and cheaper products. Hence, the selling price of a product assumes a fundamental role in the purchase decision by the consumer. In this context, online stores must carefully analyse and define the best price for each product, based on several factors such as production/acquisition cost, positioning of the product (e.g. anchor product) and the competition companies strategy. The work done by market analysts changed drastically over the last years. As the number of Web sites increases exponentially, the number of E-commerce web sites also prosperous. Web page classification becomes more important in fields like Web mining and information retrieval. The traditional classifiers are usually hand-crafted and non-adaptive, that makes them inappropriate to use in a broader context. We introduce an ensemble of methods and the posterior study of its results to create a more generic and modular crawler and scraper for detection and information extraction on E-commerce web pages. The collected information may then be processed and used in the pricing decision. This framework goes by the name Prometheus and has the goal of extracting knowledge from E-commerce Web sites. The process requires crawling an online store and gathering product pages. This implies that given a web page the framework must be able to determine if it is a product page. In order to achieve this we classify the pages in three categories: catalogue, product and ”spam”. The page classification stage was addressed based on the html text as well as on the visual layout, featuring both traditional methods and Deep Learning approaches. Once a set of product pages has been identified we proceed to the extraction of the pricing information. This is not a trivial task due to the disparity of approaches to create a web page. Furthermore, most product pages are dynamic in the sense that they are truly a page for a family of related products. For instance, when visiting a shoe store, for a particular model there are probably a number of sizes and colours available. Such a model may be displayed in a single dynamic web page making it necessary for our framework to explore all the relevant combinations. This process is called scraping and is the last stage of the Prometheus framework.O contínuo desenvolvimento social e económico tem conduzido ao longo do tempo a um aumento do consumo, assim como a uma maior exigência do consumidor por produtos melhores e mais baratos. Naturalmente, o preço de venda de um produto assume um papel fundamental na decisão de compra por parte de um consumidor. Nesse sentido, as lojas online precisam de analisar e definir qual o melhor preço para cada produto, tendo como base diversos fatores, tais como o custo de produção/venda, posicionamento do produto (e.g. produto âncora) e as próprias estratégias das empresas concorrentes. O trabalho dos analistas de mercado mudou drasticamente nos últimos anos. O crescimento de sites na Web tem sido exponencial, o número de sites E-commerce também tem prosperado. A classificação de páginas da Web torna-se cada vez mais importante, especialmente em campos como mineração de dados na Web e coleta/extração de informações. Os classificadores tradicionais são geralmente feitos manualmente e não adaptativos, o que os torna inadequados num contexto mais amplo. Nós introduzimos um conjunto de métodos e o estudo posterior dos seus resultados para criar um crawler e scraper mais genéricos e modulares para extração de conhecimento em páginas de Ecommerce. A informação recolhida pode então ser processada e utilizada na tomada de decisão sobre o preço de venda. Esta Framework chama-se Prometheus e tem como intuito extrair conhecimento de Web sites de E-commerce. Este processo necessita realizar a navegação sobre lojas online e armazenar páginas de produto. Isto implica que dado uma página web a framework seja capaz de determinar se é uma página de produto. Para atingir este objetivo nós classificamos as páginas em três categorias: catálogo, produto e spam. A classificação das páginas foi realizada tendo em conta o html e o aspeto visual das páginas, utilizando tanto métodos tradicionais como Deep Learning. Depois de identificar um conjunto de páginas de produto procedemos à extração de informação sobre o preço. Este processo não é trivial devido à quantidade de abordagens possíveis para criar uma página web. A maioria dos produtos são dinâmicos no sentido em que um produto é na realidade uma família de produtos relacionados. Por exemplo, quando visitamos uma loja online de sapatos, para um modelo em especifico existe a provavelmente um conjunto de tamanhos e cores disponíveis. Esse modelo pode ser apresentado numa única página dinâmica fazendo com que seja necessário para a nossa Framework explorar estas combinações relevantes. Este processo é chamado de scraping e é o último passo da Framework Prometheus

    Connected Information Management

    Society is currently inundated with more information than ever, making efficient management a necessity. Alas, most of current information management suffers from several levels of disconnectedness: Applications partition data into segregated islands, small notes don’t fit into traditional application categories, navigating the data is different for each kind of data; data is either available at a certain computer or only online, but rarely both. Connected information management (CoIM) is an approach to information management that avoids these ways of disconnectedness. The core idea of CoIM is to keep all information in a central repository, with generic means for organization such as tagging. The heterogeneity of data is taken into account by offering specialized editors. The central repository eliminates the islands of application-specific data and is formally grounded by a CoIM model. The foundation for structured data is an RDF repository. The RDF editing meta-model (REMM) enables form-based editing of this data, similar to database applications such as MS access. Further kinds of data are supported by extending RDF, as follows. Wiki text is stored as RDF and can both contain structured text and be combined with structured data. Files are also supported by the CoIM model and are kept externally. Notes can be quickly captured and annotated with meta-data. Generic means for organization and navigation apply to all kinds of data. Ubiquitous availability of data is ensured via two CoIM implementations, the web application HYENA/Web and the desktop application HYENA/Eclipse. All data can be synchronized between these applications. The applications were used to validate the CoIM ideas

    Adaptive Semantic Annotation of Entity and Concept Mentions in Text

    The recent years have seen an increase in interest for knowledge repositories that are useful across applications, in contrast to the creation of ad hoc or application-specific databases. These knowledge repositories figure as a central provider of unambiguous identifiers and semantic relationships between entities. As such, these shared entity descriptions serve as a common vocabulary to exchange and organize information in different formats and for different purposes. Therefore, there has been remarkable interest in systems that are able to automatically tag textual documents with identifiers from shared knowledge repositories so that the content in those documents is described in a vocabulary that is unambiguously understood across applications. Tagging textual documents according to these knowledge bases is a challenging task. It involves recognizing the entities and concepts that have been mentioned in a particular passage and attempting to resolve eventual ambiguity of language in order to choose one of many possible meanings for a phrase. There has been substantial work on recognizing and disambiguating entities for specialized applications, or constrained to limited entity types and particular types of text. In the context of shared knowledge bases, since each application has potentially very different needs, systems must have unprecedented breadth and flexibility to ensure their usefulness across applications. Documents may exhibit different language and discourse characteristics, discuss very diverse topics, or require the focus on parts of the knowledge repository that are inherently harder to disambiguate. In practice, for developers looking for a system to support their use case, is often unclear if an existing solution is applicable, leading those developers to trial-and-error and ad hoc usage of multiple systems in an attempt to achieve their objective. In this dissertation, I propose a conceptual model that unifies related techniques in this space under a common multi-dimensional framework that enables the elucidation of strengths and limitations of each technique, supporting developers in their search for a suitable tool for their needs. Moreover, the model serves as the basis for the development of flexible systems that have the ability of supporting document tagging for different use cases. I describe such an implementation, DBpedia Spotlight, along with extensions that we performed to the knowledge base DBpedia to support this implementation. I report evaluations of this tool on several well known data sets, and demonstrate applications to diverse use cases for further validation

    Web content mining with multi-source machine learning for intelligent web agents.

    The web is recognized as the largest data source in the world. The nature of such data is characterized by partial or no structure, and even worse there exist no standard data schema for the even low-volumed structured data. Web Mining aims to extract useful knowledge from the Web by using a variety of techniques that have to cope with the heterogeneity and lack of a unique and fixed way of representing information. An important aspect in Web Mining is played by the automation of extraction rules with proper algorithms. Machine Learning techniques have been successfully applied toWeb Mining and Information Extraction tasks thanks to the generalization and adaptation capabilities that are a key requirement on general content, heterogeneous web pages. The World Wide Web is a graph, more precisely a directed labeled graph where the nodes are represented by the pages and the edges are represented by links between them. Recent works propose the exploitation of the web structure (Link Analysis) for content extraction, for example one can leverage the content category of neighbor pages to categorize the contents of difficult web pages where word-frequency-based techniques are not robust enough. In this thesis we propose an automated method suitable for a wide range of domains based on Machine Learning and Link Analysis. In particular we propose an inductive model able to recognize content pages where structured information is located after being trained with proper input data. In order to keep the recognition speed high enough for real-world applications an additional algorithm is proposed which lets the approach to boost both in speed and quality. The proposed method has been tested with controlled dataset in a classic train-and-test scenario and in a real-world web crawling system

    Automated subject classification of textual web documents

    Transforming Graph Representations for Statistical Relational Learning

    Relational data representations have become an increasingly important topic due to the recent proliferation of network datasets (e.g., social, biological, information networks) and a corresponding increase in the application of statistical relational learning (SRL) algorithms to these domains. In this article, we examine a range of representation issues for graph-based relational data. Since the choice of relational data representation for the nodes, links, and features can dramatically affect the capabilities of SRL algorithms, we survey approaches and opportunities for relational representation transformation designed to improve the performance of these algorithms. This leads us to introduce an intuitive taxonomy for data representation transformations in relational domains that incorporates link transformation and node transformation as symmetric representation tasks. In particular, the transformation tasks for both nodes and links include (i) predicting their existence, (ii) predicting their label or type, (iii) estimating their weight or importance, and (iv) systematically constructing their relevant features. We motivate our taxonomy through detailed examples and use it to survey and compare competing approaches for each of these tasks. We also discuss general conditions for transforming links, nodes, and features. Finally, we highlight challenges that remain to be addressed