499 research outputs found

    Stock price change prediction using news text mining

    Get PDF
    Along with the advent of the Internet as a new way of propagating news in a digital format, came the need to understand and transform this data into information. This work presents a computational framework that aims to predict the changes of stock prices along the day, given the occurrence of news articles related to the companies listed in the Down Jones Index. For this task, an automated process that gathers, cleans, labels, classifies, and simulates investments was developed. This process integrates the existing data mining and text algorithms, with the proposal of new techniques of alignment between news articles and stock prices, pre-processing, and classifier ensemble. The result of experiments in terms of classification measures and the Cumulative Return obtained through investment simulation outperformed the other results found after an extensive review in the related literature. This work also argues that the classification measure of Accuracy and incorrect use of cross validation technique have too few to contribute in terms of investment recommendation for financial market. Altogether, the developed methodology and results contribute with the state of art in this emerging research field, demonstrating that the correct use of text mining techniques is an applicable alternative to predict stock price movements in the financial market.Com o advento da Internet como um meio de propagação de notícias em formato digital, veio a necessidade de entender e transformar esses dados em informação. Este trabalho tem como objetivo apresentar um processo computacional para predição de preços de ações ao longo do dia, dada a ocorrência de notícias relacionadas às companhias listadas no índice Down Jones. Para esta tarefa, um processo automatizado que coleta, limpa, rotula, classifica e simula investimentos foi desenvolvido. Este processo integra algoritmos de mineração de dados e textos já existentes, com novas técnicas de alinhamento entre notícias e preços de ações, pré-processamento, e assembleia de classificadores. Os resultados dos experimentos em termos de medidas de classificação e o retorno acumulado obtido através de simulação de investimentos foram maiores do que outros resultados encontrados após uma extensa revisão da literatura. Este trabalho também discute que a acurácia como medida de classificação, e a incorreta utilização da técnica de validação cruzada, têm muito pouco a contribuir em termos de recomendação de investimentos no mercado financeiro. Ao todo, a metodologia desenvolvida e resultados contribuem com o estado da arte nesta área de pesquisa emergente, demonstrando que o uso correto de técnicas de mineração de dados e texto é uma alternativa aplicável para a predição de movimentos no mercado financeiro

    Automatic inference of causal reasoning chains from student essays

    Get PDF
    While there has been an increasing focus on higher-level thinking skills arising from the Common Core Standards, many high-school and middle-school students struggle to combine and integrate information from multiple sources when writing essays. Writing is an important learning skill, and there is increasing evidence that writing about a topic develops a deeper understanding in the student. However, grading essays is time consuming for teachers, resulting in an increasing focus on shallower forms of assessment that are easier to automate, such as multiple-choice tests. Existing essay grading software has attempted to ease this burden but relies on shallow lexico-syntactic features and is unable to understand the structure or validity of a student’s arguments or explanations. Without the ability to understand a student’s reasoning processes, it is impossible to write automated formative assessment systems to assist students with improving their thinking skills through essay writing. In order to understand the arguments put forth in an explanatory essay in the science domain, we need a method of representing the causal structure of a piece of explanatory text. Psychologists use a representation called a causal model to represent a student\u27s understanding of an explanatory text. This consists of a number of core concepts, and a set of causal relations linking them into one or more causal chains, forming a causal model. In this thesis I present a novel system for automatically constructing causal models from student scientific essays using Natural Language Processing (NLP) techniques. The problem was decomposed into 4 sub-problems - assigning essay concepts to words, detecting causal-relations between these concepts, resolving coreferences within each essay, and using the structure of the whole essay to reconstruct a causal model. Solutions to each of these sub-problems build upon the predictions from the solutions to earlier problems, forming a sequential pipeline of models. Designing a system in this way allows later models to correct for false positive predictions from downstream models. However, this also has the disadvantage that errors made in earlier models can propagate through the system, negatively impacting the upstream models, and limiting their accuracy. Producing robust solutions for the initial 2 sub problems, detecting concepts, and parsing causal relations between them, was critical in building a robust system. A number of sequence labeling models were trained to classify the concepts associated with each word, with the most effective approach being a bidirectional recurrent neural network (RNN), a deep learning model commonly applied to word labeling problems. This is because the RNN used pre-trained word embeddings to better generalize to rarer words, and was able to use information from both ends of each sentence to infer a word\u27s concept. The concepts predicted by this model were then used to develop causal relation parsing models for detecting causal connections between these concepts. A shift-reduce dependency parsing model was trained using the SEARN algorithm and out-performed a number of other approaches by better utilizing the structure of the problem and directly optimizing the error metric used. Two pre-trained coreference resolution systems were used to resolve coreferences within the essays. However a word tagging model trained to predict anaphors combined with a heuristic for determining the antecedent out-performed these two systems. Finally, a model was developed for parsing a causal model from an entire essay, utilizing the solutions to the three previous problems. A beam search algorithm was used to produce multiple parses for each sentence, which in turn were combined to generate multiple candidate causal models for each student essay. A reranking algorithm was then used to select the optimal causal model from all of the generated candidates. An important contribution of this work is that it represents a system for parsing a complete causal model of a scientific essay from a student\u27s written answer. Existing systems have been developed to parse individual causal relations, but no existing system attempts to parse a sequence of linked causal relations forming a causal model from an explanatory scientific essay. It is hoped that this work can lead to the development of more robust essay grading software and formative assessment tools, and can be extended to build solutions for extracting causality from text in other domains. In addition, I also present 2 novel approaches for optimizing the micro-F1 score within the design of two of the algorithms studied: the dependency parser and the reranking algorithm. The dependency parser uses a custom cost function to estimate the impact of parsing mistakes on the overall micro-F1 score, while the reranking algorithm allows the micro-F1 score to be optimized by tuning the beam search parameter to balance recall and precision

    Text detection and recognition in images and video sequences

    Get PDF
    Text characters embedded in images and video sequences represents a rich source of information for content-based indexing and retrieval applications. However, these text characters are difficult to be detected and recognized due to their various sizes, grayscale values and complex backgrounds. This thesis investigates methods for building an efficient application system for detecting and recognizing text of any grayscale values embedded in images and video sequences. Both empirical image processing methods and statistical machine learning and modeling approaches are studied in two sub-problems: text detection and text recognition. Applying machine learning methods for text detection encounters difficulties due to character size, grayscale variations and heavy computation cost. To overcome these problems, we propose a two-step localization/verification approach. The first step aims at quickly localizing candidate text lines, enabling the normalization of characters into a unique size. In the verification step, a trained support vector machine or multi-layer perceptrons is applied on background independent features to remove the false alarms. Text recognition, even from the detected text lines, remains a challenging problem due to the variety of fonts, colors, the presence of complex backgrounds and the short length of the text strings. Two schemes are investigated addressing the text recognition problem: bi-modal enhancement scheme and multi-modal segmentation scheme. In the bi-modal scheme, we propose a set of filters to enhance the contrast of black and white characters and produce a better binarization before recognition. For more general cases, the text recognition is addressed by a text segmentation step followed by a traditional optical character recognition (OCR) algorithm within a multi-hypotheses framework. In the segmentation step, we model the distribution of grayscale values of pixels using a Gaussian mixture model or a Markov Random Field. The resulting multiple segmentation hypotheses are post-processed by a connected component analysis and a grayscale consistency constraint algorithm. Finally, they are processed by an OCR software. A selection algorithm based on language modeling and OCR statistics chooses the text result from all the produced text strings. Additionally, methods for using temporal information of video text are investigated. A Monte Carlo video text segmentation method is proposed for adapting the segmentation parameters along temporal text frames. Furthermore, a ROVER (Recognizer Output Voting Error Reduction) algorithm is studied for improving the final recognition text string by voting the characters through temporal frames

    Scene text localization and recognition in images and videos

    Get PDF
    Scene Text Localization and Recognition methods nd all areas in an image or a video that would be considered as text by a human, mark boundaries of the areas and output a sequence of characters associated with its content. They are used to process images and videos taken by a digital camera or a mobile phone and to \read" the content of each text area into a digital format, typically a list of Unicode character sequences, that can be processed in further applications. Three di erent methods for Scene Text Localization and Recognition were proposed in the course of the research, each one advancing the state of the art and improving the accuracy. The rst method detects individual characters as Extremal Regions (ER), where the probability of each ER being a character is estimated using novel features with O(1) complexity and only ERs with locally maximal probability are selected across several image projections for the second stage, where the classi cation is improved using more computationally expensive features. The method was the rst published method to address the complete problem of scene text localization and recognition as a whole - all previous work in the literature focused solely on di erent subproblems. Secondly, a novel easy-to-implement stroke detector was proposed. The detector is signi cantly faster and produces signi cantly less false detections than the commonly used ER detector. The detector e ciently produces character strokes segmentations, which are exploited in a subsequent classi cation phase based on features e ectively calculated as part of the segmentation process. Additionally, an e cient text clustering algorithm based on text direction voting is proposed, which as well as the previous stages is scale- and rotation- invariant and supports wide variety of scripts and fonts. The third method exploits a deep-learning model, which is trained for both text detection and recognition in a single trainable pipeline. The method localizes and recognizes text in an image in a single feed-forward pass, it is trained purely on synthetic data so it does not require obtaining expensive human annotations for training and it achieves state-of-the-art accuracy in the end-to-end text recognition on two standard datasets, whilst being an order of magnitude faster than the previous methods - the whole pipeline runs at 10 frames per second.Katedra kybernetik

    Searching to Translate and Translating to Search: When Information Retrieval Meets Machine Translation

    Get PDF
    With the adoption of web services in daily life, people have access to tremendous amounts of information, beyond any human's reading and comprehension capabilities. As a result, search technologies have become a fundamental tool for accessing information. Furthermore, the web contains information in multiple languages, introducing another barrier between people and information. Therefore, search technologies need to handle content written in multiple languages, which requires techniques to account for the linguistic differences. Information Retrieval (IR) is the study of search techniques, in which the task is to find material relevant to a given information need. Cross-Language Information Retrieval (CLIR) is a special case of IR when the search takes place in a multi-lingual collection. Of course, it is not helpful to retrieve content in languages the user cannot understand. Machine Translation (MT) studies the translation of text from one language into another efficiently (within a reasonable amount of time) and effectively (fluent and retaining the original meaning), which helps people understand what is being written, regardless of the source language. Putting these together, we observe that search and translation technologies are part of an important user application, calling for a better integration of search (IR) and translation (MT), since these two technologies need to work together to produce high-quality output. In this dissertation, the main goal is to build better connections between IR and MT, for which we present solutions to two problems: Searching to translate explores approximate search techniques for extracting bilingual data from multilingual Wikipedia collections to train better translation models. Translating to search explores the integration of a modern statistical MT system into the cross-language search processes. In both cases, our best-performing approach yielded improvements over strong baselines for a variety of language pairs. Finally, we propose a general architecture, in which various components of IR and MT systems can be connected together into a feedback loop, with potential improvements to both search and translation tasks. We hope that the ideas presented in this dissertation will spur more interest in the integration of search and translation technologies

    Biologically inspired evolutionary temporal neural circuits

    Get PDF
    Biological neural networks have always motivated creation of new artificial neural networks, and in this case a new autonomous temporal neural network system. Among the more challenging problems of temporal neural networks are the design and incorporation of short and long-term memories as well as the choice of network topology and training mechanism. In general, delayed copies of network signals can form short-term memory (STM), providing a limited temporal history of events similar to FIR filters, whereas the synaptic connection strengths as well as delayed feedback loops (ER circuits) can constitute longer-term memories (LTM). This dissertation introduces a new general evolutionary temporal neural network framework (GETnet) through automatic design of arbitrary neural networks with STM and LTM. GETnet is a step towards realization of general intelligent systems that need minimum or no human intervention and can be applied to a broad range of problems. GETnet utilizes nonlinear moving average/autoregressive nodes and sub-circuits that are trained by enhanced gradient descent and evolutionary search in terms of architecture, synaptic delay, and synaptic weight spaces. The mixture of Lamarckian and Darwinian evolutionary mechanisms facilitates the Baldwin effect and speeds up the hybrid training. The ability to evolve arbitrary adaptive time-delay connections enables GETnet to find novel answers to many classification and system identification tasks expressed in the general form of desired multidimensional input and output signals. Simulations using Mackey-Glass chaotic time series and fingerprint perspiration-induced temporal variations are given to demonstrate the above stated capabilities of GETnet

    Connectionist-Symbolic Machine Intelligence using Cellular Automata based Reservoir-Hyperdimensional Computing

    Full text link
    We introduce a novel framework of reservoir computing, that is capable of both connectionist machine intelligence and symbolic computation. Cellular automaton is used as the reservoir of dynamical systems. Input is randomly projected onto the initial conditions of automaton cells and nonlinear computation is performed on the input via application of a rule in the automaton for a period of time. The evolution of the automaton creates a space-time volume of the automaton state space, and it is used as the reservoir. The proposed framework is capable of long short-term memory and it requires orders of magnitude less computation compared to Echo State Networks. We prove that cellular automaton reservoir holds a distributed representation of attribute statistics, which provides a more effective computation than local representation. It is possible to estimate the kernel for linear cellular automata via metric learning, that enables a much more efficient distance computation in support vector machine framework. Also, binary reservoir feature vectors can be combined using Boolean operations as in hyperdimensional computing, paving a direct way for concept building and symbolic processing.Comment: Corrected Typos. Responded some comments on section 8. Added appendix for details. Recurrent architecture emphasize

    15th SC@RUG 2018 proceedings 2017-2018

    Get PDF
    corecore