29,846 research outputs found

    Application of Textual Feature Extraction to Corporate Bankruptcy Risk Assessment

    Get PDF
    The inception of the Internet in the late twentieth century has established the ability to generate a huge volume of data from multitudinous sources in a very short period of time. However, most of this data is presented in an unstructured format. According to the latest research, unstructured data contains more comprehensive, effective and practical information when compared to structured data due to its descriptive characteristics, especially in finance, healthcare, manufacturing and other domains. It is anticipated that the effective use of data mining technology can be applied to the development of more accurate predictive models, decision-support platforms and man-machine interactive systems on unstructured data. This thesis focuses on the application of a text mining system known as TP2K which stands for Text Pattern to Knowledge System, developed by my supervisor Professor Andrew K.C. Wong, to the finance industry. More specifically, the text mining system I proposed in this thesis is a concept-based textual feature extraction based on TP2K for corporate bankruptcy risk assessment. Bankruptcy risk assessment is to assess the bankruptcy risk of a corporation in the finance industry. It is linked to enterprise sustainability assessment, investment portfolio optimization and corporate management. Throughout the years, various models have been built using numerical and structured data (e.g. financial indicators and ratios). Yet no model has adequately leveraged the textual data for quantitative analysis in corporate bankruptcy risk assessment. Note that certain critical information such as strategic future directions and cooperate governance of an enterprise can only be reflected through textual data (e.g. annual financial reports). Recently, it has been reported that the combination of textual and numeric features will render a more accurate assessment of corporate bankruptcy. Nevertheless, extracting features from textual data remains difficult since it still requires considerable human efforts. According to the existing literature, there is no obvious criteria for textual feature mining and extraction in finance due to the diversity of objectives and interests. From a general perspective, there is no simple criteria for textual feature mining and extraction in finance according to existing literature. Thus, domain experts still remain essential in the industry. The current textual feature extraction methods in finance can be categorized into two distinct types. The first type is based on a comprehensive handcrafted dictionary of proper keywords with continuous manual updating. The second type is based on data mining technology (e.g. high-frequency words). The former is time-consuming, while the latter usually produces results which are ambiguous, irrelevant or hard to be interpreted by industry in practice. In this thesis, we (my supervisor and I) proposed a method known as concept-based textual feature extraction based on TP2K for corporate bankruptcy risk assessment. Compared to existing methods, this method can extract and mine textual features more accurately and succinctly from financial reports, allowing industrial interpretation in practice with limited human participation. It is semi-automatic and interactive. Its algorithmic procedure is briefly described as follows: (1) apply a linear-time and language-independent TP2K system to discover the “Word, Term and Phrase” (WTP) patterns from text data without relying on explicit prior knowledge or training; (2) apply a WTP-directed search algorithm in TP2K to find appropriate financial attribute names and their attribute values from the text context to obtain relevant attribute and attribute value pairs (AVPs) to build part of the Domain Knowledge Base (DKB) in support of predictive analysis of corporate bankruptcy risk. At the onset, domain experts will still play a major role in building the DKB. As more user-inputted domain information is integrated into the DKB, the system will become more automated to extract and validate related information for bankruptcy risk assessment with limited involvement from domain experts. In this thesis, AVPs have been used in corporate risk assessment to render more robust and less biased textual features. This allows experts to reasonably acquire and assist with the organization of individual selection rules in a comprehensive manner using traditional machine learning processing. To validate the proposed method, experiments on financial data have been conducted. A collection of corporate annual reports containing textual and numeric information were adopted to evaluate the corporate risk assessment in a semi-automatic manner. Initially the extracted AVPs data was converted to binarized textual features in accordance with certain finance field criteria. It was then integrated with related numerical features (financial ratios) for traditional machine learning technologies to construct a predictive model for corporate bankruptcy assessment. The experimental results demonstrated an effective two-year ahead (T-2) prediction, outperforming prediction models based on only numeric features under 10-fold cross-validation. At the same time, we observed that all features discovered, numeric or textual, were consistent to the industry standard. Hence, we believe the proposed method has achieved an important milestone for assessing bankruptcy assessment in practice, and is potentially useful for providing trading advice for investors in the future

    Towards a New Science of a Clinical Data Intelligence

    Full text link
    In this paper we define Clinical Data Intelligence as the analysis of data generated in the clinical routine with the goal of improving patient care. We define a science of a Clinical Data Intelligence as a data analysis that permits the derivation of scientific, i.e., generalizable and reliable results. We argue that a science of a Clinical Data Intelligence is sensible in the context of a Big Data analysis, i.e., with data from many patients and with complete patient information. We discuss that Clinical Data Intelligence requires the joint efforts of knowledge engineering, information extraction (from textual and other unstructured data), and statistics and statistical machine learning. We describe some of our main results as conjectures and relate them to a recently funded research project involving two major German university hospitals.Comment: NIPS 2013 Workshop: Machine Learning for Clinical Data Analysis and Healthcare, 201

    Econometrics meets sentiment : an overview of methodology and applications

    Get PDF
    The advent of massive amounts of textual, audio, and visual data has spurred the development of econometric methodology to transform qualitative sentiment data into quantitative sentiment variables, and to use those variables in an econometric analysis of the relationships between sentiment and other variables. We survey this emerging research field and refer to it as sentometrics, which is a portmanteau of sentiment and econometrics. We provide a synthesis of the relevant methodological approaches, illustrate with empirical results, and discuss useful software

    Bank Networks from Text: Interrelations, Centrality and Determinants

    Get PDF
    In the wake of the still ongoing global financial crisis, bank interdependencies have come into focus in trying to assess linkages among banks and systemic risk. To date, such analysis has largely been based on numerical data. By contrast, this study attempts to gain further insight into bank interconnections by tapping into financial discourse. We present a text-to-network process, which has its basis in co-occurrences of bank names and can be analyzed quantitatively and visualized. To quantify bank importance, we propose an information centrality measure to rank and assess trends of bank centrality in discussion. For qualitative assessment of bank networks, we put forward a visual, interactive interface for better illustrating network structures. We illustrate the text-based approach on European Large and Complex Banking Groups (LCBGs) during the ongoing financial crisis by quantifying bank interrelations and centrality from discussion in 3M news articles, spanning 2007Q1 to 2014Q3.Comment: Quantitative Finance, forthcoming in 201

    A Novel Distributed Representation of News (DRNews) for Stock Market Predictions

    Full text link
    In this study, a novel Distributed Representation of News (DRNews) model is developed and applied in deep learning-based stock market predictions. With the merit of integrating contextual information and cross-documental knowledge, the DRNews model creates news vectors that describe both the semantic information and potential linkages among news events through an attributed news network. Two stock market prediction tasks, namely the short-term stock movement prediction and stock crises early warning, are implemented in the framework of the attention-based Long Short Term-Memory (LSTM) network. It is suggested that DRNews substantially enhances the results of both tasks comparing with five baselines of news embedding models. Further, the attention mechanism suggests that short-term stock trend and stock market crises both receive influences from daily news with the former demonstrates more critical responses on the information related to the stock market {\em per se}, whilst the latter draws more concerns on the banking sector and economic policies.Comment: 25 page

    Social media analytics: a survey of techniques, tools and platforms

    Get PDF
    This paper is written for (social science) researchers seeking to analyze the wealth of social media now available. It presents a comprehensive review of software tools for social networking media, wikis, really simple syndication feeds, blogs, newsgroups, chat and news feeds. For completeness, it also includes introductions to social media scraping, storage, data cleaning and sentiment analysis. Although principally a review, the paper also provides a methodology and a critique of social media tools. Analyzing social media, in particular Twitter feeds for sentiment analysis, has become a major research and business activity due to the availability of web-based application programming interfaces (APIs) provided by Twitter, Facebook and News services. This has led to an ‘explosion’ of data services, software tools for scraping and analysis and social media analytics platforms. It is also a research area undergoing rapid change and evolution due to commercial pressures and the potential for using social media data for computational (social science) research. Using a simple taxonomy, this paper provides a review of leading software tools and how to use them to scrape, cleanse and analyze the spectrum of social media. In addition, it discussed the requirement of an experimental computational environment for social media research and presents as an illustration the system architecture of a social media (analytics) platform built by University College London. The principal contribution of this paper is to provide an overview (including code fragments) for scientists seeking to utilize social media scraping and analytics either in their research or business. The data retrieval techniques that are presented in this paper are valid at the time of writing this paper (June 2014), but they are subject to change since social media data scraping APIs are rapidly changing
    corecore