29,846 research outputs found
Application of Textual Feature Extraction to Corporate Bankruptcy Risk Assessment
The inception of the Internet in the late twentieth century has established the ability to generate a huge volume of data from multitudinous sources in a very short period of time. However, most of this data is presented in an unstructured format. According to the latest research, unstructured data contains more comprehensive, effective and practical information when compared to structured data due to its descriptive characteristics, especially in finance, healthcare, manufacturing and other domains. It is anticipated that the effective use of data mining technology can be applied to the development of more accurate predictive models, decision-support platforms and man-machine interactive systems on unstructured data.
This thesis focuses on the application of a text mining system known as TP2K which stands for Text Pattern to Knowledge System, developed by my supervisor Professor Andrew K.C. Wong, to the finance industry. More specifically, the text mining system I proposed in this thesis is a concept-based textual feature extraction based on TP2K for corporate bankruptcy risk assessment. Bankruptcy risk assessment is to assess the bankruptcy risk of a corporation in the finance industry. It is linked to enterprise sustainability assessment, investment portfolio optimization and corporate management. Throughout the years, various models have been built using numerical and structured data (e.g. financial indicators and ratios). Yet no model has adequately leveraged the textual data for quantitative analysis in corporate bankruptcy risk assessment. Note that certain critical information such as strategic future directions and cooperate governance of an enterprise can only be reflected through textual data (e.g. annual financial reports). Recently, it has been reported that the combination of textual and numeric features will render a more accurate assessment of corporate bankruptcy.
Nevertheless, extracting features from textual data remains difficult since it still requires considerable human efforts. According to the existing literature, there is no obvious criteria for textual feature mining and extraction in finance due to the diversity of objectives and interests. From a general perspective, there is no simple criteria for textual feature mining and extraction in finance according to existing literature. Thus, domain experts still remain essential in the industry. The current textual feature extraction methods in finance can be categorized into two distinct types. The first type is based on a comprehensive handcrafted dictionary of proper keywords with continuous manual updating. The second type is based on data mining technology (e.g. high-frequency words). The former is time-consuming, while the latter usually produces results which are ambiguous, irrelevant or hard to be interpreted by industry in practice.
In this thesis, we (my supervisor and I) proposed a method known as concept-based textual feature extraction based on TP2K for corporate bankruptcy risk assessment. Compared to existing methods, this method can extract and mine textual features more accurately and succinctly from financial reports, allowing industrial interpretation in practice with limited human participation. It is semi-automatic and interactive. Its algorithmic procedure is briefly described as follows: (1) apply a linear-time and language-independent TP2K system to discover the “Word, Term and Phrase” (WTP) patterns from text data without relying on explicit prior knowledge or training; (2) apply a WTP-directed search algorithm in TP2K to find appropriate financial attribute names and their attribute values from the text context to obtain relevant attribute and attribute value pairs (AVPs) to build part of the Domain Knowledge Base (DKB) in support of predictive analysis of corporate bankruptcy risk. At the onset, domain experts will still play a major role in building the DKB. As more user-inputted domain information is integrated into the DKB, the system will become more automated to extract and validate related information for bankruptcy risk assessment with limited involvement from domain experts. In this thesis, AVPs have been used in corporate risk assessment to render more robust and less biased textual features. This allows experts to reasonably acquire and assist with the organization of individual selection rules in a comprehensive manner using traditional machine learning processing.
To validate the proposed method, experiments on financial data have been conducted. A collection of corporate annual reports containing textual and numeric information were adopted to evaluate the corporate risk assessment in a semi-automatic manner. Initially the extracted AVPs data was converted to binarized textual features in accordance with certain finance field criteria. It was then integrated with related numerical features (financial ratios) for traditional machine learning technologies to construct a predictive model for corporate bankruptcy assessment. The experimental results demonstrated an effective two-year ahead (T-2) prediction, outperforming prediction models based on only numeric features under 10-fold cross-validation. At the same time, we observed that all features discovered, numeric or textual, were consistent to the industry standard. Hence, we believe the proposed method has achieved an important milestone for assessing bankruptcy assessment in practice, and is potentially useful for providing trading advice for investors in the future
Towards a New Science of a Clinical Data Intelligence
In this paper we define Clinical Data Intelligence as the analysis of data
generated in the clinical routine with the goal of improving patient care. We
define a science of a Clinical Data Intelligence as a data analysis that
permits the derivation of scientific, i.e., generalizable and reliable results.
We argue that a science of a Clinical Data Intelligence is sensible in the
context of a Big Data analysis, i.e., with data from many patients and with
complete patient information. We discuss that Clinical Data Intelligence
requires the joint efforts of knowledge engineering, information extraction
(from textual and other unstructured data), and statistics and statistical
machine learning. We describe some of our main results as conjectures and
relate them to a recently funded research project involving two major German
university hospitals.Comment: NIPS 2013 Workshop: Machine Learning for Clinical Data Analysis and
Healthcare, 201
Econometrics meets sentiment : an overview of methodology and applications
The advent of massive amounts of textual, audio, and visual data has spurred the development of econometric methodology to transform qualitative sentiment data into quantitative sentiment variables, and to use those variables in an econometric analysis of the relationships between sentiment and other variables. We survey this emerging research field and refer to it as sentometrics, which is a portmanteau of sentiment and econometrics. We provide a synthesis of the relevant methodological approaches, illustrate with empirical results, and discuss useful software
Bank Networks from Text: Interrelations, Centrality and Determinants
In the wake of the still ongoing global financial crisis, bank
interdependencies have come into focus in trying to assess linkages among banks
and systemic risk. To date, such analysis has largely been based on numerical
data. By contrast, this study attempts to gain further insight into bank
interconnections by tapping into financial discourse. We present a
text-to-network process, which has its basis in co-occurrences of bank names
and can be analyzed quantitatively and visualized. To quantify bank importance,
we propose an information centrality measure to rank and assess trends of bank
centrality in discussion. For qualitative assessment of bank networks, we put
forward a visual, interactive interface for better illustrating network
structures. We illustrate the text-based approach on European Large and Complex
Banking Groups (LCBGs) during the ongoing financial crisis by quantifying bank
interrelations and centrality from discussion in 3M news articles, spanning
2007Q1 to 2014Q3.Comment: Quantitative Finance, forthcoming in 201
A Novel Distributed Representation of News (DRNews) for Stock Market Predictions
In this study, a novel Distributed Representation of News (DRNews) model is
developed and applied in deep learning-based stock market predictions. With the
merit of integrating contextual information and cross-documental knowledge, the
DRNews model creates news vectors that describe both the semantic information
and potential linkages among news events through an attributed news network.
Two stock market prediction tasks, namely the short-term stock movement
prediction and stock crises early warning, are implemented in the framework of
the attention-based Long Short Term-Memory (LSTM) network. It is suggested that
DRNews substantially enhances the results of both tasks comparing with five
baselines of news embedding models. Further, the attention mechanism suggests
that short-term stock trend and stock market crises both receive influences
from daily news with the former demonstrates more critical responses on the
information related to the stock market {\em per se}, whilst the latter draws
more concerns on the banking sector and economic policies.Comment: 25 page
Social media analytics: a survey of techniques, tools and platforms
This paper is written for (social science) researchers seeking to analyze the wealth of social media now available. It presents a comprehensive review of software tools for social networking media, wikis, really simple syndication feeds, blogs, newsgroups, chat and news feeds. For completeness, it also includes introductions to social media scraping, storage, data cleaning and sentiment analysis. Although principally a review, the paper also provides a methodology and a critique of social media tools. Analyzing social media, in particular Twitter feeds for sentiment analysis, has become a major research and business activity due to the availability of web-based application programming interfaces (APIs) provided by Twitter, Facebook and News services. This has led to an ‘explosion’ of data services, software tools for scraping and analysis and social media analytics platforms. It is also a research area undergoing rapid change and evolution due to commercial pressures and the potential for using social media data for computational (social science) research. Using a simple taxonomy, this paper provides a review of leading software tools and how to use them to scrape, cleanse and analyze the spectrum of social media. In addition, it discussed the requirement of an experimental computational environment for social media research and presents as an illustration the system architecture of a social media (analytics) platform built by University College London. The principal contribution of this paper is to provide an overview (including code fragments) for scientists seeking to utilize social media scraping and analytics either in their research or business. The data retrieval techniques that are presented in this paper are valid at the time of writing this paper (June 2014), but they are subject to change since social media data scraping APIs are rapidly changing
- …