114 research outputs found
Realizing EDGAR: eliminating information asymmetries through artificial intelligence analysis of SEC filings
The U.S. Securities and Exchange Commission (SEC) maintains a publicly-accessible database of all required filings of all publicly traded companies. Known as EDGAR (Electronic Data Gathering, Analysis, and Retrieval), this database contains documents ranging from annual reports of major companies to personal disclosures of senior managers. However, the common user and particularly the retail investor are overwhelmed by the deluge of information, not empowered. EDGAR as it currently functions entrenches the information asymmetry between these retail investors and the large financial institutions with which they often trade. With substantial research staffs and budgets coupled to an industry standard of “playing both sides” of a transaction, these investors “in the know” lead price fluctuations while others must follow.
In general, this thesis applies recent technological advancements to the development of software tools that will derive valuable insights from EDGAR documents in an efficient time period. While numerous such commercial products currently exist, all come with significant price tags and many still rely on significant human involvement in deriving such insights. Recent years, however, have seen an explosion in the fields of Machine Learning (ML) and Natural Language Processing (NLP), which show promise in automating many of these functions with greater efficiency. ML aims to develop software which learns parameters from large datasets as opposed to traditional software which merely applies a programmer’s logic. NLP aims to read, understand, and generate language naturally, an area where recent ML advancements have proven particularly adept.
Specifically, this thesis serves as an exploratory study in applying recent advancements in ML and NLP to the vast range of documents contained in the EDGAR database. While algorithms will likely never replace the hordes of research analysts that now saturate securities markets nor the advantages that accrue to large and diverse trading desks, they do hold the potential to provide small yet significant insights at little cost.
This study first examines methods for document acquisition from EDGAR with a focus on a baseline efficiency sufficient for the real-time trading needs of market participants. Next, it applies recent advancements in ML and NLP, specifically recurrent neural networks, to the task of standardizing financial statements across different filers. Finally, the conclusion contextualizes these findings in an environment of continued technological and commercial evolution
An SEC 10-K XML Schema Extension to Extract Cyber Security Risks
The text sections of the SEC mandated annual reports abound with important corporate operational information but they are hard to manipulate in bulk because of the varying formats used by the submitting companies. Researchers and private entities have demonstrated the difficulties inherent in extracting and accumulating certain textual portions of these reports. This paper proposes an XML schema that will follow a specific DTD for the 10-K (and 10-Q) reports. Using simple computer commands, the ease of manipulation of the reports text sections is demonstrated
Applying text timing in corporate spin-off disclosure statement analysis: understanding the main concerns and recommendation of appropriate term weights
Text mining helps in extracting knowledge and useful information from unstructured data. It detects and extracts information from mountains of documents and allowing in selecting data related to a particular data.
In this study, text mining is applied to the 10-12b filings done by the companies during Corporate Spin-off. The main purposes are (1) To investigate potential and/or major concerns found from these financial statements filed for corporate spin-off and (2) To identify appropriate methods in text mining which can be used to reveal these major concerns.
10-12b filings from thirty-four companies were taken and only the Risk Factors category was taken for analysis. Term weights such as Entropy, IDF, GF-IDF, Normal and None were applied on the input data and out of them Entropy and GF-IDF were found to be the appropriate term weights which provided acceptable results. These accepted term weights gave the results which was acceptable to human expert\u27s expectations. The document distribution from these term weights created a pattern which reflected the mood or focus of the input documents.
In addition to the analysis, this study also provides a pilot study for future work in predictive text mining for the analysis of similar financial documents. For example, the descriptive terms found from this study provide a set of start word list which eliminates the try and error method of framing an initial start list --Abstract, page iii
Recommended from our members
Enhancing recall and precision of web search using genetic algorithm
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Due to rapid growth of the number of Web pages, web users encounter two main problems, namely: many of the retrieved documents are not related to the user query which is called low precision, and many of relevant documents have not been retrieved yet which is called low recall. Information Retrieval (IR) is an essential and useful technique for Web search; thus, different approaches and techniques are developed. Because of its parallel mechanism with high-dimensional space, Genetic Algorithm (GA)
has been adopted to solve many of optimization problems where IR is one of them. This thesis proposes searching model which is based on GA to retrieve HTML
documents. This model is called IR Using GA or IRUGA. It is composed of two main units. The first unit is the document indexing unit to index the HTML documents. The second unit is the GA mechanism which applies selection, crossover, and mutation operators to produce the final result, while specially designed fitness function is applied to evaluate the documents. The performance of IRUGA is investigated using the speed of convergence of the retrieval process, precision at rank N, recall at rank N, and precision at recall N. In addition, the proposed fitness function is compared experimentally with Okapi-BM25 function and Bayesian inference network model function. Moreover, IRUGA is compared with traditional IR using the same fitness function to examine the performance in terms of time required by each technique to retrieve the documents. The new techniques
developed for document representation, the GA operators and the fitness function managed to achieves an improvement over 90% for the recall and precision measures. And the relevance of the retrieved document is much higher than that retrieved by the other models. Moreover, a massive comparison of techniques applied to GA operators is performed by highlighting the strengths and weaknesses of each existing technique of GA operators. Overall, IRUGA is a promising technique in Web search domain that provides a high quality search results in terms of recall and precision
Intellectual Property Management in Health and Agricultural Innovation: A Handbook of Best Practices, Vol. 1
Prepared by and for policy-makers, leaders of public sector research establishments, technology transfer professionals, licensing executives, and scientists, this online resource offers up-to-date information and strategies for utilizing the power of both intellectual property and the public domain. Emphasis is placed on advancing innovation in health and agriculture, though many of the principles outlined here are broadly applicable across technology fields. Eschewing ideological debates and general proclamations, the authors always keep their eye on the practical side of IP management. The site is based on a comprehensive Handbook and Executive Guide that provide substantive discussions and analysis of the opportunities awaiting anyone in the field who wants to put intellectual property to work. This multi-volume work contains 153 chapters on a full range of IP topics and over 50 case studies, composed by over 200 authors from North, South, East, and West. If you are a policymaker, a senior administrator, a technology transfer manager, or a scientist, we invite you to use the companion site guide available at http://www.iphandbook.org/index.html The site guide distills the key points of each IP topic covered by the Handbook into simple language and places it in the context of evolving best practices specific to your professional role within the overall picture of IP management
Recommended from our members
SRL2003 IJCAI 2003 Workshop on Learning Statistical Models from Relational Data
- …