Augmentation of Financial Datasets and Evaluating Financial Text Generated By A.I.
Date
2024-06-07
Authors
Taylor, Stacey Dianne
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Information is fundamental to decision-making. Yet, data is very sparse for the financial domain, even though, in this era of big data, it seems abundant. The work presented in this thesis addresses that scarcity over seven projects which investigate and examine creating synthetic financial data, both quantitative and textual. In the first two projects, we examine methods to generate synthetic financial statement data as well as the effects of synthetic data on a downstream classification task. The next four projects evaluate how well ChatGPT generates textual financial data for the notes to the financial statements, selected parts of financial reports, as well as how it adapts its responses based on the identified knowledge of its end users, ranging from a non-financial user to a financially sophisticated user.
The authorship attribution project is of the utmost importance particularly since company authorship attribution has not been studied yet, to the best of our knowledge. We have author profiles and a good understanding for identified authors such as William Shakespeare, Mary Shelley, or George Washington, but we do not yet have that depth of understanding and identifiability for corporate communication. This attribution task is a non-trivial problem given that lengthy corporate communication is often collaboratively written by many authors, many (or all) of which are never identified, with contributions by non-writing authors as well who vet and review the text or sign off on the text, for example. This plethora of unidentified authors means that we have to treat the text as a single "figurehead" author, with the understanding that many (likely) unidentified authors (writing and not) have contributed to the work. In our experiments, the Common N-Gram Distance algorithm provided the best and most consistent results, achieving between 95% and 100% accuracy for character n-grams and 100% accuracy for word n-grams. Tools like ChatGPT can be exploited and used to commit fraud. Given the potential for significant effect and harm on the capital markets, tools that can quickly detect fraudulent corporate communication will be needed. Our research contributes to that effort.
Description
Keywords
Machine Learning, Natural Language Processing, Generative AI, Accounting, Finance