ChatGPT for (Finance) research: The Bananarama Conjecture

We show, based on ratings by finance journal reviewers of generated output, that the recently released AI chatbot ChatGPT can significantly assist with finance research. In principle, these results should be generalisable across research domains. There are clear advantages for idea generation and data identification. The technology, however, is weaker on literature synthesis and developing appropriate testing frameworks. Importantly, we further demonstrate that the extent of private data and researcher domain expertise input, are key factors in determining the quality of output. We conclude by considering the implications, particularly the ethical implications, which arise from this new technology.


Introduction
ChatGPT is an artificial intelligence language model introduced in November 2022 providing generated conversational responses to question prompts.The model is trained with a blend of reinforcement learning algorithms and human input on over 150 billion parameters. 1 The platform reached a million users in just its first week open to the public and has been quickly coined ''the industry's next big disrupter'' (Grant and Metz, 2022) due to the perceived quality of response output from the model.Although Large Language Models, the technical term for the process underlying this chatbot, have been used for some decades, we were not able to find a study showing how they can be used in the research generation process, as opposed to being used as part of the research.
One early academic study found the platform capable of passing the notoriously-complex common core of US professional legal accreditation examinations (Bommarito II and Katz, 2022).Another author managed to produce a reasonably-comprehensive guide to quantitative trading, almost exclusively through ChatGPT output (Marti, 2022).A range of professions even set themselves to existential pondering as to whether they have suddenly been made redundant; including educators (Herman, 2022), lawyers (Greene, 2022), and, to cover as many worried professional bases as possible, 'all writers' (Warner, 2023).It is quite the entrance for new technology.
We are interested in the extent to which ChatGPT can assist with the production of research studies; in this case, finance research.Initial research has explored some limited aspects of this question.A broad perspective on the emergent role for AI in the production of scientific research is taken by Grimaldi and Ehrler (2023) and Hutson et al. (2022).While Alshater (2022) suggests that ChatGPT should be useful for a range of tasks involved in constructing a research study, but without empirical testing.
Most of the applied research focuses on the creation of research abstracts and literature synthesis.For example, Aydın and Karaarslan (2022) attempt to create a healthcare literature review suitable for an academic journal and find that while it is possible, there is considerable 'plagiarism', or poor paraphrasing.Gao et al. (2022), however, find that novel abstracts can be generated without explicit plagiarism, although these are identifiable as being generated by an AI platform using an artificial intelligence output detector. 2Chen and Eger (2022) also explore use in title and abstract generation, and in the domain of finance, Wenzlaff and Spaeth (2022) are able to generate reasonably academically-appropriate definitions of new financial concepts.Mellon et al. (2022) explores one aspect of the application to research testing, by showing the platform can be useful as a complement to scoring open-text survey results.While Adesso (2022) has used GPT3 to write a full paper in physics, to be submitted to a journal ''as is '', and Zhai (2022) has also experimented with creating a research paper outline.
Building on, but distinct from these studies, our study is the first to provide structured testing of the potential for ChatGPT to assist with writing a research study.We test and compare generated output for four stages of the research process: idea generation, literature review, data identification and processing, and empirical testing.A panel of experienced academic authors and reviewers grade each output.We also, importantly, show how different levels of private data and researcher domain-expertise input in guiding output have a significant impact on the quality of outputs generated.Like all tools, ChatGPT is best in experienced hands.Following the opening quote of this article, we term this the Bananarama Conjecture.
Section 2 outlines our empirical approach, Section 3 presents and analyses the findings.We conclude in Section 4 with a framework for understanding the opportunities and limitations of ChatGPT, as well as some initial consideration of the ethical dimensions of the new technology.

Methodology
We focus on cryptocurrencies as our finance topic -a prominent and reasonably well-defined area of recent finance research.We further concentrate on letter-style articles, such as those published in the Finance Research Letters journal, thus, articles of about 2000-2500 words in length.
We start our empirical approach by noting that the standard research study creation process can be divided into five basic stages (Cargill and O'Connor, 2021): 1. Idea generation 2. Prior literature synthesis 3. Data identification and preparation 4. Testing framework determination and implementation 5. Results analysis As ChatGPT is currently unable to analyse empirical output we cannot evaluate the results analysis ability, so we concentrate on the first four stages of the research process.We, therefore, request the platform to generate: (1) a research idea; (2) a condensed literature review; (3) a description of suitable data for the research idea; and (4) a suitable testing framework given the research idea and the proposed data.
Three versions of the same general cryptocurrency research idea are generated, each with these four research stages.The textual prompts used to generate each stage are reported in Appendix.The first version only utilises public data already available within ChatGPT. 3 We label this version of the research study: V1: Only Public Data.
For the second version (labelled: V2: Added Private Data), we incorporate private data to assist with generating the research stages.We obtain abstracts and article identifiers for 188 articles identified as related to cryptocurrencies and published in Finance Research Letters (2021-2023) from the Elsevier Scopus database.These articles are loaded into ChatGPT in bibtex format. 4The private data from these articles add specialist knowledge to the existing generalised expertise of the platform.We then generate the four research stages telling the platform to take this prior research into account.
For the third version (V3: Private Data and Expertise), we further incorporate researcher domain expertise alongside the private data.In practice, we take the outputs from the second version, and iterate the output, by telling ChatGPT how it might improve its suggested answers.Most frequently this iterative process involves asking the platform to be more specific on particular parts of the output, as it tends towards equivocation and generality unless guided otherwise.In none of the three cases do we manually adjust any of the output generated by the model, with the exception of one minor technical correction noted in Appendix.The evaluation criteria column shows the questions asked of reviewers for that research stage, which they rate between 1 (highly disagree) and 10 (highly agree).The length column indicates the approximate word count of output requested from ChatGPT for that research stage.See Section 2 for further elaboration of labels and approach.
For our evaluation stage, a team of experienced authors and reviewers are identified who all have prior experience as reviewers or published authors for ABS-level5 finance journals.A total of 32 reviewers each review a complete single version of the output (that is, all four research stages of a full research study), and are randomly assigned to one of the three versions.
We administer the evaluation through Qualtrics.The three generated versions of the research study, as presented to reviewers, are contained in Appendix.Reviewers are asked to rate two aspects of each stage of output, see Table 1 for this evaluation criteria, and may voluntarily leave comments.A review consists of a rating between 1 (highly disagree) and 10 (highly agree) of how likely the output is to be considered acceptable for a minimum ABS2-level6 finance journal according to the specified criterion.Average scores across reviewers are reported. 7We now proceed to present and analyse the findings.

Findings
Table 2 reports the main findings and Fig. 1 presents a boxplot representation of the results.The table shows the findings for all three research study versions, and for the four research stages.The research stages are, in turn, each evaluated according to two criteria.
We could view a rating of 5.5 (the mid-point of the rating range between 1 and 10) as a basic minimum for a research study stage to be considered acceptable.Possibly acceptable with revisions, and subject, naturally, to the element of randomness and personal preference that is always present in the reviewing process.By this basic criteria, all versions of the study 'succeed'.Reading from the bottom line of Table 2, which shows the overall average rating of each study, V1 has a rating of 7.05, V2 a rating of 6.63, and V3 a rating of 7.62.These are, therefore, all studies that have a decent chance of eventual success in the reviewing process in a good finance journal.
Examining the individual research stages, we see the highest ratings are for the generation of the research idea.This makes sense when we consider that this initial stage involves thinking broadly about existing concepts and connecting these concepts into a coherent new idea.ChatGPT with its access to billions of parameters and texts, should be particularly adept at this broad exploration of existing ideas.The data summary stage is also reasonably strong, perhaps because data summaries tend to be distinct sections of a research study in easily identifiable text 'chunks'.There is also a limited range of data which can be used in a given study, meaning the search process is also limited.
Less successful, according to our results, are literature reviews and testing frameworks.The platform particularly struggles with generating suitable testing frameworks.Our view here is that this might be due to these being 'internal' tasks within a research study.The literature review is the internal tool to link the research idea with the methodology.The testing framework is linked from the research idea, the literature review, and the data summary.The model appears to be less capable of linking multiple internally-generated ideas, such as these stages entail.
Comparing the different research versions we see a clear outperformance by our most advanced research study, V3: Private Data and Expertise.We were surprised to see that the version with added private data underperformed compared to the version with only public data.On reflection, this appears to be due to the private data model excessively relying on the provided private data and restricting the extent to which it accessed other beneficial public data.This could be improved by either instructing the platform to not ignore useful public data, or by providing a better-curated set of relevant private data.The table presents the summary findings from 32 reviews of three versions of a ChatGPT-generated research study (10 reviews of V1, V3; 12 reviews of V2).The outperformance of the V3 research study is notable, not just on an overall basis, but also in the extent to which it is also of producing acceptable literature reviews and testing frameworks where the other research studies have less success.We suggested above that the general underperformance of the output for these research stages might be due to the difficultly ChatGPT has in linking multiple generated ideas.The advantage, therefore, for our V3 study, is that the researcher can observe any missing links and ask the platform to further iterate to address these gaps.The Appendix contains sample prompts given to the platform, and this addressing of missing links can be seen in the prompt text.Researcher domain-expertise appears to be key for these tasks involving conceptual complexity.
Table 3 confirms, statistically, the differences between the research studies through a range of t-tests.These two-sided t-tests assume unequal variance, as best fits our data.The main differences are observed for the evaluation criteria of ''the literature review

'
'It ain't what you do, it's the way that you do it And that's what gets results'' [Song lyrics by Bananarama and Fun Boy Three (1982)]

Table 2
Findings from reviewer evaluations of ChatGPT-generated research studies.