LLMs In Finance: Balancing Innovations With Accountability

large language models in finance

EWeek stays on the cutting edge of technology news and IT trends through interviews and expert analysis. Gain insight from top innovators and thought leaders in the fields of IT, business, enterprise software, startups, and more. LLMs offer an enormous potential productivity boost for organizations, making them a valuable asset for organizations that generate large volumes of data. Below are some of the benefits LLMs deliver to companies that leverage their capabilities. The architecture is only a first prototype, but the project shows the feasibility of designing specific AI models adapted to the financial domain. But what if a model were able to call upon only the most relevant subset of its parameters in order to respond to a given query?

While pre-trained language representation models are versatile, they may not always perform optimally for specific tasks or domains. Fine-tuned models have undergone additional training on domain-specific data to improve their performance in particular areas. For example, a GPT-3 model could be fine-tuned on medical data to create a domain-specific medical chatbot or assist in medical diagnosis. The technical foundation of large language models consists of transformer architecture, layers and parameters, training methods, deep learning, design, and attention mechanisms.

We (Kim, Muhn and Nikolaev) examine to what extent general-purpose large language models (LLM) such as GPT4 can make informed financial decisions based on mostly numerical financial data. We provide standardized and anonymous financial statements to a pre-trained LLM and design sophisticated chain-of-thought prompts that resemble how human analysts make earnings predictions. Our current results show that – even without additional narrative contexts or industry-specific information – the LLM outperforms the median financial analyst in its ability to forecast annual earnings.

Sign up to our Newsletter

For instance, the model’s performance increased from 74.2% to 82.1% on GSM8K and from 78.2% to 83.0% on DROP, two popular benchmarks used to evaluate LLM performance. The Kensho team developed the benchmark while going through the process of evaluating large language models themselves. They were using an open-source generative AI model for a product offering, then started testing other models and realized the other models performed better. The companies developing AI solutions for standardized industries don’t have the luxury of brute-force training giant models on the biggest databases around just to see what they’re capable of. The output from their systems is typically submitted for review by governing authorities such as the USFDA and global financial regulators.

Reigning in human knowledge for machine use

The idea that LLMs can generate their own training data is particularly important in light of the fact that the world may soon run out of text training data.
LLMs’ performance and accuracy rely on the quality of the training data they are fed.
BloombergGPT is a PyTorch model and all of the training sequences are the same length, at 2,048 tokens, which apparently maximizes GPU utilization.
This is not yet a widely appreciated problem, but it is one that many AI researchers are worried about.

“Some models may be good at writing poems, other models might be really good at quantitative reasoning.” In other words, applying basic math to data analysis and problem-solving. Other common forms of bias that can creep into NLP and NLG models include human bias — this happens when humans improperly label data due to cultural or intentional misinterpretation — and institutional bias. A prime example of technical bias would be teaching a machine to “predict” a person’s sexuality. Since there’s no scientific basis for this kind of AI (see our article on why supposed “gaydars” are nothing but hogwash and snake oil), they’re only able to produce made-up outputs by employing pure technical bias. This occurs when systems are designed in such a way that the outputs they produce don’t follow the scientific method. This happens when an AI model skips over pertinent or essential parts of its database when it outputs information.

Tools like ChatGPT have captivated the public imagination, demonstrating an impressive ability to generate human-like text, create content and power chatbots. But in capital markets, where precision, speed and explainability are paramount, we’re seeing a different reality. They are less well understood and more technically complex to build than dense models. Yet considering their potential advantages, most of all their computational efficiency, don’t be surprised to see the sparse expert architecture become more prevalent in the world of LLMs going forward. Interpretability—the ability for a human to understand why a model took the action that it did—is one of AI’s greatest weaknesses today.

large language models in finance

By automating routine analytical tasks, LLMs free up human analysts to focus on strategic decision-making and high-level planning. This shift allows for more efficient allocation of human resources and potentially leads to more informed financial strategies. This process also allows us to show that we are in control, correcting errors and learning from them. This could be a community or an online forum where we can chat about using LLMs, share our knowledge and learn from our own and others’ mistakes.

That’s sort of the fun part about working with huge LLMs, you never quite know what you’ll get when you query them. However, there’s no room for that kind of uncertainty in medical, financial, or business intelligence reports. While there’s no such thing as a sure bet in the world of STEM, the forecast for NLP technologies in Europe is bright and sunny for the foreseeable future.

large language models in finance

While GenAI has its uses, its current form falls short of meeting the rigorous demands of financial applications. That’s why I believe the future of AI in finance will not be driven by the biggest models but by the smartest—those built with a deep understanding of the industry’s specific challenges and needs. This objection makes sense if we conceive of large language models as databases, storing information from their training data and reproducing it in different combinations when prompted. But—uncomfortable or even eerie as it may sound—we are better off instead conceiving of large language models along the lines of the human brain (no, the analogy is of course not perfect!). “I do think LLMs are ready to do sophisticated quantitative reasoning problems, but in a field that requires accuracy there is a need for an independent assessment,” said Aaron McPherson, principal, at AFM Consulting.

An ongoing partnership between Microsoft and Nvidia is working to help enterprises meet these daunting demands. The industry giants are collaborating on products and integrations for training and deploying LLMs with billions and trillions of parameters. A key is more tightly coupling the containerized Nvidia NeMo Megatron framework and a host of other targeted products with Microsoft Azure AI Infrastructure, which can deliver a scaling efficiency of 95% on 1400 GPUs. Originating in an influential research paper from 2017, the idea took off a year later with the release of BERT (Bidirectional Encoder Representations from Transformer) open-source software and OpenAI’s GPT-3 model. As these pre-trained models have grown in complexity and size — 10x annually recently — so have their capabilities and popularity. The arrival of ChatGPT marked the clear coming out of a different kind of LLM as the foundation of generative AI and transformer neural networks (GPT stands for generative pre-trained transformer).

With the rise of conversational LLMs in recent months, research in this area is now rapidly accelerating. Of course, having access to an external information source does not by itself guarantee that LLMs will retrieve the most accurate and relevant information. One important way for LLMs to increase transparency and trust with human users is to include references to the source(s) from which they retrieved the information. Such citations allow human users to audit the information source as needed in order to decide for themselves on its reliability. DeepMind’s Chinchilla, one of today’s leading LLMs, was trained on 1.4 trillion tokens.

04/16/2025 isural