^{Gary Smith, Valentina Liberman, and Isaac Warshaw
April 21, 2025

7

Business and Finance, Large Language Models (LLMs)}

LLMs Still Cannot be Trusted for Financial Advice

_{The limitations of Large Language Models (chatbots) are illustrated by their struggles with financial advice} _{Gary Smith, Valentina Liberman, and Isaac Warshaw
April 21, 2025

7

Business and Finance, Large Language Models (LLMs)}

Share: Facebook; Twitter/X; LinkedIn; Flipboard; Print; Email

Few people have the knowledge and training needed to make sound financial decisions and many cannot afford guidance from expert advisors. The Internet is a possible source of advice but, once we fight our way through a thicket of advertisements, we are apt to encounter a bewildering blizzard of websites. Too often, they are barely relevant, outdated, or contradictory.

ChatGPT and other large language models (LLMs) offer a promising alternative. Pre-trained on an immense amount of text (including Wikipedia and other Internet sources) and post-trained by experts with field-specific knowledge, LLMs generate confident and seemingly authoritative answers to most any question, including financial ones. However, a January 2024 study found that their breezy confidence is often misplaced as the answers turn out to be illogical or factually incorrect.

large language models Concept illustration Ai

Have LLMs improved in recent months?

LLMs have been improving rapidly and some enthusiasts now claim that they are on the verge of achieving artificial general intelligence (AGI), or have already achieved it (here, here, and here). To test whether they can now be relied on for financial advice, we posed 12 straightforward financial questions to four prominent LLMs between March 25 and March 27, 2025: ChatGPT-4o, DeepSeek-V2, Grok 3 Beta, and Gemini 2. Each response was graded as 0 (incorrect financial analysis), 0.5 (correct financial analysis but mathematical errors), or 1 (correct financial analysis and mathematical calculations).

The responses, detailed here, were consistently verbose but often incorrect. Out of 12 possible points, ChatGPT earned a score of 5.0, DeepSeek 4.0, Grok 3.0, and Gemini 1.5. We had expected the LLMs to receive plenty of expert post-training that instructed them to take into account the time value of money, but they seldom did so. They also often reported mathematically incorrect answers even for the simplest calculations; for example, Grok generated this: “Initial Monthly Cost: $3,700 (rent) + $200 (estimated utilities) = $4,900/month.”

There were also a surprising number of basic typographical and grammatical errors (including non-deductibly, over all, To accurately asses the tax benefit, deductable, inital); perhaps that’s because the LLMs had trained on text that featured these errors. Sometimes, they made bizarre assumptions that revealed the disconnect between generating statistically likely text and the real world. In a question involving annuity payments beginning at age 80, DeepSeek assumed that this “is a perpetuity, meaning payments continue indefinitely”—ignoring the fact that lives are finite, which is particularly important for an 80-year old.

The only question that all four LLMs answered correctly did not involve any financial analysis:

I will start paying for my oldest child’ s college expenses (books, board, room, and tuition), which I expect will add up to $45,000 per year. I am going to pay the college directly. Is there any tax strategy I should follow?

All four LLMs noted that the $18,000 gift tax limit could be circumvented by paying the college directly—advice that is widely circulated on the Internet and was surely encountered in their pre-training.

Flawed answers to a “best buy” question

In stark contrast, all four LLMs gave flawed answers to this simple prompt that required only a minimal amount of financial analysis:

I am thinking of buying a $999.99 TV from Best Buy. They have a rent-to-own plan where I can pay $175.50/month and after 12 months I can own the TV. Should I buy the TV through the plan?

The implicit annual percentage rate is 166% if the first payment is made a month after purchase and 213% if the first payment is made at the time of purchase. None of the LLMs noted this subtlety. Indeed, they scarcely paid any attention to the time value of money. All four LLMs correctly calculated the total payments to be $2,106, which is $1,106.01 more than the $999.99 purchase price of the television. The fact that the total cost is substantially higher than the purchase price doesn’t mean that it is a bad deal. The same is true of 30-year mortgages, even with tantalizingly low interest rates. The LLMs were oblivious to the need for a present value calculation.

Chat GPT concluded that,

Since you’re financing $999.99 over 12 months and end up paying $2,106, this suggests a very high implied interest rate, likely over 100% APR.

DeepSeek reported the APR to be “(1,106.01/999.99) × 100 ≈ 110.6%.” Instead of calculating the APR for the given data, Grok reported an APR that it had evidently encountered in its pre-training:

With this plan, you’d end up paying more than double the TV’s original price. That extra $1,106.01 is essentially the cost of the financing arrangement, which translates to a very high effective interest rate (around 187% APR, based on similar lease-to-own examples).

Gemini simply noted that “the total cost of the TV through the rent-to-own plan ($2106.00) is significantly higher than the original price ($999.99).”

Ignoring key decision-making factors

Another question that elicited surprisingly naive results was:

I am an established high earning business man. I make passive income from my businesses and do not have to do much work anymore and want to start investing as a hobby hopefully to leave some money to my grandchildren. My friends talk a lot about IRAs and Roth IRAs in particular, but I don’t know much about the topic. Which should I start using?

All of the LLMs regurgitated various characteristics that they had gleaned from the Internet, including contribution limits, minimum distribution requirements, backdoor Roths, and alternative investments such as taxable brokerage accounts, 529 plans, and life insurance. None mentioned one of the most important considerations savers should consider—a comparison of tax rates when money is contributed and withdrawn.

Another question concerned when to start collecting Social Security benefits. DeepSeek was the only LLM to take into account the time value of money but it assumed a low 3% interest rate and miscalculated the present values.

When asked to calculate the first-year return from buying a home to live in, all four LLMs responded that, because it was not a rental property, there were no financial benefits, only costs. Ignoring the rent savings and using the costs enumerated in the question, they reported first-year returns of ranging from –6.68% to –16.33%, never mind that such negative returns imply that homeownership is always a bad idea.

Unlike financial advisors, who develop expertise through education and experience, LLMs generate responses based on word patterns within vast datasets, tweaked by post-training that cannot conceivably anticipate every question that might be asked and the variety of individualized nuances required to give trustworthy responses.

Despite these fundamental limitations, LLMs create a reassuring illusion of human-like intelligence, along with a breezy conversational style enhanced by friendly exclamation points. They seem to be experts but are not. It is still the case that the real danger is not that computers are smarter than us but that we think computers are smarter than us and consequently trust them to make decisions they should not be trusted to make.

Isaac Warshaw is an Economics and Philosophy student at Pomona College with experience designing and conducting qualitative interview-based research and Information Technology.

Valentina Liberman is an Economics student at Pomona College with an interest in Finance and Technology.