GPT and Other AI Models Can't Analyze an SEC Filing, Researchers Find
According to researchers from a startup called Patronus AI, ChatGPT and other chatbots that rely on large language models frequently fail to answer questions derived from Securities and Exchange Commission filings. CNBC reports: Even the best-performing artificial intelligence model configuration they tested, OpenAI's GPT-4-Turbo, when armed with the ability to read nearly an entire filing alongside the question, only got 79% of answers right on Patronus AI's new test, the company's founders told CNBC. Oftentimes, the so-called large language models would refuse to answer, or would "hallucinate" figures and facts that weren't in the SEC filings. "That type of performance rate is just absolutely unacceptable," Patronus AI co-founder Anand Kannappan said. "It has to be much much higher for it to really work in an automated and production-ready way." [...] Patronus AI worked to write a set of more than 10,000 questions and answers drawn from SEC filings from major publicly traded companies, which it calls FinanceBench. The dataset includes the correct answers, and also where exactly in any given filing to find them. Not all of the answers can be pulled directly from the text, and some questions require light math or reasoning. Qian and Kannappan say it's a test that gives a "minimum performance standard" for language AI in the financial sector. Patronus AI tested four language models: OpenAI's GPT-4 and GPT-4-Turbo, Anthropic's Claude 2 and Meta's Llama 2, using a subset of 150 of the questions it had produced. It also tested different configurations and prompts, such as one setting where the OpenAI models were given the exact relevant source text in the question, which it called "Oracle" mode. In other tests, the models were told where the underlying SEC documents would be stored, or given "long context," which meant including nearly an entire SEC filing alongside the question in the prompt. GPT-4-Turbo failed at the startup's "closed book" test, where it wasn't given access to any SEC source document. It failed to answer 88% of the 150 questions it was asked, and only produced a correct answer 14 times. It was able to improve significantly when given access to the underlying filings. In "Oracle" mode, where it was pointed to the exact text for the answer, GPT-4-Turbo answered the question correctly 85% of the time, but still produced an incorrect answer 15% of the time. But that's an unrealistic test because it requires human input to find the exact pertinent place in the filing -- the exact task that many hope that language models can address. Llama 2, an open-source AI model developed by Meta, had some of the worst "hallucinations," producing wrong answers as much as 70% of the time, and correct answers only 19% of the time, when given access to an array of underlying documents. Anthropic's Claude 2 performed well when given "long context," where nearly the entire relevant SEC filing was included along with the question. It could answer 75% of the questions it was posed, gave the wrong answer for 21%, and failed to answer only 3%. GPT-4-Turbo also did well with long context, answering 79% of the questions correctly, and giving the wrong answer for 17% of them.
Read more of this story at Slashdot.