
Cisco tested AI's ability to write an accurate report on a tabletop security incident response exercise, and found that while the tech can save time, many risks remain. The networking giant revealed its results in a Thursday blog post
https://blogs.cisco.com/security/ai-generated-reporting-lessons-learned-from-talos-incident-response by Nate Pors, a senior incident commander in the Cisco Talos Incident Response team. Pors opened by observing that when to used generate long-form technical content, large language models can deliver significant inaccuracies, unusual conclusions, and inconsistent writing styles." LLMs make those mistakes because they're essentially a fancy autocomplete system that makes educated guesses. Pors wrote that the nature of LLMs therefore sees them mess up in four ways: Using different data for each query, which means it's difficult to rely on an LLM for repeatable, standardized research outcomes."Reaching different conclusions from the same data. In a data breach scenario, a model might suggest a full organization-wide password reset in one instance and a targeted reset in another," Pors wrote and AI then often defaults to whichever recommendation it generates first" - and may therefore give bad advice.Because LLMs generate content token-by-token, they can create documents with different structure and formatting on each new run. This unpredictability is problematic for professional environments where standardized layouts, such as consistent executive summaries or recommendation sections, are essential for quality control," the Talos man observed.AI can discard data, so its output might ignore critical information.Talos developed several techniques to stop this sort of thing happening. One involves giving an LLM granular, single-task instructions" that focus on a specific, small portion of the report." Doing so means risk of hallucination or cross-contamination between sections is significantly reduced." Telling an LLM which sources to use also helps. So does setting rules about the style and format of output. Using those techniques, Cisco says the time required to draft an incident report based on a tabletop exercise fell by 50 percent. "A blind test of the sample report in our quality assurance process showed no noticeable drop in overall writing quality," Pors wrote. "The peer reviewer, professional editor, and management reviewer all made complimentary comments about the report while unaware that it was AI-generated. The peer reviewer commented that the incidence of typos and grammatical errors was far lower than in the average report." But the Talos team also found editing multiple sample reports within a single session resulted in cross-contamination of content from one report's source material to another, even if the notes used to generate the first report were deleted from the project's reference documents." The researchers therefore recommend starting a new session, and re-entering prompts, for each new incident report. They also developed a spelling-and-grammar-checking prompt that hallucinated numerous grammar issues ... failed to identify actual issues," had a success rate below 50 percent and would behave inconsistently, sometimes catching issues and sometimes overlooking them. It is currently unsuitable for production use," Pors concluded. Pors said Cisco concluded that its approach could be adapted to any cybersecurity reporting use case with standardized inputs and predictable outputs," but also warned authors must "take ownership of every word of the final report." "While testing, we found that the LLMs generated recommendations that were duplicative, irrelevant, or not actionable. If this were used in a production environment without manual checks, it could result in poor-quality recommendations in a final report." Those problems arose when considering a tabletop exercise, a far simpler affair than analysis of an incident that involves analyzing log files from multiple systems. (R)