Microsoft Unveils a Large Language Model That Excels At Encoding Spreadsheets
Microsoft has quietly announced the first details of its new "SpreadsheetLLM," claiming it has the "potential to transform spreadsheet data management and analysis, paving the way for more intelligent and efficient user interactions." You can read more details about the model in a pre-print paper available here. Jasper Hamill reports via The Stack: One of the problems with using LLMs in spreadsheets is that they get bogged down by too many tokens (basic units of information the model processes). To tackle this, Microsoft developed SheetCompressor, an "innovative encoding framework that compresses spreadsheets effectively for LLMs." "It significantly improves performance in spreadsheet table detection tasks, outperforming the vanilla approach by 25.6% in GPT4's in-context learning setting," Microsoft added. The model is made of three modules: structural-anchor-based compression, inverse index translation, and data-format-aware aggregation. The first of these modules involves placing "structural anchors" throughout the spreadsheet to help the LLM understand what's going on better. It then removes "distant, homogeneous rows and columns" to produce a condensed "skeleton" version of the table. Index translation addresses the challenge caused by spreadsheets with numerous empty cells and repetitive values, which use up too many tokens. "To improve efficiency, we depart from traditional row-by-row and column-by-column serialization and employ a lossless inverted index translation in JSON format," Microsoft wrote. "This method creates a dictionary that indexes non-empty cell texts and merges addresses with identical text, optimizing token usage while preserving data integrity." [...] After conducting a "comprehensive evaluation of our method on a variety of LLMs" Microsoft found that SheetCompressor significantly reduces token usage for spreadsheet encoding by 96%. Moreover, SpreadsheetLLM shows "exceptional performance in spreadsheet table detection," which is the "foundational task of spreadsheet understanding." The new LLM builds on the Chain of Thought methodology to introduce a framework called "Chain of Spreadsheet" (CoS), which can "decompose" spreadsheet reasoning into a table detection-match-reasoning pipeline.
Read more of this story at Slashdot.