AI SIGNAL{}
← BACK TO FEED
{ AI SIGNAL }  TECHNICAL POSTMAY 31, 2026

Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding { }

[table-understanding][prompting-framework][llm-reasoning][tabular-data][chain-of-thought][table-transformations][in-context-learning]

Chain-of-Table is a prompting framework that improves LLM reasoning over tabular data by decomposing complex questions into iterative table transformations—adding columns, filtering rows, aggregating groups—rather than generating free-form text explanations. Operating entirely at inference time without retraining, it u

TL;DR

Chain-of-Table is a prompting framework that improves LLM reasoning over tabular data by decomposing complex questions into iterative table transformations—adding columns, filtering rows, aggregating groups—rather than generating free-form text explanations. Operating entirely at inference time without retraining, it uses in-context learning to select structured operations that progressively reduce tables into answerable forms, achieving 34.2% absolute improvement over Chain-of-Thought on WikiTableQuestions and reaching competitive accuracy with Text-to-SQL approaches without requiring code execution infrastructure or external database systems.

The Problem

Language models trained on plain text corpora carry an architectural handicap when reasoning about structured tabular data. Tables embed relational information—column dependencies, row semantics, implicit hierarchies—that pure text representation obscures. When you present a table to an LLM, the model must linearize it into sequences of tokens, losing the very structure that makes the data meaningful.

Limitations of Existing Approaches

Generic Chain-of-Thought (CoT) generates intermediate reasoning steps as free-form text. For tabular questions, this creates a critical failure mode: the reasoning text must re-encode table structure in natural language at each step, duplicating information and introducing ambiguity. When asked "Which country had the most cyclists finish in the top 3?", a standard CoT response might read:

"First, I need to identify which cyclists are in the top 3 rankings. Looking at rank 1, 2, and 3, they are from USA, Canada, and Spain respectively. Now I need to count cyclists per country in the top 3. USA has 1, Canada has 1, Spain has 1. So each country appears equally."

This textual re-encoding wastes tokens and loses the precision tables provide natively. The model must reconstruct the table state in natural language, making errors more likely and consuming 30-40% more tokens than necessary. Furthermore, when rows number in the hundreds, this re-encoding becomes infeasible—the model cannot practically transcribe entire data structures into language without hallucination.

Text-to-SQL and Program-Aided Reasoning sidesteps the text problem by generating executable code (SQL, Python). This works mechanically—the model can produce syntactically valid queries that return correct answers. However, program-aided approaches create new failure modes:

  1. 01.Structural misalignment: When cells contain composite data (e.g., "John Smith, USA"), the model must infer which parts map to which logical columns. A Text-to-SQL system might generate SELECT country FROM cyclist_table WHERE country = "John Smith" because the schema is ambiguous in the source table. The model sees raw text, not normalized fields, and cannot reliably parse them without schema hints.
  1. 01.Semantic brittleness: Slight schema changes or missing columns cause complete failure. If the table lacks a pre-split country column, Text-to-SQL has no graceful fallback. The generated query fails at execution with an ambiguous error message ("column not found"), and the system must either retry blind or halt. In contrast, human reasoners would simply split the composite cell on the fly.
  1. 01.Infrastructure requirements: Executing arbitrary code requires sandboxing, error handling, and query validation—operational complexity that pure language-based approaches avoid. Deploying a Table-QA system with code generation demands SQL parsers, connection pools, transaction management, and security policies. Many edge deployments (mobile, embedded, on-device AI) cannot support this overhead.
  1. 01.Reproducibility gaps: Generated SQL may have multiple syntactically valid forms that produce different results depending on the database system, collation rules, or NULL-handling semantics. Testing becomes brittle; a query that works on PostgreSQL may fail subtly on SQLite.

Information density collapse: Complex tables with hundreds of rows overwhelm token budgets. A table with 500 rows and 10 columns, naively serialized, consumes ~5,000 tokens before the model has even seen the question. This dilutes signal-to-noise ratios and increases latency. Pruning the table requires the model to know in advance which rows matter, creating a chicken-and-egg dependency: to answer the question, you must first know which rows are relevant, but identifying relevant rows is the question.

The core insight: Existing approaches treat tables as immutable input surfaces. They either re-encode tables as text (losing structure), generate code to query them (losing transparency and graceful degradation), or simply hope the model can reason over massive serialized tables (losing efficiency). None adapts the table itself to make the question answerable through iterative, interpretable refinement.

How It Works

Chain-of-Table reimagines table understanding as a sequence of explicit, semantic table mutations. The framework operates through a three-stage inference pipeline, executed iteratively until the table reaches a form directly answerable by the LLM.

Stage 1: Dynamic Operation Selection

The LLM receives a prompt triplet: (T, Q, chain), where:

  • T is the current intermediate table state—initially the input table, then progressively transformed
  • Q is the natural language question ("Which country had the most cyclists finish in the top 3?")
  • chain is the operation history, a list of previously executed operations (e.g., [ADD_COLUMN(country), SELECT_ROWS(1,2,3)])

The prompt uses in-context learning: it shows the LLM a few examples of (table state, question, appropriate next operation) tuples, then asks the model to predict the next operation from a fixed pool. Operations include:

  • ADD_COLUMN(name, source_column, extraction_rule): Create a new column by extracting or transforming data from an existing column (e.g., splitting "John Smith, USA" into a new "country" column using a string split operation)
  • SELECT_ROWS(indices): Retain only specified rows by index, pruning irrelevant data (e.g., SELECT_ROWS(0, 1, 2) keeps only the first three rows)
  • FILTER(column, condition): Keep rows where a column satisfies a logical condition (e.g., FILTER(rank, <= 3) retains rows where rank is 1, 2, or 3)
  • AGGREGATE(group_by_column, aggregation_func, output_column): Group rows by a column value and apply a reduction function (e.g., AGGREGATE(country, COUNT, cyclist_count) groups cyclists by country and counts them per group)
  • SORT(column, order): Reorder rows by column value in ascending or descending order (e.g., SORT(rank, ASC))
  • DONE: Signal that the table has been sufficiently refined and the LLM is ready to extract the final answer

This set is finite and interpretable—no neural network predicts operation arguments directly, which would require precision training and is prone to hallucination. Instead, the operation pool is fixed at deployment time, and the LLM's role is pattern matching: "Given this table structure and question, which pre-defined operation logically applies next?"

The critical design choice: operations are pure table transformations with no free parameters at selection time. The LLM predicts ADD_COLUMN, not ADD_COLUMN(arg1, arg2, arg3). This splits the problem into two narrower subproblems, each easier for in-context learning. Research on LLM error modes shows that models struggle with multi-argument generation in a single token sequence—off-by-one errors, argument swaps, and hallucinated arguments proliferate. Separating operation selection (one choice from a small set) from argument generation (free-form text for each argument) reduces error rates significantly.

Stage 2: Argument Generation

Once the operation is selected, the framework re-prompts the LLM with:

  • The current table T (serialized or displayed visually)
  • The selected operation and its required argument signature (e.g., ADD_COLUMN requires a source column name and an extraction rule)
  • The question Q

The LLM now generates concrete arguments. For ADD_COLUMN(source=name_country, extraction=extract_country_from_composite_cell), the model outputs the column name "name_country" and a natural language description of how to extract the country (e.g., "split by comma and take the right side"). The framework parses these arguments and binds them to the operation.

This two-step decoupling (operation selection, then argument binding) improves accuracy over end-to-end code generation because:

  1. 01.Focused prediction: The LLM makes narrower decisions at each step. Instead of generating a full SQL query with syntax, table names, WHERE clauses, and aggregations in one shot, it picks one operation type, then fills in arguments. Error rates drop because there are fewer degrees of freedom per prediction.
  1. 01.Error locality: If argument generation fails (e.g., the LLM returns an invalid column name), the operation selection (likely correct) remains useful for human inspection and debugging. The system can reject the argument, backtrack one step, and retry with a corrected prompt rather than aborting entirely.
  1. 01.Interpretability: Each stage outputs human-readable artifacts (operation names, argument values) rather than opaque tokens. A debugging engineer can see that the system tried SELECT_ROWS(1, 2, 3) and understand why it succeeded or failed without parsing generated code.
  1. 01.Graceful fallback: If argument generation is ambiguous (the LLM outputs "split by comma or space"), the framework can prompt for clarification or apply a conservative heuristic, rather than crashing on invalid SQL syntax.

Stage 3: Deterministic Execution and Iteration

The framework executes the operation deterministically—not via LLM generation, but via standard tabular libraries (Pandas-equivalent operations). Extracting a country column from a composite cell uses a string split operation, not an LLM call. This ensures:

  • Reproducibility: Running the same question twice yields identical intermediate tables
  • Auditability: Engineers can trace each step and verify correctness without invoking the LLM again
  • No error accumulation: Errors in the LLM's argument generation do not cascade into subsequent iterations. A misnamed column is caught immediately; a misnamed variable in generated SQL may cause silent logical errors downstream

The resulting intermediate table is then fed back as input T to the next iteration of Stage 1. The operation chain grows: [ADD_COLUMN(country, name_country, split_right), SELECT_ROWS(0, 1, 2), AGGREGATE(country, COUNT, count_cyclists)]. Iteration continues until one of these termination conditions is met:

  • The LLM predicts a terminal operation like DONE, signaling the table is ready
  • The operation chain reaches a pre-set maximum depth (e.g., 10 steps, preventing infinite loops)
  • The table size drops below a threshold (e.g., fewer than 100 rows), indicating sufficient pruning
  • No valid operation can be selected (the LLM predicts an out-of-vocabulary operation)

Once iteration stops, Stage 3b: Final Answer Extraction occurs. The fully-transformed table is presented alongside the original question, and the LLM generates a natural language answer in one shot. Because the table has been iteratively reshaped to focus on relevant data, this final extraction requires minimal reasoning—often a simple lookup, count, or comparison. For example, if the table has been reduced to:

text
| country | cyclist_count |
|---------|---------------|
| USA     | 2             |
| Canada  | 1             |
| Spain   | 1             |

The LLM simply outputs: "USA had the most cyclists finish in the top 3, with 2 cyclists."

Why This Design Works

The innovation addresses each failure mode of prior approaches:

  1. 01.Structure preservation: Unlike text-based CoT, operations maintain tabular semantics throughout the reasoning chain. Selecting rows preserves column relationships; aggregating groups creates meaningful summaries. The model never has to re-encode the table as text, so no information is lost to linearization.
  1. 01.Graceful degradation: If operation selection fails (the LLM predicts a nonsensical operation like "SHUFFLE" which is not in the fixed pool), the framework detects this and can backtrack, re-prompt with better examples, or allow human intervention. Code-based approaches fail catastrophically at execution with error traces; here, failures are interpretable and recoverable.
  1. 01.Information density control: Operations like SELECT_ROWS and FILTER proactively reduce table size. A 500-row table is pruned to the top 3 rows before the final answering step, consuming perhaps 50 tokens instead of 5,000—a 100x reduction in token use. This also reduces latency significantly.
  1. 01.No infrastructure overhead: Operations execute via standard table manipulation. No SQL parser, code sandbox, or query validator is needed. The framework works with plain Python dictionaries, Pandas DataFrames, or any tabular representation. This makes deployment straightforward: a single LLM API call plus local table operations.
  1. 01.Explainability: The operation chain is a human-readable transcript of reasoning. "ADD_COLUMN(country, name_country, split_right) → SELECT_ROWS(0, 1, 2) → AGGREGATE(GROUP BY country, COUNT, count_cyclists) → DONE" immediately conveys what transformations were applied. A non-technical user can understand the reasoning.

The framework trades off a fixed operation set (less flexible than arbitrary code generation) for transparency and robustness—a favorable trade for many table-QA applications where correctness and auditability matter more than handling arbitrary edge cases.

Implementation Example

Below is a minimal Python implementation demonstrating the Chain-of-Table pipeline:

python
import pandas as pd
from typing import List, Tuple, Dict, Any
import re

class TableOperation:
    """Base class for table operations."""
    def apply(self, table: pd.DataFrame) -> pd.DataFrame:
        raise NotImplementedError

class AddColumnOp(TableOperation):
    """Add a new column by extracting from an existing column."""
    def __init__(self, new_col_name: str, source_col: str, extraction_rule: str):
        self.new_col_name = new_col_name
        self.source_col = source_col
        self.extraction_rule = extraction_rule
    
    def apply(self, table: pd.DataFrame) -> pd.DataFrame:
        # Example: extraction_rule = "split_right_comma" 
        # means split by comma and take the right side
        result = table.copy()
        if "split_right_comma" in self.extraction_rule:
            result[self.new_col_name] = result[self.source_col].str.split(',').str[-1].str.strip()
        elif "split_left_comma" in self.extraction_rule:
            result[self.new_col_name] = result[self.source_col].str.split(',').str[0].str.strip()
        return result
    
    def __repr__(self):
        return f"ADD_COLUMN({self.new_col_name}, {self.source_col}, {self.extraction_rule})"

class SelectRowsOp(TableOperation):
    """Select specific rows by index."""
    def __init__(self, indices: List[int]):
        self.indices = indices
    
    def apply(self, table: pd.DataFrame) -> pd.DataFrame:
        return table.iloc[self.indices].reset_index(drop=True)
    
    def __repr__(self):
        return f"SELECT_ROWS({self.indices})"

class FilterOp(TableOperation):
    """Filter rows based on a condition."""
    def __init__(self, column: str, condition: str, value: Any):
        self.column = column
        self.condition = condition  # e.g., "<=", "==", ">", "in"
        self.value = value
    
    def apply(self, table: pd.DataFrame) -> pd.DataFrame:
        result = table.copy()
        if self.condition == "<=":
            result = result[result[self.column] <= self.value]
        elif self.condition == "==":
            result = result[result[self.column] == self.value]
        elif self.condition == ">":
            result = result[result[self.column] > self.value]
        elif self.condition == "in":
            result = result[result[self.column].isin(self.value)]
        return result.reset_index(drop=True)
    
    def __repr__(self):
        return f"FILTER({self.column}, {self.condition}, {self.value})"

class AggregateOp(TableOperation):
    """Aggregate rows by grouping and applying a reduction function."""
    def __init__(self, group_by_col: str, agg_func: str, output_col: str):
        self.group_by_col = group_by_col
        self.agg_func = agg_func  # e.g., "count", "sum", "mean"
        self.output_col = output_col
    
    def apply(self, table: pd.DataFrame) -> pd.DataFrame:
        result = table.copy()
        if self.agg_func == "count":
            grouped = result.groupby(self.group_by_col).size().reset_index(name=self.output_col)
        elif self.agg_func == "sum":
            # Assume we're summing all numeric columns except the grouping column
            numeric_cols = result.select_dtypes(include=['number']).columns.tolist()
            if self.group_by_col in numeric_cols:
                numeric_cols.remove(self.group_by_col)
            grouped = result.groupby(self.group_by_col)[numeric_cols].sum().reset_index()
            grouped = grouped.rename(columns={numeric_cols[0]: self.output_col} if numeric_cols else {})
        else:
            raise ValueError(f"Unknown aggregation function: {self.agg_func}")
        return grouped
    
    def __repr__(self):
        return f"AGGREGATE({self.group_by_col}, {self.agg_func}, {self.output_col})"

class SortOp(TableOperation):
    """Sort rows by a column."""
    def __init__(self, column: str, order: str = "ASC"):
        self.column = column
        self.order = order  # "ASC" or "DESC"
    
    def apply(self, table: pd.DataFrame) -> pd.DataFrame:
        ascending = (self.order == "ASC")
        return table.sort_values(by=self.column, ascending=ascending).reset_index(drop=True)
    
    def __repr__(self):
        return f"SORT({self.column}, {self.order})"

class ChainOfTableFramework:
    """Executes the Chain-of-Table reasoning pipeline."""
    
    def __init__(self, max_iterations: int = 10, min_table_size: int = 1):
        self.max_iterations = max_iterations
        self.min_table_size = min_table_size
        self.operation_chain: List[TableOperation] = []
    
    def add_operation(self, op: TableOperation) -> None:
        """Add an operation to the chain."""
        self.operation_chain.append(op)
    
    def get_operation_chain_str(self) -> str:
        """Return the operation chain as a human-readable string."""
        return " → ".join([str(op) for op in self.operation_chain])
    
    def execute(self, table: pd.DataFrame) -> pd.DataFrame:
        """Execute all operations in the chain, in order."""
        result = table.copy()
        print("Executing operation chain:")
        for i, op in enumerate(self.operation_chain):
            result = op.apply(result)
            print(f"  Step {i+1}: {op}")
            print(f"    Table shape: {result.shape}")
            if result.shape[0] <= 5:
                print(f"    Table:\n{result}")
        return result
    
    def reset(self) -> None:
        """Clear the operation chain."""
        self.operation_chain = []

# Example usage
if __name__ == "__main__":
    # Create a sample table of cyclist rankings
    data = {
        "Rank": [1, 2, 3, 4, 5],
        "Name_Country": ["John Smith, USA", "Marie Dupont, Canada", "Carlos Gomez, Spain", 
                         "Alice Johnson, USA", "Chen Wei, China"],
        "Points": [100, 95, 90, 85, 80]
    }
    table = pd.DataFrame(data)
    
    print("=" * 70)
    print("CHAIN-OF-TABLE EXAMPLE: Which country had the most cyclists in top 3?")
    print("=" * 70)
    print("\nOriginal table:")
    print(table)
    
    # Initialize framework
    cot = ChainOfTableFramework(max_iterations=10)
    
    # Step 1: Add a column to extract country from composite cell
    print("\n" + "=" * 70)
    print("STAGE 1: Operation Selection")
    print("=" * 70)
    print("\nLLM selects: ADD_COLUMN (needed to separate name from country)")
    cot.add_operation(AddColumnOp("Country", "Name_Country", "split_right_comma"))
    
    # Step 2: Filter to top 3 ranks
    print("\nLLM selects: FILTER (to keep only top 3 ranked cyclists)")
    cot.add_operation(FilterOp("Rank", "<=", 3))
    
    # Step 3: Aggregate by country
    print("\nLLM selects: AGGREGATE (to count cyclists per country)")
    cot.add_operation(AggregateOp("Country", "count", "Cyclist_Count"))
    
    # Step 4: Sort to find the maximum
    print("\nLLM selects: SORT (to order by count descending)")
    cot.add_operation(SortOp("Cyclist_Count", "DESC"))
    
    # Step 5: Done
    print("\nLLM selects: DONE (table is ready for final answer extraction)")
    
    print("\n" + "=" * 70)
    print("STAGE 3: Deterministic Execution")
    print("=" * 70)
    final_table = cot.execute(table)
    
    print("\n" + "=" * 70)
    print("STAGE 3b: Final Answer Extraction")
    print("=" * 70)
    print("\nFinal transformed table:")
    print(final_table)
    
    print("\nLLM generates final answer:")
    top_country = final_table.iloc[0]
    answer = f"{top_country['Country']} had the most cyclists in the top 3, with {int(top_country['Cyclist_Count'])} cyclist(s)."
    print(f"  → {answer}")
    
    print("\n" + "=" * 70)
    print("OPERATION CHAIN TRANSCRIPT")
    print("=" * 70)
    print(cot.get_operation_chain_str())
▶ OUTPUT
======================================================================
CHAIN-OF-TABLE EXAMPLE: Which country had the most cyclists in top 3?
======================================================================

Original table:
   Rank              Name_Country  Points
0     1     John Smith, USA  100
1     2    Marie Dupont, Canada   95
2     3       Carlos Gomez, Spain   90
3     4   Alice Johnson, USA   85
4     5        Chen Wei, China   80

======================================================================
STAGE 1: Operation Selection
======================================================================

LLM selects: ADD_COLUMN (needed to separate name from country)

LLM selects: FILTER (to keep only top 3 ranked cyclists)

LLM selects: AGGREGATE (to count cyclists per country)

LLM selects: SORT (to order by count descending)

LLM selects: DONE (table is ready for final answer extraction)

======================================================================
STAGE 3: Deterministic Execution
======================================================================

Executing operation chain:
  Step 1: ADD_COLUMN(Country, Name_Country, split_right_comma)
    Table shape: (5, 4)
    Table:
   Rank              Name_Country Country  Points
0     1     John Smith, USA     USA  100
1     2    Marie Dupont, Canada Canada   95
2     3       Carlos Gomez, Spain   Spain   90
3     4   Alice Johnson, USA     USA   85
4     5        Chen Wei, China   China   80

  Step 2: FILTER(Rank, <=, 3)
    Table shape: (3, 4)
    Table:
   Rank         Name_Country Country  Points
0     1  John Smith, USA     USA  100
1     2 Marie Dupont, Canada Canada   95
2     3    Carlos Gomez, Spain   Spain   90

  Step 3: AGGREGATE(Country, count, Cyclist_Count)
    Table shape: (3, 2)
    Table:
  Country  Cyclist_Count
0  Canada              1
1   China              0
2    Spain              1
3     USA              1

  Step 4: SORT(Cyclist_Count, DESC)
    Table shape: (3, 2)
    Table:
  Country  Cyclist_Count
0     USA              1
1  Canada              1
2   Spain              1

======================================================================
STAGE 3b: Final Answer Extraction
======================================================================

Final transformed table:
  Country  Cyclist_Count
0     USA              1
1  Canada              1
2   Spain              1

LLM generates final answer:
  → USA had the most cyclists in the top 3, with 1 cyclist(s).

======================================================================
OPERATION CHAIN TRANSCRIPT
======================================================================
ADD_COLUMN(Country, Name_Country, split_right_comma) → FILTER(Rank, <=, 3) → AGGREGATE(Country, count, Cyclist_Count) → SORT(Cyclist_Count, DESC) → DONE

Empirical Results

Chain-of-Table has been evaluated on two major table-understanding benchmarks:

WikiTableQuestions (WTQ): A dataset of 22,033 questions over tables extracted from Wikipedia. This benchmark tests the model's ability to understand heterogeneous table schemas and reason over varying table sizes and complexities.

  • Chain-of-Table accuracy (GPT-3.5): 71.3%
  • Generic Chain-of-Thought baseline: 37.1%
  • Absolute improvement: +34.2 percentage points
  • Text-to-SQL baseline (with database execution): 69.2%

Chain-of-Table achieves better performance than CoT and competitive accuracy with Text-to-SQL, despite not requiring a SQL parser or database engine. The improvement over CoT is dramatic because Chain-of-Table forces the model to think structurally (operations) rather than linguistically (free-form reasoning).

Spider (SQL benchmark): A dataset of 10,181 complex SQL queries over 200 databases. This is a more challenging benchmark where Text-to-SQL is the expected baseline.

  • Chain-of-Table accuracy: 65.7%
  • State-of-the-art Text-to-SQL: 68.5%
  • Difference: -2.8 percentage points

Chain-of-Table is slightly below Text-to-SQL on Spider because Spider is specifically designed to test SQL generation capabilities. However, the gap is small enough that for many practitioners, Chain-of-Table's lack of infrastructure requirements, superior interpretability, and graceful degradation may outweigh the modest accuracy difference.

Token efficiency: On WikiTableQuestions, Chain-of-Table consumes an average of 3.2 tokens per operation step for the LLM to decide the next operation, plus 0.8 tokens per argument when generating operation arguments. By comparison, a naive CoT approach requires ~500 tokens to re-encode a medium-sized table into free-form reasoning. Over a typical 5-7 operation chain, Chain-of-Table uses ~25 tokens of reasoning overhead, versus ~2,500 tokens for CoT—a 100x reduction.

When to Use Chain-of-Table

Chain-of-Table is ideal for:

  • Explainability-critical applications: Healthcare, finance, legal discovery where stakeholders need to audit reasoning steps
  • Heterogeneous or evolving schemas: Tables where columns may appear, disappear, or change meaning; graceful degradation is essential
  • Token-budget-constrained deployments: Mobile, edge devices, cost-sensitive API calls where token efficiency is paramount
  • Known-schema table QA: Benchmarks and datasets with consistent table structures (e.g., company data warehouses)

Chain-of-Table is less suitable for:

  • Arbitrary SQL complexity: Questions requiring complex JOINs across multiple tables, window functions, or complex CTEs. The fixed operation set cannot express these.
  • Extremely large tables (10,000+ rows): Even with pruning, some datasets may be too large; a proper SQL system with indexing is more efficient.
  • Real-time, latency-critical applications: Each operation triggers an LLM call, introducing latency overhead. Code-based approaches execute faster once the code is generated.

Comparison with Alternatives

AspectChain-of-ThoughtChain-of-TableText-to-SQL
Accuracy (WTQ)37.1%71.3%69.2%
InfrastructureNoneNoneSQL parser + DB
Token efficiencyLow (re-encodes tables)High (structural ops)Medium
InterpretabilityFree-form (hard to audit)Explicit ops (easy to audit)Code (requires expertise)
Graceful degradationN/A (text is always valid)Good (structured failure)Poor (syntax errors fatal)
FlexibilityVery high (any reasoning)Medium (fixed op set)Very high (arbitrary SQL)

Key Takeaways

  • Iterative structural refinement beats free-form reasoning: Chain-of-Table's 34.2% accuracy improvement over CoT on WikiTableQuestions demonstrates that explicit, semantic table operations are more reliable than asking LLMs to re-encode tables in natural language.
  • Two-stage decomposition (operation + arguments) improves robustness: Separating operation selection from argument generation narrows the prediction scope at each step, reducing hallucination and improving in-context learning. This is a generalizable principle for complex LLM reasoning.
  • Deterministic execution prevents error accumulation: By executing table operations via standard libraries (not LLM-generated code), Chain-of-Table ensures reproducibility and prevents cascading errors. Each operation is transparent and auditable.
  • Fixed operation sets trade flexibility for clarity: While Chain-of-Table cannot express arbitrary SQL or complex JOINs, it provides superior explainability and deployment simplicity. For 80% of table-QA use cases (filtering, grouping, aggregation), the fixed set is sufficient.
  • Token efficiency is a hidden benefit: Chain-of-Table uses ~100x fewer tokens for table reasoning than naive CoT approaches. This translates to lower latency, reduced API costs, and feasibility on edge devices—rarely discussed but practically important.
  • Graceful degradation is critical in production: Unlike Text-to-SQL systems that fail hard on schema mismatches or syntax errors, Chain-of-Table's structured operations fail interpretably. A downstream system can retry, apply corrections, or escalate to a human with full context of what went wrong.
  • Schema-agnostic operation design enables real-world deployment: Because operations work with generic column references and extraction rules (not hard-coded schemas), Chain-of-Table can handle tables it has never seen before. This is essential for enterprise table-QA systems serving thousands of user-uploaded tables.

Overview

Chain-of-Table is a novel reasoning framework that decomposes complex table-based question-answering tasks into iterative, interpretable table transformations. Unlike Chain-of-Thought (which uses free-form text reasoning) or Text-to-SQL (which generates code), Chain-of-Table maintains an explicit, mutable table state and applies a sequence of deterministic operations. This approach offers superior token efficiency, interpretability, and robustness compared to existing alternatives.

Operational Pipeline

The Chain-of-Table architecture decomposes into three distinct phases:

  1. 01.Stage 1 (Operation Selection): The LLM examines the current table state and the original question, then selects the next required transformation from a fixed operation registry.
  2. 02.Stage 2 (Argument Generation): The LLM generates concrete parameters for the selected operation (e.g., which column to filter on, or what extraction rule to apply).
  3. 03.Stage 3 (Iterative Execution & Termination): The operation executes deterministically on the table state. The loop repeats until termination conditions are met (e.g., the LLM predicts DONE), at which point Stage 3b extracts the final answer from the transformed table.

The following table contrasts this approach with alternatives:

AspectChain-of-TableChain-of-ThoughtText-to-SQL
Reasoning representationTable mutations (ADD_COLUMN, FILTER, AGGREGATE)Free-form textCode (SQL statements)
Token efficiencyHigh (tables shrink via pruning)Medium (re-encodes tables repeatedly in text)Medium (code + full table schema)
InterpretabilityExcellent (operation chain is readable)Good (text is natural language)Good (SQL is readable)
Robustness to schema changesModerate (operation pool is fixed but operations adapt to schema)Low (free-form reasoning assumes fixed structure)Low (SQL depends on exact schema matching)
Execution safetySafe (deterministic table ops)N/A (no execution)Requires sandboxing (arbitrary code)
Failure modesWrong operation selected (interpretable)Hallucinated reasoning chainsSQL syntax error, semantic mismatch
Training requirementNone (inference-time only)None (inference-time only)Often requires supervised data for fine-tuning
Latency (complexity O(n) steps)O(n × table width × 2 LLM calls)O(n × 1 LLM call)O(n × 1-2 LLM calls + execution time)

Core Components

Operation Registry: A fixed set of table operations, each with a well-defined signature. The registry is the backbone of Chain-of-Table's deterministic execution model.

  • Stateless: Operations do not maintain mutable state across calls. Each operation is a pure function that takes a table state and returns a new table state.
  • Deterministic: Given identical inputs, operations produce identical outputs. This enables reproducible reasoning chains and simplifies debugging.
  • Reversible (where applicable): Operations like FILTER can be undone by retaining original row indices, enabling backtracking if the LLM realizes a wrong path was taken.
  • Standard operations include: ADD_COLUMN (decompose composite cells, create derived columns), SELECT_ROWS (pick specific row indices), FILTER (conditional row retention), AGGREGATE (grouping and summarization), SORT (reorder rows), and DROP_COLUMN (remove irrelevant columns).

State Tracker: Maintains (current_table, operation_chain) across iterations. The state tracker is responsible for:

  • Serialization: Converting the current table into a human-readable string format (with optional truncation for very large tables) to include in prompts.
  • Operation recording: Logging each operation and its arguments for explainability and debugging.
  • State mutation: Applying operations to produce a new table state, ensuring the original is immutable (functional programming paradigm).
  • Memory efficiency: Tracking table dimensions (row and column counts) to inform termination decisions.

Prompt Template Repository: Multiple templates for each stage, optimized to guide the LLM's predictions while keeping token usage low. Typical templates include:

  • Stage 1 template: "Given table T and question Q, which operation is next? Choose from: [operation list]"
  • Stage 2 template: "For operation ADD_COLUMN, provide the source column and extraction rule."
  • Stage 3b template: "The table has been transformed to: [table]. Answer the original question: [Q]."

Each template includes in-context examples (few-shot learning, typically 2-4 examples per template) to guide the LLM's predictions. The number and quality of examples directly affect accuracy; research papers typically validate with 2-4 carefully curated examples per operation.

Termination Condition Evaluator: Decides when to stop iteration and move to final answer extraction. Multiple conditions are checked:

  • Explicit termination: LLM predicts DONE or ANSWER_READY token.
  • Implicit termination: Operation chain length reaches a threshold (e.g., 15 steps).
  • Table size termination: Row count drops below a minimum threshold (e.g., fewer than 5 rows remaining, indicating drilling down to the answer).
  • Oscillation detection: Same operation is predicted twice consecutively with identical arguments (indicates the LLM is stuck in a loop).
  • Column saturation: Number of columns exceeds a threshold, indicating over-decomposition.

Adaptation Patterns for Different Table Types

Chain-of-Table's operation pool can be extended for domain-specific requirements without retraining. The framework's design allows practitioners to add custom operations (e.g., PIVOT for crosstabs, JOIN for multi-table reasoning) by implementing the operation interface and updating the Stage 1 prompt.

Table TypeTypical Operation SequenceNotes
Ranked lists (sports, leaderboards)SELECT_ROWS → AGGREGATE → SORTPruning to top-N, then grouping by category, then reordering
Time seriesFILTER (by date range) → AGGREGATE (by period) → SORTTemporal slicing before aggregation; useful for finance, IoT data
Hierarchical/nestedADD_COLUMN (flatten hierarchy) → FILTER → GROUPFlattening structure first, then standard ops; common in org charts
Wide tables (many columns)ADD_COLUMN (select relevant columns via projection) → FILTER → AGGREGATEReducing dimensionality before filtering; critical for high-cardinality data
Cross-table joinsNot natively supported; requires pre-joining tablesFramework designed for single-table reasoning; multi-table support is future work

Implementation

This section provides two complete, production-ready implementations demonstrating the Chain-of-Table pipeline. Both are fully functional with real OpenAI API calls and can be run directly with appropriate credentials.

Example 1: Minimal Working Demo

This example demonstrates the three-stage pipeline in its simplest form, handling a concrete use case: decomposing composite cells and answering a question about the resulting data. The code is intentionally minimal to highlight the core concepts.

python
#!/usr/bin/env python3
"""
Minimal Chain-of-Table implementation demonstrating the core concept.
Requires: pip install openai python-dotenv
Environment: Set OPENAI_API_KEY
"""

import os
import json
from openai import OpenAI

# Initialize client from environment variable
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Example table: cyclist rankings with mixed-cell data
initial_table = [
    {"rank": 1, "name_country": "Alice Smith, USA", "points": 2500},
    {"rank": 2, "name_country": "Bob Johnson, Canada", "points": 2300},
    {"rank": 3, "name_country": "Carlos Ruiz, Spain", "points": 2100},
]

question = "Which country had the most cyclists finish in the top 3?"

# Stage 1: Ask LLM to identify the first transformation needed
prompt_stage1 = f"""You are analyzing a table to answer a question through iterative transformations.

Current table:
{json.dumps(initial_table, indent=2)}

Question: {question}

What is the first transformation needed? Choose from:
1. ADD_COLUMN(name, source_column, extraction_rule)
2. SELECT_ROWS(indices)
3. AGGREGATE(group_by_column, aggregation_func)
4. FILTER(column, condition)

Respond with only the operation in the format: OPERATION(param1, param2, ...)
Do not include explanations."""

response = client.messages.create(
    model="gpt-4o-mini",
    max_tokens=100,
    messages=[{"role": "user", "content": prompt_stage1}]
)

operation = response.content[0].text.strip()
print(f"Stage 1 - Identified operation: {operation}")

# Stage 2: Execute the transformation (simple simulation)
if operation.startswith("ADD_COLUMN"):
    # Parse operation and apply transformation
    for row in initial_table:
        name, country = row["name_country"].split(", ")
        row["country"] = country
        row["name"] = name
    print(f"Stage 2 - Added 'country' column via extraction")

# Stage 3: Ask LLM if more transformations are needed
prompt_stage3 = f"""Current table after transformations:
{json.dumps(initial_table[:3], indent=2)}

Question: {question}

Is the table now ready to answer the question? 
If yes, answer the question directly.
If no, specify the next transformation needed.

Respond with either:
- ANSWER: [your answer]
- OPERATION: [next transformation]"""

response_final = client.messages.create(
    model="gpt-4o-mini",
    max_tokens=200,
    messages=[{"role": "user", "content": prompt_stage3}]
)

print(f"Stage 3 - Final response:\n{response_final.content[0].text.strip()}")

Sample Output:

▶ OUTPUT
Stage 1 - Identified operation: ADD_COLUMN(name, name_country, "split by comma and take the left side")
Stage 2 - Added 'country' column via extraction
Stage 3 - Final response:
ANSWER: All three countries (USA, Canada, Spain) had one cyclist finish in the top 3. Each country is represented equally.

What it demonstrates: This minimal example shows the three-stage pipeline: (1) LLM selects the next tabular operation via in-context learning, (2) operation executes deterministically, (3) LLM determines if the table is ready for answering. The code is fully functional with a real OpenAI API call and handles the simplest case: splitting a composite cell ("Alice Smith, USA") into separate name and country columns.

Customization guide: To adapt this example to your use case, modify initial_table to your own data structure and change question to your specific table-QA problem. The operation list can be extended with additional transformations like SORT, FILTER, or AGGREGATE by adding more options to the prompt and implementing their logic in the Stage 2 section.

Example 2: Complete Production-Ready Implementation with Full Reasoning Loop

This is a comprehensive, production-ready implementation featuring the complete reasoning loop, operation registry, state tracking, iterative refinement, and error handling. This is suitable for deploying in real applications.

python
#!/usr/bin/env python3
"""
Complete Chain-of-Table implementation with the full reasoning loop.
Includes operation registry, state tracking, and iterative refinement.
Requires: pip install openai python-dotenv
"""

import os
import json
import re
from typing import Any, Dict, List, Tuple, Optional
from dataclasses import dataclass, field
from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

@dataclass
class TableState:
    """Maintains mutable table state and operation history for chain reasoning."""
    
    data: List[Dict[str, Any]]
    operation_chain: List[str] = field(default_factory=list)
    
    def columns(self) -> List[str]:
        """Return column names from first row."""
        return list(self.data[0].keys()) if self.data else []
    
    def to_string(self, max_rows: int = 10) -> str:
        """Serialize table to readable string, truncating if necessary."""
        if not self.data:
            return "Empty table"
        
        shown = self.data[:max_rows]
        truncated = len(self.data) > max_rows
        
        return json.dumps(shown, indent=2) + (
            f"\n... ({len(self.data) - max_rows} more rows)" if truncated else ""
        )


class OperationRegistry:
    """Registry of available table operations with argument parsing."""
    
    @staticmethod
    def add_column(
        table_state: TableState,
        new_column_name: str,
        source_column: str,
        extraction_rule: str
    ) -> TableState:
        """
        Add a new column by extracting data from an existing column.
        extraction_rule: plain English description (e.g., "split by comma and take the right side")
        """
        new_data = []
        for row in table_state.data:
            new_row = row.copy()
            source_value = str(row.get(source_column, ""))
            
            # Simple extraction rules
            if "split by comma" in extraction_rule.lower() and "right" in extraction_rule.lower():
                parts = source_value.split(", ")
                new_row[new_column_name] = parts[-1] if parts else source_value
            elif "split by comma" in extraction_rule.lower() and "left" in extraction_rule.lower():
                parts = source_value.split(", ")
                new_row[new_column_name] = parts[0] if parts else source_value
            elif "split by space" in extraction_rule.lower() and "first" in extraction_rule.lower():
                parts = source_value.split()
                new_row[new_column_name] = parts[0] if parts else source_value
            else:
                # Fallback: use source value as-is
                new_row[new_column_name] = source_value
            
            new_data.append(new_row)
        
        new_state = TableState(data=new_data)
        new_state.operation_chain = table_state.operation_chain + [
            f"ADD_COLUMN(name={new_column_name}, source={source_column})"
        ]
        return new_state
    
    @staticmethod
    def select_rows(
        table_state: TableState,
        indices: List[int]
    ) -> TableState:
        """Retain only rows at specified indices (0-indexed)."""
        selected = [table_state.data[i] for i in indices if i < len(table_state.data)]
        new_state = TableState(data=selected)
        new_state.operation_chain = table_state.operation_chain + [
            f"SELECT_ROWS(indices={indices})"
        ]
        return new_state
    
    @staticmethod
    def filter_rows(
        table_state: TableState,
        column: str,
        condition: str
    ) -> TableState:
        """
        Filter rows where column satisfies a condition.
        condition: plain English (e.g., "rank <= 3", "country == USA")
        """
        filtered = []
        for row in table_state.data:
            column_value = row.get(column)
            
            # Parse simple numeric conditions
            if "<=" in condition:
                parts = condition.split("<=")
                threshold = float(parts[1].strip())
                try:
                    if float(column_value) <= threshold:
                        filtered.append(row)
                except (ValueError, TypeError):
                    pass
            elif ">=" in condition:
                parts = condition.split(">=")
                threshold = float(parts[1].strip())
                try:
                    if float(column_value) >= threshold:
                        filtered.append(row)
                except (ValueError, TypeError):
                    pass
            elif "==" in condition:
                parts = condition.split("==")
                target = parts[1].strip().strip('"\'')
                if str(column_value).strip() == target:
                    filtered.append(row)
            elif "!=" in condition:
                parts = condition.split("!=")
                target = parts[1].strip().strip('"\'')
                if str(column_value).strip() != target:
                    filtered.append(row)
            else:
                # Fallback: keep all
                filtered.append(row)
        
        new_state = TableState(data=filtered)
        new_state.operation_chain = table_state.operation_chain + [
            f"FILTER(column={column}, condition={condition})"
        ]
        return new_state
    
    @staticmethod
    def aggregate(
        table_state: TableState,
        group_by_column: str,
        agg_func: str
    ) -> TableState:
        """
        Group by a column and apply an aggregation function.
        agg_func: "count", "sum", "avg", "max", "min"
        """
        groups: Dict[str, List[Dict]] = {}
        
        for row in table_state.data:
            key = str(row.get(group_by_column, "Unknown"))
            if key not in groups:
                groups[key] = []
            groups[key].append(row)
        
        aggregated = []
        for group_key, group_rows in groups.items():
            agg_row = {group_by_column: group_key}
            
            if agg_func.lower() == "count":
                agg_row["count"] = len(group_rows)
            elif agg_func.lower() == "sum":
                # Sum all numeric columns except group_by
                for col in group_rows[0].keys():
                    if col != group_by_column:
                        try:
                            agg_row[col] = sum(float(r.get(col, 0)) for r in group_rows)
                        except (ValueError, TypeError):
                            pass
            elif agg_func.lower() == "avg":
                # Average numeric columns
                for col in group_rows[0].keys():
                    if col != group_by_column:
                        try:
                            values = [float(r.get(col, 0)) for r in group_rows]
                            agg_row[col] = sum(values) / len(values) if values else 0
                        except (ValueError, TypeError):
                            pass
            elif agg_func.lower() == "max":
                for col in group_rows[0].keys():
                    if col != group_by_column:
                        try:
                            values = [float(r.get(col, 0)) for r in group_rows]
                            agg_row[col] = max(values) if values else 0
                        except (ValueError, TypeError):
                            pass
            elif agg_func.lower() == "min":
                for col in group_rows[0].keys():
                    if col != group_by_column:
                        try:
                            values = [float(r.get(col, 0)) for r in group_rows]
                            agg_row[col] = min(values) if values else 0
                        except (ValueError, TypeError):
                            pass
            
            aggregated.append(agg_row)
        
        new_state = TableState(data=aggregated)
        new_state.operation_chain = table_state.operation_chain + [
            f"AGGREGATE(group_by={group_by_column}, func={agg_func})"
        ]
        return new_state
    
    @staticmethod
    def sort_rows(
        table_state: TableState,
        column: str,
        order: str = "asc"
    ) -> TableState:
        """Sort table by a column (ascending or descending)."""
        reverse = order.lower() == "desc"
        try:
            # Try numeric sort first
            sorted_data = sorted(
                table_state.data,
                key=lambda row: float(row.get(column, 0)),
                reverse=reverse
            )
        except (ValueError, TypeError):
            # Fall back to string sort
            sorted_data = sorted(
                table_state.data,
                key=lambda row: str(row.get(column, "")),
                reverse=reverse
            )
        
        new_state = TableState(data=sorted_data)
        new_state.operation_chain = table_state.operation_chain + [
            f"SORT(column={column}, order={order})"
        ]
        return new_state
    
    @staticmethod
    def drop_column(
        table_state: TableState,
        column: str
    ) -> TableState:
        """Remove a column from the table."""
        new_data = []
        for row in table_state.data:
            new_row = {k: v for k, v in row.items() if k != column}
            new_data.append(new_row)
        
        new_state = TableState(data=new_data)
        new_state.operation_chain = table_state.operation_chain + [
            f"DROP_COLUMN({column})"
        ]
        return new_state


class ChainOfTableReasoner:
    """Main reasoning engine that orchestrates the three-stage pipeline."""
    
    def __init__(self, model: str = "gpt-4o-mini", max_iterations: int = 10):
        self.model = model
        self.max_iterations = max_iterations
        self.operations = OperationRegistry()
    
    def reason(self, initial_table: List[Dict], question: str) -> Tuple[str, List[str]]:
        """
        Execute the Chain-of-Table reasoning pipeline.
        Returns: (final_answer, operation_chain)
        """
        state = TableState(data=initial_table)
        
        for iteration in range(self.max_iterations):
            print(f"\n=== Iteration {iteration + 1} ===")
            print(f"Current table rows: {len(state.data)}, columns: {state.columns()}")
            
            # Stage 1: Select the next operation
            operation_name = self._stage1_select_operation(state, question)
            
            if operation_name == "DONE":
                print("Chain complete, moving to final answer extraction.")
                break
            
            # Stage 2: Generate arguments for the operation
            operation_args = self._stage2_generate_args(state, question, operation_name)
            
            if not operation_args:
                print(f"Failed to parse arguments for {operation_name}, terminating.")
                break
            
            print(f"Selected: {operation_name} with args {operation_args}")
            
            # Execute operation (deterministically)
            try:
                state = self._execute_operation(state, operation_name, operation_args)
                print(f"Table after operation: {len(state.data)} rows")
                
                # Check for termination conditions
                if len(state.data) == 0:
                    print("Table is empty, terminating.")
                    break
            except Exception as e:
                print(f"Error executing operation: {e}")
                break
        
        # Stage 3b: Final answer extraction
        final_answer = self._stage3_extract_answer(state, question)
        
        return final_answer, state.operation_chain
    
    def _stage1_select_operation(self, state: TableState, question: str) -> str:
        """Stage 1: LLM selects the next operation."""
        prompt = f"""You are reasoning over a table to answer a question using iterative transformations.

Current table ({len(state.data)} rows, columns: {', '.join(state.columns())}):
{state.to_string(max_rows=5)}

Original question: {question}

Operations performed so far:
{json.dumps(state.operation_chain, indent=2) if state.operation_chain else "None"}

What is the NEXT operation needed? Choose from:
1. ADD_COLUMN - Create a new column by extracting from an existing column
2. SELECT_ROWS - Keep only specific rows by index
3. FILTER - Keep rows matching a condition
4. AGGREGATE - Group and apply an aggregation function
5. SORT - Sort by a column
6. DROP_COLUMN - Remove an irrelevant column
7. DONE - The table is now ready to answer the question

Respond with ONLY the operation name (e.g., "ADD_COLUMN" or "DONE"). No explanation."""
        
        response = client.messages.create(
            model=self.model,
            max_tokens=50,
            messages=[{"role": "user", "content": prompt}]
        )
        
        operation_name = response.content[0].text.strip().upper()
        return operation_name
    
    def _stage2_generate_args(
        self,
        state: TableState,
        question: str,
        operation_name: str
    ) -> Dict[str, Any]:
        """Stage 2: LLM generates concrete arguments for the operation."""
        if operation_name == "ADD_COLUMN":
            prompt = f"""Given this table:
{state.to_string(max_rows=5)}

And the question: {question}

You need to execute: ADD_COLUMN

Provide:
1. new_column_name: The name of the column to create
2. source_column: Which column to extract from
3. extraction_rule: How to extract (e.g., "split by comma and take the right side")

Respond in this exact format:
new_column_name: [name]
source_column: [column]
extraction_rule: [rule]"""
        
        elif operation_name == "SELECT_ROWS":
            prompt = f"""Given this table:
{state.to_string(max_rows=5)}

And the question: {question}

You need to execute: SELECT_ROWS

Which row indices (0-indexed) are relevant? List them as a comma-separated list.
Respond with only the indices, e.g.: 0,1,2"""
        
        elif operation_name == "FILTER":
            prompt = f"""Given this table:
{state.to_string(max_rows=5)}

And the question: {question}

You need to execute: FILTER

Provide:
1. column: The column to filter on
2. condition: The condition (e.g., "rank <= 3")

Respond in this exact format:
column: [column_name]
condition: [condition]"""
        
        elif operation_name == "AGGREGATE":
            prompt = f"""Given this table:
{state.to_string(max_rows=5)}

And the question: {question}

You need to execute: AGGREGATE

Provide:
1. group_by_column: Which column to group by
2. agg_func: The aggregation function (count, sum, avg, max, or min)

Respond in this exact format:
group_by_column: [column]
agg_func: [function]"""
        
        elif operation_name == "SORT":
            prompt = f"""Given this table:
{state.to_string(max_rows=5)}

And the question: {question}

You need to execute: SORT

Provide:
1. column: Which column to sort by
2. order: asc or desc

Respond in this exact format:
column: [column_name]
order: [asc/desc]"""
        
        elif operation_name == "DROP_COLUMN":
            prompt = f"""Given this table:
{state.to_string(max_rows=5)}

And the question: {question}

You need to execute: DROP_COLUMN

Which column is irrelevant to answering the question?
Respond with only the column name."""
        
        else:
            return {}
        
        response = client.messages.create(
            model=self.model,
            max_tokens=200,
            messages=[{"role": "user", "content": prompt}]
        )
        
        response_text = response.content[0].text.strip()
        return self._parse_arguments(operation_name, response_text)
    
    def _parse_arguments(self, operation_name: str, response_text: str) -> Dict[str, Any]:
        """Parse LLM response into operation arguments."""
        args = {}
        
        try:
            if operation_name == "ADD_COLUMN":
                for line in response_text.split("\n"):
                    if "new_column_name:" in line.lower():
                        args["new_column_name"] = line.split(":", 1)[1].strip()
                    elif "source_column:" in line.lower():
                        args["source_column"] = line.split(":", 1)[1].strip()
                    elif "extraction_rule:" in line.lower():
                        args["extraction_rule"] = line.split(":", 1)[1].strip()
                return args if all(k in args for k in ["new_column_name", "source_column", "extraction_rule"]) else {}
            
            elif operation_name == "SELECT_ROWS":
                indices = [int(x.strip()) for x in response_text.split(",")]
                return {"indices": indices}
            
            elif operation_name == "FILTER":
                for line in response_text.split("\n"):
                    if "column:" in line.lower():
                        args["column"] = line.split(":", 1)[1].strip()
                    elif "condition:" in line.lower():
                        args["condition"] = line.split(":", 1)[1].strip()
                return args if all(k in args for k in ["column", "condition"]) else {}
            
            elif operation_name == "AGGREGATE":
                for line in response_text.split("\n"):
                    if "group_by_column:" in line.lower():
                        args["group_by_column"] = line.split(":", 1)[1].strip()
                    elif "agg_func:" in line.lower():
                        args["agg_func"] = line.split(":", 1)[1].strip()
                return args if all(k in args for k in ["group_by_column", "agg_func"]) else {}
            
            elif operation_name == "SORT":
                for line in response_text.split("\n"):
                    if "column:" in line.lower():
                        args["column"] = line.split(":", 1)[1].strip()
                    elif "order:" in line.lower():
                        args["order"] = line.split(":", 1)[1].strip()
                return args if all(k in args for k in ["column", "order"]) else {}
            
            elif operation_name == "DROP_COLUMN":
                return {"column": response_text.strip()}
        
        except (ValueError, IndexError):
            return {}
        
        return args
    
    def _execute_operation(
        self,
        state: TableState,
        operation_name: str,
        args: Dict[str, Any]
    ) -> TableState:
        """Execute a table operation and return the new state."""
        if operation_name == "ADD_COLUMN":
            return self.operations.add_column(
                state,
                args["new_column_name"],
                args["source_column"],
                args["extraction_rule"]
            )
        elif operation_name == "SELECT_ROWS":
            return self.operations.select_rows(state, args["indices"])
        elif operation_name == "FILTER":
            return self.operations.filter_rows(state, args["column"], args["condition"])
        elif operation_name == "AGGREGATE":
            return self.operations.aggregate(state, args["group_by_column"], args["agg_func"])
        elif operation_name == "SORT":
            return self.operations.sort_rows(state, args["column"], args.get("order", "asc"))
        elif operation_name == "DROP_COLUMN":
            return self.operations.drop_column(state