Initial commit: add all skills files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-10 16:52:49 +08:00
commit 6487becf60
396 changed files with 108871 additions and 0 deletions

138
minimax-xlsx/SKILL.md Normal file
View File

@@ -0,0 +1,138 @@
---
name: minimax-xlsx
description: Open, create, read, analyze, edit, or validate Excel/spreadsheet files (.xlsx, .xlsm, .csv, .tsv). Use when the user asks to create, build, modify, analyze, read, validate, or format any Excel spreadsheet, financial model, pivot table, or tabular data file. Covers: creating new xlsx from scratch, reading and analyzing existing files, editing existing xlsx with zero format loss, formula recalculation and validation, and applying professional financial formatting standards. Triggers on 'spreadsheet', 'Excel', '.xlsx', '.csv', 'pivot table', 'financial model', 'formula', or any request to produce tabular data in Excel format.
license: MIT
metadata:
version: "1.0"
category: productivity
sources:
- ECMA-376 Office Open XML File Formats
- Microsoft Open XML SDK documentation
---
# MiniMax XLSX Skill
Handle the request directly. Do NOT spawn sub-agents. Always write the output file the user requests.
## Task Routing
| Task | Method | Guide |
|------|--------|-------|
| **READ** — analyze existing data | `xlsx_reader.py` + pandas | `references/read-analyze.md` |
| **CREATE** — new xlsx from scratch | XML template | `references/create.md` + `references/format.md` |
| **EDIT** — modify existing xlsx | XML unpack→edit→pack | `references/edit.md` (+ `format.md` if styling needed) |
| **FIX** — repair broken formulas in existing xlsx | XML unpack→fix `<f>` nodes→pack | `references/fix.md` |
| **VALIDATE** — check formulas | `formula_check.py` | `references/validate.md` |
## READ — Analyze data (read `references/read-analyze.md` first)
Start with `xlsx_reader.py` for structure discovery, then pandas for custom analysis. Never modify the source file.
**Formatting rule**: When the user specifies decimal places (e.g. "2 decimal places"), apply that format to ALL numeric values — use `f'{v:.2f}'` on every number. Never output `12875` when `12875.00` is required.
**Aggregation rule**: Always compute sums/means/counts directly from the DataFrame column — e.g. `df['Revenue'].sum()`. Never re-derive column values before aggregation.
## CREATE — XML template (read `references/create.md` + `references/format.md`)
Copy `templates/minimal_xlsx/` → edit XML directly → pack with `xlsx_pack.py`. Every derived value MUST be an Excel formula (`<f>SUM(B2:B9)</f>`), never a hardcoded number. Apply font colors per `format.md`.
## EDIT — XML direct-edit (read `references/edit.md` first)
**CRITICAL — EDIT INTEGRITY RULES:**
1. **NEVER create a new `Workbook()`** for edit tasks. Always load the original file.
2. The output MUST contain the **same sheets** as the input (same names, same data).
3. Only modify the specific cells the task asks for — everything else must be untouched.
4. **After saving output.xlsx, verify it**: open with `xlsx_reader.py` or `pandas` and confirm the original sheet names and a sample of original data are present. If verification fails, you wrote the wrong file — fix it before delivering.
Never use openpyxl round-trip on existing files (corrupts VBA, pivots, sparklines). Instead: unpack → use helper scripts → repack.
**"Fill cells" / "Add formulas to existing cells" = EDIT task.** If the input file already exists and you are told to fill, update, or add formulas to specific cells, you MUST use the XML edit path. Never create a new `Workbook()`. Example — fill B3 with a cross-sheet SUM formula:
```bash
python3 SKILL_DIR/scripts/xlsx_unpack.py input.xlsx /tmp/xlsx_work/
# Find the target sheet's XML via xl/workbook.xml → xl/_rels/workbook.xml.rels
# Then use the Edit tool to add <f> inside the target <c> element:
# <c r="B3"><f>SUM('Sales Data'!D2:D13)</f><v></v></c>
python3 SKILL_DIR/scripts/xlsx_pack.py /tmp/xlsx_work/ output.xlsx
```
**Add a column** (formulas, numfmt, styles auto-copied from adjacent column):
```bash
python3 SKILL_DIR/scripts/xlsx_unpack.py input.xlsx /tmp/xlsx_work/
python3 SKILL_DIR/scripts/xlsx_add_column.py /tmp/xlsx_work/ --col G \
--sheet "Sheet1" --header "% of Total" \
--formula '=F{row}/$F$10' --formula-rows 2:9 \
--total-row 10 --total-formula '=SUM(G2:G9)' --numfmt '0.0%' \
--border-row 10 --border-style medium
python3 SKILL_DIR/scripts/xlsx_pack.py /tmp/xlsx_work/ output.xlsx
```
The `--border-row` flag applies a top border to ALL cells in that row (not just the new column). Use it when the task requires accounting-style borders on total rows.
**Insert a row** (shifts existing rows, updates SUM formulas, fixes circular refs):
```bash
python3 SKILL_DIR/scripts/xlsx_unpack.py input.xlsx /tmp/xlsx_work/
# IMPORTANT: Find the correct --at row by searching for the label text
# in the worksheet XML, NOT by using the row number from the prompt.
# The prompt may say "row 5 (Office Rent)" but Office Rent might actually
# be at row 4. Always locate the row by its text label first.
python3 SKILL_DIR/scripts/xlsx_insert_row.py /tmp/xlsx_work/ --at 5 \
--sheet "Budget FY2025" --text A=Utilities \
--values B=3000 C=3000 D=3500 E=3500 \
--formula 'F=SUM(B{row}:E{row})' --copy-style-from 4
python3 SKILL_DIR/scripts/xlsx_pack.py /tmp/xlsx_work/ output.xlsx
```
**Row lookup rule**: When the task says "after row N (Label)", always find the row by searching for "Label" in the worksheet XML (`grep -n "Label" /tmp/xlsx_work/xl/worksheets/sheet*.xml` or check sharedStrings.xml). Use the actual row number + 1 for `--at`. Do NOT call `xlsx_shift_rows.py` separately — `xlsx_insert_row.py` calls it internally.
**Apply row-wide borders** (e.g. accounting line on a TOTAL row):
After running helper scripts, apply borders to ALL cells in the target row, not just newly added cells. In `xl/styles.xml`, append a new `<border>` with the desired style, then append a new `<xf>` in `<cellXfs>` that clones each cell's existing `<xf>` but sets the new `borderId`. Apply the new style index to every `<c>` in the row via the `s` attribute:
```xml
<!-- In xl/styles.xml, append to <borders>: -->
<border>
<left/><right/><top style="medium"/><bottom/><diagonal/>
</border>
<!-- Then append to <cellXfs> an xf clone with the new borderId for each existing style -->
```
**Key rule**: When a task says "add a border to row N", iterate over ALL cells A through the last column, not just newly added cells.
**Manual XML edit** (for anything the helper scripts don't cover):
```bash
python3 SKILL_DIR/scripts/xlsx_unpack.py input.xlsx /tmp/xlsx_work/
# ... edit XML with the Edit tool ...
python3 SKILL_DIR/scripts/xlsx_pack.py /tmp/xlsx_work/ output.xlsx
```
## FIX — Repair broken formulas (read `references/fix.md` first)
This is an EDIT task. Unpack → fix broken `<f>` nodes → pack. Preserve all original sheets and data.
## VALIDATE — Check formulas (read `references/validate.md` first)
Run `formula_check.py` for static validation. Use `libreoffice_recalc.py` for dynamic recalculation when available.
## Financial Color Standard
| Cell Role | Font Color | Hex Code |
|-----------|-----------|----------|
| Hard-coded input / assumption | Blue | `0000FF` |
| Formula / computed result | Black | `000000` |
| Cross-sheet reference formula | Green | `00B050` |
## Key Rules
1. **Formula-First**: Every calculated cell MUST use an Excel formula, not a hardcoded number
2. **CREATE → XML template**: Copy minimal template, edit XML directly, pack with `xlsx_pack.py`
3. **EDIT → XML**: Never openpyxl round-trip. Use unpack/edit/pack scripts
4. **Always produce the output file** — this is the #1 priority
5. **Validate before delivery**: `formula_check.py` exit code 0 = safe
## Utility Scripts
```bash
python3 SKILL_DIR/scripts/xlsx_reader.py input.xlsx # structure discovery
python3 SKILL_DIR/scripts/formula_check.py file.xlsx --json # formula validation
python3 SKILL_DIR/scripts/formula_check.py file.xlsx --report # standardized report
python3 SKILL_DIR/scripts/xlsx_unpack.py in.xlsx /tmp/work/ # unpack for XML editing
python3 SKILL_DIR/scripts/xlsx_pack.py /tmp/work/ out.xlsx # repack after editing
python3 SKILL_DIR/scripts/xlsx_shift_rows.py /tmp/work/ insert 5 1 # shift rows for insertion
python3 SKILL_DIR/scripts/xlsx_add_column.py /tmp/work/ --col G ... # add column with formulas
python3 SKILL_DIR/scripts/xlsx_insert_row.py /tmp/work/ --at 6 ... # insert row with data
```

View File

@@ -0,0 +1,691 @@
# Build New xlsx from Scratch
Create new, production-quality xlsx files using the XML approach. NEVER use openpyxl
for writing. NEVER hardcode Python-computed values — every derived number must be a
live Excel formula.
---
## When to Use This Path
Use this document when the user wants:
- A brand-new Excel file that does not yet exist
- A generated report, financial model, or data table
- Any "create / build / generate / make" request
If the user provides an existing file to modify, switch to `edit.md` instead.
---
## The Non-Negotiable Rules
Before touching any file, internalize these four rules:
1. **Formula-First**: Every calculated value (`SUM`, growth rate, ratio, subtotal, etc.)
MUST be written as `<f>SUM(B2:B9)</f>`, not as a hardcoded `<v>5000</v>`. Hardcoded
numbers go stale when source data changes. Only raw inputs and assumption parameters
may be hardcoded values.
2. **No openpyxl for writing**: The entire file is built by editing XML directly. Python
is only allowed for reading/analysis (`pandas.read_excel()`) and for running helper
scripts (`xlsx_pack.py`, `formula_check.py`).
3. **Style encodes meaning**: Blue font = user input/assumption. Black font = formula
result. Green font = cross-sheet reference. See `format.md` for the full color system
and style index table.
4. **Validate before delivery**: Run `formula_check.py` and fix all errors before
handing the file to the user.
---
## Complete Creation Workflow
### Step 1 — Plan Before Writing
Define the full structure on paper before touching any XML:
- **Sheets**: names, order, purpose (e.g., Assumptions / Model / Summary)
- **Layout per sheet**: which rows are headers, inputs, formulas, totals
- **String inventory**: collect all text labels you will need in sharedStrings
- **Style choices**: what number format each column needs (currency, %, integer, year)
- **Cross-sheet links**: which sheets pull data from other sheets
This planning step prevents the costly cycle of adding strings to sharedStrings
mid-way and recomputing all indices.
---
### Step 2 — Copy Minimal Template
```bash
cp -r SKILL_DIR/templates/minimal_xlsx/ /tmp/xlsx_work/
```
The template gives you a complete, valid 7-file xlsx skeleton:
```
/tmp/xlsx_work/
├── [Content_Types].xml ← MIME type registry
├── _rels/
│ └── .rels ← root relationship (points to workbook.xml)
└── xl/
├── workbook.xml ← sheet list and calc settings
├── styles.xml ← 13 pre-built financial style slots
├── sharedStrings.xml ← text string table (starts empty)
├── _rels/
│ └── workbook.xml.rels ← maps rId → file paths
└── worksheets/
└── sheet1.xml ← one empty sheet
```
After copying, rename sheets and add content. Do not create files from scratch —
always start from the template.
---
### Step 3 — Configure Sheet Structure
#### Single-Sheet Workbook
The template already has one sheet named "Sheet1". Just change the `name` attribute
in `xl/workbook.xml`:
```xml
<sheets>
<sheet name="Revenue Model" sheetId="1" r:id="rId1"/>
</sheets>
```
No other files need to change for a single-sheet workbook.
#### Multi-Sheet Workbook
Four files must be kept in sync. Work through them in this order:
**IMPORTANT — rId collision rule**: In the template's `workbook.xml.rels`, the IDs
`rId1`, `rId2`, and `rId3` are already taken:
- `rId1``worksheets/sheet1.xml`
- `rId2``styles.xml`
- `rId3``sharedStrings.xml`
New worksheet entries MUST start at `rId4` and count upward.
**File 1 of 4 — `xl/workbook.xml`** (sheet list):
```xml
<sheets>
<sheet name="Assumptions" sheetId="1" r:id="rId1"/>
<sheet name="Model" sheetId="2" r:id="rId4"/>
<sheet name="Summary" sheetId="3" r:id="rId5"/>
</sheets>
```
Special characters in sheet names:
- `&``&amp;` in XML: `<sheet name="P&amp;L" .../>`
- Max 31 characters
- Forbidden: `/ \ ? * [ ] :`
- Sheet names with spaces need single quotes in formula references: `'Q1 Data'!B5`
**File 2 of 4 — `xl/_rels/workbook.xml.rels`** (ID → file mapping):
```xml
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Id="rId1"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/worksheet"
Target="worksheets/sheet1.xml"/>
<Relationship Id="rId2"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles"
Target="styles.xml"/>
<Relationship Id="rId3"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/sharedStrings"
Target="sharedStrings.xml"/>
<Relationship Id="rId4"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/worksheet"
Target="worksheets/sheet2.xml"/>
<Relationship Id="rId5"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/worksheet"
Target="worksheets/sheet3.xml"/>
</Relationships>
```
**File 3 of 4 — `[Content_Types].xml`** (MIME type declarations):
```xml
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
<Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
<Default Extension="xml" ContentType="application/xml"/>
<Override PartName="/xl/workbook.xml"
ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml"/>
<Override PartName="/xl/worksheets/sheet1.xml"
ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.worksheet+xml"/>
<Override PartName="/xl/worksheets/sheet2.xml"
ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.worksheet+xml"/>
<Override PartName="/xl/worksheets/sheet3.xml"
ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.worksheet+xml"/>
<Override PartName="/xl/styles.xml"
ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.styles+xml"/>
<Override PartName="/xl/sharedStrings.xml"
ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sharedStrings+xml"/>
</Types>
```
**File 4 of 4 — Create new worksheet XML files**
Copy `sheet1.xml` to `sheet2.xml` and `sheet3.xml`, then clear the `<sheetData>` content:
```xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<worksheet
xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
<sheetViews>
<sheetView workbookViewId="0"/>
</sheetViews>
<sheetFormatPr defaultRowHeight="15" x14ac:dyDescent="0.25"
xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac"/>
<sheetData>
<!-- Data rows go here -->
</sheetData>
<pageMargins left="0.7" right="0.7" top="0.75" bottom="0.75" header="0.3" footer="0.3"/>
</worksheet>
```
**Sync checklist** — every time you add a sheet, verify all four are consistent:
| Check | What to verify |
|-------|---------------|
| `workbook.xml` | New `<sheet name="..." sheetId="N" r:id="rIdX"/>` exists |
| `workbook.xml.rels` | New `<Relationship Id="rIdX" ... Target="worksheets/sheetN.xml"/>` exists |
| `[Content_Types].xml` | New `<Override PartName="/xl/worksheets/sheetN.xml" .../>` exists |
| Filesystem | `xl/worksheets/sheetN.xml` file actually exists |
---
### Step 4 — Populate sharedStrings
All text values (headers, row labels, category names, any string the user will read)
must be stored in `xl/sharedStrings.xml`. Cells reference them by 0-based index.
**Recommended workflow**: collect ALL text you need first, write the complete table once,
then fill in indices while writing worksheet XML. This avoids re-counting indices mid-way.
```xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<sst xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
count="10" uniqueCount="10">
<si><t>Item</t></si> <!-- index 0 -->
<si><t>FY2023A</t></si> <!-- index 1 -->
<si><t>FY2024E</t></si> <!-- index 2 -->
<si><t>FY2025E</t></si> <!-- index 3 -->
<si><t>YoY Growth</t></si> <!-- index 4 -->
<si><t>Revenue</t></si> <!-- index 5 -->
<si><t>Cost of Goods Sold</t></si> <!-- index 6 -->
<si><t>Gross Profit</t></si> <!-- index 7 -->
<si><t>EBITDA</t></si> <!-- index 8 -->
<si><t>Net Income</t></si> <!-- index 9 -->
</sst>
```
**Attribute rules**:
- `uniqueCount` = number of `<si>` elements (unique strings in the table)
- `count` = total number of cell references to strings across the entire workbook
(if "Revenue" appears in 3 sheets, count is `uniqueCount + 2`)
- For new files where each string appears once, `count == uniqueCount`
- Both attributes MUST be accurate — wrong values trigger warnings in some Excel versions
**Special character escaping**:
```xml
<si><t>R&amp;D Expenses</t></si> <!-- & must be &amp; -->
<si><t>Revenue &lt; Target</t></si> <!-- < must be &lt; -->
<si><t xml:space="preserve"> (note) </t></si> <!-- preserve leading/trailing spaces -->
```
**Helper script**: use `shared_strings_builder.py` to generate the complete
`sharedStrings.xml` from a plain list of strings:
```bash
python3 SKILL_DIR/scripts/shared_strings_builder.py \
"Item" "FY2024" "FY2025" "Revenue" "Gross Profit" \
> /tmp/xlsx_work/xl/sharedStrings.xml
```
Or interactively from a file listing one string per line:
```bash
python3 SKILL_DIR/scripts/shared_strings_builder.py --file strings.txt \
> /tmp/xlsx_work/xl/sharedStrings.xml
```
---
### Step 5 — Write Worksheet Data
Edit each `xl/worksheets/sheetN.xml`. Replace the empty `<sheetData>` with rows
and cells.
#### Cell XML Anatomy
```
<c r="B5" t="s" s="4">
↑ ↑ ↑
address type style index (from cellXfs in styles.xml)
<v>3</v>
value (for t="s": sharedStrings index; for numbers: the number itself)
```
#### Data Type Reference
| Data | `t` attr | XML Example | Notes |
|------|---------|-------------|-------|
| Shared string (text) | `s` | `<c r="A1" t="s" s="4"><v>0</v></c>` | `<v>` = sharedStrings index |
| Number | omit | `<c r="B2" s="5"><v>1000000</v></c>` | default type, `t` omitted |
| Percentage (as decimal) | omit | `<c r="C2" s="7"><v>0.125</v></c>` | 12.5% stored as 0.125 |
| Boolean | `b` | `<c r="D1" t="b"><v>1</v></c>` | 1=TRUE, 0=FALSE |
| Formula | omit | `<c r="B4" s="2"><f>SUM(B2:B3)</f><v></v></c>` | `<v>` left empty |
| Cross-sheet formula | omit | `<c r="C1" s="3"><f>Assumptions!B2</f><v></v></c>` | use s=3 (green) |
#### A Full Sheet Data Example
```xml
<cols>
<col min="1" max="1" width="26" customWidth="1"/> <!-- A: label column -->
<col min="2" max="5" width="14" customWidth="1"/> <!-- B-E: data columns -->
</cols>
<sheetData>
<!-- Row 1: headers (style 4 = bold header) -->
<row r="1" ht="18" customHeight="1">
<c r="A1" t="s" s="4"><v>0</v></c> <!-- "Item" -->
<c r="B1" t="s" s="4"><v>1</v></c> <!-- "FY2023A" -->
<c r="C1" t="s" s="4"><v>2</v></c> <!-- "FY2024E" -->
<c r="D1" t="s" s="4"><v>3</v></c> <!-- "FY2025E" -->
<c r="E1" t="s" s="4"><v>4</v></c> <!-- "YoY Growth" -->
</row>
<!-- Row 2: Revenue — actual value (input) + formula (computed) -->
<row r="2">
<c r="A2" t="s" s="1"><v>5</v></c> <!-- "Revenue", blue input label -->
<c r="B2" s="5"><v>85000000</v></c> <!-- FY2023A actual: $85M, currency input -->
<c r="C2" s="6"><f>B2*(1+Assumptions!C3)</f><v></v></c> <!-- formula, currency -->
<c r="D2" s="6"><f>C2*(1+Assumptions!D3)</f><v></v></c>
<c r="E2" s="8"><f>D2/C2-1</f><v></v></c> <!-- YoY growth, percentage formula -->
</row>
<!-- Row 3: Gross Profit -->
<row r="3">
<c r="A3" t="s" s="2"><v>7</v></c> <!-- "Gross Profit", black formula label -->
<c r="B3" s="6"><f>B2*Assumptions!B4</f><v></v></c>
<c r="C3" s="6"><f>C2*Assumptions!C4</f><v></v></c>
<c r="D3" s="6"><f>D2*Assumptions!D4</f><v></v></c>
<c r="E3" s="8"><f>D3/C3-1</f><v></v></c>
</row>
<!-- Row 5: SUM total row -->
<row r="5">
<c r="A5" t="s" s="4"><v>8</v></c> <!-- "EBITDA" -->
<c r="B5" s="6"><f>SUM(B2:B4)</f><v></v></c>
<c r="C5" s="6"><f>SUM(C2:C4)</f><v></v></c>
<c r="D5" s="6"><f>SUM(D2:D4)</f><v></v></c>
<c r="E5" s="8"><f>D5/C5-1</f><v></v></c>
</row>
</sheetData>
```
#### Column Width and Freeze Pane
Column widths go **before** `<sheetData>`, freeze pane goes inside `<sheetView>`:
```xml
<!-- Inside <sheetViews><sheetView ...> — freeze the header row -->
<pane ySplit="1" topLeftCell="A2" activePane="bottomLeft" state="frozen"/>
<!-- Before <sheetData> — set column widths -->
<cols>
<col min="1" max="1" width="28" customWidth="1"/>
<col min="2" max="8" width="14" customWidth="1"/>
</cols>
```
---
### Step 6 — Apply Styles
The template's `xl/styles.xml` has 13 pre-built semantic style slots (indices 012).
**Read `format.md` for the complete style index table, color system, and how to add new styles.**
Quick reference for the most common slots:
| `s` | Role | Example |
|-----|------|---------|
| 4 | Header (bold) | Column/row titles |
| 5 / 6 | Currency input (blue) / formula (black) | `$#,##0` |
| 7 / 8 | Percentage input / formula | `0.0%` |
| 11 | Year (no comma) | 2024 not 2,024 |
Design principle: Blue = human sets this. Black = Excel computes this. Green = cross-sheet.
If you need a style not in the 13 pre-built slots, follow the append-only procedure in `format.md` section 3.2.
---
### Step 7 — Formula Cookbook
#### XML Formula Syntax Reminder
Formulas in XML have **no leading `=`**:
```xml
<!-- Excel UI: =SUM(B2:B9) → XML: -->
<c r="B10" s="6"><f>SUM(B2:B9)</f><v></v></c>
```
#### Basic Aggregations
```xml
<c r="B10" s="6"><f>SUM(B2:B9)</f><v></v></c>
<c r="B11" s="6"><f>AVERAGE(B2:B9)</f><v></v></c>
<c r="B12" s="10"><f>COUNT(B2:B9)</f><v></v></c>
<c r="B13" s="10"><f>COUNTA(A2:A100)</f><v></v></c>
<c r="B14" s="6"><f>MAX(B2:B9)</f><v></v></c>
<c r="B15" s="6"><f>MIN(B2:B9)</f><v></v></c>
```
#### Financial Calculations
```xml
<!-- YoY growth rate: current / prior - 1 -->
<c r="E5" s="8"><f>D5/C5-1</f><v></v></c>
<!-- Gross profit: revenue × gross margin -->
<c r="B6" s="6"><f>B4*B3</f><v></v></c>
<!-- EBITDA margin: EBITDA / Revenue -->
<c r="B9" s="8"><f>B8/B4</f><v></v></c>
<!-- Suppress #DIV/0! when denominator may be zero -->
<c r="E5" s="8"><f>IF(C5=0,0,D5/C5-1)</f><v></v></c>
<!-- NPV and IRR (cash flows in B2:B7, discount rate in B1) -->
<c r="C1" s="6"><f>NPV(B1,B3:B7)+B2</f><v></v></c>
<c r="C2" s="8"><f>IRR(B2:B7)</f><v></v></c>
```
#### Cross-Sheet References
```xml
<!-- No spaces in name: no quotes needed -->
<c r="B3" s="3"><f>Assumptions!B5</f><v></v></c>
<!-- Space in sheet name: single quotes required -->
<c r="B3" s="3"><f>'Q1 Data'!B5</f><v></v></c>
<!-- Ampersand in sheet name (XML-escaped in workbook.xml, but in formula: literal &) -->
<c r="B3" s="3"><f>'R&amp;D'!B5</f><v></v></c>
<!-- Cross-sheet range: SUM of a range in another sheet -->
<c r="B10" s="6"><f>SUM(Data!C2:C1000)</f><v></v></c>
<!-- 3D reference: sum same cell across multiple sheets -->
<c r="B5" s="6"><f>SUM(Jan:Dec!B5)</f><v></v></c>
```
Cross-sheet formula cells should use `s="3"` (green) to signal the data origin.
#### Shared Formulas (Same Pattern Repeated Down a Column)
When many consecutive cells share the same formula structure with only the row number
changing, use shared formulas to keep the XML compact:
```xml
<!-- D2: defines the shared group (si="0", ref="D2:D11") -->
<c r="D2" s="8"><f t="shared" ref="D2:D11" si="0">C2/B2-1</f><v></v></c>
<!-- D3 through D11: reference the same group, no formula text needed -->
<c r="D3" s="8"><f t="shared" si="0"/><v></v></c>
<c r="D4" s="8"><f t="shared" si="0"/><v></v></c>
<c r="D5" s="8"><f t="shared" si="0"/><v></v></c>
<c r="D6" s="8"><f t="shared" si="0"/><v></v></c>
<c r="D7" s="8"><f t="shared" si="0"/><v></v></c>
<c r="D8" s="8"><f t="shared" si="0"/><v></v></c>
<c r="D9" s="8"><f t="shared" si="0"/><v></v></c>
<c r="D10" s="8"><f t="shared" si="0"/><v></v></c>
<c r="D11" s="8"><f t="shared" si="0"/><v></v></c>
```
Excel adjusts relative references automatically (D3 computes `C3/B3-1`, etc.).
If you have multiple shared formula groups, assign sequential `si` values (0, 1, 2, …).
#### Absolute References
```xml
<!-- $B$2 locks to that cell when the formula is copied -->
<c r="C5" s="8"><f>B5/$B$2</f><v></v></c>
```
The `$` character needs no XML escaping — write it literally.
#### Lookup Formulas
```xml
<!-- VLOOKUP: exact match (last arg 0) -->
<c r="C5" s="6"><f>VLOOKUP(A5,Assumptions!A:C,2,0)</f><v></v></c>
<!-- INDEX/MATCH: more flexible -->
<c r="C5" s="6"><f>INDEX(B:B,MATCH(A5,A:A,0))</f><v></v></c>
<!-- XLOOKUP (Excel 2019+) -->
<c r="C5" s="6"><f>XLOOKUP(A5,A:A,B:B)</f><v></v></c>
```
---
### Step 8 — Pack and Validate
**Pack**:
```bash
python3 SKILL_DIR/scripts/xlsx_pack.py /tmp/xlsx_work/ /path/to/output.xlsx
```
`xlsx_pack.py` will:
1. Check that `[Content_Types].xml` exists at the root
2. Parse every `.xml` and `.rels` file for well-formedness — abort if any fail
3. Create the ZIP archive with correct compression
**Validate**:
```bash
python3 SKILL_DIR/scripts/formula_check.py /path/to/output.xlsx
```
`formula_check.py` will:
1. Scan every cell for `<c t="e">` entries (cached error values) — all 7 error types
2. Extract sheet name references from every `<f>` formula
3. Verify each referenced sheet exists in `workbook.xml`
Fix every reported error before delivery. Exit code 0 = safe to deliver.
---
## Pre-Delivery Checklist
Run through this list before handing the file to the user:
- [ ] `formula_check.py` reports 0 errors
- [ ] Every calculated cell has `<f>` — not just `<v>` with a number
- [ ] `sharedStrings.xml` `count` and `uniqueCount` match actual `<si>` count
- [ ] Every cell `s` attribute value is in range `0` to `cellXfs count - 1`
- [ ] Every sheet in `workbook.xml` has a matching entry in `workbook.xml.rels`
- [ ] Every `worksheets/sheetN.xml` file has a matching `<Override>` in `[Content_Types].xml`
- [ ] Year columns use `s="11"` (format `0`, no thousands separator)
- [ ] Cross-sheet reference formulas use `s="3"` (green font)
- [ ] Assumption inputs use `s="1"` or `s="5"` or `s="7"` (blue font)
---
## Common Mistakes and Fixes
| Mistake | Symptom | Fix |
|---------|---------|-----|
| Formula has leading `=` | Cell shows `=SUM(...)` as text | Remove `=` from `<f>` content |
| sharedStrings `count` not updated | Excel warning or blank cells | Count `<si>` elements, update both `count` and `uniqueCount` |
| Style index out of range | File corruption / Excel repair | Ensure `s` < `cellXfs count`; append new `<xf>` if needed |
| New sheet rId conflicts with styles/sharedStrings rId | Sheet missing or styles lost | New sheets use rId4, rId5, … (rId1-3 are reserved in template) |
| Sheet name has `&` unescaped in XML | XML parse error | Use `&amp;` in `workbook.xml` name attribute |
| Cross-sheet ref to sheet with space, no quotes | `#REF!` error | Wrap sheet name in single quotes: `'Sheet Name'!B5` |
| Cross-sheet ref to non-existent sheet | `#REF!` error | Check `workbook.xml` sheet list vs formula |
| Number stored as text (`t="s"`) | Left-aligned, can't sum | Remove `t` attribute from number cells |
| Year displayed as `2,024` | Readability issue | Use `s="11"` (numFmtId=1, format `0`) |
| Hardcoded Python result instead of formula | "Dead table" — won't update | Replace `<v>N</v>` with `<f>formula</f><v></v>` |
---
## Column Letter Reference
| Col # | Letter | Col # | Letter | Col # | Letter |
|-------|--------|-------|--------|-------|--------|
| 1 | A | 26 | Z | 27 | AA |
| 28 | AB | 52 | AZ | 53 | BA |
| 54 | BB | 78 | BZ | 79 | CA |
Python conversion (use when building formulas programmatically):
```python
def col_letter(n: int) -> str:
"""Convert 1-based column number to Excel letter (A, B, ..., Z, AA, AB, ...)."""
result = ""
while n > 0:
n, rem = divmod(n - 1, 26)
result = chr(65 + rem) + result
return result
def col_number(s: str) -> int:
"""Convert Excel column letter to 1-based number."""
n = 0
for c in s.upper():
n = n * 26 + (ord(c) - 64)
return n
```
---
## Typical Scenario Walkthroughs
### Scenario A — Three-Year Financial Model (Single Sheet)
Layout: rows 1-12 = Assumptions (blue inputs) / rows 14-30 = Model (black formulas).
```xml
<!-- sharedStrings.xml (excerpt) -->
<sst count="8" uniqueCount="8">
<si><t>Metric</t></si> <!-- 0 -->
<si><t>FY2023A</t></si> <!-- 1 -->
<si><t>FY2024E</t></si> <!-- 2 -->
<si><t>FY2025E</t></si> <!-- 3 -->
<si><t>Revenue Growth</t></si> <!-- 4 -->
<si><t>Gross Margin</t></si> <!-- 5 -->
<si><t>Revenue</t></si> <!-- 6 -->
<si><t>Gross Profit</t></si> <!-- 7 -->
</sst>
<!-- sheet1.xml (excerpt) -->
<sheetData>
<!-- Header -->
<row r="1">
<c r="A1" t="s" s="4"><v>0</v></c>
<c r="B1" t="s" s="4"><v>1</v></c>
<c r="C1" t="s" s="4"><v>2</v></c>
<c r="D1" t="s" s="4"><v>3</v></c>
</row>
<!-- Assumptions (rows 2-3) -->
<row r="2">
<c r="A2" t="s" s="1"><v>4</v></c> <!-- "Revenue Growth", blue -->
<c r="B2" s="7"><v>0</v></c> <!-- FY2023A: n/a, 0% placeholder -->
<c r="C2" s="7"><v>0.12</v></c> <!-- FY2024E: 12.0% input -->
<c r="D2" s="7"><v>0.15</v></c> <!-- FY2025E: 15.0% input -->
</row>
<row r="3">
<c r="A3" t="s" s="1"><v>5</v></c> <!-- "Gross Margin", blue -->
<c r="B3" s="7"><v>0.45</v></c>
<c r="C3" s="7"><v>0.46</v></c>
<c r="D3" s="7"><v>0.47</v></c>
</row>
<!-- Model (rows 14-15) -->
<row r="14">
<c r="A14" t="s" s="2"><v>6</v></c> <!-- "Revenue", black -->
<c r="B14" s="5"><v>85000000</v></c> <!-- actual, currency input -->
<c r="C14" s="6"><f>B14*(1+C2)</f><v></v></c>
<c r="D14" s="6"><f>C14*(1+D2)</f><v></v></c>
</row>
<row r="15">
<c r="A15" t="s" s="2"><v>7</v></c> <!-- "Gross Profit", black -->
<c r="B15" s="6"><f>B14*B3</f><v></v></c>
<c r="C15" s="6"><f>C14*C3</f><v></v></c>
<c r="D15" s="6"><f>D14*D3</f><v></v></c>
</row>
</sheetData>
```
### Scenario B — Data + Summary (Two Sheets)
The `Summary` sheet pulls from `Data` using cross-sheet formulas (green, `s="3"`):
```xml
<!-- Summary/sheet2.xml sheetData excerpt -->
<sheetData>
<row r="1">
<c r="A1" t="s" s="4"><v>0</v></c> <!-- "Metric" -->
<c r="B1" t="s" s="4"><v>1</v></c> <!-- "Value" -->
</row>
<row r="2">
<c r="A2" t="s" s="0"><v>2</v></c> <!-- "Total Revenue" -->
<c r="B2" s="3"><f>SUM(Data!C2:C10000)</f><v></v></c>
</row>
<row r="3">
<c r="A3" t="s" s="0"><v>3</v></c> <!-- "Deal Count" -->
<c r="B3" s="3"><f>COUNTA(Data!A2:A10000)</f><v></v></c>
</row>
<row r="4">
<c r="A4" t="s" s="0"><v>4</v></c> <!-- "Avg Deal Size" -->
<c r="B4" s="3"><f>IF(B3=0,0,B2/B3)</f><v></v></c>
</row>
</sheetData>
```
### Scenario C — Multi-Department Consolidation
`Consolidated` sheet sums the same cells from multiple department sheets:
```xml
<!-- Consolidated/sheet4.xml — summing across Dept_Eng and Dept_Mkt -->
<sheetData>
<row r="5">
<c r="A5" t="s" s="2"><v>0</v></c>
<!-- No spaces in sheet names → no quotes needed -->
<c r="B5" s="3"><f>Dept_Engineering!B5+Dept_Marketing!B5</f><v></v></c>
</row>
<row r="6">
<c r="A6" t="s" s="2"><v>1</v></c>
<c r="B6" s="3"><f>SUM(Dept_Engineering!B6,Dept_Marketing!B6)</f><v></v></c>
</row>
</sheetData>
```
---
## What You Must NOT Do
- Do NOT use openpyxl or any Python library to write the final xlsx file
- Do NOT hardcode any calculated value — use `<f>` formulas for every derived number
- Do NOT deliver without running `formula_check.py` first
- Do NOT set a cell's `s` attribute to a value >= `cellXfs count`
- Do NOT modify an existing `<xf>` entry in `styles.xml` — only append new ones
- Do NOT add a new sheet without updating all four sync points (workbook.xml,
workbook.xml.rels, [Content_Types].xml, actual .xml file)
- Do NOT assign new worksheet rIds that overlap with rId1, rId2, or rId3 (reserved
for sheet1, styles, sharedStrings in the template)

View File

@@ -0,0 +1,684 @@
# Minimal-Invasive Editing of Existing xlsx
Make precise, surgical changes to existing xlsx files while preserving everything you do not touch: styles, macros, pivot tables, charts, sparklines, named ranges, data validation, conditional formatting, and all other embedded content.
---
## 1. When to Use This Path
Use the edit (unpack → XML edit → pack) path whenever the task involves **modifying an existing xlsx file**:
- Template filling — populating designated input cells with values or formulas
- Data updates — replacing outdated numbers, text, or dates in a live file
- Content corrections — fixing wrong values, broken formulas, or mistyped labels
- Adding new data rows to an existing table
- Renaming a sheet
- Applying a new style to specific cells
Do NOT use this path for creating a brand-new workbook from scratch. For that, see `create.md`.
---
## 2. Why openpyxl round-trip Is Forbidden for Existing Files
openpyxl `load_workbook()` followed by `workbook.save()` is a **destructive operation** on any file that contains advanced features. The library silently drops content it does not understand:
| Feature | openpyxl behavior | Consequence |
|---------|-------------------|-------------|
| VBA macros (`vbaProject.bin`) | Dropped entirely | All automation is lost; file saved as `.xlsx` not `.xlsm` |
| Pivot tables (`xl/pivotTables/`) | Dropped | Interactive analysis is destroyed |
| Slicers | Dropped | Filter UI is lost |
| Sparklines (`<sparklineGroups>`) | Dropped | In-cell mini-charts disappear |
| Chart formatting details | Partially lost | Series colors, custom axes may revert |
| Print area / page breaks | Sometimes lost | Print layout changes |
| Custom XML parts | Dropped | Third-party data bindings broken |
| Theme-linked colors | May be de-themed | Colors converted to absolute, breaking theme switching |
Even on a "plain" file without these features, openpyxl may normalize whitespace in XML that Excel relies on, alter namespace declarations, or reset `calcMode` flags.
**The rule is absolute: never open an existing file with openpyxl for the purpose of re-saving it.**
The XML direct-edit approach is safe because it operates on the raw bytes. You only change the nodes you touch. Everything else is byte-equivalent to the original.
---
## 3. Standard Operating Procedure
### Step 1 — Unpack
```bash
python3 SKILL_DIR/scripts/xlsx_unpack.py input.xlsx /tmp/xlsx_work/
```
The script unzips the xlsx, pretty-prints every XML and `.rels` file, and prints a categorized inventory of key files plus a warning if high-risk content is detected (VBA, pivot tables, charts).
Read the printed output carefully before proceeding. If the script reports `xl/vbaProject.bin` or `xl/pivotTables/`, follow the constraints in Section 7.
### Step 2 — Reconnaissance
Map the structure before touching anything.
**Identify sheet names and their XML files:**
```
xl/workbook.xml → <sheet name="Revenue" sheetId="1" r:id="rId1"/>
xl/_rels/workbook.xml.rels → <Relationship Id="rId1" Target="worksheets/sheet1.xml"/>
```
The sheet named "Revenue" lives in `xl/worksheets/sheet1.xml`. Always resolve this mapping before editing a worksheet.
**Understand the shared strings table:**
```bash
# Count existing entries in xl/sharedStrings.xml
grep -c "<si>" /tmp/xlsx_work/xl/sharedStrings.xml
```
Every text cell uses a zero-based index into this table. Know the current count before appending.
**Understand the styles table:**
```bash
# Count existing cellXfs entries
grep -c "<xf " /tmp/xlsx_work/xl/styles.xml
```
New style slots are appended after existing ones. The index of the first new slot = current count.
**Scan for high-risk XML regions in the target worksheet:**
Look for these elements in the target `sheet*.xml` before editing:
- `<mergeCell>` — merged cell ranges; row/column insertion shifts these
- `<conditionalFormatting>` — condition ranges; row/column insertion shifts these
- `<dataValidations>` — validation ranges; row/column insertion shifts these
- `<tableParts>` — table definitions; row insertion inside a table needs `<tableColumn>` updates
- `<sparklineGroups>` — sparklines; preserve without modification
### Step 3 — Map Intent to Minimal XML Changes
Before writing a single character, produce a written list of exactly which XML nodes change. This prevents scope creep.
| User intent | Files to change | Nodes to change |
|-------------|----------------|-----------------|
| Change a cell's numeric value | `xl/worksheets/sheetN.xml` | `<v>` inside target `<c>` |
| Change a cell's text | `xl/sharedStrings.xml` (append) + `xl/worksheets/sheetN.xml` | New `<si>`, update cell `<v>` index |
| Change a cell's formula | `xl/worksheets/sheetN.xml` | `<f>` text inside target `<c>` |
| Add a new data row at the bottom | `xl/worksheets/sheetN.xml` + possibly `xl/sharedStrings.xml` | Append `<row>` element |
| Apply a new style to cells | `xl/styles.xml` + `xl/worksheets/sheetN.xml` | Append `<xf>` in `<cellXfs>`, update `s` attribute on `<c>` |
| Rename a sheet | `xl/workbook.xml` | `name` attribute on `<sheet>` element |
| Rename a sheet (with cross-sheet formulas) | `xl/workbook.xml` + all `xl/worksheets/*.xml` | `name` attribute + `<f>` text referencing old name |
### Step 4 — Execute Changes
Use the Edit tool. Edit the minimum. Never rewrite whole files.
See Section 4 for precise XML patterns for each operation type.
### Step 5 — Cascade Check
After any change that shifts row or column positions, audit all affected XML regions. See Section 5.
### Step 6 — Pack and Validate
```bash
python3 SKILL_DIR/scripts/xlsx_pack.py /tmp/xlsx_work/ output.xlsx
python3 SKILL_DIR/scripts/formula_check.py output.xlsx
```
The pack script validates XML well-formedness before creating the ZIP. Fix any reported parse errors before packing. After packing, run `formula_check.py` to confirm no formula errors were introduced.
---
## 4. Precise XML Patterns for Common Edits
### 4.1 Changing a Numeric Cell Value
Find the `<c r="B5">` element in the worksheet XML and replace the `<v>` text.
**Before:**
```xml
<c r="B5">
<v>1000</v>
</c>
```
**After (new value 1500):**
```xml
<c r="B5">
<v>1500</v>
</c>
```
Rules:
- Do not add or remove the `s` attribute (style) unless explicitly changing the style.
- Do not add a `t` attribute — numbers omit `t` or use `t="n"`.
- Do not change the `r` attribute (cell reference).
---
### 4.2 Changing a Text Cell Value
Text cells reference the shared strings table by index (`t="s"`). You cannot edit the string in-place without affecting every other cell that uses the same index. The safe approach is to append a new entry.
**Before — shared strings file (`xl/sharedStrings.xml`):**
```xml
<sst count="4" uniqueCount="4">
<si><t>Revenue</t></si>
<si><t>Cost</t></si>
<si><t>Margin</t></si>
<si><t>Old Label</t></si>
</sst>
```
**After — append new string, increment counts:**
```xml
<sst count="5" uniqueCount="5">
<si><t>Revenue</t></si>
<si><t>Cost</t></si>
<si><t>Margin</t></si>
<si><t>Old Label</t></si>
<si><t>New Label</t></si>
</sst>
```
New string is at index 4 (zero-based).
**Before — cell in worksheet XML:**
```xml
<c r="A7" t="s">
<v>3</v>
</c>
```
**After — point to new index:**
```xml
<c r="A7" t="s">
<v>4</v>
</c>
```
Rules:
- Never modify or delete existing `<si>` entries. Only append.
- Both `count` and `uniqueCount` must be incremented together.
- If the new string contains `&`, `<`, or `>`, escape them: `&amp;`, `&lt;`, `&gt;`.
- If the string has leading or trailing spaces, add `xml:space="preserve"` to `<t>`:
```xml
<si><t xml:space="preserve"> indented text </t></si>
```
---
### 4.3 Changing a Formula
Formulas are stored in `<f>` elements **without a leading `=`** (unlike what you type in Excel's UI).
**Before:**
```xml
<c r="C10">
<f>SUM(C2:C9)</f>
<v>4800</v>
</c>
```
**After (extended range):**
```xml
<c r="C10">
<f>SUM(C2:C11)</f>
<v></v>
</c>
```
Rules:
- Clear `<v>` to an empty string when changing the formula. The cached value is now stale.
- Do not add `t="s"` or any type attribute to formula cells. The `t` attribute is absent or uses a result-type value, not a formula marker.
- Cross-sheet references use `SheetName!CellRef`. If the sheet name contains spaces, wrap in single quotes: `'Q1 Data'!B5`.
- The `<f>` text must not include the leading `=`.
**Before (converting a hardcoded value to a live formula):**
```xml
<c r="D15">
<v>95000</v>
</c>
```
**After:**
```xml
<c r="D15">
<f>SUM(D2:D14)</f>
<v></v>
</c>
```
---
### 4.4 Adding a New Data Row
Append after the last `<row>` element inside `<sheetData>`. Row numbers in OOXML are 1-based and must be sequential.
**Before (last row is row 10):**
```xml
<row r="10">
<c r="A10" t="s"><v>3</v></c>
<c r="B10"><v>2023</v></c>
<c r="C10"><v>88000</v></c>
<c r="D10"><f>C10*1.1</f><v></v></c>
</row>
</sheetData>
```
**After (new row 11 appended):**
```xml
<row r="10">
<c r="A10" t="s"><v>3</v></c>
<c r="B10"><v>2023</v></c>
<c r="C10"><v>88000</v></c>
<c r="D10"><f>C10*1.1</f><v></v></c>
</row>
<row r="11">
<c r="A11" t="s"><v>4</v></c>
<c r="B11"><v>2024</v></c>
<c r="C11"><v>96000</v></c>
<c r="D11"><f>C11*1.1</f><v></v></c>
</row>
</sheetData>
```
Rules:
- Every `<c>` inside the row must have `r` set to the correct cell address (e.g., `A11`).
- Text cells need `t="s"` and a sharedStrings index in `<v>`. Numeric cells omit `t`.
- Formula cells use `<f>` and an empty `<v>`.
- Copy the `s` attribute from the row above if you want matching styles. Do not invent a style index that does not exist in `styles.xml`.
- If the sheet contains a `<dimension>` element (e.g., `<dimension ref="A1:D10"/>`), update it to include the new row: `<dimension ref="A1:D11"/>`.
- If the sheet contains a `<tableparts>` referencing a table, update the table's `ref` attribute in the corresponding `xl/tables/tableN.xml` file.
---
### 4.5 Adding a New Column
Append new `<c>` elements to each existing `<row>` and, if present, update the `<cols>` section.
**Before (rows have columns AC):**
```xml
<cols>
<col min="1" max="3" width="14" customWidth="1"/>
</cols>
<sheetData>
<row r="1">
<c r="A1" t="s"><v>0</v></c>
<c r="B1" t="s"><v>1</v></c>
<c r="C1" t="s"><v>2</v></c>
</row>
<row r="2">
<c r="A2"><v>100</v></c>
<c r="B2"><v>200</v></c>
<c r="C2"><v>300</v></c>
</row>
</sheetData>
```
**After (adding column D):**
```xml
<cols>
<col min="1" max="3" width="14" customWidth="1"/>
<col min="4" max="4" width="14" customWidth="1"/>
</cols>
<sheetData>
<row r="1">
<c r="A1" t="s"><v>0</v></c>
<c r="B1" t="s"><v>1</v></c>
<c r="C1" t="s"><v>2</v></c>
<c r="D1" t="s"><v>5</v></c>
</row>
<row r="2">
<c r="A2"><v>100</v></c>
<c r="B2"><v>200</v></c>
<c r="C2"><v>300</v></c>
<c r="D2"><f>A2+B2+C2</f><v></v></c>
</row>
</sheetData>
```
Rules:
- Adding a column at the end (after the last existing column) is safe — no existing formula references shift.
- Inserting a column in the middle shifts all columns to the right, which requires the same cascade updates as row insertion (see Section 5).
- Update the `<dimension>` element if present.
---
### 4.6 Modifying or Adding Styles
Styles use a multi-level indirect reference chain. Read `ooxml-cheatsheet.md` for the full chain. The key rule: **only append new entries, never modify existing ones**.
**Scenario:** Add a blue-font style (for hardcoded input cells) that doesn't yet exist.
**Step 1 — Check if a matching font already exists in `xl/styles.xml`:**
```xml
<!-- Look inside <fonts> for an existing blue font -->
<font>
<color rgb="000000FF"/>
<!-- other attributes -->
</font>
```
If found, note its index (zero-based position in the `<fonts>` list). If not found, append.
**Step 2 — Append the new font if needed:**
Before:
```xml
<fonts count="3">
<font>...</font> <!-- index 0 -->
<font>...</font> <!-- index 1 -->
<font>...</font> <!-- index 2 -->
</fonts>
```
After:
```xml
<fonts count="4">
<font>...</font> <!-- index 0 -->
<font>...</font> <!-- index 1 -->
<font>...</font> <!-- index 2 -->
<font>
<b/>
<sz val="11"/>
<color rgb="000000FF"/>
<name val="Calibri"/>
</font> <!-- index 3 (new) -->
</fonts>
```
**Step 3 — Append a new `<xf>` in `<cellXfs>`:**
Before:
```xml
<cellXfs count="5">
<xf .../> <!-- index 0 -->
<xf .../> <!-- index 1 -->
<xf .../> <!-- index 2 -->
<xf .../> <!-- index 3 -->
<xf .../> <!-- index 4 -->
</cellXfs>
```
After:
```xml
<cellXfs count="6">
<xf .../> <!-- index 0 -->
<xf .../> <!-- index 1 -->
<xf .../> <!-- index 2 -->
<xf .../> <!-- index 3 -->
<xf .../> <!-- index 4 -->
<xf numFmtId="0" fontId="3" fillId="0" borderId="0" xfId="0"
applyFont="1"/> <!-- index 5 (new) -->
</cellXfs>
```
**Step 4 — Apply to target cells:**
Before:
```xml
<c r="B3">
<v>0.08</v>
</c>
```
After:
```xml
<c r="B3" s="5">
<v>0.08</v>
</c>
```
Rules:
- Never delete or reorder existing entries in `<fonts>`, `<fills>`, `<borders>`, `<cellXfs>`.
- Always update the `count` attribute when appending.
- The new `cellXfs` index = the old `count` value before appending (zero-based: if count was 5, new index is 5).
- Custom `numFmt` IDs must be 164 or above. IDs 0163 are built-in and must not be re-declared.
- If the desired style already exists elsewhere in the file (on a similar cell), reuse its `s` index rather than creating a duplicate.
---
### 4.7 Renaming a Sheet
**Only `xl/workbook.xml` needs to change** — unless cross-sheet formulas reference the old name.
**Before (`xl/workbook.xml`):**
```xml
<sheet name="Sheet1" sheetId="1" r:id="rId1"/>
```
**After:**
```xml
<sheet name="Revenue" sheetId="1" r:id="rId1"/>
```
**If any formula in any worksheet references the old name, update those too:**
Before (`xl/worksheets/sheet2.xml`):
```xml
<c r="B5"><f>Sheet1!C10</f><v></v></c>
```
After:
```xml
<c r="B5"><f>Revenue!C10</f><v></v></c>
```
If the new name contains spaces:
```xml
<c r="B5"><f>'Q1 Revenue'!C10</f><v></v></c>
```
Scan all worksheet XML files for the old name:
```bash
grep -r "Sheet1!" /tmp/xlsx_work/xl/worksheets/
```
Rules:
- The `.rels` file and `[Content_Types].xml` do NOT need to change — they reference the XML file path, not the sheet name.
- `sheetId` must not change; it is a stable internal identifier.
- Sheet names are case-sensitive in formula references.
---
## 5. High-Risk Operations — Cascade Effects
### 5.1 Inserting a Row in the Middle
Inserting a row at position N shifts all rows from N downward. Every reference to those rows in every XML file must be updated.
**Files to check and update:**
| XML region | What to update | Example shift |
|------------|---------------|---------------|
| Worksheet `<row r="...">` attributes | Increment row number for all rows >= N | `r="7"` → `r="8"` |
| All `<c r="...">` within those rows | Increment row number in cell address | `r="A7"` → `r="A8"` |
| All `<f>` formula text in any sheet | Shift absolute row references >= N | `B7` → `B8` |
| `<mergeCell ref="...">` | Shift start and end rows | `A7:C7` → `A8:C8` |
| `<conditionalFormatting sqref="...">` | Shift range | `A5:D20` → `A5:D21` |
| `<dataValidations sqref="...">` | Shift range | `B6:B50` → `B7:B51` |
| `xl/charts/chartN.xml` data source ranges | Shift series ranges | `Sheet1!$B$5:$B$20` → `Sheet1!$B$6:$B$21` |
| `xl/pivotTables/*.xml` source ranges | Shift source data range | Handle with extreme care — see Section 7 |
| `<dimension ref="...">` | Expand to include new extent | `A1:D20` → `A1:D21` |
| `xl/tables/tableN.xml` `ref` attribute | Expand table boundary | `A1:D20` → `A1:D21` |
**Do not attempt row insertion manually in large or formula-heavy files.** Use the dedicated shift script instead:
```bash
# Insert 1 row at row 5: all rows 5 and below shift down by 1
python3 SKILL_DIR/scripts/xlsx_shift_rows.py /tmp/xlsx_work/ insert 5 1
# Delete 1 row at row 8: all rows 9 and above shift up by 1
python3 SKILL_DIR/scripts/xlsx_shift_rows.py /tmp/xlsx_work/ delete 8 1
```
The script updates in one pass: `<row r="...">` attributes, `<c r="...">` cell addresses, all `<f>` formula text across every worksheet, `<mergeCell>` ranges, `<conditionalFormatting sqref="...">`, `<dataValidation sqref="...">`, `<dimension ref="...">`, table `ref` attributes in `xl/tables/`, chart series ranges in `xl/charts/`, and pivot cache source ranges in `xl/pivotCaches/`.
**After running the shift script, always repack and validate:**
```bash
python3 SKILL_DIR/scripts/xlsx_pack.py /tmp/xlsx_work/ output.xlsx
python3 SKILL_DIR/scripts/formula_check.py output.xlsx
```
**What the script does NOT update (review manually):**
- Named ranges in `xl/workbook.xml` `<definedNames>` — check and update if they reference shifted rows.
- Structured table references (`Table[@Column]`) inside formulas.
- External workbook links in `xl/externalLinks/`.
### 5.2 Inserting a Column in the Middle
Same cascade logic as row insertion, but for columns. Column references in formulas (`B`, `$C`, etc.) and in merged cell ranges, conditional formatting ranges, and chart data sources all need updating.
Column letter shifting is harder to automate safely. Prefer **appending columns at the end** whenever possible.
### 5.3 Deleting a Row or Column
Deletion is more dangerous than insertion because any formula that referenced a deleted row or column will become `#REF!`. Before deleting:
1. Search all `<f>` elements for references to the deleted range.
2. If any formula references a cell in the deleted row/column, do not delete — instead, either clear the row's data or consult the user.
3. After deletion, shift all references to rows/columns beyond the deletion point downward/leftward.
---
## 6. Template Filling — Identifying and Populating Input Cells
Templates designate certain cells as input zones. Common patterns to recognize them:
### 6.1 How Templates Signal Input Zones
| Signal | XML manifestation | What to look for |
|--------|-------------------|-----------------|
| Blue font color | `s` attribute pointing to a `cellXfs` entry with `fontId` → `<color rgb="000000FF"/>` | Check `styles.xml` to decode `s` values |
| Yellow fill (highlight) | `s` → `fillId` → `<fill><patternFill><fgColor rgb="00FFFF00"/>` | |
| Empty `<v>` element | `<c r="B5"><v></v></c>` or cell entirely absent from `<row>` | The cell has no value yet |
| Comment/annotation near cell | `xl/comments1.xml` with `ref="B5"` | Comments often label input fields |
| Named ranges | `xl/workbook.xml` `<definedName>` elements | Template may define `InputRevenue` etc. |
### 6.2 Filling a Template Cell
Do not change `s` attributes. Do not change `t` attributes unless you must change from empty to typed. Only change `<v>` or add `<f>`.
**Before (empty input cell with style preserved):**
```xml
<c r="C5" s="3">
<v></v>
</c>
```
**After (filled with a number, style unchanged):**
```xml
<c r="C5" s="3">
<v>125000</v>
</c>
```
**After (filled with text — requires shared string entry first):**
```xml
<!-- 1. Append to sharedStrings.xml: <si><t>North Region</t></si> at index 7 -->
<c r="C5" t="s" s="3">
<v>7</v>
</c>
```
**After (filled with a formula, preserving style):**
```xml
<c r="C5" s="3">
<f>Assumptions!D12</f>
<v></v>
</c>
```
### 6.3 Locating Input Zones Without Opening the File in Excel
After unpacking, decode the style index on suspected input cells to determine if they have the template's input color:
1. Note the `s` value on the cell (e.g., `s="4"`).
2. In `xl/styles.xml`, find `<cellXfs>` and look at the 5th entry (index 4).
3. Note its `fontId` (e.g., `fontId="2"`).
4. In `<fonts>`, look at the 3rd entry (index 2) and check for `<color rgb="000000FF"/>` (blue) or other input marker.
If the template uses named ranges as input fields, read them from `xl/workbook.xml`:
```xml
<definedNames>
<definedName name="InputGrowthRate">Assumptions!$B$5</definedName>
<definedName name="InputDiscountRate">Assumptions!$B$6</definedName>
</definedNames>
```
Fill the target cells (`Assumptions!B5`, `Assumptions!B6`) directly.
### 6.4 Template Filling Rules
- Fill only cells the template designated as inputs. Do not fill cells that are formula-driven.
- Do not apply new styles when filling. The template's formatting is the deliverable.
- Do not add or remove rows inside the template's data area unless the template explicitly has an "append here" zone.
- After filling, verify that no formula errors were introduced: some templates have input-validation formulas that produce `#VALUE!` if the wrong data type is entered.
---
## 7. Files You Must Never Modify
### 7.1 Absolute no-touch list
| File / location | Why |
|-----------------|-----|
| `xl/vbaProject.bin` | Binary VBA bytecode. Any byte modification corrupts the macro project. Editing even one bit makes the macros fail to load. |
| `xl/pivotCaches/pivotCacheDefinition*.xml` | The cache definition ties the pivot table to its source data. Editing it without also updating the corresponding `pivotTable*.xml` will corrupt the pivot. |
| `xl/pivotTables/*.xml` | Pivot table XML is tightly coupled with the cache definition and with internal state Excel rebuilds on load. Do not edit. If you shifted rows and the pivot's source range now points to wrong data, update only the `<cacheSource>` range in the cache definition, and only the `ref` attribute in the pivot table — no other changes. |
| `xl/slicers/*.xml` | Slicers are connected to specific cache IDs and pivot fields. Breaking these connections silently corrupts the file. |
| `xl/connections.xml` | External data connections. Editing breaks live data refresh. |
| `xl/externalLinks/` | External workbook links. The binary `.bin` files in here must not be modified. |
### 7.2 Conditionally safe files (update only specific attributes)
| File | What you may update | What to leave alone |
|------|--------------------|--------------------|
| `xl/charts/chartN.xml` | Data series range references (`<numRef><f>`) after a row/column shift | Chart type, formatting, layout |
| `xl/tables/tableN.xml` | `ref` attribute on `<table>` after adding rows | Column definitions, style info |
| `xl/pivotCaches/pivotCacheDefinition*.xml` | `ref` attribute on `<cacheSource><worksheetSource>` after shifting source data | All other content |
---
## 8. Validation After Every Edit
Never skip validation. Even a one-character change in a formula can cause cascading errors.
```bash
# Pack
python3 SKILL_DIR/scripts/xlsx_pack.py /tmp/xlsx_work/ output.xlsx
# Static formula validation (always run)
python3 SKILL_DIR/scripts/formula_check.py output.xlsx
# Dynamic validation (if LibreOffice available)
python3 SKILL_DIR/scripts/libreoffice_recalc.py output.xlsx /tmp/recalc.xlsx
python3 SKILL_DIR/scripts/formula_check.py /tmp/recalc.xlsx
```
If `formula_check.py` reports any error:
1. Unpack the output file again (it is the packed version).
2. Locate the reported cell in the worksheet XML.
3. Fix the `<f>` element.
4. Repack and re-validate.
Do not deliver the file until `formula_check.py` reports zero errors.
---
## 9. Absolute Rules Summary
| Rule | Rationale |
|------|-----------|
| Never use openpyxl `load_workbook` + `save` on an existing file | Round-trip destroys pivot tables, VBA, sparklines, slicers |
| Never delete or reorder existing `<si>` entries in sharedStrings | Breaks every cell referencing that index |
| Never delete or reorder existing `<xf>` entries in `<cellXfs>` | Breaks every cell using that style index |
| Never modify `vbaProject.bin` | Binary file; any change corrupts VBA |
| Never change `sheetId` when renaming a sheet | Internal ID is stable; changing it breaks relationships |
| Never skip post-edit validation | Leaves broken references undetected |
| Never edit more XML nodes than required | Extra changes risk introducing subtle corruption |
| Clear `<v>` to empty string when changing a formula | Prevents stale cached value from misleading downstream consumers |
| Append-only to sharedStrings | Existing indexes must remain valid |
| Append-only to styles collections | Existing style indexes must remain valid |

View File

@@ -0,0 +1,37 @@
# FIX — Repair Broken Formulas in an Existing xlsx
This is an EDIT task. You MUST preserve all original sheets and data. Never create a new workbook.
## Workflow
```bash
# Step 1: Identify errors
python3 SKILL_DIR/scripts/formula_check.py input.xlsx --json
# Step 2: Unpack
python3 SKILL_DIR/scripts/xlsx_unpack.py input.xlsx /tmp/xlsx_work/
# Step 3: Fix each broken <f> element in the worksheet XML using the Edit tool
# (see Error-to-Fix mapping below)
# Step 4: Pack and validate
python3 SKILL_DIR/scripts/xlsx_pack.py /tmp/xlsx_work/ output.xlsx
python3 SKILL_DIR/scripts/formula_check.py output.xlsx
```
## Error-to-Fix Mapping
| Error | Fix Strategy |
|-------|-------------|
| `#DIV/0!` | Wrap: `IFERROR(original_formula, "-")` |
| `#NAME?` | Fix misspelled function (e.g. `SUMM``SUM`) |
| `#REF!` | Reconstruct the broken reference |
| `#VALUE!` | Fix type mismatch |
For the full list of Excel error types and advanced diagnostics, see `validate.md`.
## Critical Rules
- The output MUST contain the same sheets as the input. Do NOT create a new workbook.
- Only modify the specific `<f>` elements that are broken — everything else must be untouched.
- After packing, always run `formula_check.py` to confirm all errors are resolved.

View File

@@ -0,0 +1,768 @@
# Financial Formatting & Output Standards — Complete Agent Guide
> This document is the complete reference manual for the agent when applying professional financial formatting to xlsx files. All operations target direct XML surgery on `xl/styles.xml` without using openpyxl. Every operational step provides ready-to-use XML snippets.
---
## 1. When to Use This Path
This document (FORMAT path) applies to the following two scenarios:
**Scenario A — Dedicated Formatting of an Existing File**
The user provides an existing xlsx file and requests that financial modeling formatting standards be applied or unified. The starting point is to unpack the file, audit the existing `styles.xml`, then append missing styles and batch-update cell `s` attributes. No cell values or formulas are modified.
**Scenario B — Applying Format Standards After CREATE/EDIT**
After completing data entry or formula writing, formatting is applied as the final step. At this point, `styles.xml` may come from the minimal_xlsx template (which pre-defines 13 style slots) or from a user file. In either case, follow the principle of "append only, never modify existing xf entries."
**Not applicable**: Reading or analyzing file contents only (use the READ path); modifying formulas or data (use the EDIT path).
---
## 2. Financial Format Semantic System
### 2.1 Font Color = Cell Role (Color = Role)
The primary convention of financial modeling: **font color encodes the cell's role, not decoration**. A reviewer can glance at colors to determine which cells are adjustable parameters and which are model-calculated results. This is an industry-wide convention (followed by investment banks, the Big Four, and corporate finance teams).
| Role | Font Color | AARRGGBB | Use Case |
|------|-----------|----------|----------|
| Hard-coded input / assumption | Blue | `000000FF` | Growth rates, discount rates, tax rates, and other user-modifiable parameters |
| Formula / calculated result | Black | `00000000` | All cells containing a `<f>` element |
| Same-workbook cross-sheet reference | Green | `00008000` | Cells whose formula starts with `SheetName!` |
| External file link | Red | `00FF0000` | Cells whose formula contains `[FileName.xlsx]` (flagged as fragile links) |
| Label / text | Black (default) | theme color | Row labels, category headings |
| Key assumption requiring review | Blue font + yellow fill | Font `000000FF` / Fill `00FFFF00` | Provisional values, parameters pending confirmation |
**Decision tree**:
```
Does the cell contain a <f> element?
+-- Yes -> Does the formula start with [FileName]?
| +-- Yes -> Red (external link)
| +-- No -> Does the formula contain SheetName!?
| +-- Yes -> Green (cross-sheet reference)
| +-- No -> Black (same-sheet formula)
+-- No -> Is the value a user-adjustable parameter?
+-- Yes -> Blue (input/assumption)
+-- No -> Black default (label)
```
**Strictly prohibited**: Blue font + `<f>` element coexisting (color role contradiction — must be corrected).
### 2.2 Number Format Matrix
| Data Type | formatCode | numFmtId | Display Example | Applicable Scenario |
|-----------|-----------|----------|-----------------|---------------------|
| Standard currency (whole dollars) | `$#,##0;($#,##0);"-"` | 164 | $1,234 / ($1,234) / - | P&L, balance sheet amount rows |
| Standard currency (with cents) | `$#,##0.00;($#,##0.00);"-"` | 169 | $1,234.56 / ($1,234.56) / - | Unit prices, detailed costs |
| Thousands (K) | `#,##0,"K"` | 171 | 1,234K | Simplified display for management reports |
| Millions (M) | `#,##0,,"M"` | 172 | 1M | Macro-level summary rows |
| Percentage (1 decimal) | `0.0%` | 165 | 12.5% | Growth rates, gross margins |
| Percentage (2 decimals) | `0.00%` | 170 | 12.50% | IRR, precise interest rates |
| Multiple / valuation multiplier | `0.0x` | 166 | 8.5x | EV/EBITDA, P/E |
| Integer (thousands separator) | `#,##0` | 167 | 12,345 | Employee count, unit quantities |
| Year | `0` | 1 (built-in, no declaration needed) | 2024 | Column header years, prevents 2,024 |
| Date | `m/d/yyyy` | 14 (built-in, no declaration needed) | 3/21/2026 | Timelines |
| General text | General | 0 (built-in, no declaration needed) | — | Label rows, cells with no format requirement |
numFmtId 169172 are custom formats that need to be appended beyond the 4 formats (164167) pre-defined in the minimal_xlsx template. When appending, assign IDs according to the rules (see Section 3.4).
**Built-in format IDs do not need to be declared in `<numFmts>`** (IDs 0163 are built into Excel/LibreOffice; simply reference the numFmtId in `<xf>`):
| numFmtId | formatCode | Description |
|----------|-----------|-------------|
| 0 | General | General format |
| 1 | `0` | Integer, no thousands separator (use this ID for years) |
| 3 | `#,##0` | Thousands-separated integer (no decimals) |
| 9 | `0%` | Percentage integer |
| 10 | `0.00%` | Percentage with two decimals |
| 14 | `m/d/yyyy` | Short date |
### 2.3 Negative Number Display Standards
Financial reports have two mainstream conventions for negative numbers — choose one and **maintain consistency** throughout the entire workbook:
**Parenthetical style (investment banking standard, recommended for external deliverables)**
```
Positive: $1,234 Negative: ($1,234) Zero: -
formatCode: $#,##0;($#,##0);"-"
```
**Red minus sign style (suitable for internal operational analysis reports)**
```
Positive: $1,234 Negative: -$1,234 (red)
formatCode: $#,##0;[Red]-$#,##0;"-"
```
Rule: Once a style is determined, maintain it across the entire workbook. Do not mix two negative number display styles within the same workbook.
### 2.4 Zero Value Display Standards
In financial models, "0" and "no data" have different semantics and should be visually distinct:
| Scenario | Recommended Display | formatCode Third Segment |
|----------|-------------------|--------------------------|
| Sparse matrix (most rows have zero-value periods) | Dash `-` | `"-"` |
| Quantity counts (zero itself is meaningful) | `0` | `0` or omit |
| Placeholder row (explicitly empty) | Leave blank | Do not write to cell |
Four-segment format syntax: `positive format;negative format;zero value format;text format`
Zero as dash: `$#,##0;($#,##0);"-"`
Zero preserved as 0: `#,##0;(#,##0);0`
---
## 3. styles.xml Surgical Operations
### 3.1 Auditing Existing Styles: Understanding the cellXfs Indirect Reference Chain
A cell's `s` attribute points to a position index (0-based) in `cellXfs`, and each `<xf>` entry in `cellXfs` references its respective definition libraries through `fontId`, `fillId`, `borderId`, and `numFmtId`.
Reference chain diagram:
```
Cell <c s="6">
| Look up cellXfs by 0-based index
cellXfs[6] -> numFmtId="164" fontId="2" fillId="0" borderId="0"
| | | |
numFmts fonts[2] fills[0] borders[0]
id=164 color=00000000 (no fill) (no border)
$#,##0... black
```
Audit steps:
**Step 1**: Read `<numFmts>` and record all declared custom formats and their IDs:
```xml
<numFmts count="4">
<numFmt numFmtId="164" formatCode="$#,##0;($#,##0);&quot;-&quot;"/>
<numFmt numFmtId="165" formatCode="0.0%"/>
<numFmt numFmtId="166" formatCode="0.0x"/>
<numFmt numFmtId="167" formatCode="#,##0"/>
</numFmts>
```
Record: current maximum custom numFmtId = 167, next available ID = 168.
**Step 2**: Read `<fonts>` and list each `<font>` by 0-based index with its color and style:
```
fontId=0 -> No explicit color (theme default black)
fontId=1 -> color rgb="000000FF" (blue, input role)
fontId=2 -> color rgb="00000000" (black, formula role)
fontId=3 -> color rgb="00008000" (green, cross-sheet reference role)
fontId=4 -> <b/> + color rgb="00000000" (bold black, header)
```
**Step 3**: Read `<fills>` and confirm that fills[0] and fills[1] are spec-mandated reserved entries (never delete):
```
fillId=0 -> patternType="none" (spec-mandated)
fillId=1 -> patternType="gray125" (spec-mandated)
fillId=2 -> Yellow highlight (if present)
```
**Step 4**: Read `<cellXfs>` and list each `<xf>` entry by 0-based index with its combination:
```
index 0 -> numFmtId=0, fontId=0, fillId=0 -> Default style
index 1 -> numFmtId=0, fontId=1, fillId=0 -> Blue font general (input)
index 5 -> numFmtId=164, fontId=1, fillId=0 -> Blue font currency (currency input)
index 6 -> numFmtId=164, fontId=2, fillId=0 -> Black font currency (currency formula)
...
```
**Step 5**: Verify that all count attributes match the actual number of elements (count mismatches will cause Excel to refuse to open the file).
### 3.2 Safely Appending New Styles (Golden Rule: Append Only, Never Modify Existing xf)
**Never modify existing `<xf>` entries**. Modifications will affect all cells that already reference that index, breaking existing formatting. Only append new entries at the end.
Complete atomic operation sequence for appending new styles (all 5 steps must be executed):
**Step 1**: Determine if a new `<numFmt>` is needed
Built-in formats (ID 0163) skip this step. Custom formats are appended to the end of `<numFmts>`:
```xml
<numFmts count="5"> <!-- count +1 -->
<!-- Keep existing entries unchanged -->
<numFmt numFmtId="164" formatCode="$#,##0;($#,##0);&quot;-&quot;"/>
<numFmt numFmtId="165" formatCode="0.0%"/>
<numFmt numFmtId="166" formatCode="0.0x"/>
<numFmt numFmtId="167" formatCode="#,##0"/>
<!-- Newly appended -->
<numFmt numFmtId="168" formatCode="$#,##0.00;($#,##0.00);&quot;-&quot;"/>
</numFmts>
```
**Step 2**: Determine if a new `<font>` is needed
Check whether the existing fonts already contain a matching color+style combination. If not, append to the end of `<fonts>`:
```xml
<fonts count="6"> <!-- count +1 -->
<!-- Keep existing entries unchanged -->
...
<!-- Newly appended: red font (external link role), new fontId = 5 -->
<font>
<sz val="11"/>
<name val="Calibri"/>
<color rgb="00FF0000"/>
</font>
</fonts>
```
New fontId = the count value before appending (when original count=5, new fontId=5).
**Step 3**: Determine if a new `<fill>` is needed
If a new background color is needed, append to the end of `<fills>` (note: fills[0] and fills[1] must never be modified):
```xml
<fills count="4"> <!-- count +1 -->
<fill><patternFill patternType="none"/></fill> <!-- 0: spec-mandated -->
<fill><patternFill patternType="gray125"/></fill> <!-- 1: spec-mandated -->
<fill> <!-- 2: yellow highlight -->
<patternFill patternType="solid">
<fgColor rgb="00FFFF00"/>
<bgColor indexed="64"/>
</patternFill>
</fill>
<!-- Newly appended: light gray fill (projection period distinction), new fillId = 3 -->
<fill>
<patternFill patternType="solid">
<fgColor rgb="00D3D3D3"/>
<bgColor indexed="64"/>
</patternFill>
</fill>
</fills>
```
**Step 4**: Append a new `<xf>` combination at the end of `<cellXfs>`
```xml
<cellXfs count="14"> <!-- count +1 -->
<!-- Keep existing entries 0-12 unchanged -->
...
<!-- Newly appended index=13: currency with cents formula (black font + numFmtId=168) -->
<xf numFmtId="168" fontId="2" fillId="0" borderId="0" xfId="0"
applyFont="1" applyNumberFormat="1"/>
</cellXfs>
```
New style index = the count value before appending (when original count=13, new index=13).
**Step 5**: Record the new style index; subsequently set the `s` attribute of corresponding cells in the sheet XML to this value.
### 3.3 AARRGGBB Color Format Explanation
OOXML's `rgb` attribute uses **8-digit hexadecimal AARRGGBB** format (not HTML's 6-digit RRGGBB):
```
AA RR GG BB
| | | |
Alpha Red Green Blue
```
- Alpha channel: `00` = fully opaque (normal use value); `FF` = fully transparent (invisible, never use this)
- Financial color standards always use `00` as the Alpha prefix
| Color | AARRGGBB | Corresponding Role |
|-------|----------|-------------------|
| Blue (input) | `000000FF` | Hard-coded assumptions |
| Black (formula) | `00000000` | Calculated results |
| Green (cross-sheet reference) | `00008000` | Same-workbook cross-sheet |
| Red (external link) | `00FF0000` | References to other files |
| Yellow (review-required fill) | `00FFFF00` | Key assumption highlight |
| Light gray (projection period fill) | `00D3D3D3` | Distinguishing historical vs. forecast periods |
| White | `00FFFFFF` | Pure white fill |
**Common mistake**: Mistakenly writing HTML format `#0000FF` as `FF0000FF` (Alpha=FF makes the color fully transparent and invisible). Correct format: `000000FF`.
### 3.4 numFmtId Assignment Rules
```
ID 0-163 -> Excel/LibreOffice built-in formats, no declaration needed in <numFmts>, reference directly in <xf>
ID 164+ -> Custom formats, must be explicitly declared as <numFmt> elements in <numFmts>
```
Rules for assigning new IDs:
1. Read all `numFmtId` attribute values in the current `<numFmts>`
2. Take the maximum value + 1 as the next custom format ID
3. Do not reuse existing IDs; do not skip numbers
The minimal_xlsx template pre-defines IDs: 164, 165, 166, 167. The next available ID is 168.
---
## 4. Pre-defined Style Index Complete Reference Table (13 Slots)
The following are the 13 style slots (cellXfs index 012) pre-defined in the minimal_xlsx template's `styles.xml`, which can be directly referenced in the cell `s` attribute in sheet XML:
| Index | Semantic Role | Font Color | Fill | numFmtId | Format Display | Typical Use |
|-------|--------------|------------|------|----------|---------------|-------------|
| **0** | Default style | Theme black | None | 0 | General | Cells requiring no special formatting |
| **1** | Input / assumption (general) | Blue `000000FF` | None | 0 | General | Text-type assumptions, flags |
| **2** | Formula / calculated result (general) | Black `00000000` | None | 0 | General | Text concatenation formulas, non-numeric calculations |
| **3** | Cross-sheet reference (general) | Green `00008000` | None | 0 | General | Values pulled from cross-sheet (general format) |
| **4** | Header (bold) | Bold black | None | 0 | General | Row/column headings |
| **5** | Currency input | Blue `000000FF` | None | 164 | $1,234 / ($1,234) / - | Amount inputs in the assumptions area |
| **6** | Currency formula | Black `00000000` | None | 164 | $1,234 / ($1,234) / - | Amount calculations in the model area (revenue, EBITDA) |
| **7** | Percentage input | Blue `000000FF` | None | 165 | 12.5% | Rate inputs in the assumptions area (growth rate, gross margin assumptions) |
| **8** | Percentage formula | Black `00000000` | None | 165 | 12.5% | Rate calculations in the model area (actual gross margin) |
| **9** | Integer (comma) input | Blue `000000FF` | None | 167 | 12,345 | Quantity inputs in the assumptions area (employee count) |
| **10** | Integer (comma) formula | Black `00000000` | None | 167 | 12,345 | Quantity calculations in the model area |
| **11** | Year input | Blue `000000FF` | None | 1 | 2024 | Column header years (no thousands separator) |
| **12** | Key assumption highlight | Blue `000000FF` | Yellow `00FFFF00` | 0 | General | Key parameters pending review or confirmation |
**Selection guide**:
- Determine "input" vs. "formula" -> Choose odd-numbered (input/blue) or even-numbered (formula/black) paired slots
- Determine data type -> Choose the corresponding currency (5/6) / percentage (7/8) / integer (9/10) / year (11) slot
- Cross-sheet reference needing number format -> Append a new green + number format combination (see Section 5.4)
- Parameter pending review -> index 12
---
## 5. Assumption Separation Principle: XML-Level Implementation
### 5.1 Structural Design
Assumption separation principle: **Input assumptions are centralized in a dedicated area (sheet or block); the model calculation area contains only formulas, no hard-coded values**.
Recommended structure:
```
Workbook sheet layout
sheet 1 "Assumptions" -> All blue-font cells (style 1/5/7/9/11/12)
sheet 2 "Model" -> All black or green-font cells (style 2/3/4/6/8/10)
```
Same-sheet zoning approach for simple models:
```
Rows 1-5: [Assumptions block - blue font]
Row 6: [Empty row separator]
Rows 7+: [Model block - black/green font formulas referencing assumptions area]
```
### 5.2 Assumptions Area XML Example
```xml
<!-- Assumptions sheet (sheet1.xml) example -->
<!-- Row 1: Block title -->
<row r="1">
<c r="A1" s="4" t="inlineStr"><is><t>Model Assumptions</t></is></c>
</row>
<!-- Row 2: Growth rate assumption - blue font percentage input, s="7" -->
<row r="2">
<c r="A2" t="inlineStr"><is><t>Revenue Growth Rate</t></is></c>
<c r="B2" s="7"><v>0.08</v></c>
</row>
<!-- Row 3: Gross margin assumption - blue font percentage input, s="7" -->
<row r="3">
<c r="A3" t="inlineStr"><is><t>Gross Margin</t></is></c>
<c r="B3" s="7"><v>0.65</v></c>
</row>
<!-- Row 4: Base revenue - blue font currency input, s="5" -->
<row r="4">
<c r="A4" t="inlineStr"><is><t>Base Revenue (Year 0)</t></is></c>
<c r="B4" s="5"><v>1000000</v></c>
</row>
<!-- Row 5: Key assumption (pending review) - blue font yellow fill, s="12" -->
<row r="5">
<c r="A5" t="inlineStr"><is><t>Terminal Growth Rate</t></is></c>
<c r="B5" s="12"><v>0.03</v></c>
</row>
```
### 5.3 Model Area XML Example (Referencing Assumptions Area)
```xml
<!-- Model sheet (sheet2.xml) example -->
<!-- Row 1: Column headers (years) - bold header, s="4"; year cells, s="11" -->
<row r="1">
<c r="A1" s="4" t="inlineStr"><is><t>Metric</t></is></c>
<c r="B1" s="11"><v>2024</v></c>
<c r="C1" s="11"><v>2025</v></c>
<c r="D1" s="11"><v>2026</v></c>
</row>
<!-- Row 2: Revenue row -->
<row r="2">
<c r="A2" t="inlineStr"><is><t>Revenue</t></is></c>
<!-- B2: Base year revenue, cross-sheet reference from Assumptions, green, s="3" (general format) -->
<!-- If currency format is needed, append new style s="13" (see Section 5.4) -->
<c r="B2" s="3"><f>Assumptions!B4</f><v></v></c>
<!-- C2, D2: Next year revenue = prior year * (1 + growth rate), black font currency formula, s="6" -->
<c r="C2" s="6"><f>B2*(1+Assumptions!B2)</f><v></v></c>
<c r="D2" s="6"><f>C2*(1+Assumptions!B2)</f><v></v></c>
</row>
<!-- Row 3: Gross profit row - black font currency formula, s="6" -->
<row r="3">
<c r="A3" t="inlineStr"><is><t>Gross Profit</t></is></c>
<c r="B3" s="6"><f>B2*Assumptions!B3</f><v></v></c>
<c r="C3" s="6"><f>C2*Assumptions!B3</f><v></v></c>
<c r="D3" s="6"><f>D2*Assumptions!B3</f><v></v></c>
</row>
<!-- Row 4: Gross margin row - black font percentage formula, s="8" -->
<row r="4">
<c r="A4" t="inlineStr"><is><t>Gross Margin %</t></is></c>
<c r="B4" s="8"><f>B3/B2</f><v></v></c>
<c r="C4" s="8"><f>C3/C2</f><v></v></c>
<c r="D4" s="8"><f>D3/D2</f><v></v></c>
</row>
```
### 5.4 Appending "Green + Number Format" Combinations
Pre-defined index 3 is green font + general format. If a cross-sheet reference involves a currency amount, a green style with a number format must be appended:
```xml
<!-- Append at the end of <cellXfs> in styles.xml (assuming current count=13, new index=13) -->
<!-- index 13: cross-sheet reference + currency format (green font + $#,##0) -->
<xf numFmtId="164" fontId="3" fillId="0" borderId="0" xfId="0"
applyFont="1" applyNumberFormat="1"/>
<!-- Update count to 14 -->
```
After appending, cross-sheet reference currency cells use `s="13"`.
---
## 6. Complete Operational Workflow
### 6.1 Workflow Overview
```
[Existing xlsx or file after CREATE/EDIT]
|
Step 1: Unpack (extract to temporary directory)
|
Step 2: Audit styles.xml (review existing styles, build index mapping table)
|
Step 3: Audit sheet XML (identify cells needing formatting and their semantic roles)
|
Step 4: Append missing styles (numFmt -> font -> fill -> xf, update counts)
|
Step 5: Batch-update the s attribute of each cell in the sheet XML
|
Step 6: XML validity + style reference integrity verification
|
Step 7: Pack (recompress as xlsx)
```
### 6.2 Step 1 — Unpack
```bash
python3 SKILL_DIR/scripts/xlsx_unpack.py input.xlsx /tmp/xlsx_fmt/
```
If the script is unavailable, unpack manually:
```bash
mkdir -p /tmp/xlsx_fmt && cp input.xlsx /tmp/xlsx_fmt/input.xlsx
cd /tmp/xlsx_fmt && unzip input.xlsx -d unpacked/
```
### 6.3 Step 2 — Audit styles.xml
Execute according to the method in Section 3.1. Quick check for minimal_xlsx template initial state:
- `<cellXfs count="13">` and `<numFmts count="4">` -> Template initial state, all 13 pre-defined slots can be used directly
- Otherwise -> A complete review of the existing index mapping is required
### 6.4 Step 3 — Audit Sheet XML, Build Formatting Plan
Read `xl/worksheets/sheet*.xml` and evaluate each cell:
1. Does it contain a `<f>` element (formula)? -> Requires black/green/red style
2. Is it a hard-coded numeric parameter? -> Requires blue style
3. Is the data type currency/percentage/integer/year? -> Select the corresponding number format slot
4. Is it a header? -> Bold style (index 4)
Build a formatting mapping table: `{cell coordinate: target style index}`
### 6.5 Step 4 — Append Styles
Execute according to the atomic operation sequence in Section 3.2. Update the corresponding count attribute immediately after appending each component.
### 6.6 Step 5 — Batch-Update Cell s Attributes
```xml
<!-- Before formatting: no style -->
<c r="B5"><v>0.08</v></c>
<!-- After formatting: growth rate assumption, blue font percentage, s="7" -->
<c r="B5" s="7"><v>0.08</v></c>
```
```xml
<!-- Before formatting: formula without style -->
<c r="C10"><f>B10*(1+Assumptions!B2)</f><v></v></c>
<!-- After formatting: currency formula, black font, s="6" -->
<c r="C10" s="6"><f>B10*(1+Assumptions!B2)</f><v></v></c>
```
For consecutive rows of the same type, row-level default styles can be used to reduce repetition:
```xml
<!-- Entire row uses style=6, only override for exception cells -->
<row r="5" s="6" customFormat="1">
<c r="A5" s="0" t="inlineStr"><is><t>Operating Income</t></is></c> <!-- Text overridden to default -->
<c r="B5"><f>B3-B4</f><v></v></c> <!-- Inherits row-level s=6 -->
<c r="C5"><f>C3-C4</f><v></v></c>
</row>
```
### 6.7 Step 6 — Verification
```bash
# XML validity verification is handled automatically by xlsx_pack.py, no need to manually run xmllint
# The pack script validates styles.xml and sheet XML legality before packaging; it aborts and reports on errors
# Style audit (optional, audit the entire unpacked directory after formatting is complete)
python3 SKILL_DIR/scripts/style_audit.py /tmp/xlsx_fmt/unpacked/
# Formula error static scan (must specify a single .xlsx file, does not accept directories)
# Pack first, then scan:
python3 SKILL_DIR/scripts/xlsx_pack.py /tmp/xlsx_fmt/unpacked/ /tmp/output.xlsx
python3 SKILL_DIR/scripts/formula_check.py /tmp/output.xlsx
```
Manual style reference integrity check:
```bash
# Find the maximum s attribute value in the sheet XML
grep -o 's="[0-9]*"' /tmp/xlsx_fmt/unpacked/xl/worksheets/sheet1.xml \
| grep -o '[0-9]*' | sort -n | tail -1
# Compare with the cellXfs count attribute (max s value must be < count)
grep 'cellXfs count' /tmp/xlsx_fmt/unpacked/xl/styles.xml
```
### 6.8 Step 7 — Pack
```bash
python3 SKILL_DIR/scripts/xlsx_pack.py /tmp/xlsx_fmt/unpacked/ output.xlsx
```
If the script is unavailable, pack manually:
```bash
cd /tmp/xlsx_fmt/unpacked/
zip -r ../output.xlsx . -x "*.DS_Store"
```
---
## 7. Formatting Completeness Checklist
Verify each item before delivery:
### Color Role Consistency
- [ ] All numeric cells containing `<f>` elements: fontId corresponds to black (formula) or green (cross-sheet reference)
- [ ] All hard-coded numeric values that are user-adjustable parameters: fontId corresponds to blue (input)
- [ ] Cross-sheet references (formula contains `SheetName!`): fontId corresponds to green
- [ ] External file references (formula contains `[FileName.xlsx]`): fontId corresponds to red
- [ ] No cell simultaneously contains a `<f>` element and uses blue font (color role contradiction)
### Number Format Correctness
- [ ] Year columns: numFmtId="1" (`0` format), displays as 2024 not 2,024
- [ ] Currency rows: numFmtId="164" or variant, negative numbers display as ($1,234) not -$1,234
- [ ] Percentage rows: values stored as decimals (0.08 = 8%), format numFmtId="165", displays as 8.0%
- [ ] Zero values: displayed as `-` in sparse matrices rather than `0` (formatCode third segment contains `"-"`)
- [ ] Multiple rows (EV/EBITDA, etc.): numFmtId="166" (`0.0x` format)
- [ ] Negative number display style is consistent throughout the entire workbook (parenthetical or red minus sign)
### styles.xml Structural Integrity
- [ ] `<numFmts count>` = actual number of `<numFmt>` elements
- [ ] `<fonts count>` = actual number of `<font>` elements
- [ ] `<fills count>` = actual number of `<fill>` elements (including spec-mandated fills[0] and fills[1])
- [ ] `<cellXfs count>` = actual number of `<xf>` elements
- [ ] fills[0] is `patternType="none"`, fills[1] is `patternType="gray125"` (spec-mandated)
- [ ] All `<xf>` referenced fontId / fillId / borderId are within the valid range of their respective collections
- [ ] All cell `s` attribute values < `cellXfs count` (no out-of-bounds references)
### Assumption Separation Verification
- [ ] No black-font numeric cells in the assumptions area/sheet (black numeric = formula, should not be in assumptions)
- [ ] No blue-font non-year numeric cells in the model area/sheet (blue numeric = hard-coded, should be in assumptions)
- [ ] Input parameters in the model area reference the assumptions area via formulas, not by directly copying values
### Formula and Format Linkage
- [ ] All cells with `<f>` elements have an explicit `s` attribute (must not use default style=0, whose font color is not explicitly black)
- [ ] SUM summary rows: style uses black font + corresponding number format (e.g., s="6" for currency summaries)
- [ ] Percentage formulas: values stored as decimals, format is `0.0%`; do not multiply values by 100 before applying percentage format
### Visual Hierarchy
- [ ] Header rows (years/metric names): style=4 (bold black)
- [ ] Summary rows (Total/EBITDA/Net Income): bold + corresponding number format (append style if needed)
- [ ] Unit description rows (e.g., "$ thousands"): use style=0 or style=2 (blue not needed)
---
## 8. Prohibited Actions (What You Must NOT Do)
- **Do not modify existing `<xf>` entries**: This will batch-change the style of all cells referencing that index
- **Do not delete fills[0] and fills[1]**: Required by OOXML specification; deletion causes file corruption
- **Do not modify cell values or formulas**: The FORMAT path only changes styles, not content
- **Do not use openpyxl for formatting**: openpyxl rewrites the entire styles.xml on save, losing unsupported features
- **Do not apply global override styles**: Do not cover the entire workbook with a single style; assign precisely by semantic role
- **Do not write FF in the Alpha channel**: `rgb="FF0000FF"` makes the color fully transparent; the correct format is `rgb="000000FF"`
---
## 9. Common Errors and Fixes
### Error 1: Year displays as 2,024
Cause: The year cell's `s` attribute uses a format with thousands separator (e.g., numFmtId="3" or numFmtId="167").
```xml
<!-- Incorrect -->
<c r="B1" s="9"><v>2024</v></c>
<!-- Fix: Change to s="11" (numFmtId="1", format 0) -->
<c r="B1" s="11"><v>2024</v></c>
```
### Error 2: Percentage displays as 800% (value was multiplied by 100)
Cause: 8% was stored as `<v>8</v>` instead of `<v>0.08</v>`. Excel's `%` format automatically multiplies the value by 100 for display.
```xml
<!-- Incorrect -->
<c r="B2" s="7"><v>8</v></c>
<!-- Fix: Value must be stored in decimal form -->
<c r="B2" s="7"><v>0.08</v></c>
```
### Error 3: File corruption after appending styles without updating count
Cause: A `<font>` or `<xf>` element was appended but the count attribute was not updated; Excel reads beyond bounds using the old count.
Fix: Update the corresponding count immediately after appending each element:
```xml
<!-- After appending the 6th font, count must be changed from 5 to 6 -->
<fonts count="6">
...
</fonts>
```
### Error 4: Blue font + formula (color role contradiction)
Cause: A formula cell mistakenly uses an input style (e.g., s="5" for currency input).
```xml
<!-- Incorrect: Formula cell uses blue input style -->
<c r="C5" s="5"><f>B5*1.08</f><v></v></c>
<!-- Fix: Change formula cell to corresponding black formula style (5->6, 7->8, 9->10) -->
<c r="C5" s="6"><f>B5*1.08</f><v></v></c>
```
### Error 5: AARRGGBB color missing Alpha (only 6 digits)
```xml
<!-- Incorrect: 6-digit format, behavior depends on implementation, usually causes wrong color -->
<color rgb="0000FF"/>
<!-- Fix: Always use 8-digit AARRGGBB, Alpha fixed at 00 -->
<color rgb="000000FF"/>
```
### Error 6: Modifying existing xf (affects all cells referencing that index)
Cause: Directly modifying attributes of the Nth `<xf>` in cellXfs, causing all cells with `s="N"` to be batch-changed.
Fix: Keep existing entries unchanged, append a new entry at the end, and only change the `s` attribute of cells that need the new style to the new index:
```xml
<!-- Incorrect: Modified the existing xf at index=6 -->
<xf numFmtId="164" fontId="2" fillId="0" borderId="0" xfId="0"
applyFont="1" applyNumberFormat="1" applyAlignment="1">
<alignment horizontal="right"/> <!-- New attribute added, affects ALL cells already using s="6" -->
</xf>
<!-- Fix: Append new index (when original count=13, new index=13), only change the s attribute of cells needing right alignment -->
<!-- Keep index=6 as-is -->
<xf numFmtId="164" fontId="2" fillId="0" borderId="0" xfId="0"
applyFont="1" applyNumberFormat="1" applyAlignment="1">
<alignment horizontal="right"/>
</xf> <!-- New index=13 -->
```
---
## 10. Financial Model Structure Conventions
### 10.1 Header Rows
- Bold font (corresponds to style index 4 in this skill's template)
- Year columns: use number format `0` (numFmtId="1", no thousands separator) to prevent 2024 from displaying as 2,024
- A unit description row may be added below headers: gray or italic text, e.g., "$ thousands" or "% of Revenue"
### 10.2 Row Type Standards
| Row Type | Style Recommendation | Example |
|----------|---------------------|---------|
| Category heading row | Bold, optionally with fill color | "Revenue" |
| Line item row | Normal style | "Product A", "Product B" |
| Subtotal row | Bold + top border | "Total Revenue" |
| Operating metric row | Normal style | "Gross Margin %" |
| Separator row | Empty row | (empty) |
### 10.3 Multi-Year Model Column Layout
```
Col A: Label column (width 28, left-aligned text, s="4" for headers or s="0" for labels)
Col B: FY2022 Actual (width 12, year header s="11", data cells styled by semantic role)
Col C: FY2023 Actual
Col D: FY2024E (forecast period - can use light gray fill fillId=3 to differentiate)
Col E: FY2025E
Col F: FY2026E
```
### 10.4 Cross-Sheet Reference Patterns
Complete XML example of parameters passing from assumptions sheet to model sheet:
```xml
<!-- Assumptions sheet, cell B5: 8% growth rate, blue percentage input -->
<c r="B5" s="7"><v>0.08</v></c>
<!-- Model sheet, cell C10: references assumption area growth rate, green percentage formula -->
<!-- Requires appending index=13: green + percentage format (fontId=3, numFmtId=165) -->
<c r="C10" s="13"><f>Assumptions!B5</f><v></v></c>
```
---
## 11. Assumption Categories
In the assumptions area (Assumptions sheet or assumptions block), organize assumptions in the following standard order for ease of review and maintenance:
1. **Revenue assumptions**: Growth rates, pricing, sales volume
2. **Cost assumptions**: Gross margin, fixed/variable cost ratios
3. **Working capital**: DSO (Days Sales Outstanding), DPO (Days Payable Outstanding), inventory days
4. **Capital expenditures (CapEx)**: As a percentage of revenue or absolute amounts
5. **Financing assumptions**: Interest rates, debt repayment schedules
6. **Tax and other**: Effective tax rate, depreciation & amortization (D&A)
---
## 12. Audit Trail Best Practices
- Use `s="12"` (blue font + yellow fill highlight) to mark cells requiring review or pending changes, making them immediately visible to reviewers
- In sensitivity analysis rows or a separate Sensitivity tab, show the impact of +/-1% changes in key assumptions on results
- **Do not hide rows containing assumptions**: Assumption rows must be visible to reviewers; do not use the `hidden="1"` attribute
- Note a "Last Updated" date at the top of the assumptions area or in a dedicated cell, recording the last modification time of the model
---
## 13. Pre-Delivery Checklist (Common Financial Model Checklist)
Before outputting the final file, confirm each item:
- [ ] Formula rows contain no hard-coded values (can use `formula_check.py` to scan the packaged `.xlsx` file)
- [ ] Year columns display as 2024 not 2,024 (numFmtId="1", format `0`)
- [ ] Negative numbers display as (1,234) not -1,234 (use parenthetical style for externally delivered financial reports)
- [ ] Zero values display as `-` in sparse rows rather than `0` (formatCode third segment is `"-"`)
- [ ] Growth rates and percentages are stored as decimals (0.08 = 8%), format is `0.0%`
- [ ] All cross-sheet reference cells use green font (style index 3 or an appended green + number format combination)
- [ ] Assumptions block and model block are clearly separated (different sheets or separated by empty rows within the same sheet)
- [ ] Summary rows use `SUM()` formulas, not manually hard-coded totals
- [ ] Balance verification: summary rows = sum of their respective line items (a check row can be added at the end of the model to verify)

View File

@@ -0,0 +1,231 @@
# OOXML SpreadsheetML Cheat Sheet
Quick reference for XML manipulation of xlsx files.
---
## Package Structure
```
my_file.xlsx (ZIP archive)
├── [Content_Types].xml ← declares MIME types for all files
├── _rels/
│ └── .rels ← root relationship: points to xl/workbook.xml
└── xl/
├── workbook.xml ← sheet list, calc settings
├── styles.xml ← ALL style definitions
├── sharedStrings.xml ← ALL text strings (referenced by index)
├── _rels/
│ └── workbook.xml.rels ← maps r:id → worksheet/styles/sharedStrings files
├── worksheets/
│ ├── sheet1.xml ← Sheet 1 data
│ ├── sheet2.xml ← Sheet 2 data
│ └── ...
├── charts/ ← chart XML (if any)
├── pivotTables/ ← pivot table XML (if any)
└── theme/
└── theme1.xml ← color/font theme
```
---
## Cell Reference Format
```
A1 → column A (1), row 1
B5 → column B (2), row 5
AA1 → column 27, row 1
```
Column letter ↔ number conversion:
```python
def col_letter(n): # 1-based → letter
r = ""
while n > 0:
n, rem = divmod(n - 1, 26)
r = chr(65 + rem) + r
return r
def col_number(s): # letter → 1-based
n = 0
for c in s.upper():
n = n * 26 + (ord(c) - 64)
return n
```
---
## Cell XML Reference
### Data Types
| Type | `t` attr | XML Example | Value |
|------|---------|-------------|-------|
| Number | omit | `<c r="B2"><v>1000</v></c>` | 1000 |
| String (shared) | `s` | `<c r="A1" t="s"><v>0</v></c>` | sharedStrings[0] |
| String (inline) | `inlineStr` | `<c r="A1" t="inlineStr"><is><t>Hi</t></is></c>` | "Hi" |
| Boolean | `b` | `<c r="D1" t="b"><v>1</v></c>` | TRUE |
| Error | `e` | `<c r="E1" t="e"><v>#REF!</v></c>` | #REF! |
| Formula | omit | `<c r="B4"><f>SUM(B2:B3)</f><v></v></c>` | computed |
### Formula Types
```xml
<!-- Basic formula (no leading = in XML!) -->
<c r="B4"><f>SUM(B2:B3)</f><v></v></c>
<!-- Cross-sheet -->
<c r="C1"><f>Assumptions!B5</f><v></v></c>
<c r="C1"><f>'Sheet With Spaces'!B5</f><v></v></c>
<!-- Shared formula: D2:D100 all use B*C with relative row offset -->
<c r="D2"><f t="shared" ref="D2:D100" si="0">B2*C2</f><v></v></c>
<c r="D3"><f t="shared" si="0"/><v></v></c>
<!-- Array formula -->
<c r="E1"><f t="array" ref="E1:E5">SORT(A1:A5)</f><v></v></c>
```
---
## styles.xml Reference
### Indirect Reference Chain
```
Cell s="3"
cellXfs[3] → fontId="2", fillId="0", borderId="0", numFmtId="165"
↓ ↓ ↓ ↓ ↓
fonts[2] fills[0] borders[0] numFmts: id=165
blue color no fill no border "0.0%"
```
### Adding a New Style (step-by-step)
1. In `<numFmts>`: add `<numFmt numFmtId="168" formatCode="0.00%"/>`, update `count`
2. In `<fonts>`: add font entry, note its index
3. In `<cellXfs>`: append `<xf numFmtId="168" fontId="N" .../>`, update `count`
4. New style index = old `cellXfs count` value (before incrementing)
5. Apply to cells: `<c r="B5" s="NEW_INDEX">...</c>`
### Color Format
`AARRGGBB` — Alpha (always `00` for opaque) + Red + Green + Blue
```
000000FF → Blue
00000000 → Black
00008000 → Green (dark)
00FF0000 → Red
00FFFF00 → Yellow (for fills)
00FFFFFF → White
```
### Built-in numFmtIds (no declaration needed)
| ID | Format | Display |
|----|--------|---------|
| 0 | General | as-is |
| 1 | 0 | 2024 (use for years!) |
| 2 | 0.00 | 1000.00 |
| 3 | #,##0 | 1,000 |
| 4 | #,##0.00 | 1,000.00 |
| 9 | 0% | 15% |
| 10 | 0.00% | 15.25% |
| 14 | m/d/yyyy | 3/21/2026 |
---
## sharedStrings.xml Reference
```xml
<sst count="3" uniqueCount="3">
<si><t>Revenue</t></si> <!-- index 0 -->
<si><t>Cost</t></si> <!-- index 1 -->
<si><t>Margin</t></si> <!-- index 2 -->
</sst>
```
Text with leading/trailing spaces:
```xml
<si><t xml:space="preserve"> indented </t></si>
```
Special characters:
```xml
<si><t>R&amp;D Expenses</t></si> <!-- & must be &amp; -->
```
---
## workbook.xml / .rels Sync
Every `<sheet>` in workbook.xml needs a matching `<Relationship>` in workbook.xml.rels:
```xml
<!-- workbook.xml -->
<!-- NOTE: rId numbering depends on what rIds are already in workbook.xml.rels.
The minimal template reserves rId1=sheet1, rId2=styles, rId3=sharedStrings.
When ADDING sheets to the template, start from rId4 to avoid conflicts.
The rId3 here is just a generic illustration — use the next available rId. -->
<sheet name="Summary" sheetId="3" r:id="rId3"/>
<!-- workbook.xml.rels -->
<Relationship Id="rId3"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/worksheet"
Target="worksheets/sheet3.xml"/>
```
And a matching `<Override>` in `[Content_Types].xml`:
```xml
<Override PartName="/xl/worksheets/sheet3.xml"
ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.worksheet+xml"/>
```
---
## Column / Row Dimensions
```xml
<!-- Before <sheetData> -->
<cols>
<col min="1" max="1" width="28" customWidth="1"/> <!-- A: 28 chars -->
<col min="2" max="6" width="14" customWidth="1"/> <!-- B-F: 14 chars -->
</cols>
<!-- Row height on individual rows -->
<row r="1" ht="20" customHeight="1">
...
</row>
```
---
## Freeze Panes
Inside `<sheetView>`:
```xml
<!-- Freeze row 1 (header row stays visible) -->
<pane ySplit="1" topLeftCell="A2" activePane="bottomLeft" state="frozen"/>
<!-- Freeze column A -->
<pane xSplit="1" topLeftCell="B1" activePane="topRight" state="frozen"/>
<!-- Freeze both row 1 and column A -->
<pane xSplit="1" ySplit="1" topLeftCell="B2" activePane="bottomRight" state="frozen"/>
```
---
## 7 Excel Error Types (All Must Be Absent at Delivery)
| Error | Meaning | Detect in XML |
|-------|---------|---------------|
| `#REF!` | Invalid cell reference | `<c t="e"><v>#REF!</v></c>` |
| `#DIV/0!` | Divide by zero | `<c t="e"><v>#DIV/0!</v></c>` |
| `#VALUE!` | Wrong data type | `<c t="e"><v>#VALUE!</v></c>` |
| `#NAME?` | Unknown function/name | `<c t="e"><v>#NAME?</v></c>` |
| `#NULL!` | Empty intersection | `<c t="e"><v>#NULL!</v></c>` |
| `#NUM!` | Number out of range | `<c t="e"><v>#NUM!</v></c>` |
| `#N/A` | Value not found | `<c t="e"><v>#N/A</v></c>` |

View File

@@ -0,0 +1,97 @@
# Data Reading & Analysis Guide
> Reference for the READ path. Use `xlsx_reader.py` for structure discovery and data quality auditing,
> then pandas for custom analysis. **Never modify the source file.**
---
## When to Use This Path
The user asks to read, analyze, view, summarize, extract, or answer questions about an Excel/CSV file's contents,
without requiring file modification. If modification is needed, hand off to `edit.md`.
---
## Workflow
### Step 1 — Structure Discovery
Run `xlsx_reader.py` first. It handles format detection, encoding fallback, structure exploration, and data quality audit:
```bash
python3 SKILL_DIR/scripts/xlsx_reader.py input.xlsx # full report
python3 SKILL_DIR/scripts/xlsx_reader.py input.xlsx --sheet Sales # single sheet
python3 SKILL_DIR/scripts/xlsx_reader.py input.xlsx --quality # quality audit only
python3 SKILL_DIR/scripts/xlsx_reader.py input.xlsx --json # machine-readable
```
Supported formats: `.xlsx`, `.xlsm`, `.csv`, `.tsv`. The script tries multiple encodings for CSV (utf-8-sig, gbk, utf-8, latin-1).
### Step 2 — Custom Analysis with pandas
Load data and perform the analysis the user requests:
```python
import pandas as pd
df = pd.read_excel("input.xlsx", sheet_name=None) # dict of all sheets
# For CSV: pd.read_csv("input.csv")
```
**Header handling** (when the default `header=0` doesn't work):
| Situation | Code |
|-----------|------|
| Header on row 3 | `pd.read_excel(path, header=2)` |
| Multi-level merged header | `pd.read_excel(path, header=[0, 1])` |
| No header | `pd.read_excel(path, header=None)` |
**Analysis quick reference:**
| Scenario | Pattern |
|----------|---------|
| Descriptive stats | `df.describe()` or `df['Col'].agg(['sum', 'mean', 'min', 'max'])` |
| Group aggregation | `df.groupby('Region')['Revenue'].agg(Total='sum', Avg='mean')` |
| Top N | `df.groupby('Region')['Revenue'].sum().sort_values(ascending=False).head(5)` |
| Pivot table | `df.pivot_table(values='Revenue', index='Region', columns='Quarter', aggfunc='sum', margins=True)` |
| Time series | `df.set_index(pd.to_datetime(df['Date'])).resample('ME')['Revenue'].sum()` |
| Cross-sheet merge | `pd.merge(sales, customers, on='CustomerID', how='left', validate='m:1')` |
| Stack sheets | `pd.concat([df.assign(Source=name) for name, df in sheets.items()], ignore_index=True)` |
| Large files (>50MB) | `pd.read_excel(path, usecols=['Date', 'Revenue'])` or `pd.read_csv(path, chunksize=10000)` |
### Step 3 — Output
If the user specifies an output file path, write results to it (highest priority). Format the report as:
```
## Analysis Report: {filename}
### File Overview — format, sheets, row counts
### Data Quality — nulls, duplicates, mixed types (or "no issues")
### Key Findings — direct answer to the user's question
### Additional Notes — formula NaN, encoding issues, caveats
```
**Numeric display**: monetary `1,234,567.89`, percentage `12.3%`, multiples `8.5x`, counts as integers.
---
## Common Pitfalls
| Pitfall | Cause | Fix |
|---------|-------|-----|
| Formula cells read as NaN | `<v>` cache empty in freshly generated files | Inform user; suggest opening in Excel and re-saving; or use `libreoffice_recalc.py` |
| CSV encoding errors | Chinese Windows exports use GBK | `xlsx_reader.py` auto-tries multiple encodings; manually specify if all fail |
| Mixed types in column | Column has both numbers and text (e.g., "N/A") | `pd.to_numeric(df['Col'], errors='coerce')` — report unconvertible rows |
| Year shows as 2,024 | Thousands separator format applied to year | `df['Year'].astype(int).astype(str)` |
| Multi-level headers | Two-row header merged | `pd.read_excel(path, header=[0, 1])`, then flatten with `' - '.join()` |
| Row number mismatch | pandas 0-indexed vs Excel 1-indexed | `excel_row = pandas_index + 2` (+1 for 1-index, +1 for header) |
**Critical**: Never open with `data_only=True` then `save()` — this permanently destroys all formulas.
---
## Prohibitions
- Never modify the source file (no `save()`, no XML edits)
- Never report formula NaN as "data is zero" — explain it's a formula cache issue
- Never report pandas indices as Excel row numbers
- Never make speculative conclusions unsupported by the data

View File

@@ -0,0 +1,772 @@
# Formula Validation & Recalculation Guide
Ensure every formula in an xlsx file is provably correct before delivery. A file that opens without visible errors is not a passing file — only a file that has cleared both validation tiers is a passing file.
---
## Foundational Rules
- **Never declare PASS without running `formula_check.py` first.** Visual inspection of a spreadsheet is not validation.
- **Tier 1 (static) is mandatory in every scenario.** Tier 2 (dynamic) is mandatory when LibreOffice is available. If it is unavailable, you must state this explicitly in the report — you may not silently skip it.
- **Never use openpyxl with `data_only=True` to check formula values.** Opening and saving a workbook in `data_only=True` mode permanently replaces all formulas with their last cached values. Formulas cannot be recovered afterward.
- **Auto-fix only deterministic errors.** Any fix that requires understanding business logic must be flagged for human review.
---
## Two-Tier Validation Architecture
```
Tier 1 — Static Validation (XML scan, no external tools)
├── Detect: all 7 Excel error types already cached in <v> elements
├── Detect: cross-sheet references pointing to nonexistent sheets
├── Detect: formula cells with t="e" attribute (error type marker)
└── Tool: formula_check.py + manual XML inspection
▼ (if LibreOffice is present)
Tier 2 — Dynamic Validation (LibreOffice headless recalculation)
├── Executes all formulas via the LibreOffice Calc engine
├── Populates <v> cache values with real computed results
├── Exposes runtime errors invisible before recalculation
└── Follow-up: re-run Tier 1 on the recalculated file
```
**Why two tiers?**
openpyxl and all Python xlsx libraries write formula strings (e.g. `=SUM(B2:B9)`) into `<f>` elements but do not evaluate them. A freshly generated file has empty `<v>` cache elements for every formula cell. This means:
- Tier 1 can only catch errors that are already encoded in the XML — either as `t="e"` cells or as structurally broken cross-sheet references.
- Tier 2 uses LibreOffice as the actual calculation engine, runs every formula, fills `<v>` with real results, and surfaces runtime errors (`#DIV/0!`, `#N/A`, etc.) that can only appear after computation.
Neither tier alone is sufficient. Together they cover the full correctability surface.
---
## Tier 1 — Static Validation
Static validation requires no external tools. It works directly on the ZIP/XML structure of the xlsx file.
### Step 1: Run formula_check.py
**Standard (human-readable) output:**
```bash
python3 SKILL_DIR/scripts/formula_check.py /path/to/file.xlsx
```
**JSON output (for programmatic processing):**
```bash
python3 SKILL_DIR/scripts/formula_check.py /path/to/file.xlsx --json
```
**Single-sheet mode (faster for targeted checks):**
```bash
python3 SKILL_DIR/scripts/formula_check.py /path/to/file.xlsx --sheet Summary
```
**Summary mode (counts only, no per-cell detail):**
```bash
python3 SKILL_DIR/scripts/formula_check.py /path/to/file.xlsx --summary
```
Exit codes:
- `0` — no hard errors (PASS or PASS with heuristic warnings)
- `1` — hard errors detected, or file cannot be opened (FAIL)
#### What formula_check.py examines
The script opens the xlsx as a ZIP archive without using any Excel library. It reads `xl/workbook.xml` to enumerate sheet names and named ranges, reads `xl/_rels/workbook.xml.rels` to map each sheet to its XML file, then iterates every `<c>` element in every worksheet.
It performs five checks:
1. **Error-value detection**: If the cell has `t="e"`, its `<v>` element contains an Excel error string. The cell is recorded with its sheet name, cell reference (e.g. `C5`), the error value, and the formula text if present.
2. **Broken cross-sheet reference detection**: If the cell has an `<f>` element, the script extracts all sheet names referenced in the formula (both `SheetName!` and `'Sheet Name'!` syntax). Each name is compared against the list of sheets in `workbook.xml`. A mismatch is a broken reference.
3. **Unknown named-range detection (heuristic)**: Identifiers in formulas that are not function names, not cell references, and not found in `workbook.xml`'s `<definedNames>` are flagged as `unknown_name_ref` warnings. This is a heuristic — false positives are possible; always verify manually.
4. **Shared formula integrity**: Shared formula consumer cells (those with only `<f t="shared" si="N"/>`) are skipped for formula counting and cross-ref checks because they inherit the primary cell's formula. Only the primary cell (with `ref="..."` attribute and formula text) is checked and counted.
5. **Malformed error cells**: Cells with `t="e"` but no `<v>` child element are flagged as structural XML issues.
Hard errors (exit code 1): `error_value`, `broken_sheet_ref`, `malformed_error_cell`, `file_error`
Soft warnings (exit code 0): `unknown_name_ref` — must be verified manually but do not block delivery alone
#### Reading formula_check.py human-readable output
A clean file looks like this:
```
File : /tmp/budget_2024.xlsx
Sheets : Summary, Q1, Q2, Q3, Q4, Assumptions
Formulas checked : 312 distinct formula cells
Shared formula ranges : 4 ranges
Errors found : 0
PASS — No formula errors detected
```
A file with errors looks like this:
```
File : /tmp/budget_2024.xlsx
Sheets : Summary, Q1, Q2, Q3, Q4, Assumptions
Formulas checked : 312 distinct formula cells
Shared formula ranges : 4 ranges
Errors found : 4
── Error Details ──
[FAIL] [Summary!C12] contains #REF! (formula: Q1!A0/Q1!A1)
[FAIL] [Summary!D15] references missing sheet 'Q5'
Formula: Q5!D15
Valid sheets: ['Assumptions', 'Q1', 'Q2', 'Q3', 'Q4', 'Summary']
[FAIL] [Q1!F8] contains #DIV/0!
[WARN] [Q2!B10] uses unknown name 'GrowthAssumptions' (heuristic — verify manually)
Formula: SUM(GrowthAssumptions)
Defined names: ['RevenueRange', 'CostRange']
FAIL — 3 error(s) must be fixed before delivery
WARN — 1 heuristic warning(s) require manual review
```
Interpretation of each line:
- `[FAIL] [Summary!C12] contains #REF! (formula: Q1!A0/Q1!A1)` — The cell has `t="e"` and `<v>#REF!</v>`. The formula references row 0, which does not exist in Excel's 1-based system. This is an off-by-one error in a generated reference.
- `[FAIL] [Summary!D15] references missing sheet 'Q5'` — The formula contains `Q5!D15`, but no sheet named `Q5` exists in the workbook. The valid sheet list is provided for comparison.
- `[FAIL] [Q1!F8] contains #DIV/0!` — This cell's `<v>` is already an error value (the file was previously recalculated). The formula divided by zero.
- `[WARN] [Q2!B10] uses unknown name 'GrowthAssumptions'` — The identifier `GrowthAssumptions` appears in the formula but is not in `<definedNames>`. This may be a typo or a name that was accidentally omitted. It is a heuristic warning — verify manually. The warning alone does not block delivery.
#### Reading formula_check.py JSON output
```json
{
"file": "/tmp/budget_2024.xlsx",
"sheets_checked": ["Summary", "Q1", "Q2", "Q3", "Q4", "Assumptions"],
"formula_count": 312,
"shared_formula_ranges": 4,
"error_count": 4,
"errors": [
{
"type": "error_value",
"error": "#REF!",
"sheet": "Summary",
"cell": "C12",
"formula": "Q1!A0/Q1!A1"
},
{
"type": "broken_sheet_ref",
"sheet": "Summary",
"cell": "D15",
"formula": "Q5!D15",
"missing_sheet": "Q5",
"valid_sheets": ["Assumptions", "Q1", "Q2", "Q3", "Q4", "Summary"]
},
{
"type": "error_value",
"error": "#DIV/0!",
"sheet": "Q1",
"cell": "F8",
"formula": null
},
{
"type": "unknown_name_ref",
"sheet": "Q2",
"cell": "B10",
"formula": "SUM(GrowthAssumptions)",
"unknown_name": "GrowthAssumptions",
"defined_names": ["RevenueRange", "CostRange"],
"note": "Heuristic check — verify manually if this is a false positive"
}
]
}
```
Field reference:
| Field | Meaning |
|-------|---------|
| `type: "error_value"` | Cell has `t="e"` — an Excel error is stored in the `<v>` element |
| `type: "broken_sheet_ref"` | Formula references a sheet name not present in workbook.xml |
| `type: "unknown_name_ref"` | Formula references an identifier not in `<definedNames>` (heuristic, soft warning) |
| `type: "malformed_error_cell"` | Cell has `t="e"` but no `<v>` child — structural XML problem |
| `type: "file_error"` | The file could not be opened (bad ZIP, not found, etc.) |
| `sheet` | The sheet where the error was found |
| `cell` | Cell reference in A1 notation |
| `formula` | The full formula text from the `<f>` element (null if not present) |
| `error` | The error string from `<v>` (for `error_value` type) |
| `missing_sheet` | The sheet name extracted from the formula that does not exist |
| `valid_sheets` | All sheet names actually present in workbook.xml |
| `unknown_name` | The identifier that was not found in `<definedNames>` |
| `defined_names` | All named ranges actually present in workbook.xml |
| `shared_formula_ranges` | Count of shared formula definitions (top-level `<f t="shared" ref="...">` elements) |
### Step 2: Manual XML inspection
When formula_check.py reports errors, unpack the file to inspect the raw XML:
```bash
python3 SKILL_DIR/scripts/xlsx_unpack.py /path/to/file.xlsx /tmp/xlsx_inspect/
```
Navigate to the worksheet file for the reported sheet. The sheet-to-file mapping is in `xl/_rels/workbook.xml.rels`. For example, if `rId1` maps to `worksheets/sheet1.xml`, then sheet1.xml is the file for the sheet with `r:id="rId1"` in `xl/workbook.xml`.
For each reported error cell, locate the `<c r="CELLREF">` element and examine:
**For `error_value` errors:**
```xml
<!-- This is what an error cell looks like in XML -->
<c r="C12" t="e">
<f>Q1!C10/Q1!C11</f>
<v>#DIV/0!</v>
</c>
```
Ask:
- Is the `<f>` formula syntactically correct?
- Does the cell reference in the formula point to a row/column that exists?
- If it is a division, is it possible the denominator cell is empty or zero?
**For `broken_sheet_ref` errors:**
Check `xl/workbook.xml` for the actual sheet list:
```xml
<sheets>
<sheet name="Summary" sheetId="1" r:id="rId1"/>
<sheet name="Q1" sheetId="2" r:id="rId2"/>
<sheet name="Q2" sheetId="3" r:id="rId3"/>
</sheets>
```
Sheet names are case-sensitive. `q1` and `Q1` are different sheets. Compare the name in the formula exactly against the names here.
### Step 3: Cross-sheet reference audit (multi-sheet workbooks)
For workbooks with 3 or more sheets, run a broader cross-reference audit after unpacking:
```bash
# Extract all formulas containing cross-sheet references
grep -h "<f>" /tmp/xlsx_inspect/xl/worksheets/*.xml | grep "!"
# List all actual sheet names from workbook.xml
grep -o 'name="[^"]*"' /tmp/xlsx_inspect/xl/workbook.xml | grep -v sheetId
```
Every sheet name appearing in formulas (in the form `SheetName!` or `'Sheet Name'!`) must appear in the workbook sheet list. If any do not match, that is a broken reference even if formula_check.py did not catch it (which can happen with shared formulas where only the primary cell is examined).
To check shared formulas specifically, look for `<f t="shared" ref="...">` elements:
```xml
<!-- Shared formula: defined on D2, applied to D2:D100 -->
<c r="D2"><f t="shared" ref="D2:D100" si="0">Q1!B2*C2</f><v></v></c>
<!-- Shared formula consumers: only si is present, no formula text -->
<c r="D3"><f t="shared" si="0"/><v></v></c>
```
formula_check.py reads the formula text from the primary cell (`D2` above). The referenced sheet `Q1` in that formula applies to the entire range `D2:D100`. If the sheet is broken, all 99 rows are broken even though they appear as empty `<f>` elements.
---
## Tier 2 — Dynamic Validation (LibreOffice Headless)
### Check LibreOffice availability
```bash
# Check macOS (typical install location)
which soffice
/Applications/LibreOffice.app/Contents/MacOS/soffice --version
# Check Linux
which libreoffice || which soffice
libreoffice --version
```
If neither command returns a path, LibreOffice is not installed. Record "Tier 2: SKIPPED — LibreOffice not available" in the report and proceed to delivery with Tier 1 results only.
### Install LibreOffice (if permitted in the environment)
macOS:
```bash
brew install --cask libreoffice
```
Ubuntu/Debian:
```bash
sudo apt-get install -y libreoffice
```
### Run headless recalculation
Use the dedicated recalculation script. It handles binary discovery across macOS and Linux, works from a temporary copy of the input (preserving the original), and provides structured output and exit codes compatible with the validation pipeline.
```bash
# Check LibreOffice availability first
python3 SKILL_DIR/scripts/libreoffice_recalc.py --check
# Run recalculation (default timeout: 60s)
python3 SKILL_DIR/scripts/libreoffice_recalc.py /path/to/input.xlsx /tmp/recalculated.xlsx
# For large or complex files, extend the timeout
python3 SKILL_DIR/scripts/libreoffice_recalc.py /path/to/input.xlsx /tmp/recalculated.xlsx --timeout 120
```
Exit codes from `libreoffice_recalc.py`:
- `0` — recalculation succeeded, output file written
- `2` — LibreOffice not found (note as SKIPPED in report; not a hard failure)
- `1` — LibreOffice found but failed (timeout, crash, malformed file)
**What the script does internally:**
LibreOffice's `--convert-to xlsx` command opens the file using the full Calc engine with the `--infilter="Calc MS Excel 2007 XML"` filter, executes every formula, writes computed values into the `<v>` cache elements, and saves the output. This is the closest server-side equivalent of "open in Excel and press Save." The script also passes `--norestore` to prevent LibreOffice from attempting to restore previous sessions, which can cause hangs in automated environments.
**If LibreOffice is not installed:**
macOS:
```bash
brew install --cask libreoffice
```
Ubuntu/Debian:
```bash
sudo apt-get install -y libreoffice
```
**If the script times out (libreoffice_recalc.py exits with code 1 and "timed out" message):**
Record "Tier 2: TIMEOUT — LibreOffice did not complete within Ns" in the report. Do not retry in a loop. Investigate whether the file has circular references or extremely large data ranges.
### Re-run Tier 1 after recalculation
After LibreOffice recalculation, the `<v>` elements contain real computed values. Errors that were invisible before (because `<v>` was empty in a freshly generated file) now appear as `t="e"` cells with actual error strings.
```bash
python3 SKILL_DIR/scripts/formula_check.py /tmp/recalculated.xlsx
```
This second Tier 1 pass is the definitive runtime error check. Any errors it finds are real calculation failures that must be fixed.
---
## All 7 Error Types — Causes and Fix Strategies
### #REF! — Invalid Cell Reference
**What it means:** The formula references a cell, range, or sheet that no longer exists or never existed.
**Common causes in generated files:**
- Off-by-one error in row/column calculation (e.g., referencing row 0 which does not exist in Excel's 1-based system)
- Column letter computed incorrectly (e.g., column 64 maps to `BL`, not `BK`)
- Formula references a sheet that was never created or was renamed
**XML signature:**
```xml
<c r="D5" t="e">
<f>Sheet2!A0</f>
<v>#REF!</v>
</c>
```
**Fix — correct the reference:**
```xml
<c r="D5">
<f>Sheet2!A1</f>
<v></v>
</c>
```
Note: remove `t="e"` and clear `<v>` after correcting the formula. The error type marker belongs to the cached state, not the formula.
**Auto-fixable?** Only if the correct target can be determined with certainty from the surrounding context. Otherwise flag for human review.
---
### #DIV/0! — Division by Zero
**What it means:** The formula divides by a value that is zero or an empty cell (empty cells evaluate to 0 in arithmetic context).
**Common causes in generated files:**
- Percentage change formula `=(B2-B1)/B1` where `B1` is empty or zero
- Rate formula `=Value/Total` where the total row hasn't been populated yet
**XML signature:**
```xml
<c r="C8" t="e">
<f>B8/B7</f>
<v>#DIV/0!</v>
</c>
```
**Fix — wrap with IFERROR:**
```xml
<c r="C8">
<f>IFERROR(B8/B7,0)</f>
<v></v>
</c>
```
Alternative — explicit zero check:
```xml
<c r="C8">
<f>IF(B7=0,0,B8/B7)</f>
<v></v>
</c>
```
**Auto-fixable?** Yes. Wrapping with `IFERROR(...,0)` is safe for most financial formulas. If the business expectation is that the result should display as blank rather than zero, use `IFERROR(...,"")` instead.
---
### #VALUE! — Wrong Data Type
**What it means:** The formula attempts an arithmetic or logical operation on a value of the wrong type (e.g., adding a text string to a number).
**Common causes in generated files:**
- A cell intended to hold a number was written as a string type (`t="s"` or `t="inlineStr"`) instead of a numeric type
- A formula references a cell containing text (e.g., a unit label like "thousands") and treats it as a number
**XML signature:**
```xml
<c r="F3" t="e">
<f>E3+D3</f>
<v>#VALUE!</v>
</c>
```
**Fix — check source cells for incorrect type:**
If `D3` was incorrectly written as a string:
```xml
<!-- Wrong: numeric value stored as string -->
<c r="D3" t="inlineStr"><is><t>1000</t></is></c>
<!-- Correct: numeric value stored as number (t attribute omitted or "n") -->
<c r="D3"><v>1000</v></c>
```
Alternatively, wrap the formula with `VALUE()` conversion:
```xml
<c r="F3">
<f>VALUE(E3)+VALUE(D3)</f>
<v></v>
</c>
```
**Auto-fixable?** Partially. If the source cell type is visibly wrong (a number stored as string), fix the type. If the cause is ambiguous (the cell is supposed to contain text), flag for human review.
---
### #NAME? — Unrecognized Name
**What it means:** The formula contains an identifier that Excel does not recognize — either a misspelled function name, an undefined named range, or a function that is not available in the target Excel version.
**Common causes in generated files:**
- LLM writes a function name with a typo: `SUMIF` written as `SUMIFS` when only 3 arguments are provided, or `XLOOKUP` used in a context targeting Excel 2010
- Named range referenced in formula does not exist in `xl/workbook.xml`
**XML signature:**
```xml
<c r="B2" t="e">
<f>SUMSQ(A2:A10)</f>
<v>#NAME?</v>
</c>
```
**Fix — verify function name and named ranges:**
Check named ranges in `xl/workbook.xml`:
```xml
<definedNames>
<definedName name="RevenueRange">Sheet1!$B$2:$B$13</definedName>
</definedNames>
```
If the formula references `RevenuRange` (typo), correct it to `RevenueRange`:
```xml
<c r="B2">
<f>SUM(RevenueRange)</f>
<v></v>
</c>
```
**Auto-fixable?** Only if the correct name is unambiguous (e.g., a single close match exists). Otherwise flag for human review — function name fixes require understanding the intended calculation.
---
### #N/A — Value Not Available
**What it means:** A lookup function (VLOOKUP, HLOOKUP, MATCH, INDEX/MATCH, XLOOKUP) searched for a value that does not exist in the lookup table.
**Common causes in generated files:**
- Lookup key exists in the formula but the lookup table is empty or not yet populated
- Key format mismatch (text "2024" vs numeric 2024)
**XML signature:**
```xml
<c r="G5" t="e">
<f>VLOOKUP(F5,Assumptions!$A$2:$B$20,2,0)</f>
<v>#N/A</v>
</c>
```
**Fix — wrap with IFERROR for missing-match tolerance:**
```xml
<c r="G5">
<f>IFERROR(VLOOKUP(F5,Assumptions!$A$2:$B$20,2,0),0)</f>
<v></v>
</c>
```
**Auto-fixable?** Adding `IFERROR` is safe if a zero default is acceptable. If the lookup failure indicates a data integrity problem (the key should always be present), do not auto-fix — flag for human review.
---
### #NULL! — Empty Intersection
**What it means:** The space operator (which computes the intersection of two ranges) was applied to two ranges that do not intersect.
**Common causes in generated files:**
- Accidental space between two range references: `=SUM(A1:A5 C1:C5)` instead of `=SUM(A1:A5,C1:C5)`
- Rarely seen in typical financial models; usually indicates a formula generation error
**XML signature:**
```xml
<c r="H10" t="e">
<f>SUM(A1:A5 C1:C5)</f>
<v>#NULL!</v>
</c>
```
**Fix — replace space with comma (union) or colon (range):**
```xml
<!-- Union of two separate ranges -->
<c r="H10">
<f>SUM(A1:A5,C1:C5)</f>
<v></v>
</c>
```
**Auto-fixable?** Yes. The space operator is almost never intentional in generated formulas. Replacing with a comma is safe.
---
### #NUM! — Numeric Error
**What it means:** A formula produced a number that Excel cannot represent (overflow, underflow) or a mathematical operation that has no real-number result (square root of negative, LOG of zero or negative).
**Common causes in generated files:**
- IRR or NPV formula where the cash flow series has no convergent solution
- `SQRT()` applied to a cell that can be negative
- Very large exponentiation
**XML signature:**
```xml
<c r="J15" t="e">
<f>IRR(B5:B15)</f>
<v>#NUM!</v>
</c>
```
**Fix — add a conditional guard:**
```xml
<c r="J15">
<f>IFERROR(IRR(B5:B15),"")</f>
<v></v>
</c>
```
For SQRT:
```xml
<c r="K5">
<f>IF(A5>=0,SQRT(A5),"")</f>
<v></v>
</c>
```
**Auto-fixable?** Partially. Wrapping with `IFERROR` suppresses the error display but does not fix the underlying calculation issue. Flag the cell for human review even after applying the IFERROR wrapper.
---
## Auto-Fix vs. Human Review Decision Matrix
| Error Type | Auto-Fix Safe? | Condition | Action |
|------------|---------------|-----------|--------|
| `#DIV/0!` | Yes | Always | Wrap with `IFERROR(formula,0)` |
| `#NULL!` | Yes | Always | Replace space operator with comma |
| `#REF!` | Yes | Only if correct target is unambiguous from context | Correct reference; otherwise flag |
| `#NAME?` | Yes | Only if typo has exactly one plausible correction | Fix name; otherwise flag |
| `#N/A` | Conditional | If a zero/blank default is business-acceptable | Add IFERROR wrapper; document assumption |
| `#VALUE!` | Conditional | Only if source cell type is clearly wrong | Fix type; otherwise flag |
| `#NUM!` | No | Always | Add IFERROR to suppress display, then flag |
| Broken sheet ref | Yes | Only if renamed sheet can be identified from workbook.xml | Correct name |
| Business logic errors | Never | Any case | Human review only |
**What counts as a business logic error (never auto-fix):**
- A formula that produces a wrong number but no Excel error (e.g., `=SUM(B2:B8)` when the intent was `=SUM(B2:B9)`)
- A formula where the IFERROR default value is meaningful (e.g., whether to use 0, blank, or a prior-period value)
- Any formula where fixing the error requires knowing what the formula was supposed to calculate
---
## Delivery Standard — Validation Report
Every validation task must produce a structured report. This report is the deliverable, regardless of whether errors were found.
### Required report format
```markdown
## Formula Validation Report
**File**: /path/to/filename.xlsx
**Date**: YYYY-MM-DD
**Sheets checked**: Sheet1, Sheet2, Sheet3
**Total formulas scanned**: N
---
### Tier 1 — Static Validation
**Status**: PASS / FAIL
**Tool**: formula_check.py (direct XML scan)
| Sheet | Cell | Error Type | Detail | Fix Applied |
|-------|------|-----------|--------|-------------|
| Summary | C12 | #REF! | Formula: Q1!A0 | Corrected to Q1!A1 |
| Summary | D15 | broken_sheet_ref | References missing sheet 'Q5' | Renamed to Q4 |
_(If no errors: "No errors detected.")_
---
### Tier 2 — Dynamic Validation
**Status**: PASS / FAIL / SKIPPED
**Tool**: LibreOffice headless (version X.Y.Z) / Not available
_(If SKIPPED: state the reason — LibreOffice not installed, timeout, etc.)_
| Sheet | Cell | Error Type | Detail | Fix Applied |
|-------|------|-----------|--------|-------------|
| Q1 | F8 | #DIV/0! | Formula: C8/C7 | Wrapped with IFERROR |
_(If no errors: "No runtime errors detected after recalculation.")_
---
### Summary
- **Total errors found**: N
- **Auto-fixed**: N (list types)
- **Flagged for human review**: N (list cells and reason)
- **Final status**: PASS (ready for delivery) / FAIL (blocked)
### Human Review Required
| Cell | Error | Reason Auto-Fix Not Applied |
|------|-------|----------------------------|
| Q2!B15 | #NUM! | IRR formula — business must confirm cash flow inputs |
```
### Minimum required fields
The report is invalid (and delivery is blocked) if any of these are missing:
- File path and date
- Which sheets were checked
- Total formula count
- Tier 1 status with explicit PASS/FAIL
- Tier 2 status with explicit PASS/FAIL/SKIPPED and reason if SKIPPED
- For every error: sheet, cell, error type, and disposition (fixed or flagged)
- Final delivery status
---
## Common Scenarios
### Scenario 1: Validate immediately after creating a new file
When `create.md` workflow produces a new xlsx, run validation before any delivery response.
```bash
# Step 1: Static check on the freshly written file
python3 SKILL_DIR/scripts/formula_check.py /path/to/output.xlsx
# Step 2: Dynamic check (if LibreOffice available)
python3 SKILL_DIR/scripts/libreoffice_recalc.py /path/to/output.xlsx /tmp/recalculated.xlsx
python3 SKILL_DIR/scripts/formula_check.py /tmp/recalculated.xlsx
```
Expected behavior on a freshly created file: Tier 1 will find zero `error_value` errors (because `<v>` elements are empty, not error-valued). It will find any broken cross-sheet references if sheet names were misspelled. Tier 2 will populate `<v>` and reveal runtime errors like `#DIV/0!`.
If Tier 2 reveals errors, fix them in the source XML (not the recalculated copy), repack, and re-run both tiers.
### Scenario 2: Validate after editing an existing file
When `edit.md` workflow modifies an existing xlsx, validate only the affected sheets if the edit was surgical. If the edit touched shared formulas or cross-sheet references, validate all sheets.
```bash
# Targeted static check — look at specific sheet
# (formula_check.py checks all sheets; examine only the relevant section of output)
python3 SKILL_DIR/scripts/formula_check.py /path/to/edited.xlsx --json \
| python3 -c "
import json, sys
r = json.load(sys.stdin)
for e in r['errors']:
if e.get('sheet') in ['Summary', 'Q1']:
print(e)
"
```
Always run Tier 2 after edits that modify formulas, even if Tier 1 passes. Edits to data ranges can cause previously-valid formulas to produce runtime errors.
### Scenario 3: User provides a file with suspected formula errors
When a user submits a file and reports wrong values or visible errors:
```bash
# Step 1: Static scan — find all error cells
python3 SKILL_DIR/scripts/formula_check.py /path/to/user_file.xlsx --json > /tmp/validation_results.json
# Step 2: Unpack for manual inspection
python3 SKILL_DIR/scripts/xlsx_unpack.py /path/to/user_file.xlsx /tmp/xlsx_inspect/
# Step 3: Dynamic recalculation
python3 SKILL_DIR/scripts/libreoffice_recalc.py /path/to/user_file.xlsx /tmp/user_file_recalc.xlsx
# Step 4: Re-validate recalculated file
python3 SKILL_DIR/scripts/formula_check.py /tmp/user_file_recalc.xlsx --json > /tmp/validation_after_recalc.json
# Step 5: Compare before and after
python3 - <<'EOF'
import json
before = json.load(open("/tmp/validation_results.json"))
after = json.load(open("/tmp/validation_after_recalc.json"))
print(f"Before recalc: {before['error_count']} errors")
print(f"After recalc: {after['error_count']} errors")
EOF
```
If errors appear only after recalculation (not in the original static scan), the formulas were syntactically correct but produce wrong results at runtime. These are runtime errors that require formula-level fixes, not XML-structure fixes.
If errors appear in both scans, they were already cached in `<v>` before recalculation — the file was previously opened by Excel/LibreOffice and the errors persisted.
---
## Critical Pitfalls
**Pitfall 1: openpyxl `data_only=True` destroys formulas.**
Opening a workbook with `data_only=True` reads cached values instead of formulas. If you then save the workbook, all `<f>` elements are permanently removed and replaced with their last-cached values. Never use this mode for validation workflows.
**Pitfall 2: Empty `<v>` is not the same as a passing formula.**
A freshly generated file has empty `<v>` elements for all formula cells. formula_check.py will not report these as errors — they are not yet errors. They become errors only after recalculation if the calculated value is an error type. This is why Tier 2 is mandatory.
**Pitfall 3: Shared formula errors affect the entire range.**
If a shared formula's primary cell has a broken reference, every cell in the shared range (`ref="D2:D100"`) inherits that broken reference. The count of logical errors can be much larger than the count of distinct error entries in formula_check.py output. When fixing a broken shared formula, fix the primary cell's `<f t="shared" ref="...">` element; the consumers (`<f t="shared" si="N"/>`) automatically inherit the corrected formula.
**Pitfall 4: Sheet names are case-sensitive.**
`=q1!B5` and `=Q1!B5` are different references. Excel internally treats them the same, but formula_check.py's string comparison is case-sensitive. If a formula uses a lowercase sheet name that matches an uppercase sheet in the workbook, it will be flagged as a broken reference. The fix is to match the exact case in `workbook.xml`.
**Pitfall 5: `--convert-to xlsx` does not guarantee formula preservation.**
LibreOffice's conversion can occasionally alter certain formula types (array formulas, dynamic array functions like `SORT`, `UNIQUE`). After Tier 2, if the recalculated file shows formula changes unrelated to error fixing, do not deliver the recalculated file directly — use the original file with targeted XML fixes instead.

View File

@@ -0,0 +1,422 @@
#!/usr/bin/env python3
# SPDX-License-Identifier: MIT
"""
formula_check.py — Static formula validator for xlsx files.
Usage:
python3 formula_check.py <input.xlsx>
python3 formula_check.py <input.xlsx> --json # machine-readable output
python3 formula_check.py <input.xlsx> --report # standardized validation report (JSON)
python3 formula_check.py <input.xlsx> --report -o out # report to file
python3 formula_check.py <input.xlsx> --sheet Sales # limit to one sheet
python3 formula_check.py <input.xlsx> --summary # error counts only, no details
What it checks:
1. Error-value cells: <c t="e"><v>#REF!</v></c> — all 7 Excel error types
2. Broken cross-sheet references: formula references a sheet not in workbook.xml
3. Broken named-range references: formula references a name not in workbook.xml <definedNames>
4. Shared formula integrity: shared formula primary cell exists and has formula text
5. Missing <v> on t="e" cells (malformed XML)
Checks NOT performed (require dynamic recalculation):
- Runtime errors that only appear after formulas execute (#DIV/0! on empty denominator, etc.)
-> Use libreoffice_recalc.py + re-run formula_check.py for dynamic validation
Exit code:
0 — no errors found
1 — errors detected (or file cannot be opened)
"""
import sys
import zipfile
import xml.etree.ElementTree as ET
import re
import json
# OOXML SpreadsheetML namespace
NS = "http://schemas.openxmlformats.org/spreadsheetml/2006/main"
NSP = f"{{{NS}}}"
# All 7 standard Excel formula error types
EXCEL_ERRORS = {"#REF!", "#DIV/0!", "#VALUE!", "#NAME?", "#NULL!", "#NUM!", "#N/A"}
# Excel built-in function names (subset of common ones) — used for #NAME? heuristic
# Full list: https://support.microsoft.com/en-us/office/excel-functions-alphabetical
_BUILTIN_FUNCTIONS = {
"ABS", "AND", "AVERAGE", "AVERAGEIF", "AVERAGEIFS", "CEILING", "CHOOSE",
"COUNTA", "COUNTIF", "COUNTIFS", "COUNT", "DATE", "EDATE", "EOMONTH",
"FALSE", "FILTER", "FIND", "FLOOR", "IF", "IFERROR", "IFNA", "IFS",
"INDEX", "INDIRECT", "INT", "IRR", "ISBLANK", "ISERROR", "ISNA", "ISNUMBER",
"LARGE", "LEFT", "LEN", "LOOKUP", "LOWER", "MATCH", "MAX", "MID", "MIN",
"MOD", "MONTH", "NETWORKDAYS", "NOT", "NOW", "NPV", "OFFSET", "OR",
"PMT", "PV", "RAND", "RANK", "RIGHT", "ROUND", "ROUNDDOWN", "ROUNDUP",
"ROW", "ROWS", "SEARCH", "SMALL", "SORT", "SQRT", "SUBSTITUTE", "SUM",
"SUMIF", "SUMIFS", "SUMPRODUCT", "TEXT", "TODAY", "TRANSPOSE", "TRIM",
"TRUE", "UNIQUE", "UPPER", "VALUE", "VLOOKUP", "HLOOKUP", "XLOOKUP",
"XMATCH", "XNPV", "XIRR", "YEAR", "YEARFRAC",
}
def get_sheet_names(z: zipfile.ZipFile) -> dict[str, str]:
"""Return dict of {r:id -> sheet_name} from workbook.xml."""
wb_xml = z.read("xl/workbook.xml")
wb = ET.fromstring(wb_xml)
rel_ns = "http://schemas.openxmlformats.org/officeDocument/2006/relationships"
sheets = {}
for sheet in wb.findall(f".//{NSP}sheet"):
name = sheet.get("name", "")
rid = sheet.get(f"{{{rel_ns}}}id", "")
sheets[rid] = name
return sheets
def get_defined_names(z: zipfile.ZipFile) -> set[str]:
"""Return set of named ranges defined in workbook.xml <definedNames>."""
wb_xml = z.read("xl/workbook.xml")
wb = ET.fromstring(wb_xml)
names = set()
for dn in wb.findall(f".//{NSP}definedName"):
n = dn.get("name", "")
if n:
names.add(n)
return names
def get_sheet_files(z: zipfile.ZipFile) -> dict[str, str]:
"""Return dict of {r:id -> xl/worksheets/sheetN.xml} from workbook.xml.rels."""
rels_xml = z.read("xl/_rels/workbook.xml.rels")
rels = ET.fromstring(rels_xml)
mapping = {}
for rel in rels:
rid = rel.get("Id", "")
target = rel.get("Target", "")
if "worksheets" in target:
# Target may be relative: "worksheets/sheet1.xml" -> "xl/worksheets/sheet1.xml"
if not target.startswith("xl/"):
target = "xl/" + target
mapping[rid] = target
return mapping
def extract_sheet_refs(formula: str) -> list[str]:
"""
Extract all sheet names referenced in a formula string.
Handles:
- 'Sheet Name'!A1 (quoted, may contain spaces)
- SheetName!A1 (unquoted, no spaces)
Returns a list of sheet name strings (may contain duplicates if the same
sheet is referenced multiple times in one formula).
"""
refs = []
# Quoted sheet names: 'Sheet Name'!
for m in re.finditer(r"'([^']+)'!", formula):
refs.append(m.group(1))
# Unquoted sheet names: SheetName! (not preceded by a single quote)
for m in re.finditer(r"(?<!')([A-Za-z_\u4e00-\u9fff][A-Za-z0-9_.·\u4e00-\u9fff]*)!", formula):
refs.append(m.group(1))
return refs
def extract_name_refs(formula: str) -> list[str]:
"""
Extract identifiers in a formula that could be named range references.
Heuristic: identifiers that:
- Are not preceded by a sheet reference (no "!" before them)
- Are not followed by "(" (which would make them function calls)
- Match the pattern of a name (letters/underscore start, alphanumeric/underscore body)
- Are not single-letter column references or row references
This is approximate. False positives are possible; false negatives are rare.
"""
names = []
# Remove quoted sheet references first to avoid false matches
formula_clean = re.sub(r"'[^']*'![A-Z$0-9:]+", "", formula)
formula_clean = re.sub(r"[A-Za-z_][A-Za-z0-9_.]*![A-Z$0-9:]+", "", formula_clean)
# Find identifiers not followed by "(" (not function calls)
for m in re.finditer(r"\b([A-Za-z_][A-Za-z0-9_]{2,})\b(?!\s*\()", formula_clean):
candidate = m.group(1)
# Exclude Excel cell references like A1, B10, AA100
if re.fullmatch(r"[A-Z]{1,3}[0-9]+", candidate):
continue
# Exclude built-in function names (they appear without parens sometimes in array formulas)
if candidate.upper() in _BUILTIN_FUNCTIONS:
continue
names.append(candidate)
return names
def check(xlsx_path: str, sheet_filter: str | None = None) -> dict:
"""
Run all static checks on the given xlsx file.
Args:
xlsx_path: path to the .xlsx file
sheet_filter: if provided, only check the sheet with this name
Returns:
A dict with keys:
file, sheets_checked, formula_count, shared_formula_ranges,
error_count, errors
"""
results = {
"file": xlsx_path,
"sheets_checked": [],
"formula_count": 0,
"shared_formula_ranges": 0, # number of shared formula definitions
"error_count": 0,
"errors": [],
}
try:
z = zipfile.ZipFile(xlsx_path, "r")
except (zipfile.BadZipFile, FileNotFoundError) as e:
results["errors"].append({"type": "file_error", "message": str(e)})
results["error_count"] = 1
return results
with z:
sheet_names = get_sheet_names(z)
sheet_files = get_sheet_files(z)
valid_sheet_names = set(sheet_names.values())
defined_names = get_defined_names(z)
for rid, sheet_name in sheet_names.items():
# Apply sheet filter if requested
if sheet_filter and sheet_name != sheet_filter:
continue
ws_file = sheet_files.get(rid)
if not ws_file or ws_file not in z.namelist():
continue
results["sheets_checked"].append(sheet_name)
ws_xml = z.read(ws_file)
ws = ET.fromstring(ws_xml)
# Track shared formula IDs seen on this sheet (si -> primary cell ref)
shared_primary: dict[str, str] = {}
for cell in ws.findall(f".//{NSP}c"):
cell_ref = cell.get("r", "?")
cell_type = cell.get("t", "n")
# ── Check 1: error-value cell ──────────────────────────────
if cell_type == "e":
v_elem = cell.find(f"{NSP}v")
if v_elem is None:
# Malformed: t="e" but no <v> — record as structural issue
results["errors"].append(
{
"type": "malformed_error_cell",
"sheet": sheet_name,
"cell": cell_ref,
"detail": "Cell has t='e' but no <v> child element",
}
)
results["error_count"] += 1
else:
error_val = v_elem.text or "#UNKNOWN"
f_elem = cell.find(f"{NSP}f")
results["errors"].append(
{
"type": "error_value",
"error": error_val,
"sheet": sheet_name,
"cell": cell_ref,
# Include formula text if present
"formula": f_elem.text if (f_elem is not None and f_elem.text) else None,
}
)
results["error_count"] += 1
# ── Check 2 & 3: formulas ──────────────────────────────────
f_elem = cell.find(f"{NSP}f")
if f_elem is None:
continue
f_type = f_elem.get("t", "") # "shared", "array", or "" for normal
f_si = f_elem.get("si") # shared formula group ID
# Count formulas:
# - Normal formulas: always count
# - Shared formula PRIMARY (has text + ref attribute): count once
# - Shared formula CONSUMER (si only, no text): do NOT count separately
# (they are covered by the primary's ref range)
if f_type == "shared" and f_elem.text is None:
# Consumer cell: skip formula counting and cross-ref checks
# (the primary cell already covers this formula)
continue
formula = f_elem.text or ""
if f_type == "shared" and f_elem.get("ref"):
results["shared_formula_ranges"] += 1
if f_si is not None:
shared_primary[f_si] = cell_ref
if formula:
results["formula_count"] += 1
# Check 2: cross-sheet references
for ref_sheet in extract_sheet_refs(formula):
if ref_sheet not in valid_sheet_names:
results["errors"].append(
{
"type": "broken_sheet_ref",
"sheet": sheet_name,
"cell": cell_ref,
"formula": formula,
"missing_sheet": ref_sheet,
"valid_sheets": sorted(valid_sheet_names),
}
)
results["error_count"] += 1
# Check 3: named range references
# Only flag if the name is not a built-in and not a sheet-prefixed ref
for name_ref in extract_name_refs(formula):
if name_ref not in defined_names:
results["errors"].append(
{
"type": "unknown_name_ref",
"sheet": sheet_name,
"cell": cell_ref,
"formula": formula,
"unknown_name": name_ref,
"defined_names": sorted(defined_names),
"note": "Heuristic check — verify manually if this is a false positive",
}
)
results["error_count"] += 1
return results
def build_report(results: dict) -> dict:
"""
Transform raw check() output into a standardized validation report.
Usage:
python3 formula_check.py <input.xlsx> --report # JSON report to stdout
python3 formula_check.py <input.xlsx> --report -o out # JSON report to file
"""
from collections import Counter
errors = results.get("errors", [])
error_types = [e.get("error", e.get("type", "unknown")) for e in errors]
return {
"status": "success" if results["error_count"] == 0 else "errors_found",
"file": results["file"],
"sheets_checked": results["sheets_checked"],
"total_formulas": results["formula_count"],
"total_errors": results["error_count"],
"shared_formula_ranges": results.get("shared_formula_ranges", 0),
"errors_by_type": dict(Counter(error_types)) if errors else {},
"errors": errors,
}
def main() -> None:
use_json = "--json" in sys.argv
use_report = "--report" in sys.argv
summary_only = "--summary" in sys.argv
output_file = None
sheet_filter = None
args_clean = []
i = 1
while i < len(sys.argv):
arg = sys.argv[i]
if arg == "--sheet" and i + 1 < len(sys.argv):
sheet_filter = sys.argv[i + 1]
i += 2
elif arg == "-o" and i + 1 < len(sys.argv):
output_file = sys.argv[i + 1]
i += 2
elif arg.startswith("--"):
i += 1 # skip flags already handled
else:
args_clean.append(arg)
i += 1
if not args_clean:
print("Usage: formula_check.py <input.xlsx> [--json] [--report [-o FILE]] [--sheet NAME] [--summary]")
sys.exit(1)
results = check(args_clean[0], sheet_filter=sheet_filter)
if use_report:
report = build_report(results)
output = json.dumps(report, indent=2, ensure_ascii=False)
if output_file:
with open(output_file, "w", encoding="utf-8") as f:
f.write(output + "\n")
else:
print(output)
sys.exit(1 if results["error_count"] > 0 else 0)
if use_json:
print(json.dumps(results, indent=2, ensure_ascii=False))
sys.exit(1 if results["error_count"] > 0 else 0)
# Human-readable output
sheets = ", ".join(results["sheets_checked"]) or "(none)"
if sheet_filter:
sheets = f"{sheet_filter} (filtered)"
print(f"File : {results['file']}")
print(f"Sheets : {sheets}")
print(f"Formulas checked : {results['formula_count']} distinct formula cells")
print(f"Shared formula ranges : {results['shared_formula_ranges']} ranges")
print(f"Errors found : {results['error_count']}")
if not summary_only and results["errors"]:
print("\n── Error Details ──")
for e in results["errors"]:
if e["type"] == "error_value":
formula_hint = f" (formula: {e['formula']})" if e.get("formula") else ""
print(f" [FAIL] [{e['sheet']}!{e['cell']}] contains {e['error']}{formula_hint}")
elif e["type"] == "broken_sheet_ref":
print(
f" [FAIL] [{e['sheet']}!{e['cell']}] references missing sheet "
f"'{e['missing_sheet']}'"
)
print(f" Formula: {e['formula']}")
print(f" Valid sheets: {e.get('valid_sheets', [])}")
elif e["type"] == "unknown_name_ref":
print(
f" [WARN] [{e['sheet']}!{e['cell']}] uses unknown name "
f"'{e['unknown_name']}' (heuristic — verify manually)"
)
print(f" Formula: {e['formula']}")
print(f" Defined names: {e.get('defined_names', [])}")
elif e["type"] == "malformed_error_cell":
print(f" [FAIL] [{e['sheet']}!{e['cell']}] malformed error cell: {e['detail']}")
elif e["type"] == "file_error":
print(f" [FAIL] File error: {e['message']}")
print()
if results["error_count"] == 0:
print("PASS — No formula errors detected")
else:
# Separate definitive failures from heuristic warnings
hard_errors = [e for e in results["errors"] if e["type"] != "unknown_name_ref"]
warnings = [e for e in results["errors"] if e["type"] == "unknown_name_ref"]
if hard_errors:
print(f"FAIL — {len(hard_errors)} error(s) must be fixed before delivery")
if warnings:
print(f"WARN — {len(warnings)} heuristic warning(s) require manual review")
sys.exit(1)
else:
# Only heuristic warnings — do not block delivery but alert
print(f"PASS with WARN — {len(warnings)} heuristic warning(s) require manual review")
# Exit 0: heuristic warnings alone do not block delivery
sys.exit(0)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,248 @@
#!/usr/bin/env python3
# SPDX-License-Identifier: MIT
"""
libreoffice_recalc.py — Tier 2 dynamic formula recalculation via LibreOffice headless.
Opens the xlsx file with the LibreOffice Calc engine, executes all formulas, writes
the computed values into the <v> cache elements, and saves the result. This is the
closest server-side equivalent of "open in Excel and save."
After recalculation, run formula_check.py on the output file to detect runtime errors
(#DIV/0!, #N/A, etc.) that only surface after actual computation.
Usage:
python3 libreoffice_recalc.py input.xlsx output.xlsx
python3 libreoffice_recalc.py input.xlsx output.xlsx --timeout 90
python3 libreoffice_recalc.py --check # check LibreOffice availability only
Exit codes:
0 — recalculation succeeded, output file written
2 — LibreOffice not found (Tier 2 unavailable — not a hard failure, note in report)
1 — LibreOffice found but recalculation failed (timeout, crash, bad file)
"""
import subprocess
import sys
import shutil
import os
import tempfile
import argparse
# ── LibreOffice discovery ───────────────────────────────────────────────────
def find_soffice() -> str | None:
"""
Locate the soffice (LibreOffice) binary.
Search order:
1. macOS application bundle (default install location)
2. PATH lookup for 'soffice'
3. PATH lookup for 'libreoffice' (common on Linux)
"""
candidates = [
"/Applications/LibreOffice.app/Contents/MacOS/soffice", # macOS
"soffice", # Linux / macOS if on PATH
"libreoffice", # alternative Linux name
]
for c in candidates:
# shutil.which handles PATH lookup; also check absolute paths directly
found = shutil.which(c)
if found:
return found
if os.path.isfile(c) and os.access(c, os.X_OK):
return c
return None
def get_libreoffice_version(soffice: str) -> str:
"""Return LibreOffice version string, or 'unknown' on failure."""
try:
result = subprocess.run(
[soffice, "--version"],
capture_output=True,
timeout=10,
)
return result.stdout.decode(errors="replace").strip()
except Exception:
return "unknown"
# ── Recalculation ───────────────────────────────────────────────────────────
def recalculate(
input_path: str,
output_path: str,
timeout: int = 60,
) -> tuple[bool, str]:
"""
Run LibreOffice headless recalculation on input_path, write result to output_path.
Returns:
(success: bool, message: str)
The message explains what happened (success or failure reason).
"""
soffice = find_soffice()
if not soffice:
return False, (
"LibreOffice not found. Tier 2 validation is unavailable in this environment. "
"Install LibreOffice to enable dynamic formula recalculation.\n"
" macOS: brew install --cask libreoffice\n"
" Linux: sudo apt-get install -y libreoffice"
)
version = get_libreoffice_version(soffice)
# Work on a copy in a temp directory to avoid side effects on the source file.
# LibreOffice writes the output using the same filename stem in --outdir.
with tempfile.TemporaryDirectory(prefix="xlsx_recalc_") as tmpdir:
tmp_input = os.path.join(tmpdir, os.path.basename(input_path))
shutil.copy(input_path, tmp_input)
cmd = [
soffice,
"--headless",
"--norestore", # do not attempt to restore crashed sessions
"--infilter=Calc MS Excel 2007 XML",
"--convert-to", "xlsx",
"--outdir", tmpdir,
tmp_input,
]
try:
result = subprocess.run(
cmd,
capture_output=True,
timeout=timeout,
)
except subprocess.TimeoutExpired:
return False, (
f"LibreOffice timed out after {timeout}s. "
"The file may be too large or contain constructs that cause LibreOffice to hang. "
"Try increasing --timeout or simplify the file."
)
except FileNotFoundError:
return False, f"LibreOffice binary not executable: {soffice}"
if result.returncode != 0:
stderr = result.stderr.decode(errors="replace").strip()
stdout = result.stdout.decode(errors="replace").strip()
return False, (
f"LibreOffice exited with code {result.returncode}.\n"
f"stderr: {stderr}\n"
f"stdout: {stdout}"
)
# LibreOffice writes: <tmpdir>/<stem>.xlsx
stem = os.path.splitext(os.path.basename(tmp_input))[0]
tmp_output = os.path.join(tmpdir, stem + ".xlsx")
if not os.path.isfile(tmp_output):
# Try to find any .xlsx file in tmpdir (LibreOffice may behave differently)
xlsx_files = [f for f in os.listdir(tmpdir) if f.endswith(".xlsx") and f != os.path.basename(tmp_input)]
if xlsx_files:
tmp_output = os.path.join(tmpdir, xlsx_files[0])
else:
stdout = result.stdout.decode(errors="replace").strip()
return False, (
f"LibreOffice succeeded (exit 0) but output file not found in {tmpdir}.\n"
f"stdout: {stdout}\n"
f"Files in tmpdir: {os.listdir(tmpdir)}"
)
# Copy recalculated file to final destination
os.makedirs(os.path.dirname(os.path.abspath(output_path)), exist_ok=True)
shutil.copy(tmp_output, output_path)
return True, f"Recalculation complete. LibreOffice {version}. Output: {output_path}"
# ── CLI ─────────────────────────────────────────────────────────────────────
def main() -> None:
parser = argparse.ArgumentParser(
description="LibreOffice headless formula recalculation for xlsx files.",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Basic recalculation
python3 libreoffice_recalc.py report.xlsx report_recalc.xlsx
# With extended timeout for large files
python3 libreoffice_recalc.py big_model.xlsx big_model_recalc.xlsx --timeout 120
# Check if LibreOffice is available (useful in CI)
python3 libreoffice_recalc.py --check
# Full validation pipeline
python3 libreoffice_recalc.py input.xlsx /tmp/recalc.xlsx && \\
python3 formula_check.py /tmp/recalc.xlsx
""",
)
parser.add_argument("input", nargs="?", help="Input xlsx file path")
parser.add_argument("output", nargs="?", help="Output xlsx file path (recalculated)")
parser.add_argument(
"--timeout",
type=int,
default=60,
metavar="SECONDS",
help="Maximum time to wait for LibreOffice (default: 60)",
)
parser.add_argument(
"--check",
action="store_true",
help="Only check if LibreOffice is available, then exit",
)
args = parser.parse_args()
# ── --check mode ─────────────────────────────────────────────────────────
if args.check:
soffice = find_soffice()
if soffice:
version = get_libreoffice_version(soffice)
print(f"LibreOffice available: {soffice}")
print(f"Version: {version}")
sys.exit(0)
else:
print("LibreOffice NOT available.")
print("Tier 2 dynamic validation requires LibreOffice.")
print(" macOS: brew install --cask libreoffice")
print(" Linux: sudo apt-get install -y libreoffice")
sys.exit(2)
# ── Recalculation mode ────────────────────────────────────────────────────
if not args.input or not args.output:
parser.print_help()
sys.exit(1)
if not os.path.isfile(args.input):
print(f"ERROR: Input file not found: {args.input}")
sys.exit(1)
print(f"Input : {args.input}")
print(f"Output : {args.output}")
print(f"Timeout: {args.timeout}s")
print()
success, message = recalculate(args.input, args.output, timeout=args.timeout)
if success:
print(f"OK: {message}")
print()
print("Next step: run formula_check.py on the recalculated file to detect runtime errors:")
print(f" python3 formula_check.py {args.output}")
sys.exit(0)
else:
# Distinguish "not installed" (exit 2) from "failed" (exit 1)
if "not found" in message.lower() or "not available" in message.lower():
print(f"SKIP (Tier 2 unavailable): {message}")
sys.exit(2)
else:
print(f"ERROR: {message}")
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,163 @@
#!/usr/bin/env python3
# SPDX-License-Identifier: MIT
"""
shared_strings_builder.py — Generate a valid sharedStrings.xml from a list of strings.
Usage (strings as command-line arguments):
python3 shared_strings_builder.py "Revenue" "Cost" "Gross Profit" > sharedStrings.xml
Usage (strings from a file, one per line):
python3 shared_strings_builder.py --file strings.txt > sharedStrings.xml
Usage (print index table instead of XML, for reference):
python3 shared_strings_builder.py --index "Revenue" "Cost" "Gross Profit"
python3 shared_strings_builder.py --index --file strings.txt
Output format:
Valid xl/sharedStrings.xml written to stdout.
Redirect to the correct path:
python3 shared_strings_builder.py "A" "B" > /tmp/xlsx_work/xl/sharedStrings.xml
Notes:
- Strings are de-duplicated: identical strings appear only once in the table.
- The 'count' attribute equals the number of unique strings (appropriate for new files
where each string is used in exactly one cell). If a string appears in multiple cells,
manually increment 'count' by the number of extra references.
- Special characters (&, <, >) are automatically XML-escaped.
- Leading/trailing spaces are preserved with xml:space="preserve".
"""
import sys
import html
import argparse
HEADER = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>'
SST_NS = "http://schemas.openxmlformats.org/spreadsheetml/2006/main"
def escape_text(s: str) -> tuple[str, bool]:
"""
Return (escaped_text, needs_preserve).
needs_preserve is True if the string has leading or trailing whitespace.
"""
escaped = html.escape(s, quote=False)
needs_preserve = s != s.strip()
return escaped, needs_preserve
def build_xml(strings: list[str]) -> str:
"""Build sharedStrings.xml content from a list of unique strings."""
n = len(strings)
lines = [
HEADER,
f'<sst xmlns="{SST_NS}" count="{n}" uniqueCount="{n}">',
]
for i, s in enumerate(strings):
escaped, preserve = escape_text(s)
if preserve:
lines.append(f' <si><t xml:space="preserve">{escaped}</t></si>'
f' <!-- index {i} -->')
else:
lines.append(f' <si><t>{escaped}</t></si> <!-- index {i} -->')
lines.append("</sst>")
return "\n".join(lines) + "\n"
def build_index_table(strings: list[str]) -> str:
"""Return a human-readable index table (for agent reference, not written to file)."""
lines = [
f"{'Index':<6} String",
"-" * 50,
]
for i, s in enumerate(strings):
lines.append(f"{i:<6} {s!r}")
lines.append("")
lines.append(
f"Total: {len(strings)} unique strings. "
"Use these indices in <c t=\"s\"><v>N</v></c> cells."
)
return "\n".join(lines) + "\n"
def deduplicate(strings: list[str]) -> list[str]:
"""Remove duplicates while preserving first-occurrence order."""
seen: set[str] = set()
result: list[str] = []
for s in strings:
if s not in seen:
seen.add(s)
result.append(s)
return result
def load_from_file(path: str) -> list[str]:
"""Read one string per non-empty line from a file."""
with open(path, encoding="utf-8") as f:
return [line.rstrip("\n") for line in f if line.strip()]
def main() -> None:
parser = argparse.ArgumentParser(
description="Generate xl/sharedStrings.xml from a list of strings.",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=__doc__,
)
parser.add_argument(
"strings",
nargs="*",
metavar="STRING",
help="String values to include in the shared string table.",
)
parser.add_argument(
"--file",
"-f",
metavar="PATH",
help="Read strings from a file (one string per line) instead of arguments.",
)
parser.add_argument(
"--index",
action="store_true",
help="Print a human-readable index table instead of XML output.",
)
args = parser.parse_args()
if args.file:
try:
raw = load_from_file(args.file)
except FileNotFoundError:
print(f"ERROR: File not found: {args.file}", file=sys.stderr)
sys.exit(1)
except OSError as e:
print(f"ERROR: Cannot read file: {e}", file=sys.stderr)
sys.exit(1)
else:
raw = list(args.strings)
if not raw:
print(
"ERROR: No strings provided.\n"
"Usage: shared_strings_builder.py \"String1\" \"String2\" ...\n"
" or: shared_strings_builder.py --file strings.txt",
file=sys.stderr,
)
sys.exit(1)
strings = deduplicate(raw)
if len(strings) < len(raw):
removed = len(raw) - len(strings)
print(
f"Note: {removed} duplicate(s) removed. "
f"{len(strings)} unique strings in table.",
file=sys.stderr,
)
if args.index:
print(build_index_table(strings))
else:
print(build_xml(strings), end="")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,575 @@
#!/usr/bin/env python3
# SPDX-License-Identifier: MIT
"""
style_audit.py — Financial formatting compliance checker for xlsx files.
Audits an xlsx file (or an unpacked xlsx directory) and reports:
1. Style system integrity: count attributes match actual element counts
2. Color-role violations: formula cells with blue font, input cells with black font
3. Year-format violations: cells containing 4-digit years using comma-format
4. Percentage value violations: percentage-formatted cells with values > 1 (likely meant 0.08 not 8)
5. Style index out-of-range: s attribute exceeds cellXfs count
6. fills[0]/fills[1] presence check (OOXML spec requirement)
Usage:
python3 style_audit.py input.xlsx # audit a packed xlsx
python3 style_audit.py /tmp/xlsx_work/ # audit an unpacked directory
python3 style_audit.py input.xlsx --json # machine-readable output
python3 style_audit.py input.xlsx --summary # counts only, no detail
Exit code:
0 — no violations found
1 — violations detected (or file cannot be opened)
"""
import sys
import os
import zipfile
import xml.etree.ElementTree as ET
import json
import re
import tempfile
import shutil
NS = "http://schemas.openxmlformats.org/spreadsheetml/2006/main"
NSP = f"{{{NS}}}"
# Predefined style index semantics from minimal_xlsx template.
# Maps cellXfs index -> (role, font_color_expectation, numFmt_type)
# role: "input" = blue expected, "formula" = black/green expected, "header" = any, "any" = skip
TEMPLATE_SLOT_ROLES = {
0: ("any", None, None),
1: ("input", "blue", "general"),
2: ("formula", "black", "general"),
3: ("formula", "green", "general"),
4: ("any", None, "general"), # header
5: ("input", "blue", "currency"),
6: ("formula", "black", "currency"),
7: ("input", "blue", "percent"),
8: ("formula", "black", "percent"),
9: ("input", "blue", "integer"),
10: ("formula", "black", "integer"),
11: ("input", "blue", "year"),
12: ("input", "blue", "general"), # highlight
}
# AARRGGBB values for each role color
BLUE_RGB = "000000ff"
BLACK_RGB = "00000000"
GREEN_RGB = "00008000"
RED_RGB = "00ff0000"
# numFmtIds that represent percentage formats (built-in + common custom)
PERCENT_FMT_IDS = {9, 10, 165, 170}
# numFmtIds that use comma separator (would corrupt year display)
COMMA_FMT_IDS = {3, 4, 167, 168} # #,##0 style — 4-digit years would show as 2,024
def _parse_styles(styles_xml: bytes) -> dict:
"""Parse styles.xml and return structured data."""
root = ET.fromstring(styles_xml)
def find(tag):
return root.find(f"{NSP}{tag}")
# numFmts
num_fmts = {} # id -> formatCode
nf_elem = find("numFmts")
if nf_elem is not None:
declared_count = int(nf_elem.get("count", "0"))
actual_count = len(nf_elem)
for nf in nf_elem:
fid = int(nf.get("numFmtId", "0"))
num_fmts[fid] = nf.get("formatCode", "")
else:
declared_count = 0
actual_count = 0
# fonts — extract color and bold flag
fonts = []
fonts_elem = find("fonts")
fonts_declared = 0
if fonts_elem is not None:
fonts_declared = int(fonts_elem.get("count", "0"))
for font in fonts_elem:
color_elem = font.find(f"{NSP}color")
bold_elem = font.find(f"{NSP}b")
if color_elem is not None:
rgb = color_elem.get("rgb", "").lower()
theme = color_elem.get("theme")
else:
rgb = ""
theme = None
fonts.append({
"rgb": rgb,
"theme": theme,
"bold": bold_elem is not None,
})
# fills
fills = []
fills_elem = find("fills")
fills_declared = 0
if fills_elem is not None:
fills_declared = int(fills_elem.get("count", "0"))
for fill in fills_elem:
pf = fill.find(f"{NSP}patternFill")
pattern_type = pf.get("patternType", "") if pf is not None else ""
fills.append({"patternType": pattern_type})
# cellXfs
xfs = []
xfs_elem = find("cellXfs")
xfs_declared = 0
if xfs_elem is not None:
xfs_declared = int(xfs_elem.get("count", "0"))
for xf in xfs_elem:
xfs.append({
"numFmtId": int(xf.get("numFmtId", "0")),
"fontId": int(xf.get("fontId", "0")),
"fillId": int(xf.get("fillId", "0")),
"borderId": int(xf.get("borderId", "0")),
})
return {
"num_fmts": num_fmts,
"num_fmts_declared": declared_count,
"num_fmts_actual": actual_count,
"fonts": fonts,
"fonts_declared": fonts_declared,
"fonts_actual": len(fonts),
"fills": fills,
"fills_declared": fills_declared,
"fills_actual": len(fills),
"xfs": xfs,
"xfs_declared": xfs_declared,
"xfs_actual": len(xfs),
}
def _is_blue_font(font: dict) -> bool:
return font["rgb"] == BLUE_RGB
def _is_black_font(font: dict) -> bool:
return font["rgb"] == BLACK_RGB or (font["rgb"] == "" and font["theme"] is not None)
def _is_green_font(font: dict) -> bool:
return font["rgb"] == GREEN_RGB
def _fmt_is_percent(num_fmt_id: int, num_fmts: dict) -> bool:
if num_fmt_id in PERCENT_FMT_IDS:
return True
fmt_code = num_fmts.get(num_fmt_id, "")
return "%" in fmt_code
def _fmt_is_comma(num_fmt_id: int, num_fmts: dict) -> bool:
if num_fmt_id in COMMA_FMT_IDS:
return True
fmt_code = num_fmts.get(num_fmt_id, "")
# formatCode has comma separator if it contains #,##0 but not a trailing , (scale)
return "#,##" in fmt_code and not fmt_code.endswith(",") and not fmt_code.endswith(",\"M\"") and not fmt_code.endswith(",\"K\"")
def _looks_like_year(value_text: str) -> bool:
"""True if value is a 4-digit year between 1900 and 2100."""
try:
v = int(float(value_text))
return 1900 <= v <= 2100
except (ValueError, TypeError):
return False
def _audit(styles_xml: bytes, sheet_xmls: list[tuple[str, bytes]]) -> dict:
"""
Run all formatting compliance checks.
Args:
styles_xml: content of xl/styles.xml
sheet_xmls: list of (sheet_name, xml_bytes) for each worksheet
Returns:
dict with violations and summary
"""
results = {
"violations": [],
"warnings": [],
"summary": {},
}
v = results["violations"]
w = results["warnings"]
styles = _parse_styles(styles_xml)
fonts = styles["fonts"]
xfs = styles["xfs"]
num_fmts = styles["num_fmts"]
# ── Check A: count attribute integrity ──────────────────────────────────
if styles["fonts_declared"] != styles["fonts_actual"]:
v.append({
"type": "count_mismatch",
"element": "fonts",
"declared": styles["fonts_declared"],
"actual": styles["fonts_actual"],
"fix": f"Update <fonts count=\"{styles['fonts_actual']}\">",
})
if styles["fills_declared"] != styles["fills_actual"]:
v.append({
"type": "count_mismatch",
"element": "fills",
"declared": styles["fills_declared"],
"actual": styles["fills_actual"],
"fix": f"Update <fills count=\"{styles['fills_actual']}\">",
})
if styles["xfs_declared"] != styles["xfs_actual"]:
v.append({
"type": "count_mismatch",
"element": "cellXfs",
"declared": styles["xfs_declared"],
"actual": styles["xfs_actual"],
"fix": f"Update <cellXfs count=\"{styles['xfs_actual']}\">",
})
# ── Check B: fills[0] and fills[1] presence ──────────────────────────────
fills = styles["fills"]
if len(fills) < 2:
v.append({
"type": "missing_required_fills",
"detail": "fills[0] (none) and fills[1] (gray125) are required by OOXML spec",
"fix": "Prepend <fill><patternFill patternType='none'/></fill> and <fill><patternFill patternType='gray125'/></fill>",
})
else:
if fills[0].get("patternType") != "none":
v.append({
"type": "fills_0_corrupted",
"detail": f"fills[0] patternType='{fills[0].get('patternType')}', must be 'none'",
"fix": "Set fills[0] patternFill patternType to 'none'",
})
if fills[1].get("patternType") != "gray125":
v.append({
"type": "fills_1_corrupted",
"detail": f"fills[1] patternType='{fills[1].get('patternType')}', must be 'gray125'",
"fix": "Set fills[1] patternFill patternType to 'gray125'",
})
# ── Check C: per-cell style violations ───────────────────────────────────
total_cells = 0
formula_cells = 0
input_cells = 0
for sheet_name, sheet_xml in sheet_xmls:
ws = ET.fromstring(sheet_xml)
for cell in ws.findall(f".//{NSP}c"):
cell_ref = cell.get("r", "?")
s_attr = cell.get("s")
has_formula = cell.find(f"{NSP}f") is not None
v_elem = cell.find(f"{NSP}v")
value_text = v_elem.text if v_elem is not None else None
total_cells += 1
# Skip cells with no style
if s_attr is None:
continue
try:
s_idx = int(s_attr)
except ValueError:
continue
# Check C1: s index out of range
if s_idx >= len(xfs):
v.append({
"type": "style_index_out_of_range",
"sheet": sheet_name,
"cell": cell_ref,
"s": s_idx,
"cellXfs_count": len(xfs),
"fix": f"s={s_idx} exceeds cellXfs count={len(xfs)}; add missing <xf> entries or lower s value",
})
continue
xf = xfs[s_idx]
font_id = xf["fontId"]
num_fmt_id = xf["numFmtId"]
if font_id >= len(fonts):
v.append({
"type": "font_index_out_of_range",
"sheet": sheet_name,
"cell": cell_ref,
"fontId": font_id,
"fonts_count": len(fonts),
"fix": f"fontId={font_id} exceeds fonts count={len(fonts)}; add missing <font> entries",
})
continue
font = fonts[font_id]
# Check C2: color-role violation — formula cell with blue font
if has_formula and _is_blue_font(font):
formula_cells += 1
f_elem = cell.find(f"{NSP}f")
formula_text = f_elem.text if f_elem is not None else ""
v.append({
"type": "formula_cell_blue_font",
"sheet": sheet_name,
"cell": cell_ref,
"s": s_idx,
"formula": formula_text,
"fix": "Formula cells must use black font (formula) or green font (cross-sheet ref). "
"Use style index 2/6/8/10 (black) or 3/13 (green) instead.",
})
# Check C3: color-role violation — non-formula cell with explicit black
# (only flag if it looks like it should be an input — has a numeric value)
if (not has_formula and _is_black_font(font)
and value_text is not None
and not font.get("bold")
and num_fmt_id not in (0,) # skip general-format black (could be label)
):
try:
float(value_text)
# It's a numeric value with black font — possible missing blue input marker
w.append({
"type": "numeric_input_may_lack_blue",
"sheet": sheet_name,
"cell": cell_ref,
"s": s_idx,
"value": value_text,
"note": "Hardcoded numeric value has black font — if this is a user-editable "
"assumption, change to blue-font input style (e.g. s=1/5/7/9/11/12).",
})
except (ValueError, TypeError):
pass
# Check C4: year value with comma-formatted numFmt
if value_text and _looks_like_year(value_text) and _fmt_is_comma(num_fmt_id, num_fmts):
v.append({
"type": "year_with_comma_format",
"sheet": sheet_name,
"cell": cell_ref,
"s": s_idx,
"value": value_text,
"numFmtId": num_fmt_id,
"fix": "Year values must use numFmtId=1 (format '0') to display as 2024 not 2,024. "
"Use style index 11 or a custom xf with numFmtId=1.",
})
# Check C5: percentage format with value > 1 (likely 8 instead of 0.08)
if value_text and _fmt_is_percent(num_fmt_id, num_fmts):
try:
pct_val = float(value_text)
if pct_val > 1.0:
w.append({
"type": "percent_value_gt_1",
"sheet": sheet_name,
"cell": cell_ref,
"s": s_idx,
"value": value_text,
"displayed_as": f"{pct_val * 100:.0f}%",
"note": f"Value {value_text} with percentage format displays as {pct_val*100:.0f}%. "
"If intended rate is ~{:.0f}%, store as {:.4f} instead.".format(
pct_val, pct_val / 100
),
})
except (ValueError, TypeError):
pass
if has_formula:
formula_cells += 1
elif value_text is not None:
input_cells += 1
results["summary"] = {
"total_cells_inspected": total_cells,
"formula_cells": formula_cells,
"input_cells": input_cells,
"violations": len(v),
"warnings": len(w),
}
return results
def _load_from_xlsx(xlsx_path: str) -> tuple[bytes, list[tuple[str, bytes]]]:
"""Load styles.xml and all sheet XMLs from a packed xlsx file."""
with zipfile.ZipFile(xlsx_path, "r") as z:
styles_xml = z.read("xl/styles.xml")
# Get sheet name mapping
wb_xml = z.read("xl/workbook.xml")
wb = ET.fromstring(wb_xml)
rel_ns = "http://schemas.openxmlformats.org/officeDocument/2006/relationships"
rels_xml = z.read("xl/_rels/workbook.xml.rels")
rels = ET.fromstring(rels_xml)
rid_to_name = {}
for sheet in wb.findall(f".//{{{NS}}}sheet"):
rid = sheet.get(f"{{{rel_ns}}}id", "")
name = sheet.get("name", "")
rid_to_name[rid] = name
rid_to_path = {}
for rel in rels:
rid = rel.get("Id", "")
target = rel.get("Target", "")
if "worksheets" in target:
if not target.startswith("xl/"):
target = "xl/" + target
rid_to_path[rid] = target
sheet_xmls = []
for rid, name in rid_to_name.items():
path = rid_to_path.get(rid)
if path and path in z.namelist():
sheet_xmls.append((name, z.read(path)))
return styles_xml, sheet_xmls
def _load_from_dir(unpacked_dir: str) -> tuple[bytes, list[tuple[str, bytes]]]:
"""Load styles.xml and all sheet XMLs from an unpacked directory."""
styles_path = os.path.join(unpacked_dir, "xl", "styles.xml")
with open(styles_path, "rb") as f:
styles_xml = f.read()
# Get sheet names from workbook.xml
wb_path = os.path.join(unpacked_dir, "xl", "workbook.xml")
wb = ET.fromstring(open(wb_path, "rb").read())
rels_path = os.path.join(unpacked_dir, "xl", "_rels", "workbook.xml.rels")
rels = ET.fromstring(open(rels_path, "rb").read())
rel_ns = "http://schemas.openxmlformats.org/officeDocument/2006/relationships"
rid_to_name = {}
for sheet in wb.findall(f".//{{{NS}}}sheet"):
rid = sheet.get(f"{{{rel_ns}}}id", "")
name = sheet.get("name", "")
rid_to_name[rid] = name
rid_to_path = {}
for rel in rels:
rid = rel.get("Id", "")
target = rel.get("Target", "")
if "worksheets" in target:
rid_to_path[rid] = target
sheet_xmls = []
ws_dir = os.path.join(unpacked_dir, "xl", "worksheets")
for rid, name in rid_to_name.items():
rel_path = rid_to_path.get(rid, "")
# rel_path may be "worksheets/sheet1.xml" or absolute path
if rel_path.startswith("worksheets/"):
full = os.path.join(unpacked_dir, "xl", rel_path)
else:
full = os.path.join(unpacked_dir, "xl", "worksheets", os.path.basename(rel_path))
if os.path.exists(full):
with open(full, "rb") as f:
sheet_xmls.append((name, f.read()))
return styles_xml, sheet_xmls
def main() -> None:
use_json = "--json" in sys.argv
summary_only = "--summary" in sys.argv
args_clean = [a for a in sys.argv[1:] if not a.startswith("--")]
if not args_clean:
print("Usage: style_audit.py <input.xlsx | unpacked_dir/> [--json] [--summary]")
sys.exit(1)
target = args_clean[0]
try:
if os.path.isdir(target):
styles_xml, sheet_xmls = _load_from_dir(target)
elif target.endswith(".xlsx") or target.endswith(".xlsm"):
styles_xml, sheet_xmls = _load_from_xlsx(target)
else:
print(f"ERROR: unrecognized target '{target}' — must be .xlsx file or unpacked directory")
sys.exit(1)
except Exception as e:
print(f"ERROR loading file: {e}")
sys.exit(1)
results = _audit(styles_xml, sheet_xmls)
if use_json:
print(json.dumps(results, indent=2, ensure_ascii=False))
sys.exit(1 if results["summary"]["violations"] > 0 else 0)
# Human-readable output
s = results["summary"]
print(f"Target : {target}")
print(f"Cells : {s['total_cells_inspected']} inspected "
f"({s['formula_cells']} formula, {s['input_cells']} input)")
print(f"Violations : {s['violations']}")
print(f"Warnings : {s['warnings']}")
if not summary_only:
if results["violations"]:
print("\n── Violations (must fix) ──")
for item in results["violations"]:
t = item["type"]
if t == "count_mismatch":
print(f" [FAIL] {item['element']} count mismatch: declared={item['declared']}, "
f"actual={item['actual']}")
print(f" Fix: {item['fix']}")
elif t == "missing_required_fills":
print(f" [FAIL] {item['detail']}")
print(f" Fix: {item['fix']}")
elif t in ("fills_0_corrupted", "fills_1_corrupted"):
print(f" [FAIL] {item['detail']}")
print(f" Fix: {item['fix']}")
elif t == "formula_cell_blue_font":
print(f" [FAIL] [{item['sheet']}!{item['cell']}] formula cell has blue font "
f"(role=input, but cell contains formula: {item.get('formula', '')})")
print(f" Fix: {item['fix']}")
elif t == "style_index_out_of_range":
print(f" [FAIL] [{item['sheet']}!{item['cell']}] s={item['s']} but "
f"cellXfs count={item['cellXfs_count']}")
print(f" Fix: {item['fix']}")
elif t == "font_index_out_of_range":
print(f" [FAIL] [{item['sheet']}!{item['cell']}] fontId={item['fontId']} but "
f"fonts count={item['fonts_count']}")
print(f" Fix: {item['fix']}")
elif t == "year_with_comma_format":
print(f" [FAIL] [{item['sheet']}!{item['cell']}] year value {item['value']} "
f"uses comma-format (numFmtId={item['numFmtId']}) — will display as "
f"{int(float(item['value'])):,}")
print(f" Fix: {item['fix']}")
else:
print(f" [FAIL] {item}")
if results["warnings"] and not summary_only:
print("\n── Warnings (review recommended) ──")
for item in results["warnings"]:
t = item["type"]
if t == "numeric_input_may_lack_blue":
print(f" [WARN] [{item['sheet']}!{item['cell']}] numeric value={item['value']} "
f"has black font — if user-editable assumption, use blue-font input style")
elif t == "percent_value_gt_1":
print(f" [WARN] [{item['sheet']}!{item['cell']}] percent-format cell has "
f"value={item['value']} (displays as {item['displayed_as']}) — "
f"likely should be stored as decimal (e.g. 0.08 for 8%)")
else:
print(f" [WARN] {item}")
print()
if s["violations"] == 0:
if s["warnings"] == 0:
print("PASS — Financial formatting is compliant")
else:
print(f"PASS with WARN — {s['warnings']} warning(s) need review")
else:
print(f"FAIL — {s['violations']} violation(s) must be fixed before delivery")
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,395 @@
#!/usr/bin/env python3
# SPDX-License-Identifier: MIT
"""
xlsx_add_column.py — Add a new column to a worksheet in an unpacked xlsx.
Usage examples:
# Add a percentage column with formulas and number format
python3 xlsx_add_column.py /tmp/work/ --col G \\
--sheet "Budget FY2025" \\
--header "% of Total" \\
--formula '=F{row}/$F$10' --formula-rows 2:9 \\
--total-row 10 --total-formula '=SUM(G2:G9)' \\
--numfmt '0.0%'
What it does:
1. Adds header cell (copies style from previous column's header)
2. Adds formula cells for the specified row range
3. Adds a total formula cell if specified
4. Creates a new cell style with the given numfmt if needed
5. Updates sharedStrings.xml for header text
6. Updates dimension ref and column definitions
IMPORTANT: Run on an UNPACKED directory (from xlsx_unpack.py).
After running, repack with xlsx_pack.py.
"""
import argparse
import copy
import os
import re
import sys
import xml.dom.minidom
import xml.etree.ElementTree as ET
NS_SS = "http://schemas.openxmlformats.org/spreadsheetml/2006/main"
NS_REL = "http://schemas.openxmlformats.org/officeDocument/2006/relationships"
ET.register_namespace('', NS_SS)
ET.register_namespace('r', NS_REL)
ET.register_namespace('xdr', 'http://schemas.openxmlformats.org/drawingml/2006/spreadsheetDrawing')
ET.register_namespace('x14', 'http://schemas.microsoft.com/office/spreadsheetml/2009/9/main')
ET.register_namespace('xr2', 'http://schemas.microsoft.com/office/spreadsheetml/2015/revision2')
ET.register_namespace('mc', 'http://schemas.openxmlformats.org/markup-compatibility/2006')
def _tag(local: str) -> str:
return f"{{{NS_SS}}}{local}"
def _write_tree(tree: ET.ElementTree, path: str) -> None:
tree.write(path, encoding="unicode", xml_declaration=False)
with open(path, "r", encoding="utf-8") as fh:
raw = fh.read()
try:
dom = xml.dom.minidom.parseString(raw.encode("utf-8"))
pretty = dom.toprettyxml(indent=" ", encoding="utf-8").decode("utf-8")
lines = [line for line in pretty.splitlines() if line.strip()]
with open(path, "w", encoding="utf-8") as fh:
fh.write("\n".join(lines) + "\n")
except Exception:
pass
def col_number(s: str) -> int:
n = 0
for c in s.upper():
n = n * 26 + (ord(c) - 64)
return n
def col_letter(n: int) -> str:
r = ""
while n > 0:
n, rem = divmod(n - 1, 26)
r = chr(65 + rem) + r
return r
def find_ws_path(work_dir: str, sheet_name: str | None) -> str:
wb_tree = ET.parse(os.path.join(work_dir, "xl", "workbook.xml"))
rid = None
for sheet in wb_tree.getroot().iter(_tag("sheet")):
if sheet_name is None or sheet.get("name") == sheet_name:
rid = sheet.get(f"{{{NS_REL}}}id")
break
if rid is None:
print(f"ERROR: Sheet not found: {sheet_name}")
sys.exit(1)
rels_tree = ET.parse(os.path.join(work_dir, "xl", "_rels", "workbook.xml.rels"))
for rel in rels_tree.getroot():
if rel.get("Id") == rid:
return os.path.join(work_dir, "xl", rel.get("Target"))
print(f"ERROR: Relationship not found: {rid}")
sys.exit(1)
def add_shared_string(work_dir: str, text: str) -> int:
ss_path = os.path.join(work_dir, "xl", "sharedStrings.xml")
tree = ET.parse(ss_path)
root = tree.getroot()
idx = 0
for si in root.findall(_tag("si")):
t_el = si.find(_tag("t"))
if t_el is not None and t_el.text == text:
return idx
idx += 1
si = ET.SubElement(root, _tag("si"))
t = ET.SubElement(si, _tag("t"))
t.set("{http://www.w3.org/XML/1998/namespace}space", "preserve")
t.text = text
root.set("count", str(int(root.get("count", "0")) + 1))
root.set("uniqueCount", str(int(root.get("uniqueCount", "0")) + 1))
_write_tree(tree, ss_path)
return idx
def get_cell_style(ws_tree: ET.ElementTree, col: str, row: int) -> int:
ref = f"{col}{row}"
for row_el in ws_tree.getroot().iter(_tag("row")):
if row_el.get("r") == str(row):
for c in row_el:
if c.get("r") == ref:
return int(c.get("s", "0"))
return 0
def ensure_numfmt_style(work_dir: str, ref_style_idx: int, numfmt_code: str) -> int:
"""Clone a cellXfs entry with the given numfmt. Returns new style index."""
styles_path = os.path.join(work_dir, "xl", "styles.xml")
tree = ET.parse(styles_path)
root = tree.getroot()
# Find or add numFmt
numfmts = root.find(_tag("numFmts"))
numfmt_id = None
if numfmts is not None:
for nf in numfmts:
if nf.get("formatCode") == numfmt_code:
numfmt_id = int(nf.get("numFmtId"))
break
if numfmt_id is None:
max_id = 163
if numfmts is not None:
for nf in numfmts:
max_id = max(max_id, int(nf.get("numFmtId", "0")))
else:
numfmts = ET.SubElement(root, _tag("numFmts"))
numfmts.set("count", "0")
root.remove(numfmts)
root.insert(0, numfmts)
numfmt_id = max_id + 1
nf = ET.SubElement(numfmts, _tag("numFmt"))
nf.set("numFmtId", str(numfmt_id))
nf.set("formatCode", numfmt_code)
numfmts.set("count", str(len(list(numfmts))))
# Find or create cellXfs entry
cellxfs = root.find(_tag("cellXfs"))
xf_list = list(cellxfs)
ref_xf = xf_list[min(ref_style_idx, len(xf_list) - 1)]
for i, xf in enumerate(xf_list):
if (xf.get("numFmtId") == str(numfmt_id) and
xf.get("fontId") == ref_xf.get("fontId") and
xf.get("fillId") == ref_xf.get("fillId") and
xf.get("borderId") == ref_xf.get("borderId")):
return i
new_xf = copy.deepcopy(ref_xf)
new_xf.set("numFmtId", str(numfmt_id))
new_xf.set("applyNumberFormat", "true")
cellxfs.append(new_xf)
cellxfs.set("count", str(len(list(cellxfs))))
_write_tree(tree, styles_path)
return len(list(cellxfs)) - 1
def _apply_border_to_row(work_dir: str, ws_path: str, ws_tree: ET.ElementTree,
ws_root: ET.Element, row_map: dict, border_row: int,
border_style: str, new_col: str) -> None:
"""Apply a top border to ALL cells in the specified row (A through new_col)."""
styles_path = os.path.join(work_dir, "xl", "styles.xml")
st_tree = ET.parse(styles_path)
st_root = st_tree.getroot()
# 1. Create a new border entry with the specified top style
borders = st_root.find(_tag("borders"))
new_border = ET.SubElement(borders, _tag("border"))
for side in ("left", "right"):
ET.SubElement(new_border, _tag(side))
top_el = ET.SubElement(new_border, _tag("top"))
top_el.set("style", border_style)
ET.SubElement(new_border, _tag("bottom"))
ET.SubElement(new_border, _tag("diagonal"))
borders.set("count", str(len(list(borders))))
new_border_id = len(list(borders)) - 1
# 2. For each existing style used in the row, create a clone with the new borderId
cellxfs = st_root.find(_tag("cellXfs"))
style_remap = {} # old_style_idx -> new_style_idx
if border_row not in row_map:
return
row_el = row_map[border_row]
# Collect all cells in this row and their styles
for c in row_el:
old_s = int(c.get("s", "0"))
if old_s not in style_remap:
xf_list = list(cellxfs)
ref_xf = xf_list[min(old_s, len(xf_list) - 1)]
new_xf = copy.deepcopy(ref_xf)
new_xf.set("borderId", str(new_border_id))
new_xf.set("applyBorder", "true")
cellxfs.append(new_xf)
cellxfs.set("count", str(len(list(cellxfs))))
style_remap[old_s] = len(list(cellxfs)) - 1
# 3. Apply remapped styles to all cells in the row
for c in row_el:
old_s = int(c.get("s", "0"))
if old_s in style_remap:
c.set("s", str(style_remap[old_s]))
_write_tree(st_tree, styles_path)
last_col_num = col_number(new_col)
print(f" Applied {border_style} top border to all cells in row {border_row} "
f"(A-{new_col}, {len(style_remap)} style(s) cloned)")
def main() -> None:
parser = argparse.ArgumentParser(
description="Add a column to a worksheet in an unpacked xlsx")
parser.add_argument("work_dir", help="Unpacked xlsx working directory")
parser.add_argument("--col", required=True, help="Column letter (e.g., G)")
parser.add_argument("--sheet", default=None, help="Sheet name (default: first)")
parser.add_argument("--header", default=None, help="Header text for row 1")
parser.add_argument("--formula", default=None,
help="Formula template with {row} placeholder")
parser.add_argument("--formula-rows", default=None,
help="Row range for formulas (e.g., 2:9)")
parser.add_argument("--total-row", type=int, default=None,
help="Row number for total formula")
parser.add_argument("--total-formula", default=None,
help="Formula for total row")
parser.add_argument("--numfmt", default=None,
help="Number format for data/total cells (e.g., 0.0%%)")
parser.add_argument("--border-row", type=int, default=None,
help="Row to apply a top border to ALL cells (e.g., 10)")
parser.add_argument("--border-style", default="medium",
help="Border style: thin, medium, thick (default: medium)")
args = parser.parse_args()
col = args.col.upper()
prev_col = col_letter(col_number(col) - 1) if col_number(col) > 1 else "A"
ws_path = find_ws_path(args.work_dir, args.sheet)
ws_tree = ET.parse(ws_path)
changes = 0
print(f"Adding column {col} to {os.path.basename(ws_path)}")
# Resolve styles from previous column
header_style = get_cell_style(ws_tree, prev_col, 1) if args.header else 0
data_style = None
if args.formula_rows:
start_row = int(args.formula_rows.split(":")[0])
ref = get_cell_style(ws_tree, prev_col, start_row)
data_style = (ensure_numfmt_style(args.work_dir, ref, args.numfmt)
if args.numfmt else ref)
total_style = None
if args.total_row:
ref = get_cell_style(ws_tree, prev_col, args.total_row)
total_style = (ensure_numfmt_style(args.work_dir, ref, args.numfmt)
if args.numfmt else ref)
# Add header to sharedStrings
header_idx = add_shared_string(args.work_dir, args.header) if args.header else None
# Re-parse worksheet (sharedStrings write may have changed state)
ws_tree = ET.parse(ws_path)
root = ws_tree.getroot()
sheet_data = root.find(_tag("sheetData"))
row_map = {}
for row_el in sheet_data:
r = row_el.get("r")
if r:
row_map[int(r)] = row_el
# Add header cell
if args.header and 1 in row_map:
cell = ET.SubElement(row_map[1], _tag("c"))
cell.set("r", f"{col}1")
cell.set("s", str(header_style))
cell.set("t", "s")
v = ET.SubElement(cell, _tag("v"))
v.text = str(header_idx)
changes += 1
print(f" {col}1 = \"{args.header}\" (header, style={header_style})")
# Add formula cells
if args.formula and args.formula_rows:
start, end = map(int, args.formula_rows.split(":"))
for row_num in range(start, end + 1):
if row_num not in row_map:
row_el = ET.SubElement(sheet_data, _tag("row"))
row_el.set("r", str(row_num))
row_map[row_num] = row_el
formula_text = args.formula.replace("{row}", str(row_num))
formula_text = formula_text.lstrip("=")
cell = ET.SubElement(row_map[row_num], _tag("c"))
cell.set("r", f"{col}{row_num}")
if data_style is not None:
cell.set("s", str(data_style))
f_el = ET.SubElement(cell, _tag("f"))
f_el.text = formula_text
changes += 1
print(f" {col}{start}:{col}{end} = formulas (style={data_style})")
# Add total formula
if args.total_row and args.total_formula:
if args.total_row not in row_map:
row_el = ET.SubElement(sheet_data, _tag("row"))
row_el.set("r", str(args.total_row))
row_map[args.total_row] = row_el
total_f = args.total_formula.lstrip("=")
cell = ET.SubElement(row_map[args.total_row], _tag("c"))
cell.set("r", f"{col}{args.total_row}")
if total_style is not None:
cell.set("s", str(total_style))
f_el = ET.SubElement(cell, _tag("f"))
f_el.text = total_f
changes += 1
print(f" {col}{args.total_row} = ={total_f} (style={total_style})")
# Update dimension
for dim in root.iter(_tag("dimension")):
old_ref = dim.get("ref", "")
if ":" in old_ref:
start_ref, end_ref = old_ref.split(":")
end_col_str = re.match(r"([A-Z]+)", end_ref).group(1)
end_row_str = re.search(r"(\d+)", end_ref).group(1)
if col_number(col) > col_number(end_col_str):
new_ref = f"{start_ref}:{col}{end_row_str}"
dim.set("ref", new_ref)
print(f" Dimension: {old_ref}{new_ref}")
# Extend <cols> to cover new column
cols_el = root.find(_tag("cols"))
if cols_el is not None:
new_col_num = col_number(col)
covered = any(
int(c.get("min", "0")) <= new_col_num <= int(c.get("max", "0"))
for c in cols_el
)
if not covered:
prev_num = col_number(prev_col)
for c in cols_el:
if int(c.get("min", "0")) <= prev_num <= int(c.get("max", "0")):
new_col_def = copy.deepcopy(c)
new_col_def.set("min", str(new_col_num))
new_col_def.set("max", str(new_col_num))
cols_el.append(new_col_def)
print(f" Added <col> definition for column {col}")
break
# Apply border to entire row if requested
if args.border_row:
_apply_border_to_row(args.work_dir, ws_path, ws_tree, root,
row_map, args.border_row, args.border_style,
col)
_write_tree(ws_tree, ws_path)
print(f"\nDone. {changes} cells added.")
print(f"\nNext: python3 xlsx_pack.py {args.work_dir} output.xlsx")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,274 @@
#!/usr/bin/env python3
# SPDX-License-Identifier: MIT
"""
xlsx_insert_row.py — Insert a new data row into a worksheet in an unpacked xlsx.
Usage examples:
# Insert "Utilities" row at position 6, copying styles from row 5
python3 xlsx_insert_row.py /tmp/work/ --at 6 \\
--sheet "Budget FY2025" \\
--text A=Utilities \\
--values B=3000 C=3000 D=3500 E=3500 \\
--formula 'F=SUM(B{row}:E{row})' \\
--copy-style-from 5
What it does:
1. Shifts all rows >= at down by 1 (calls xlsx_shift_rows.py)
2. Adds text values to sharedStrings.xml
3. Inserts new row with specified cells (text, numbers, formulas)
4. Copies cell styles from a reference row
5. Updates dimension ref
The shift operation automatically expands SUM formulas that span the
insertion point, so total-row formulas are updated without extra work.
IMPORTANT: Run on an UNPACKED directory (from xlsx_unpack.py).
After running, repack with xlsx_pack.py.
"""
import argparse
import os
import re
import subprocess
import sys
import xml.dom.minidom
import xml.etree.ElementTree as ET
NS_SS = "http://schemas.openxmlformats.org/spreadsheetml/2006/main"
NS_REL = "http://schemas.openxmlformats.org/officeDocument/2006/relationships"
ET.register_namespace('', NS_SS)
ET.register_namespace('r', NS_REL)
ET.register_namespace('xdr', 'http://schemas.openxmlformats.org/drawingml/2006/spreadsheetDrawing')
ET.register_namespace('x14', 'http://schemas.microsoft.com/office/spreadsheetml/2009/9/main')
ET.register_namespace('xr2', 'http://schemas.microsoft.com/office/spreadsheetml/2015/revision2')
ET.register_namespace('mc', 'http://schemas.openxmlformats.org/markup-compatibility/2006')
def _tag(local: str) -> str:
return f"{{{NS_SS}}}{local}"
def _write_tree(tree: ET.ElementTree, path: str) -> None:
tree.write(path, encoding="unicode", xml_declaration=False)
with open(path, "r", encoding="utf-8") as fh:
raw = fh.read()
try:
dom = xml.dom.minidom.parseString(raw.encode("utf-8"))
pretty = dom.toprettyxml(indent=" ", encoding="utf-8").decode("utf-8")
lines = [line for line in pretty.splitlines() if line.strip()]
with open(path, "w", encoding="utf-8") as fh:
fh.write("\n".join(lines) + "\n")
except Exception:
pass
def col_number(s: str) -> int:
n = 0
for c in s.upper():
n = n * 26 + (ord(c) - 64)
return n
def find_ws_path(work_dir: str, sheet_name: str | None) -> str:
wb_tree = ET.parse(os.path.join(work_dir, "xl", "workbook.xml"))
rid = None
for sheet in wb_tree.getroot().iter(_tag("sheet")):
if sheet_name is None or sheet.get("name") == sheet_name:
rid = sheet.get(f"{{{NS_REL}}}id")
break
if rid is None:
print(f"ERROR: Sheet not found: {sheet_name}")
sys.exit(1)
rels_tree = ET.parse(os.path.join(work_dir, "xl", "_rels", "workbook.xml.rels"))
for rel in rels_tree.getroot():
if rel.get("Id") == rid:
return os.path.join(work_dir, "xl", rel.get("Target"))
print(f"ERROR: Relationship not found: {rid}")
sys.exit(1)
def add_shared_string(work_dir: str, text: str) -> int:
ss_path = os.path.join(work_dir, "xl", "sharedStrings.xml")
tree = ET.parse(ss_path)
root = tree.getroot()
idx = 0
for si in root.findall(_tag("si")):
t_el = si.find(_tag("t"))
if t_el is not None and t_el.text == text:
return idx
idx += 1
si = ET.SubElement(root, _tag("si"))
t = ET.SubElement(si, _tag("t"))
t.set("{http://www.w3.org/XML/1998/namespace}space", "preserve")
t.text = text
root.set("count", str(int(root.get("count", "0")) + 1))
root.set("uniqueCount", str(int(root.get("uniqueCount", "0")) + 1))
_write_tree(tree, ss_path)
return idx
def get_row_styles(ws_tree: ET.ElementTree, row_num: int) -> dict[str, int]:
"""Get {col_letter: style_index} for all cells in a row."""
styles = {}
for row_el in ws_tree.getroot().iter(_tag("row")):
if row_el.get("r") == str(row_num):
for c in row_el:
ref = c.get("r", "")
col_str = re.match(r"([A-Z]+)", ref)
if col_str:
styles[col_str.group(1)] = int(c.get("s", "0"))
break
return styles
def parse_kv(specs: list[str] | None) -> dict[str, str]:
if not specs:
return {}
result = {}
for spec in specs:
col, _, val = spec.partition("=")
result[col.upper()] = val
return result
def main() -> None:
parser = argparse.ArgumentParser(
description="Insert a new row into a worksheet in an unpacked xlsx")
parser.add_argument("work_dir", help="Unpacked xlsx working directory")
parser.add_argument("--at", type=int, required=True,
help="Row number to insert at (existing rows shift down)")
parser.add_argument("--sheet", default=None, help="Sheet name (default: first)")
parser.add_argument("--text", nargs="+", default=None,
help="Text cells: COL=VALUE (e.g., A=Utilities)")
parser.add_argument("--values", nargs="+", default=None,
help="Numeric cells: COL=VALUE (e.g., B=3000 C=3000)")
parser.add_argument("--formula", nargs="+", default=None,
help="Formula cells: COL=FORMULA with {row} (e.g., F=SUM(B{row}:E{row}))")
parser.add_argument("--copy-style-from", type=int, default=None,
help="Copy cell styles from this row number")
args = parser.parse_args()
at = args.at
text_cells = parse_kv(args.text)
num_cells = parse_kv(args.values)
formula_cells = parse_kv(args.formula)
# Step 1: Shift rows down using xlsx_shift_rows.py
script_dir = os.path.dirname(os.path.abspath(__file__))
shift_script = os.path.join(script_dir, "xlsx_shift_rows.py")
print(f"Step 1: Shifting rows >= {at} down by 1...")
result = subprocess.run(
[sys.executable, shift_script, args.work_dir, "insert", str(at), "1"],
capture_output=True, text=True,
)
if result.returncode != 0:
print(f"ERROR: shift_rows failed:\n{result.stderr}")
sys.exit(1)
print(result.stdout)
# Step 2: Resolve worksheet path and get reference styles
ws_path = find_ws_path(args.work_dir, args.sheet)
ws_tree = ET.parse(ws_path)
ref_styles = {}
if args.copy_style_from is not None:
ref_styles = get_row_styles(ws_tree, args.copy_style_from)
print(f"Step 2: Copied styles from row {args.copy_style_from}: {ref_styles}")
# Step 3: Add text values to sharedStrings
text_indices = {}
for col, text in text_cells.items():
text_indices[col] = add_shared_string(args.work_dir, text)
print(f" Added shared string: \"{text}\" → index {text_indices[col]}")
# Step 4: Re-parse worksheet and build new row
ws_tree = ET.parse(ws_path)
root = ws_tree.getroot()
sheet_data = root.find(_tag("sheetData"))
new_row = ET.Element(_tag("row"))
new_row.set("r", str(at))
all_cols = sorted(
set(list(text_cells) + list(num_cells) + list(formula_cells)),
key=col_number,
)
for col in all_cols:
cell = ET.SubElement(new_row, _tag("c"))
cell.set("r", f"{col}{at}")
if col in ref_styles:
cell.set("s", str(ref_styles[col]))
if col in text_cells:
cell.set("t", "s")
v = ET.SubElement(cell, _tag("v"))
v.text = str(text_indices[col])
elif col in num_cells:
# Omit t attribute for numbers — "n" is the default per OOXML spec
v = ET.SubElement(cell, _tag("v"))
v.text = str(num_cells[col])
elif col in formula_cells:
formula_text = formula_cells[col].replace("{row}", str(at)).lstrip("=")
f_el = ET.SubElement(cell, _tag("f"))
f_el.text = formula_text
# Use formula style from reference if available; it may differ
# from the data style (e.g., black font vs blue font).
# Look for the formula column's style specifically.
if col in ref_styles:
cell.set("s", str(ref_styles[col]))
# Insert new row at the correct position in sheetData (sorted by row number)
insert_idx = 0
for i, row_el in enumerate(list(sheet_data)):
r = row_el.get("r")
if r and int(r) > at:
insert_idx = i
break
insert_idx = i + 1
sheet_data.insert(insert_idx, new_row)
print(f"\nStep 3: Inserted row {at} with {len(all_cols)} cells:")
for col in all_cols:
if col in text_cells:
print(f" {col}{at} = \"{text_cells[col]}\" (text)")
elif col in num_cells:
print(f" {col}{at} = {num_cells[col]} (number)")
elif col in formula_cells:
ftext = formula_cells[col].replace("{row}", str(at))
print(f" {col}{at} = {ftext} (formula)")
# Step 5: Update dimension
for dim in root.iter(_tag("dimension")):
old_ref = dim.get("ref", "")
if ":" in old_ref:
start_ref, end_ref = old_ref.split(":")
end_row = int(re.search(r"(\d+)", end_ref).group(1))
end_col = re.match(r"([A-Z]+)", end_ref).group(1)
# Dimension was already shifted by shift_rows, just verify
max_col = max(col_number(end_col), max(col_number(c) for c in all_cols))
max_col_letter = end_col if col_number(end_col) >= max_col else col
new_ref = f"{start_ref}:{max_col_letter}{end_row}"
if new_ref != old_ref:
dim.set("ref", new_ref)
print(f"\n Dimension: {old_ref}{new_ref}")
_write_tree(ws_tree, ws_path)
print(f"\nDone. Row {at} inserted successfully.")
print(f"\nNext: python3 xlsx_pack.py {args.work_dir} output.xlsx")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,87 @@
#!/usr/bin/env python3
# SPDX-License-Identifier: MIT
"""
xlsx_pack.py — Pack a working directory back into a valid xlsx file.
Usage:
python3 xlsx_pack.py <source_dir> <output.xlsx>
Requirements:
- source_dir must contain [Content_Types].xml at its root
- All XML files are re-validated for well-formedness before packing
The resulting xlsx is a valid ZIP archive with correct OOXML structure.
"""
import sys
import os
import zipfile
import xml.etree.ElementTree as ET
def validate_xml_files(source_dir: str) -> list[str]:
"""Return list of XML files that fail to parse."""
bad = []
for dirpath, _, filenames in os.walk(source_dir):
for fname in filenames:
if fname.endswith(".xml") or fname.endswith(".rels"):
fpath = os.path.join(dirpath, fname)
try:
ET.parse(fpath)
except ET.ParseError as e:
rel = os.path.relpath(fpath, source_dir)
bad.append(f"{rel}: {e}")
return bad
def pack(source_dir: str, xlsx_path: str) -> None:
if not os.path.isdir(source_dir):
print(f"ERROR: Directory not found: {source_dir}", file=sys.stderr)
sys.exit(1)
content_types = os.path.join(source_dir, "[Content_Types].xml")
if not os.path.isfile(content_types):
print(
f"ERROR: Missing [Content_Types].xml in {source_dir}\n"
" This file is required at the root of every valid xlsx package.",
file=sys.stderr,
)
sys.exit(1)
# Validate XML well-formedness before packing
print("Validating XML files...")
bad_files = validate_xml_files(source_dir)
if bad_files:
print("ERROR: The following files have XML parse errors:", file=sys.stderr)
for b in bad_files:
print(f" {b}", file=sys.stderr)
print(
"\nFix all XML errors before packing. "
"A malformed xlsx cannot be opened by Excel or LibreOffice.",
file=sys.stderr,
)
sys.exit(1)
print("✓ All XML files are well-formed")
# Count files to pack
file_count = sum(len(files) for _, _, files in os.walk(source_dir))
with zipfile.ZipFile(xlsx_path, "w", compression=zipfile.ZIP_DEFLATED) as z:
for dirpath, _, filenames in os.walk(source_dir):
for fname in filenames:
fpath = os.path.join(dirpath, fname)
arcname = os.path.relpath(fpath, source_dir)
z.write(fpath, arcname)
size = os.path.getsize(xlsx_path)
print(f"Packed {file_count} files → '{xlsx_path}' ({size:,} bytes)")
print("\nNext step: run formula_check.py to validate formulas:")
print(f" python3 formula_check.py {xlsx_path}")
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: xlsx_pack.py <source_dir> <output.xlsx>")
sys.exit(1)
pack(sys.argv[1], sys.argv[2])

View File

@@ -0,0 +1,362 @@
#!/usr/bin/env python3
# SPDX-License-Identifier: MIT
"""
xlsx_reader.py — Structure discovery and data analysis tool for Excel/CSV files.
Usage:
python3 xlsx_reader.py <file> # full structure report
python3 xlsx_reader.py <file> --sheet Sales # analyze one sheet
python3 xlsx_reader.py <file> --json # machine-readable output
python3 xlsx_reader.py <file> --quality # data quality audit only
Supports: .xlsx, .xlsm, .csv, .tsv
Does NOT modify the source file in any way.
Exit codes:
0 — success
1 — file not found / unsupported format / encoding failure
"""
import sys
import json
import argparse
from pathlib import Path
# ---------------------------------------------------------------------------
# Format detection and loading
# ---------------------------------------------------------------------------
def detect_and_load(file_path: str, sheet_name_filter: str | None = None) -> dict:
"""
Load file into {sheet_name: DataFrame} dict.
CSV/TSV files are mapped to a single-key dict using the file stem as key.
Raises ValueError for unsupported formats or encoding failures.
"""
try:
import pandas as pd
except ImportError:
raise RuntimeError(
"pandas is not installed. Run: pip install pandas openpyxl"
)
path = Path(file_path)
if not path.exists():
raise FileNotFoundError(f"File not found: {file_path}")
suffix = path.suffix.lower()
if suffix in (".xlsx", ".xlsm"):
target = sheet_name_filter if sheet_name_filter else None
result = pd.read_excel(file_path, sheet_name=target)
# pd.read_excel with sheet_name=None returns dict; with a name, returns DataFrame
if isinstance(result, dict):
return result
else:
return {sheet_name_filter: result}
elif suffix in (".csv", ".tsv"):
sep = "\t" if suffix == ".tsv" else ","
encodings = ["utf-8-sig", "gbk", "utf-8", "latin-1"]
last_error = None
for enc in encodings:
try:
import pandas as pd
df = pd.read_csv(file_path, sep=sep, encoding=enc)
df._reader_encoding = enc # attach metadata (non-standard, for reporting)
return {path.stem: df}
except (UnicodeDecodeError, Exception) as e:
last_error = e
continue
raise ValueError(
f"Cannot decode {file_path}. Tried encodings: {encodings}. "
f"Last error: {last_error}"
)
elif suffix == ".xls":
raise ValueError(
".xls is a legacy binary format not supported by this tool. "
"Please open the file in Excel and save as .xlsx, then retry."
)
else:
raise ValueError(
f"Unsupported file format: {suffix}. "
"Supported formats: .xlsx, .xlsm, .csv, .tsv"
)
# ---------------------------------------------------------------------------
# Structure discovery
# ---------------------------------------------------------------------------
def explore_structure(sheets: dict) -> dict:
"""
Return a structured dict describing each sheet.
Keys: sheet_name -> {shape, columns, dtypes, null_counts, preview}
"""
result = {}
for sheet_name, df in sheets.items():
null_counts = df.isnull().sum()
null_info = {
col: {"count": int(cnt), "pct": round(cnt / max(len(df), 1) * 100, 1)}
for col, cnt in null_counts.items()
if cnt > 0
}
result[sheet_name] = {
"shape": {"rows": df.shape[0], "cols": df.shape[1]},
"columns": list(df.columns),
"dtypes": {col: str(dtype) for col, dtype in df.dtypes.items()},
"null_columns": null_info,
"preview": df.head(5).to_dict(orient="records"),
}
return result
# ---------------------------------------------------------------------------
# Data quality audit
# ---------------------------------------------------------------------------
def audit_quality(sheets: dict) -> dict:
"""
Return data quality findings per sheet.
Checks: nulls, duplicates, mixed-type columns, potential year formatting issues.
"""
import pandas as pd
findings = {}
for sheet_name, df in sheets.items():
sheet_findings = []
# Null values
null_counts = df.isnull().sum()
for col, cnt in null_counts.items():
if cnt > 0:
pct = round(cnt / max(len(df), 1) * 100, 1)
sheet_findings.append({
"type": "null_values",
"column": col,
"count": int(cnt),
"pct": pct,
"note": f"Column '{col}' has {cnt} null values ({pct}%). "
"If this column contains Excel formulas, null values may "
"indicate that the formula cache has not been populated "
"(file was never opened in Excel after the formulas were written)."
})
# Duplicate rows
dup_count = int(df.duplicated().sum())
if dup_count > 0:
sheet_findings.append({
"type": "duplicate_rows",
"count": dup_count,
"note": f"{dup_count} fully duplicate rows found."
})
# Mixed-type object columns (numeric data stored as text)
for col in df.select_dtypes(include="object").columns:
numeric_converted = pd.to_numeric(df[col], errors="coerce")
convertible = int(numeric_converted.notna().sum())
non_null_total = int(df[col].notna().sum())
if 0 < convertible < non_null_total:
sheet_findings.append({
"type": "mixed_type",
"column": col,
"convertible_to_numeric": convertible,
"non_convertible": non_null_total - convertible,
"note": f"Column '{col}' appears to contain mixed types: "
f"{convertible} values can be parsed as numbers, "
f"{non_null_total - convertible} cannot. "
"Use pd.to_numeric(df[col], errors='coerce') to unify."
})
# Year column formatting (e.g., 2024.0 stored as float)
for col in df.select_dtypes(include="number").columns:
col_lower = str(col).lower()
# "年" is the Chinese character for "year" — detect year columns in CJK spreadsheets
if "year" in col_lower or "yr" in col_lower or "" in col_lower:
if df[col].dropna().between(1900, 2200).all():
if df[col].dtype == float:
sheet_findings.append({
"type": "year_as_float",
"column": col,
"note": f"Column '{col}' appears to be a year column stored as float "
"(e.g., 2024.0). Convert with df[col].astype(int).astype(str) "
"to get clean year strings like '2024'."
})
# Outliers via IQR on numeric columns
for col in df.select_dtypes(include="number").columns:
series = df[col].dropna()
if len(series) < 4:
continue
Q1, Q3 = series.quantile(0.25), series.quantile(0.75)
IQR = Q3 - Q1
if IQR == 0:
continue
outlier_mask = (df[col] < Q1 - 1.5 * IQR) | (df[col] > Q3 + 1.5 * IQR)
outlier_count = int(outlier_mask.sum())
if outlier_count > 0:
sheet_findings.append({
"type": "outliers_iqr",
"column": col,
"count": outlier_count,
"note": f"Column '{col}' has {outlier_count} potential outlier(s) "
f"(outside 1.5×IQR bounds: [{Q1 - 1.5*IQR:.2f}, {Q3 + 1.5*IQR:.2f}])."
})
findings[sheet_name] = sheet_findings
return findings
# ---------------------------------------------------------------------------
# Summary statistics
# ---------------------------------------------------------------------------
def compute_stats(sheets: dict) -> dict:
"""Compute descriptive statistics for numeric columns per sheet."""
stats = {}
for sheet_name, df in sheets.items():
numeric_df = df.select_dtypes(include="number")
if numeric_df.empty:
stats[sheet_name] = {}
continue
desc = numeric_df.describe().round(4)
stats[sheet_name] = desc.to_dict()
return stats
# ---------------------------------------------------------------------------
# Human-readable report rendering
# ---------------------------------------------------------------------------
def render_report(
file_path: str,
structure: dict,
quality: dict,
stats: dict,
) -> str:
lines = []
p = lines.append
p("=" * 60)
p(f"ANALYSIS REPORT: {Path(file_path).name}")
p("=" * 60)
# File overview
sheet_list = list(structure.keys())
total_rows = sum(s["shape"]["rows"] for s in structure.values())
p(f"\nSheets ({len(sheet_list)}): {', '.join(sheet_list)}")
p(f"Total rows across all sheets: {total_rows:,}")
for sheet_name, info in structure.items():
p(f"\n{'' * 50}")
p(f"Sheet: {sheet_name}")
p(f"{'' * 50}")
p(f" Size: {info['shape']['rows']:,} rows × {info['shape']['cols']} cols")
p(f" Columns: {info['columns']}")
# Data types
p("\n Column types:")
for col, dtype in info["dtypes"].items():
p(f" {col}: {dtype}")
# Nulls
if info["null_columns"]:
p("\n Null values (columns with nulls only):")
for col, null_info in info["null_columns"].items():
p(f" {col}: {null_info['count']} nulls ({null_info['pct']}%)")
else:
p("\n Null values: none")
# Stats
sheet_stats = stats.get(sheet_name, {})
if sheet_stats:
p("\n Numeric column statistics:")
numeric_cols = list(sheet_stats.keys())
# Show only first 6 to keep report readable
for col in numeric_cols[:6]:
col_stats = sheet_stats[col]
p(f" {col}:")
p(f" count={col_stats.get('count', 'N/A')} "
f"mean={col_stats.get('mean', 'N/A')} "
f"min={col_stats.get('min', 'N/A')} "
f"max={col_stats.get('max', 'N/A')}")
if len(numeric_cols) > 6:
p(f" ... and {len(numeric_cols) - 6} more numeric columns")
# Quality findings for this sheet
sheet_quality = quality.get(sheet_name, [])
if sheet_quality:
p(f"\n Data quality issues ({len(sheet_quality)} found):")
for finding in sheet_quality:
p(f" [{finding['type'].upper()}] {finding['note']}")
else:
p("\n Data quality: no issues found")
# Preview
if info["preview"]:
p("\n Preview (first 3 rows):")
import pandas as pd
preview_df = pd.DataFrame(info["preview"][:3])
for line in preview_df.to_string(index=False).splitlines():
p(f" {line}")
p("\n" + "=" * 60)
quality_issue_count = sum(len(v) for v in quality.values())
if quality_issue_count == 0:
p("RESULT: No data quality issues detected.")
else:
p(f"RESULT: {quality_issue_count} data quality issue(s) found. See details above.")
p("=" * 60)
return "\n".join(lines)
# ---------------------------------------------------------------------------
# CLI entry point
# ---------------------------------------------------------------------------
def main() -> None:
parser = argparse.ArgumentParser(
description="Read and analyze Excel/CSV files without modifying them."
)
parser.add_argument("file", help="Path to .xlsx, .xlsm, .csv, or .tsv file")
parser.add_argument("--sheet", help="Analyze a specific sheet only", default=None)
parser.add_argument(
"--json", action="store_true", help="Output machine-readable JSON"
)
parser.add_argument(
"--quality", action="store_true",
help="Run data quality audit only (skip stats)"
)
args = parser.parse_args()
try:
sheets = detect_and_load(args.file, sheet_name_filter=args.sheet)
except (FileNotFoundError, ValueError, RuntimeError) as e:
print(f"ERROR: {e}", file=sys.stderr)
sys.exit(1)
structure = explore_structure(sheets)
quality = audit_quality(sheets)
stats = {} if args.quality else compute_stats(sheets)
if args.json:
output = {
"file": args.file,
"structure": structure,
"quality": quality,
"stats": stats,
}
# Convert preview records to serializable form (handle non-JSON types)
print(json.dumps(output, indent=2, ensure_ascii=False, default=str))
else:
report = render_report(args.file, structure, quality, stats)
print(report)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,396 @@
#!/usr/bin/env python3
# SPDX-License-Identifier: MIT
"""
xlsx_shift_rows.py — Shift all row references in an unpacked xlsx working directory
after inserting or deleting rows.
Usage:
# Insert 2 rows at row 5 (rows 5+ shift down by 2)
python3 xlsx_shift_rows.py <work_dir> insert 5 2
# Delete 1 row at row 8 (rows 9+ shift up by 1)
python3 xlsx_shift_rows.py <work_dir> delete 8 1
What it updates in every XML file under <work_dir>:
- <row r="N"> attributes in worksheet sheetData
- <c r="XN"> cell address attributes in worksheet sheetData
- <f> formula text: absolute row references (e.g. B7, $B$7, $B7) in all sheets
- <mergeCell ref="A5:C7"> ranges
- <conditionalFormatting sqref="..."> ranges
- <dataValidations sqref="..."> ranges
- <dimension ref="A1:D20"> extent marker
- Table <table ref="A1:D20"> in xl/tables/*.xml
- Chart series <numRef><f> and <strRef><f> range references in xl/charts/*.xml
- PivotCache source <worksheetSource ref="..."> in xl/pivotCaches/*.xml
IMPORTANT: Run this script on the UNPACKED directory before repacking.
After running, repack with xlsx_pack.py and re-validate with formula_check.py.
Limitations:
- Named ranges in workbook.xml <definedNames> are NOT updated automatically.
Review them manually after running this script.
- Structured table references (Table[@Column]) are NOT updated.
- External workbook links in xl/externalLinks/ are NOT updated.
"""
import sys
import os
import re
import xml.etree.ElementTree as ET
import xml.dom.minidom
def col_letter(n: int) -> str:
"""Convert 1-based column number to Excel column letter(s)."""
r = ""
while n > 0:
n, rem = divmod(n - 1, 26)
r = chr(65 + rem) + r
return r
def col_number(s: str) -> int:
"""Convert Excel column letter(s) to 1-based column number."""
n = 0
for c in s.upper():
n = n * 26 + (ord(c) - 64)
return n
# ---------------------------------------------------------------------------
# Core shifting logic for formula strings
# ---------------------------------------------------------------------------
def _shift_refs(text: str, at: int, delta: int) -> str:
"""Shift cell references in a non-quoted formula fragment."""
def replacer(m: re.Match) -> str:
dollar_col = m.group(1) # "$" or ""
col_part = m.group(2) # e.g. "B" or "AB"
dollar_row = m.group(3) # "$" or ""
row_str = m.group(4) # e.g. "7"
row = int(row_str)
if row >= at:
row = max(1, row + delta)
return f"{dollar_col}{col_part}{dollar_row}{row}"
pattern = r'(\$?)([A-Z]+)(\$?)(\d+)'
return re.sub(pattern, replacer, text)
def shift_formula(formula: str, at: int, delta: int) -> str:
"""
Shift absolute and mixed row references >= `at` by `delta` in a formula string.
Handles:
B7 (relative col, absolute row — shifts if row >= at)
$B$7 (absolute col, absolute row — shifts)
$B7 (absolute col, relative row — shifts)
B$7 (relative col, absolute — shifts)
BUT NOT: B:B (whole-column reference — left as-is)
Skips content inside single-quoted sheet name prefixes to avoid
corrupting names like 'Budget FY2025' (where FY2025 is NOT a cell ref).
Does NOT handle:
- Named ranges
- Structured references (Table[@Col])
- R1C1 notation
"""
# Split on quoted sheet names: 'Sheet Name' portions are odd-indexed
segments = re.split(r"('[^']*(?:''[^']*)*')", formula)
result = []
for i, seg in enumerate(segments):
if i % 2 == 1:
result.append(seg)
else:
result.append(_shift_refs(seg, at, delta))
return "".join(result)
def shift_sqref(sqref: str, at: int, delta: int) -> str:
"""
Shift row references in a sqref string (space-separated cell/range addresses).
E.g. "A5:D20 B30" → shift rows >= 5 by delta.
"""
parts = sqref.split()
result = []
for part in parts:
if ':' in part:
left, right = part.split(':', 1)
left = shift_formula(left, at, delta)
right = shift_formula(right, at, delta)
result.append(f"{left}:{right}")
else:
result.append(shift_formula(part, at, delta))
return " ".join(result)
def shift_chart_range(text: str, at: int, delta: int) -> str:
"""
Shift row references inside a chart range formula like:
Sheet1!$B$5:$B$20
'Q1 Data'!$A$3:$A$15
"""
# Split on the "!" to preserve sheet name
if '!' not in text:
return text
bang = text.index('!')
sheet_part = text[:bang + 1]
range_part = text[bang + 1:]
return sheet_part + shift_formula(range_part, at, delta)
# ---------------------------------------------------------------------------
# XML file processors
# ---------------------------------------------------------------------------
NS_MAIN = "http://schemas.openxmlformats.org/spreadsheetml/2006/main"
NS_DRAWING = "http://schemas.openxmlformats.org/drawingml/2006/chartDrawing"
# Namespace map used by ElementTree for tag lookup
NSMAP = {"ss": NS_MAIN}
def _tag(local: str) -> str:
return f"{{{NS_MAIN}}}{local}"
def process_worksheet(path: str, at: int, delta: int) -> int:
"""Update row/cell references in a worksheet XML. Returns change count."""
tree = ET.parse(path)
root = tree.getroot()
changes = 0
# 1. <dimension ref="A1:D20">
for dim in root.iter(_tag("dimension")):
old = dim.get("ref", "")
new = shift_sqref(old, at, delta)
if new != old:
dim.set("ref", new)
changes += 1
# 2. <row r="N"> and <c r="XN"> inside sheetData
sheet_data = root.find(_tag("sheetData"))
if sheet_data is not None:
rows_to_reorder = []
for row_el in list(sheet_data):
r_str = row_el.get("r")
if r_str is None:
continue
r = int(r_str)
if r >= at:
new_r = max(1, r + delta)
row_el.set("r", str(new_r))
changes += 1
# Update each cell's r attribute
for cell_el in row_el:
cell_ref = cell_el.get("r", "")
if cell_ref:
new_ref = shift_formula(cell_ref, at, delta)
if new_ref != cell_ref:
cell_el.set("r", new_ref)
changes += 1
# Also update formulas in every row (formulas can reference any row)
for cell_el in row_el:
f_el = cell_el.find(_tag("f"))
if f_el is not None and f_el.text:
new_f = shift_formula(f_el.text, at, delta)
if new_f != f_el.text:
f_el.text = new_f
changes += 1
# 3. <mergeCell ref="A5:C7">
for mc in root.iter(_tag("mergeCell")):
old = mc.get("ref", "")
new = shift_sqref(old, at, delta)
if new != old:
mc.set("ref", new)
changes += 1
# 4. <conditionalFormatting sqref="...">
for cf in root.iter(_tag("conditionalFormatting")):
old = cf.get("sqref", "")
new = shift_sqref(old, at, delta)
if new != old:
cf.set("sqref", new)
changes += 1
# 5. <dataValidation sqref="...">
for dv in root.iter(_tag("dataValidation")):
old = dv.get("sqref", "")
new = shift_sqref(old, at, delta)
if new != old:
dv.set("sqref", new)
changes += 1
if changes > 0:
_write_tree(tree, path)
return changes
def process_chart(path: str, at: int, delta: int) -> int:
"""Update data range references in a chart XML."""
# Charts use DrawingML namespace; we look for <f> elements with range strings
with open(path, "r", encoding="utf-8") as fh:
content = fh.read()
# Pattern matches content of <f>Sheet1!$A$1:$A$10</f> style elements
def replace_f(m: re.Match) -> str:
tag_open = m.group(1)
inner = m.group(2)
tag_close = m.group(3)
new_inner = shift_chart_range(inner, at, delta)
return f"{tag_open}{new_inner}{tag_close}"
new_content = re.sub(r'(<(?:[^:>]+:)?f>)([^<]+)(</(?:[^:>]+:)?f>)',
replace_f, content)
changes = content != new_content
if changes:
with open(path, "w", encoding="utf-8") as fh:
fh.write(new_content)
return 1 if changes else 0
def process_table(path: str, at: int, delta: int) -> int:
"""Update the ref attribute on the <table> root element."""
tree = ET.parse(path)
root = tree.getroot()
# The root element IS the table
old = root.get("ref", "")
if not old:
return 0
new = shift_sqref(old, at, delta)
if new == old:
return 0
root.set("ref", new)
_write_tree(tree, path)
return 1
def process_pivot_cache(path: str, at: int, delta: int) -> int:
"""Update worksheetSource ref in a pivot cache definition."""
tree = ET.parse(path)
root = tree.getroot()
changes = 0
# Look for <worksheetSource ref="A1:D100" ...>
for ws in root.iter():
if ws.tag.endswith("}worksheetSource") or ws.tag == "worksheetSource":
old = ws.get("ref", "")
if old:
new = shift_sqref(old, at, delta)
if new != old:
ws.set("ref", new)
changes += 1
if changes:
_write_tree(tree, path)
return changes
def _write_tree(tree: ET.ElementTree, path: str) -> None:
"""Write ElementTree back to file with pretty-printing."""
tree.write(path, encoding="unicode", xml_declaration=False)
# Re-pretty-print for readability
with open(path, "r", encoding="utf-8") as fh:
raw = fh.read()
try:
dom = xml.dom.minidom.parseString(raw.encode("utf-8"))
pretty = dom.toprettyxml(indent=" ", encoding="utf-8").decode("utf-8")
lines = [line for line in pretty.splitlines() if line.strip()]
with open(path, "w", encoding="utf-8") as fh:
fh.write("\n".join(lines) + "\n")
except Exception:
pass # If pretty-print fails, leave the file as-is
# ---------------------------------------------------------------------------
# Main driver
# ---------------------------------------------------------------------------
def main() -> None:
if len(sys.argv) < 5:
print(__doc__)
sys.exit(1)
work_dir = sys.argv[1]
operation = sys.argv[2].lower()
at = int(sys.argv[3])
count = int(sys.argv[4])
if operation not in ("insert", "delete"):
print(f"ERROR: operation must be 'insert' or 'delete', got '{operation}'")
sys.exit(1)
if operation == "insert":
delta = count
else:
delta = -count
if not os.path.isdir(work_dir):
print(f"ERROR: Directory not found: {work_dir}")
sys.exit(1)
print(f"Operation : {operation} {count} row(s) at row {at} (delta={delta:+d})")
print(f"Work dir : {work_dir}")
print()
total_changes = 0
# Process all worksheets
ws_dir = os.path.join(work_dir, "xl", "worksheets")
if os.path.isdir(ws_dir):
for fname in sorted(os.listdir(ws_dir)):
if fname.endswith(".xml"):
fpath = os.path.join(ws_dir, fname)
n = process_worksheet(fpath, at, delta)
if n:
print(f" Updated {n:3d} references in xl/worksheets/{fname}")
total_changes += n
# Process all charts
charts_dir = os.path.join(work_dir, "xl", "charts")
if os.path.isdir(charts_dir):
for fname in sorted(os.listdir(charts_dir)):
if fname.endswith(".xml"):
fpath = os.path.join(charts_dir, fname)
n = process_chart(fpath, at, delta)
if n:
print(f" Updated chart ranges in xl/charts/{fname}")
total_changes += n
# Process all tables
tables_dir = os.path.join(work_dir, "xl", "tables")
if os.path.isdir(tables_dir):
for fname in sorted(os.listdir(tables_dir)):
if fname.endswith(".xml"):
fpath = os.path.join(tables_dir, fname)
n = process_table(fpath, at, delta)
if n:
print(f" Updated table ref in xl/tables/{fname}")
total_changes += n
# Process pivot cache definitions
cache_dir = os.path.join(work_dir, "xl", "pivotCaches")
if os.path.isdir(cache_dir):
for fname in sorted(os.listdir(cache_dir)):
if "Definition" in fname and fname.endswith(".xml"):
fpath = os.path.join(cache_dir, fname)
n = process_pivot_cache(fpath, at, delta)
if n:
print(f" Updated pivot source range in xl/pivotCaches/{fname}")
total_changes += n
print()
print(f"Total changes: {total_changes}")
print()
print("IMPORTANT: Review named ranges in xl/workbook.xml <definedNames> manually.")
print(" Structured table references (Table[@Col]) are NOT updated.")
print()
print("Next steps:")
print(" 1. Review the changes above")
print(f" 2. python3 xlsx_pack.py {work_dir} output.xlsx")
print(" 3. python3 formula_check.py output.xlsx")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,130 @@
#!/usr/bin/env python3
# SPDX-License-Identifier: MIT
"""
xlsx_unpack.py — Unpack an xlsx file into a working directory for XML editing.
Usage:
python3 xlsx_unpack.py <input.xlsx> <output_dir>
What it does:
1. Unzips the xlsx (which is a ZIP archive)
2. Pretty-prints all XML and .rels files for readability
3. Prints a summary of key files to edit
"""
import sys
import zipfile
import os
import shutil
import xml.dom.minidom
def pretty_print_xml(content: bytes) -> str:
"""Pretty-print XML bytes. Returns original content on parse failure."""
try:
dom = xml.dom.minidom.parseString(content)
pretty = dom.toprettyxml(indent=" ", encoding="utf-8").decode("utf-8")
# Remove the extra blank lines toprettyxml adds
lines = [line for line in pretty.splitlines() if line.strip()]
return "\n".join(lines) + "\n"
except Exception:
return content.decode("utf-8", errors="replace")
def unpack(xlsx_path: str, output_dir: str) -> None:
if not os.path.isfile(xlsx_path):
print(f"ERROR: File not found: {xlsx_path}", file=sys.stderr)
sys.exit(1)
if not xlsx_path.lower().endswith((".xlsx", ".xlsm")):
print(f"WARNING: '{xlsx_path}' does not have an .xlsx/.xlsm extension", file=sys.stderr)
if os.path.exists(output_dir):
shutil.rmtree(output_dir)
os.makedirs(output_dir)
try:
with zipfile.ZipFile(xlsx_path, "r") as z:
# Validate member paths to prevent zip-slip (path traversal) attacks
for member in z.namelist():
member_path = os.path.realpath(os.path.join(output_dir, member))
if not member_path.startswith(os.path.realpath(output_dir) + os.sep) and member_path != os.path.realpath(output_dir):
print(f"ERROR: Zip entry '{member}' would escape target directory (path traversal blocked)", file=sys.stderr)
shutil.rmtree(output_dir, ignore_errors=True)
sys.exit(1)
z.extractall(output_dir)
except zipfile.BadZipFile:
shutil.rmtree(output_dir, ignore_errors=True)
print(f"ERROR: '{xlsx_path}' is not a valid ZIP/xlsx file", file=sys.stderr)
sys.exit(1)
# Pretty-print XML and .rels files
xml_count = 0
for dirpath, _, filenames in os.walk(output_dir):
for fname in filenames:
if fname.endswith(".xml") or fname.endswith(".rels"):
fpath = os.path.join(dirpath, fname)
with open(fpath, "rb") as f:
raw = f.read()
pretty = pretty_print_xml(raw)
with open(fpath, "w", encoding="utf-8") as f:
f.write(pretty)
xml_count += 1
print(f"Unpacked '{xlsx_path}''{output_dir}'")
print(f"Pretty-printed {xml_count} XML/rels files\n")
# Print key files grouped by category
categories = {
"Package root": ["[Content_Types].xml", "_rels/.rels"],
"Workbook": ["xl/workbook.xml", "xl/_rels/workbook.xml.rels"],
"Styles & Strings": ["xl/styles.xml", "xl/sharedStrings.xml"],
"Worksheets": [],
}
all_files = []
for dirpath, _, filenames in os.walk(output_dir):
for fname in filenames:
rel = os.path.relpath(os.path.join(dirpath, fname), output_dir)
all_files.append(rel)
# Collect worksheets
for rel in sorted(all_files):
if rel.startswith("xl/worksheets/") and rel.endswith(".xml"):
categories["Worksheets"].append(rel)
print("Key files to inspect/edit:")
for category, files in categories.items():
if not files:
continue
print(f"\n [{category}]")
for f in files:
full = os.path.join(output_dir, f)
if os.path.isfile(full):
size = os.path.getsize(full)
print(f" {f} ({size:,} bytes)")
else:
print(f" {f} (not found)")
# Warn about high-risk files present
risky = {
"xl/vbaProject.bin": "VBA macros — DO NOT modify",
"xl/pivotTables": "Pivot tables — update source ranges carefully if shifting rows",
"xl/charts": "Charts — update data ranges if shifting rows",
}
print("\n [High-risk content detected:]")
found_any = False
for path, warning in risky.items():
full = os.path.join(output_dir, path)
if os.path.exists(full):
print(f" ⚠️ {path}{warning}")
found_any = True
if not found_any:
print(" ✓ None (safe to edit)")
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: xlsx_unpack.py <input.xlsx> <output_dir>")
sys.exit(1)
unpack(sys.argv[1], sys.argv[2])

View File

@@ -0,0 +1,9 @@
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
<Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
<Default Extension="xml" ContentType="application/xml"/>
<Override PartName="/xl/workbook.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml"/>
<Override PartName="/xl/worksheets/sheet1.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.worksheet+xml"/>
<Override PartName="/xl/styles.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.styles+xml"/>
<Override PartName="/xl/sharedStrings.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sharedStrings+xml"/>
</Types>

View File

@@ -0,0 +1,6 @@
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Id="rId1"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument"
Target="xl/workbook.xml"/>
</Relationships>

View File

@@ -0,0 +1,19 @@
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!--
workbook.xml.rels — Maps each sheet r:id to its worksheet XML file.
When adding a new sheet:
- Add: <Relationship Id="rId2" Type="...worksheet" Target="worksheets/sheet2.xml"/>
- The Id must match the r:id in workbook.xml <sheet> element
-->
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Id="rId1"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/worksheet"
Target="worksheets/sheet1.xml"/>
<Relationship Id="rId2"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles"
Target="styles.xml"/>
<Relationship Id="rId3"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/sharedStrings"
Target="sharedStrings.xml"/>
</Relationships>

View File

@@ -0,0 +1,33 @@
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!--
sharedStrings.xml — The shared string table.
All text values in cells use this table. Instead of storing text directly
in the cell, each text string is stored here once, and cells reference it
by 0-based index:
<c r="A1" t="s"><v>0</v></c> → first string in this table
To add strings:
1. Append a new <si><t>Your Text</t></si> element
2. Increment both `count` and `uniqueCount` attributes
3. Use the new string's 0-based index in the cell's <v> element
Special characters in text:
- & → &amp;
- < → &lt;
- > → &gt;
- Leading/trailing spaces → use xml:space="preserve": <t xml:space="preserve"> text </t>
count = total number of string references across the workbook
uniqueCount = number of unique strings in this table
(They may differ if some strings are used in multiple cells)
-->
<sst xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
count="0" uniqueCount="0">
<!-- Strings will be added here. Example:
<si><t>Revenue</t></si>
<si><t>Cost of Goods Sold</t></si>
<si><t>Gross Profit</t></si>
-->
</sst>

View File

@@ -0,0 +1,160 @@
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!--
styles.xml — The complete style system for this workbook.
Style index (cellXfs) lookup table:
┌───────┬─────────────────────────────────┬────────────────┬───────────────────┐
│ Index │ Semantic Role │ Font Color │ Number Format │
├───────┼─────────────────────────────────┼────────────────┼───────────────────┤
│ 0 │ Default │ Theme (black) │ General │
│ 1 │ Input / Assumption │ Blue 000000FF │ General │
│ 2 │ Formula / Computed result │ Black 00000000 │ General │
│ 3 │ Cross-sheet reference │ Green 00008000 │ General │
│ 4 │ Header (bold) │ Black bold │ General │
│ 5 │ Currency input │ Blue 000000FF │ $#,##0 (id=164) │
│ 6 │ Currency formula │ Black │ $#,##0 (id=164) │
│ 7 │ Percentage input │ Blue 000000FF │ 0.0% (id=165) │
│ 8 │ Percentage formula │ Black │ 0.0% (id=165) │
│ 9 │ Integer with commas input │ Blue 000000FF │ #,##0 (id=167) │
│ 10 │ Integer with commas formula │ Black │ #,##0 (id=167) │
│ 11 │ Year (no comma) — input │ Blue 000000FF │ 0 (id=1) │
│ 12 │ Key assumption (yellow bg) │ Blue 000000FF │ General + yellow │
└───────┴─────────────────────────────────┴────────────────┴───────────────────┘
To add a new style:
1. If needed, add a <numFmt> to <numFmts> with a new numFmtId >= 164 (increment max)
2. If needed, add a <font> to <fonts>
3. If needed, add a <fill> to <fills>
4. Append a new <xf> to <cellXfs>, combining fontId + fillId + numFmtId
5. Update the count attributes on <numFmts>, <fonts>, <fills>, <cellXfs>
6. The new xf's index = (old cellXfs count) — use this as the s attribute on cells
CRITICAL RULES:
- fills[0] and fills[1] are REQUIRED BY SPEC — never remove them
- Do NOT modify existing <xf> entries — only append new ones
- AARRGGBB color format: first 2 hex digits = Alpha (00 = opaque)
-->
<styleSheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<!-- ── Number Formats ─────────────────────────────────────────────────── -->
<!-- Built-in IDs 0-163 need NOT be declared here. Custom formats: 164+ -->
<numFmts count="4">
<!-- 164: Standard currency — positive $1,234 / negative ($1,234) / zero - -->
<numFmt numFmtId="164" formatCode="$#,##0;($#,##0);&quot;-&quot;"/>
<!-- 165: Percentage with 1 decimal place -->
<numFmt numFmtId="165" formatCode="0.0%"/>
<!-- 166: Multiplier / ratio (e.g. 8.5x for EV/EBITDA) -->
<numFmt numFmtId="166" formatCode="0.0x"/>
<!-- 167: Integer with thousands separator, no decimals -->
<numFmt numFmtId="167" formatCode="#,##0"/>
</numFmts>
<!-- ── Fonts ──────────────────────────────────────────────────────────── -->
<fonts count="5">
<!-- 0: Default (theme color, no explicit color) -->
<font>
<sz val="11"/>
<name val="Calibri"/>
</font>
<!-- 1: Input / Assumption — Blue -->
<font>
<sz val="11"/>
<name val="Calibri"/>
<color rgb="000000FF"/>
</font>
<!-- 2: Formula / Computed result — Black (explicit) -->
<font>
<sz val="11"/>
<name val="Calibri"/>
<color rgb="00000000"/>
</font>
<!-- 3: Cross-sheet reference — Green -->
<font>
<sz val="11"/>
<name val="Calibri"/>
<color rgb="00008000"/>
</font>
<!-- 4: Header — Bold Black -->
<font>
<b/>
<sz val="11"/>
<name val="Calibri"/>
<color rgb="00000000"/>
</font>
</fonts>
<!-- ── Fills ──────────────────────────────────────────────────────────── -->
<!-- fills[0] and fills[1] are REQUIRED by OOXML spec — DO NOT REMOVE -->
<fills count="3">
<fill><patternFill patternType="none"/></fill>
<fill><patternFill patternType="gray125"/></fill>
<!-- 2: Yellow highlight for key assumptions requiring review -->
<fill>
<patternFill patternType="solid">
<fgColor rgb="00FFFF00"/>
<bgColor indexed="64"/>
</patternFill>
</fill>
</fills>
<!-- ── Borders ────────────────────────────────────────────────────────── -->
<borders count="1">
<!-- 0: No borders (default) -->
<border>
<left/>
<right/>
<top/>
<bottom/>
<diagonal/>
</border>
</borders>
<!-- ── Cell Style Xfs (base styles) ──────────────────────────────────── -->
<cellStyleXfs count="1">
<xf numFmtId="0" fontId="0" fillId="0" borderId="0"/>
</cellStyleXfs>
<!-- ── Cell Xfs (the actual style slots referenced by <c s="N">) ──────── -->
<cellXfs count="13">
<!-- 0: Default -->
<xf numFmtId="0" fontId="0" fillId="0" borderId="0" xfId="0"/>
<!-- 1: Input / Assumption (blue, general format) -->
<xf numFmtId="0" fontId="1" fillId="0" borderId="0" xfId="0" applyFont="1"/>
<!-- 2: Formula / Computed (black, general format) -->
<xf numFmtId="0" fontId="2" fillId="0" borderId="0" xfId="0" applyFont="1"/>
<!-- 3: Cross-sheet reference (green, general format) -->
<xf numFmtId="0" fontId="3" fillId="0" borderId="0" xfId="0" applyFont="1"/>
<!-- 4: Header (bold black) -->
<xf numFmtId="0" fontId="4" fillId="0" borderId="0" xfId="0" applyFont="1"/>
<!-- 5: Currency input (blue + $#,##0) -->
<xf numFmtId="164" fontId="1" fillId="0" borderId="0" xfId="0"
applyFont="1" applyNumberFormat="1"/>
<!-- 6: Currency formula (black + $#,##0) -->
<xf numFmtId="164" fontId="2" fillId="0" borderId="0" xfId="0"
applyFont="1" applyNumberFormat="1"/>
<!-- 7: Percentage input (blue + 0.0%) -->
<xf numFmtId="165" fontId="1" fillId="0" borderId="0" xfId="0"
applyFont="1" applyNumberFormat="1"/>
<!-- 8: Percentage formula (black + 0.0%) -->
<xf numFmtId="165" fontId="2" fillId="0" borderId="0" xfId="0"
applyFont="1" applyNumberFormat="1"/>
<!-- 9: Integer-with-commas input (blue + #,##0) -->
<xf numFmtId="167" fontId="1" fillId="0" borderId="0" xfId="0"
applyFont="1" applyNumberFormat="1"/>
<!-- 10: Integer-with-commas formula (black + #,##0) -->
<xf numFmtId="167" fontId="2" fillId="0" borderId="0" xfId="0"
applyFont="1" applyNumberFormat="1"/>
<!-- 11: Year column input (blue + plain integer "0", no comma) -->
<xf numFmtId="1" fontId="1" fillId="0" borderId="0" xfId="0"
applyFont="1" applyNumberFormat="1"/>
<!-- 12: Key assumption highlight (blue on yellow — needs human review) -->
<xf numFmtId="0" fontId="1" fillId="2" borderId="0" xfId="0"
applyFont="1" applyFill="1"/>
</cellXfs>
<!-- ── Named Cell Styles ─────────────────────────────────────────────── -->
<cellStyles count="1">
<cellStyle name="Normal" xfId="0" builtinId="0"/>
</cellStyles>
</styleSheet>

View File

@@ -0,0 +1,30 @@
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!--
workbook.xml — Defines the list of sheets.
To add a new sheet:
1. Add a <sheet> element below with a unique sheetId and r:id
2. Add a matching <Relationship> in xl/_rels/workbook.xml.rels
3. Add an <Override> in [Content_Types].xml
4. Create the xl/worksheets/sheetN.xml file
Sheet name rules:
- Max 31 characters
- Cannot contain: / \ ? * [ ] :
- Ampersand must be escaped as &amp; in XML
-->
<workbook
xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
<fileVersion appName="xl" lastEdited="7" lowestEdited="7"/>
<workbookPr defaultThemeVersion="166925"/>
<bookViews>
<workbookView xWindow="0" yWindow="0" windowWidth="20140" windowHeight="10960"/>
</bookViews>
<sheets>
<!-- Add more <sheet> elements here for multi-sheet workbooks -->
<sheet name="Sheet1" sheetId="1" r:id="rId1"/>
</sheets>
<!-- calcId ensures Excel recalculates formulas on open -->
<calcPr calcId="191029"/>
</workbook>

View File

@@ -0,0 +1,70 @@
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!--
sheet1.xml — Worksheet data.
Cell anatomy:
<c r="A1" t="s" s="4"><v>0</v></c>
↑ ↑ ↑ ↑
address type style value (sharedStrings index for t="s")
Type values (t attribute):
s = shared string (text) — <v> contains index into sharedStrings.xml
inlineStr = inline string — use <is><t>text</t></is> instead of <v>
n (or omit)= number — <v> contains the raw number
b = boolean — <v> is 1 (TRUE) or 0 (FALSE)
e = error — <v> contains error string like #REF!
(no t) = formula cell — <f> contains formula (NO leading =), <v> is cache
Formula cells:
<c r="B5" s="2"><f>SUM(B2:B4)</f><v></v></c>
Cross-sheet: <c r="C1" s="3"><f>Assumptions!B2</f><v></v></c>
With spaces: <c r="C1" s="3"><f>'Q1 Data'!B2</f><v></v></c>
Style index (s attribute) — pre-built in styles.xml:
0 = default
1 = input/assumption (blue font)
2 = formula/computed (black font)
3 = cross-sheet reference (green font)
4 = header bold
5 = currency input (blue + $#,##0 format)
See styles.xml for the full list and how to add more.
Row r attribute must be 1-based integer.
Column letters: A=1, Z=26, AA=27, AZ=52, BA=53, BZ=78, etc.
-->
<worksheet
xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
<sheetViews>
<sheetView tabSelected="1" workbookViewId="0">
<!-- Freeze top row as header: -->
<!-- <pane ySplit="1" topLeftCell="A2" activePane="bottomLeft" state="frozen"/> -->
</sheetView>
</sheetViews>
<sheetFormatPr defaultRowHeight="15" x14ac:dyDescent="0.25"
xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac"/>
<!-- Column widths — uncomment and adjust as needed:
<cols>
<col min="1" max="1" width="24" customWidth="1"/>
<col min="2" max="10" width="14" customWidth="1"/>
</cols>
-->
<sheetData>
<!-- Replace this placeholder with actual data rows.
Example:
<row r="1">
<c r="A1" t="s" s="4"><v>0</v></c>
<c r="B1" t="s" s="4"><v>1</v></c>
</row>
<row r="2">
<c r="A2" t="s" s="1"><v>2</v></c>
<c r="B2" s="1"><v>1000</v></c>
</row>
<row r="3">
<c r="A3" t="s" s="2"><v>3</v></c>
<c r="B3" s="2"><f>B2*1.1</f><v></v></c>
</row>
-->
</sheetData>
<pageMargins left="0.7" right="0.7" top="0.75" bottom="0.75" header="0.3" footer="0.3"/>
</worksheet>