# Data Reading & Analysis Guide > Reference for the READ path. Use `xlsx_reader.py` for structure discovery and data quality auditing, > then pandas for custom analysis. **Never modify the source file.** --- ## When to Use This Path The user asks to read, analyze, view, summarize, extract, or answer questions about an Excel/CSV file's contents, without requiring file modification. If modification is needed, hand off to `edit.md`. --- ## Workflow ### Step 1 — Structure Discovery Run `xlsx_reader.py` first. It handles format detection, encoding fallback, structure exploration, and data quality audit: ```bash python3 SKILL_DIR/scripts/xlsx_reader.py input.xlsx # full report python3 SKILL_DIR/scripts/xlsx_reader.py input.xlsx --sheet Sales # single sheet python3 SKILL_DIR/scripts/xlsx_reader.py input.xlsx --quality # quality audit only python3 SKILL_DIR/scripts/xlsx_reader.py input.xlsx --json # machine-readable ``` Supported formats: `.xlsx`, `.xlsm`, `.csv`, `.tsv`. The script tries multiple encodings for CSV (utf-8-sig, gbk, utf-8, latin-1). ### Step 2 — Custom Analysis with pandas Load data and perform the analysis the user requests: ```python import pandas as pd df = pd.read_excel("input.xlsx", sheet_name=None) # dict of all sheets # For CSV: pd.read_csv("input.csv") ``` **Header handling** (when the default `header=0` doesn't work): | Situation | Code | |-----------|------| | Header on row 3 | `pd.read_excel(path, header=2)` | | Multi-level merged header | `pd.read_excel(path, header=[0, 1])` | | No header | `pd.read_excel(path, header=None)` | **Analysis quick reference:** | Scenario | Pattern | |----------|---------| | Descriptive stats | `df.describe()` or `df['Col'].agg(['sum', 'mean', 'min', 'max'])` | | Group aggregation | `df.groupby('Region')['Revenue'].agg(Total='sum', Avg='mean')` | | Top N | `df.groupby('Region')['Revenue'].sum().sort_values(ascending=False).head(5)` | | Pivot table | `df.pivot_table(values='Revenue', index='Region', columns='Quarter', aggfunc='sum', margins=True)` | | Time series | `df.set_index(pd.to_datetime(df['Date'])).resample('ME')['Revenue'].sum()` | | Cross-sheet merge | `pd.merge(sales, customers, on='CustomerID', how='left', validate='m:1')` | | Stack sheets | `pd.concat([df.assign(Source=name) for name, df in sheets.items()], ignore_index=True)` | | Large files (>50MB) | `pd.read_excel(path, usecols=['Date', 'Revenue'])` or `pd.read_csv(path, chunksize=10000)` | ### Step 3 — Output If the user specifies an output file path, write results to it (highest priority). Format the report as: ``` ## Analysis Report: {filename} ### File Overview — format, sheets, row counts ### Data Quality — nulls, duplicates, mixed types (or "no issues") ### Key Findings — direct answer to the user's question ### Additional Notes — formula NaN, encoding issues, caveats ``` **Numeric display**: monetary `1,234,567.89`, percentage `12.3%`, multiples `8.5x`, counts as integers. --- ## Common Pitfalls | Pitfall | Cause | Fix | |---------|-------|-----| | Formula cells read as NaN | `` cache empty in freshly generated files | Inform user; suggest opening in Excel and re-saving; or use `libreoffice_recalc.py` | | CSV encoding errors | Chinese Windows exports use GBK | `xlsx_reader.py` auto-tries multiple encodings; manually specify if all fail | | Mixed types in column | Column has both numbers and text (e.g., "N/A") | `pd.to_numeric(df['Col'], errors='coerce')` — report unconvertible rows | | Year shows as 2,024 | Thousands separator format applied to year | `df['Year'].astype(int).astype(str)` | | Multi-level headers | Two-row header merged | `pd.read_excel(path, header=[0, 1])`, then flatten with `' - '.join()` | | Row number mismatch | pandas 0-indexed vs Excel 1-indexed | `excel_row = pandas_index + 2` (+1 for 1-index, +1 for header) | **Critical**: Never open with `data_only=True` then `save()` — this permanently destroys all formulas. --- ## Prohibitions - Never modify the source file (no `save()`, no XML edits) - Never report formula NaN as "data is zero" — explain it's a formula cache issue - Never report pandas indices as Excel row numbers - Never make speculative conclusions unsupported by the data