# AGENTS.md — pptx-image-compress Guidelines for AI agents and contributors working in this codebase. --- ## Project Overview Single-file Python CLI tool (`pptx_image_compress.py`) that compresses images inside `.pptx` files using the external binary `caesiumclt`. Supports single- file and batch modes, multi-threaded processing, and CSV logging. **Entry point:** `pptx_image_compress.py` → `main()` **Tests:** `test_pptx_image_compress.py` (stdlib `unittest`, run via `pytest`) **External dependency:** `caesiumclt` must be on `PATH` --- ## Running Tests ```bash python -m pytest test_pptx_image_compress.py -v ``` All 5 tests must pass before any change is considered complete. Never remove or weaken an existing test. Always add a test for new behaviour. --- ## Code Readability - **One responsibility per function.** If a function does more than one thing, split it. - **Descriptive names.** Avoid single-letter variables outside of short loops. Prefer `img_path` over `p`, `result` over `r`. - **Type-annotate every function signature** — parameters and return type. Use `Optional[X]` / `X | None` consistently (the codebase uses both; prefer `X | None` for new code on Python 3.10+). - **Constants at module level**, UPPER_SNAKE_CASE. Never hardcode magic values inline (e.g. file extensions, prefix strings, bar lengths). - **Section comments** (`# --- Section ---`) are used to separate logical blocks. Keep them and add new ones when introducing a new logical group. - **German UI strings are intentional** (progress output, error messages shown to the end-user). Keep them in German. Internal code identifiers stay in English. - **No dead code.** Remove commented-out blocks and unused functions before committing. --- ## Testability - **Inject external dependencies via callable parameters.** The `compressor` parameter on `process_image_file` and `process_single_deck` is the canonical pattern — always use it for any new external-process call. - **Never call `shutil.which` or `subprocess` directly inside a function under test.** Route through an injectable or mockable seam. - **Tests use `tempfile.TemporaryDirectory`** for isolation. Every test must clean up after itself — rely on the context manager, not `tearDown`. - **Do not test private implementation details.** Test observable behaviour: return values, file contents, log output. - **One assertion focus per test.** A test named `test_X` should assert exactly what `X` does, with a minimal setup. - **Use `fake_compressor` pattern** (as seen in existing tests) to decouple image-compression logic from the real `caesiumclt` binary in all unit tests. --- ## Performance - **Thread pool sizing:** outer thread count is controlled by `-t/--threads` (default 16). When `threads > 1`, each `caesiumclt` subprocess is launched with `--threads 1` to prevent CPU over-subscription. Do not change this without benchmarking. - **Scratch directories are per-image** (`img_{idx:06d}` sub-dirs) to avoid filename collisions across threads without locking. - **`Lock` scope must be minimal.** Only counter increments and `log_lines` appends are inside the lock — never I/O or subprocess calls. - **Avoid redundant filesystem walks.** `build_image_slide_index` is called once per deck, not per image. Keep it that way. - **`zip_dir_to_pptx` collects all files before writing** so `[Content_Types].xml` can be placed first. Do not revert this to a streaming walk. --- ## Architecture ### Current state Single-file design (`pptx_image_compress.py`) is intentional for zero-install distribution. It is acceptable as long as the file stays under ~700 lines. ### Target layout (clean architecture — migrate when the file grows) When the project needs to grow, extract to a package following these layers. Dependencies must only point **inward** (CLI → Application → Domain ← Infrastructure implements Domain ports). ``` pptx_compress/ ├── __init__.py ├── __main__.py # python -m pptx_compress entry point │ ├── domain/ # innermost — zero external imports │ ├── __init__.py │ ├── models.py # DeckResult, ImageProcessResult (dataclasses) │ ├── constants.py # ALLOWED_EXT, TEMP_PREFIX, defaults │ └── ports.py # Compressor Protocol (typing.Protocol), SlideIndex ABC │ ├── application/ # orchestration — imports domain only │ ├── __init__.py │ ├── compress_deck.py # process_single_deck() use-case │ └── batch.py # batch loop, overall summary logic │ ├── infrastructure/ # implements domain ports — imports domain + stdlib/3rd-party │ ├── __init__.py │ ├── caesium_adapter.py # compress_with_caesium() (caesiumclt subprocess) │ ├── pptx_reader.py # discover_images(), build_image_slide_index() │ ├── pptx_writer.py # zip_dir_to_pptx() │ └── temp_manager.py # cleanup_old_temps(), TEMP_PREFIX lifecycle │ └── cli/ # outermost — imports application only ├── __init__.py ├── args.py # argparse definition, expand_inputs(), collect_from_dir() └── output.py # print_progress(), format_duration(), human_mb/kb ``` ### Layer rules | Layer | May import | Must NOT import | |---|---|---| | `domain` | stdlib only | everything else | | `application` | `domain` | `infrastructure`, `cli` | | `infrastructure` | `domain`, stdlib, 3rd-party | `application`, `cli` | | `cli` | `application`, `domain.models` | `infrastructure` directly | ### Key architectural decisions - **`Compressor` is a `typing.Protocol`** (in `domain/ports.py`), not a bare `Callable`. This makes the contract explicit and IDE-checkable without creating an import cycle: ```python class Compressor(Protocol): def __call__( self, original: Path, out_dir: Path, threads: int | None, quality: int, min_savings: str, ) -> Path | None: ... ``` - **`DeckResult` and `ImageProcessResult` live in `domain/models.py`** — they are pure data, no logic, no I/O. - **`compress_deck.py` receives a `Compressor` instance via constructor or parameter** — never imports `caesium_adapter` directly. This is what makes the use-case fully unit-testable with a `fake_compressor`. - **`main()` (in `cli/args.py`) owns argument parsing only.** It resolves paths, builds the `Compressor` adapter, and calls `application.compress_deck` or `application.batch`. No processing logic belongs there. - **`expand_inputs` / `collect_from_dir` live in `cli/args.py`** — path resolution is a CLI concern. All layers below receive `Path` objects. - **Temp directory lifecycle belongs in `infrastructure/temp_manager.py`.** Always use `TEMP_PREFIX` so orphaned dirs from crashed runs are recoverable. ### Migration guide (single file → package) 1. Create the `pptx_compress/` directory. 2. Move dataclasses and constants to `domain/`. 3. Move `compress_with_caesium` → `infrastructure/caesium_adapter.py`. 4. Move PPTX read/write helpers → `infrastructure/pptx_reader.py` and `pptx_writer.py`. 5. Move `process_image_file` + `process_single_deck` → `application/compress_deck.py`. 6. Move `main()` + input helpers → `cli/args.py`. 7. Add `__main__.py` with `from pptx_compress.cli.args import main; main()`. 8. Update `test_pptx_image_compress.py` imports accordingly — test logic does not need to change because the public API surface is identical. ### Refactoring plan (aligned with this AGENTS.md) - Keep the same layer direction: `cli` → `application` → `domain`; only `infrastructure` implements domain ports. - Add dedicated raster/vector implementations behind domain ports, not in CLI: - `domain/ports.py`: `RasterCompressor`, `VectorCompressor` protocols (or one `Compressor` protocol + typed strategies) - `infrastructure/caesium_adapter.py`: raster implementation - `infrastructure/svg_polish_adapter.py`: vector implementation - Add routing in `application` (not `infrastructure`): - `application/compress_deck.py`: `CompressorRouter` decides by extension - no direct `subprocess` / external library calls in `application` - Split image workflow into explicit application steps: - `compress_step` - `optimal_format_step` (PNG → JPEG optimization step; not a fallback) - `replace_step` (atomic replace via `.tmp` + `Path.replace()`) - Centralize PPTX metadata handling in infrastructure modules: - keep relationship/content-type updates in `infrastructure/pptx_reader.py` and/or `infrastructure/pptx_writer.py` - `application` only orchestrates and passes domain models - Introduce configuration object in `domain/constants.py` or a dedicated domain config model; avoid new magic values in `application`. - Preserve public behaviour and CLI surface during migration; refactor in small commits with green tests after each step. ### Suggested commit sequence 1. Extract domain models/constants/ports unchanged. 2. Extract caesium adapter + add svg_polish adapter seam. 3. Introduce router in `application` with extension-based dispatch. 4. Refactor image processing into `compress_step` + `optimal_format_step` + `replace_step`. 5. Extract PPTX metadata update helpers to infrastructure modules. 6. Move CLI parsing/output concerns into `cli/` only. 7. Remove dead monolith code paths and keep tests passing. --- ## Security - **Never pass unsanitised user input directly to `subprocess`.** The `compress_with_caesium` function builds the command as a list (not a shell string). Keep it that way — do not use `shell=True`. - **Validate file extensions before compression.** `compress_with_caesium` checks `ext not in ALLOWED_EXT` and returns `None` for unrecognised types. Do not bypass or widen this check without explicit justification. - **Validate input paths early.** `process_single_deck` checks that the input exists and has a `.pptx` suffix before doing any filesystem work. - **Temp files are written atomically.** Image replacement uses a `.tmp` intermediate and `Path.replace()` (atomic rename) — do not change this to a direct overwrite. - **`capture_output=True`** is set on all subprocess calls so that stdout/stderr from `caesiumclt` cannot interfere with or inject into the tool's own output. - **Do not log file contents**, only metadata (name, size, slide references). The CSV log must never contain image binary data or path information outside the output directory. - **`ignore_errors=True` on `shutil.rmtree`** is acceptable for temp cleanup only. Never suppress errors on writes to the output PPTX or its log file.