237 lines
11 KiB
Markdown
237 lines
11 KiB
Markdown
# AGENTS.md — pptx-image-compress
|
|
|
|
Guidelines for AI agents and contributors working in this codebase.
|
|
|
|
---
|
|
|
|
## Project Overview
|
|
|
|
Single-file Python CLI tool (`pptx_image_compress.py`) that compresses images
|
|
inside `.pptx` files using the external binary `caesiumclt`. Supports single-
|
|
file and batch modes, multi-threaded processing, and CSV logging.
|
|
|
|
**Entry point:** `pptx_image_compress.py` → `main()`
|
|
**Tests:** `test_pptx_image_compress.py` (stdlib `unittest`, run via `pytest`)
|
|
**External dependency:** `caesiumclt` must be on `PATH`
|
|
|
|
---
|
|
|
|
## Running Tests
|
|
|
|
```bash
|
|
python -m pytest test_pptx_image_compress.py -v
|
|
```
|
|
|
|
All 5 tests must pass before any change is considered complete.
|
|
Never remove or weaken an existing test. Always add a test for new behaviour.
|
|
|
|
---
|
|
|
|
## Code Readability
|
|
|
|
- **One responsibility per function.** If a function does more than one thing,
|
|
split it.
|
|
- **Descriptive names.** Avoid single-letter variables outside of short loops.
|
|
Prefer `img_path` over `p`, `result` over `r`.
|
|
- **Type-annotate every function signature** — parameters and return type.
|
|
Use `Optional[X]` / `X | None` consistently (the codebase uses both; prefer
|
|
`X | None` for new code on Python 3.10+).
|
|
- **Constants at module level**, UPPER_SNAKE_CASE. Never hardcode magic values
|
|
inline (e.g. file extensions, prefix strings, bar lengths).
|
|
- **Section comments** (`# --- Section ---`) are used to separate logical
|
|
blocks. Keep them and add new ones when introducing a new logical group.
|
|
- **German UI strings are intentional** (progress output, error messages shown
|
|
to the end-user). Keep them in German. Internal code identifiers stay in
|
|
English.
|
|
- **No dead code.** Remove commented-out blocks and unused functions before
|
|
committing.
|
|
|
|
---
|
|
|
|
## Testability
|
|
|
|
- **Inject external dependencies via callable parameters.** The `compressor`
|
|
parameter on `process_image_file` and `process_single_deck` is the canonical
|
|
pattern — always use it for any new external-process call.
|
|
- **Never call `shutil.which` or `subprocess` directly inside a function under
|
|
test.** Route through an injectable or mockable seam.
|
|
- **Tests use `tempfile.TemporaryDirectory`** for isolation. Every test must
|
|
clean up after itself — rely on the context manager, not `tearDown`.
|
|
- **Do not test private implementation details.** Test observable behaviour:
|
|
return values, file contents, log output.
|
|
- **One assertion focus per test.** A test named `test_X` should assert exactly
|
|
what `X` does, with a minimal setup.
|
|
- **Use `fake_compressor` pattern** (as seen in existing tests) to decouple
|
|
image-compression logic from the real `caesiumclt` binary in all unit tests.
|
|
|
|
---
|
|
|
|
## Performance
|
|
|
|
- **Thread pool sizing:** outer thread count is controlled by `-t/--threads`
|
|
(default 16). When `threads > 1`, each `caesiumclt` subprocess is launched
|
|
with `--threads 1` to prevent CPU over-subscription. Do not change this
|
|
without benchmarking.
|
|
- **Scratch directories are per-image** (`img_{idx:06d}` sub-dirs) to avoid
|
|
filename collisions across threads without locking.
|
|
- **`Lock` scope must be minimal.** Only counter increments and `log_lines`
|
|
appends are inside the lock — never I/O or subprocess calls.
|
|
- **Avoid redundant filesystem walks.** `build_image_slide_index` is called
|
|
once per deck, not per image. Keep it that way.
|
|
- **`zip_dir_to_pptx` collects all files before writing** so `[Content_Types].xml`
|
|
can be placed first. Do not revert this to a streaming walk.
|
|
|
|
---
|
|
|
|
## Architecture
|
|
|
|
### Current state
|
|
|
|
Single-file design (`pptx_image_compress.py`) is intentional for zero-install
|
|
distribution. It is acceptable as long as the file stays under ~700 lines.
|
|
|
|
### Target layout (clean architecture — migrate when the file grows)
|
|
|
|
When the project needs to grow, extract to a package following these layers.
|
|
Dependencies must only point **inward** (CLI → Application → Domain ←
|
|
Infrastructure implements Domain ports).
|
|
|
|
```
|
|
pptx_compress/
|
|
├── __init__.py
|
|
├── __main__.py # python -m pptx_compress entry point
|
|
│
|
|
├── domain/ # innermost — zero external imports
|
|
│ ├── __init__.py
|
|
│ ├── models.py # DeckResult, ImageProcessResult (dataclasses)
|
|
│ ├── constants.py # ALLOWED_EXT, TEMP_PREFIX, defaults
|
|
│ └── ports.py # Compressor Protocol (typing.Protocol), SlideIndex ABC
|
|
│
|
|
├── application/ # orchestration — imports domain only
|
|
│ ├── __init__.py
|
|
│ ├── compress_deck.py # process_single_deck() use-case
|
|
│ └── batch.py # batch loop, overall summary logic
|
|
│
|
|
├── infrastructure/ # implements domain ports — imports domain + stdlib/3rd-party
|
|
│ ├── __init__.py
|
|
│ ├── caesium_adapter.py # compress_with_caesium() (caesiumclt subprocess)
|
|
│ ├── pptx_reader.py # discover_images(), build_image_slide_index()
|
|
│ ├── pptx_writer.py # zip_dir_to_pptx()
|
|
│ └── temp_manager.py # cleanup_old_temps(), TEMP_PREFIX lifecycle
|
|
│
|
|
└── cli/ # outermost — imports application only
|
|
├── __init__.py
|
|
├── args.py # argparse definition, expand_inputs(), collect_from_dir()
|
|
└── output.py # print_progress(), format_duration(), human_mb/kb
|
|
```
|
|
|
|
### Layer rules
|
|
|
|
| Layer | May import | Must NOT import |
|
|
|---|---|---|
|
|
| `domain` | stdlib only | everything else |
|
|
| `application` | `domain` | `infrastructure`, `cli` |
|
|
| `infrastructure` | `domain`, stdlib, 3rd-party | `application`, `cli` |
|
|
| `cli` | `application`, `domain.models` | `infrastructure` directly |
|
|
|
|
### Key architectural decisions
|
|
|
|
- **`Compressor` is a `typing.Protocol`** (in `domain/ports.py`), not a bare
|
|
`Callable`. This makes the contract explicit and IDE-checkable without
|
|
creating an import cycle:
|
|
```python
|
|
class Compressor(Protocol):
|
|
def __call__(
|
|
self,
|
|
original: Path,
|
|
out_dir: Path,
|
|
threads: int | None,
|
|
quality: int,
|
|
min_savings: str,
|
|
) -> Path | None: ...
|
|
```
|
|
- **`DeckResult` and `ImageProcessResult` live in `domain/models.py`** — they
|
|
are pure data, no logic, no I/O.
|
|
- **`compress_deck.py` receives a `Compressor` instance via constructor or
|
|
parameter** — never imports `caesium_adapter` directly. This is what makes
|
|
the use-case fully unit-testable with a `fake_compressor`.
|
|
- **`main()` (in `cli/args.py`) owns argument parsing only.** It resolves
|
|
paths, builds the `Compressor` adapter, and calls `application.compress_deck`
|
|
or `application.batch`. No processing logic belongs there.
|
|
- **`expand_inputs` / `collect_from_dir` live in `cli/args.py`** — path
|
|
resolution is a CLI concern. All layers below receive `Path` objects.
|
|
- **Temp directory lifecycle belongs in `infrastructure/temp_manager.py`.**
|
|
Always use `TEMP_PREFIX` so orphaned dirs from crashed runs are recoverable.
|
|
|
|
### Migration guide (single file → package)
|
|
|
|
1. Create the `pptx_compress/` directory.
|
|
2. Move dataclasses and constants to `domain/`.
|
|
3. Move `compress_with_caesium` → `infrastructure/caesium_adapter.py`.
|
|
4. Move PPTX read/write helpers → `infrastructure/pptx_reader.py` and
|
|
`pptx_writer.py`.
|
|
5. Move `process_image_file` + `process_single_deck` → `application/compress_deck.py`.
|
|
6. Move `main()` + input helpers → `cli/args.py`.
|
|
7. Add `__main__.py` with `from pptx_compress.cli.args import main; main()`.
|
|
8. Update `test_pptx_image_compress.py` imports accordingly — test logic does
|
|
not need to change because the public API surface is identical.
|
|
|
|
### Refactoring plan (aligned with this AGENTS.md)
|
|
|
|
- Keep the same layer direction: `cli` → `application` → `domain`; only
|
|
`infrastructure` implements domain ports.
|
|
- Add dedicated raster/vector implementations behind domain ports, not in CLI:
|
|
- `domain/ports.py`: `RasterCompressor`, `VectorCompressor` protocols
|
|
(or one `Compressor` protocol + typed strategies)
|
|
- `infrastructure/caesium_adapter.py`: raster implementation
|
|
- `infrastructure/svg_polish_adapter.py`: vector implementation
|
|
- Add routing in `application` (not `infrastructure`):
|
|
- `application/compress_deck.py`: `CompressorRouter` decides by extension
|
|
- no direct `subprocess` / external library calls in `application`
|
|
- Split image workflow into explicit application steps:
|
|
- `compress_step`
|
|
- `optimal_format_step` (PNG → JPEG optimization step; not a fallback)
|
|
- `replace_step` (atomic replace via `.tmp` + `Path.replace()`)
|
|
- Centralize PPTX metadata handling in infrastructure modules:
|
|
- keep relationship/content-type updates in `infrastructure/pptx_reader.py`
|
|
and/or `infrastructure/pptx_writer.py`
|
|
- `application` only orchestrates and passes domain models
|
|
- Introduce configuration object in `domain/constants.py` or a dedicated
|
|
domain config model; avoid new magic values in `application`.
|
|
- Preserve public behaviour and CLI surface during migration; refactor in
|
|
small commits with green tests after each step.
|
|
|
|
### Suggested commit sequence
|
|
|
|
1. Extract domain models/constants/ports unchanged.
|
|
2. Extract caesium adapter + add svg_polish adapter seam.
|
|
3. Introduce router in `application` with extension-based dispatch.
|
|
4. Refactor image processing into `compress_step` + `optimal_format_step` +
|
|
`replace_step`.
|
|
5. Extract PPTX metadata update helpers to infrastructure modules.
|
|
6. Move CLI parsing/output concerns into `cli/` only.
|
|
7. Remove dead monolith code paths and keep tests passing.
|
|
|
|
---
|
|
|
|
## Security
|
|
|
|
- **Never pass unsanitised user input directly to `subprocess`.** The
|
|
`compress_with_caesium` function builds the command as a list (not a shell
|
|
string). Keep it that way — do not use `shell=True`.
|
|
- **Validate file extensions before compression.** `compress_with_caesium`
|
|
checks `ext not in ALLOWED_EXT` and returns `None` for unrecognised types.
|
|
Do not bypass or widen this check without explicit justification.
|
|
- **Validate input paths early.** `process_single_deck` checks that the input
|
|
exists and has a `.pptx` suffix before doing any filesystem work.
|
|
- **Temp files are written atomically.** Image replacement uses a `.tmp`
|
|
intermediate and `Path.replace()` (atomic rename) — do not change this to a
|
|
direct overwrite.
|
|
- **`capture_output=True`** is set on all subprocess calls so that stdout/stderr
|
|
from `caesiumclt` cannot interfere with or inject into the tool's own output.
|
|
- **Do not log file contents**, only metadata (name, size, slide references).
|
|
The CSV log must never contain image binary data or path information outside
|
|
the output directory.
|
|
- **`ignore_errors=True` on `shutil.rmtree`** is acceptable for temp cleanup
|
|
only. Never suppress errors on writes to the output PPTX or its log file.
|