build.bat: added upx minify for caesiumclt.exe

This commit is contained in:
2026-06-17 11:34:32 +02:00
parent bb1cf98aba
commit 697ed6dc84
3 changed files with 539 additions and 13 deletions
+236
View File
@@ -0,0 +1,236 @@
# AGENTS.md — pptx-image-compress
Guidelines for AI agents and contributors working in this codebase.
---
## Project Overview
Single-file Python CLI tool (`pptx_image_compress.py`) that compresses images
inside `.pptx` files using the external binary `caesiumclt`. Supports single-
file and batch modes, multi-threaded processing, and CSV logging.
**Entry point:** `pptx_image_compress.py``main()`
**Tests:** `test_pptx_image_compress.py` (stdlib `unittest`, run via `pytest`)
**External dependency:** `caesiumclt` must be on `PATH`
---
## Running Tests
```bash
python -m pytest test_pptx_image_compress.py -v
```
All 5 tests must pass before any change is considered complete.
Never remove or weaken an existing test. Always add a test for new behaviour.
---
## Code Readability
- **One responsibility per function.** If a function does more than one thing,
split it.
- **Descriptive names.** Avoid single-letter variables outside of short loops.
Prefer `img_path` over `p`, `result` over `r`.
- **Type-annotate every function signature** — parameters and return type.
Use `Optional[X]` / `X | None` consistently (the codebase uses both; prefer
`X | None` for new code on Python 3.10+).
- **Constants at module level**, UPPER_SNAKE_CASE. Never hardcode magic values
inline (e.g. file extensions, prefix strings, bar lengths).
- **Section comments** (`# --- Section ---`) are used to separate logical
blocks. Keep them and add new ones when introducing a new logical group.
- **German UI strings are intentional** (progress output, error messages shown
to the end-user). Keep them in German. Internal code identifiers stay in
English.
- **No dead code.** Remove commented-out blocks and unused functions before
committing.
---
## Testability
- **Inject external dependencies via callable parameters.** The `compressor`
parameter on `process_image_file` and `process_single_deck` is the canonical
pattern — always use it for any new external-process call.
- **Never call `shutil.which` or `subprocess` directly inside a function under
test.** Route through an injectable or mockable seam.
- **Tests use `tempfile.TemporaryDirectory`** for isolation. Every test must
clean up after itself — rely on the context manager, not `tearDown`.
- **Do not test private implementation details.** Test observable behaviour:
return values, file contents, log output.
- **One assertion focus per test.** A test named `test_X` should assert exactly
what `X` does, with a minimal setup.
- **Use `fake_compressor` pattern** (as seen in existing tests) to decouple
image-compression logic from the real `caesiumclt` binary in all unit tests.
---
## Performance
- **Thread pool sizing:** outer thread count is controlled by `-t/--threads`
(default 16). When `threads > 1`, each `caesiumclt` subprocess is launched
with `--threads 1` to prevent CPU over-subscription. Do not change this
without benchmarking.
- **Scratch directories are per-image** (`img_{idx:06d}` sub-dirs) to avoid
filename collisions across threads without locking.
- **`Lock` scope must be minimal.** Only counter increments and `log_lines`
appends are inside the lock — never I/O or subprocess calls.
- **Avoid redundant filesystem walks.** `build_image_slide_index` is called
once per deck, not per image. Keep it that way.
- **`zip_dir_to_pptx` collects all files before writing** so `[Content_Types].xml`
can be placed first. Do not revert this to a streaming walk.
---
## Architecture
### Current state
Single-file design (`pptx_image_compress.py`) is intentional for zero-install
distribution. It is acceptable as long as the file stays under ~700 lines.
### Target layout (clean architecture — migrate when the file grows)
When the project needs to grow, extract to a package following these layers.
Dependencies must only point **inward** (CLI → Application → Domain ←
Infrastructure implements Domain ports).
```
pptx_compress/
├── __init__.py
├── __main__.py # python -m pptx_compress entry point
├── domain/ # innermost — zero external imports
│ ├── __init__.py
│ ├── models.py # DeckResult, ImageProcessResult (dataclasses)
│ ├── constants.py # ALLOWED_EXT, TEMP_PREFIX, defaults
│ └── ports.py # Compressor Protocol (typing.Protocol), SlideIndex ABC
├── application/ # orchestration — imports domain only
│ ├── __init__.py
│ ├── compress_deck.py # process_single_deck() use-case
│ └── batch.py # batch loop, overall summary logic
├── infrastructure/ # implements domain ports — imports domain + stdlib/3rd-party
│ ├── __init__.py
│ ├── caesium_adapter.py # compress_with_caesium() (caesiumclt subprocess)
│ ├── pptx_reader.py # discover_images(), build_image_slide_index()
│ ├── pptx_writer.py # zip_dir_to_pptx()
│ └── temp_manager.py # cleanup_old_temps(), TEMP_PREFIX lifecycle
└── cli/ # outermost — imports application only
├── __init__.py
├── args.py # argparse definition, expand_inputs(), collect_from_dir()
└── output.py # print_progress(), format_duration(), human_mb/kb
```
### Layer rules
| Layer | May import | Must NOT import |
|---|---|---|
| `domain` | stdlib only | everything else |
| `application` | `domain` | `infrastructure`, `cli` |
| `infrastructure` | `domain`, stdlib, 3rd-party | `application`, `cli` |
| `cli` | `application`, `domain.models` | `infrastructure` directly |
### Key architectural decisions
- **`Compressor` is a `typing.Protocol`** (in `domain/ports.py`), not a bare
`Callable`. This makes the contract explicit and IDE-checkable without
creating an import cycle:
```python
class Compressor(Protocol):
def __call__(
self,
original: Path,
out_dir: Path,
threads: int | None,
quality: int,
min_savings: str,
) -> Path | None: ...
```
- **`DeckResult` and `ImageProcessResult` live in `domain/models.py`** — they
are pure data, no logic, no I/O.
- **`compress_deck.py` receives a `Compressor` instance via constructor or
parameter** — never imports `caesium_adapter` directly. This is what makes
the use-case fully unit-testable with a `fake_compressor`.
- **`main()` (in `cli/args.py`) owns argument parsing only.** It resolves
paths, builds the `Compressor` adapter, and calls `application.compress_deck`
or `application.batch`. No processing logic belongs there.
- **`expand_inputs` / `collect_from_dir` live in `cli/args.py`** — path
resolution is a CLI concern. All layers below receive `Path` objects.
- **Temp directory lifecycle belongs in `infrastructure/temp_manager.py`.**
Always use `TEMP_PREFIX` so orphaned dirs from crashed runs are recoverable.
### Migration guide (single file → package)
1. Create the `pptx_compress/` directory.
2. Move dataclasses and constants to `domain/`.
3. Move `compress_with_caesium` → `infrastructure/caesium_adapter.py`.
4. Move PPTX read/write helpers → `infrastructure/pptx_reader.py` and
`pptx_writer.py`.
5. Move `process_image_file` + `process_single_deck` → `application/compress_deck.py`.
6. Move `main()` + input helpers → `cli/args.py`.
7. Add `__main__.py` with `from pptx_compress.cli.args import main; main()`.
8. Update `test_pptx_image_compress.py` imports accordingly — test logic does
not need to change because the public API surface is identical.
### Refactoring plan (aligned with this AGENTS.md)
- Keep the same layer direction: `cli` → `application` → `domain`; only
`infrastructure` implements domain ports.
- Add dedicated raster/vector implementations behind domain ports, not in CLI:
- `domain/ports.py`: `RasterCompressor`, `VectorCompressor` protocols
(or one `Compressor` protocol + typed strategies)
- `infrastructure/caesium_adapter.py`: raster implementation
- `infrastructure/svg_polish_adapter.py`: vector implementation
- Add routing in `application` (not `infrastructure`):
- `application/compress_deck.py`: `CompressorRouter` decides by extension
- no direct `subprocess` / external library calls in `application`
- Split image workflow into explicit application steps:
- `compress_step`
- `optimal_format_step` (PNG → JPEG optimization step; not a fallback)
- `replace_step` (atomic replace via `.tmp` + `Path.replace()`)
- Centralize PPTX metadata handling in infrastructure modules:
- keep relationship/content-type updates in `infrastructure/pptx_reader.py`
and/or `infrastructure/pptx_writer.py`
- `application` only orchestrates and passes domain models
- Introduce configuration object in `domain/constants.py` or a dedicated
domain config model; avoid new magic values in `application`.
- Preserve public behaviour and CLI surface during migration; refactor in
small commits with green tests after each step.
### Suggested commit sequence
1. Extract domain models/constants/ports unchanged.
2. Extract caesium adapter + add svg_polish adapter seam.
3. Introduce router in `application` with extension-based dispatch.
4. Refactor image processing into `compress_step` + `optimal_format_step` +
`replace_step`.
5. Extract PPTX metadata update helpers to infrastructure modules.
6. Move CLI parsing/output concerns into `cli/` only.
7. Remove dead monolith code paths and keep tests passing.
---
## Security
- **Never pass unsanitised user input directly to `subprocess`.** The
`compress_with_caesium` function builds the command as a list (not a shell
string). Keep it that way — do not use `shell=True`.
- **Validate file extensions before compression.** `compress_with_caesium`
checks `ext not in ALLOWED_EXT` and returns `None` for unrecognised types.
Do not bypass or widen this check without explicit justification.
- **Validate input paths early.** `process_single_deck` checks that the input
exists and has a `.pptx` suffix before doing any filesystem work.
- **Temp files are written atomically.** Image replacement uses a `.tmp`
intermediate and `Path.replace()` (atomic rename) — do not change this to a
direct overwrite.
- **`capture_output=True`** is set on all subprocess calls so that stdout/stderr
from `caesiumclt` cannot interfere with or inject into the tool's own output.
- **Do not log file contents**, only metadata (name, size, slide references).
The CSV log must never contain image binary data or path information outside
the output directory.
- **`ignore_errors=True` on `shutil.rmtree`** is acceptable for temp cleanup
only. Never suppress errors on writes to the output PPTX or its log file.