build.bat: added upx minify for caesiumclt.exe
This commit is contained in:
@@ -0,0 +1,236 @@
|
||||
# AGENTS.md — pptx-image-compress
|
||||
|
||||
Guidelines for AI agents and contributors working in this codebase.
|
||||
|
||||
---
|
||||
|
||||
## Project Overview
|
||||
|
||||
Single-file Python CLI tool (`pptx_image_compress.py`) that compresses images
|
||||
inside `.pptx` files using the external binary `caesiumclt`. Supports single-
|
||||
file and batch modes, multi-threaded processing, and CSV logging.
|
||||
|
||||
**Entry point:** `pptx_image_compress.py` → `main()`
|
||||
**Tests:** `test_pptx_image_compress.py` (stdlib `unittest`, run via `pytest`)
|
||||
**External dependency:** `caesiumclt` must be on `PATH`
|
||||
|
||||
---
|
||||
|
||||
## Running Tests
|
||||
|
||||
```bash
|
||||
python -m pytest test_pptx_image_compress.py -v
|
||||
```
|
||||
|
||||
All 5 tests must pass before any change is considered complete.
|
||||
Never remove or weaken an existing test. Always add a test for new behaviour.
|
||||
|
||||
---
|
||||
|
||||
## Code Readability
|
||||
|
||||
- **One responsibility per function.** If a function does more than one thing,
|
||||
split it.
|
||||
- **Descriptive names.** Avoid single-letter variables outside of short loops.
|
||||
Prefer `img_path` over `p`, `result` over `r`.
|
||||
- **Type-annotate every function signature** — parameters and return type.
|
||||
Use `Optional[X]` / `X | None` consistently (the codebase uses both; prefer
|
||||
`X | None` for new code on Python 3.10+).
|
||||
- **Constants at module level**, UPPER_SNAKE_CASE. Never hardcode magic values
|
||||
inline (e.g. file extensions, prefix strings, bar lengths).
|
||||
- **Section comments** (`# --- Section ---`) are used to separate logical
|
||||
blocks. Keep them and add new ones when introducing a new logical group.
|
||||
- **German UI strings are intentional** (progress output, error messages shown
|
||||
to the end-user). Keep them in German. Internal code identifiers stay in
|
||||
English.
|
||||
- **No dead code.** Remove commented-out blocks and unused functions before
|
||||
committing.
|
||||
|
||||
---
|
||||
|
||||
## Testability
|
||||
|
||||
- **Inject external dependencies via callable parameters.** The `compressor`
|
||||
parameter on `process_image_file` and `process_single_deck` is the canonical
|
||||
pattern — always use it for any new external-process call.
|
||||
- **Never call `shutil.which` or `subprocess` directly inside a function under
|
||||
test.** Route through an injectable or mockable seam.
|
||||
- **Tests use `tempfile.TemporaryDirectory`** for isolation. Every test must
|
||||
clean up after itself — rely on the context manager, not `tearDown`.
|
||||
- **Do not test private implementation details.** Test observable behaviour:
|
||||
return values, file contents, log output.
|
||||
- **One assertion focus per test.** A test named `test_X` should assert exactly
|
||||
what `X` does, with a minimal setup.
|
||||
- **Use `fake_compressor` pattern** (as seen in existing tests) to decouple
|
||||
image-compression logic from the real `caesiumclt` binary in all unit tests.
|
||||
|
||||
---
|
||||
|
||||
## Performance
|
||||
|
||||
- **Thread pool sizing:** outer thread count is controlled by `-t/--threads`
|
||||
(default 16). When `threads > 1`, each `caesiumclt` subprocess is launched
|
||||
with `--threads 1` to prevent CPU over-subscription. Do not change this
|
||||
without benchmarking.
|
||||
- **Scratch directories are per-image** (`img_{idx:06d}` sub-dirs) to avoid
|
||||
filename collisions across threads without locking.
|
||||
- **`Lock` scope must be minimal.** Only counter increments and `log_lines`
|
||||
appends are inside the lock — never I/O or subprocess calls.
|
||||
- **Avoid redundant filesystem walks.** `build_image_slide_index` is called
|
||||
once per deck, not per image. Keep it that way.
|
||||
- **`zip_dir_to_pptx` collects all files before writing** so `[Content_Types].xml`
|
||||
can be placed first. Do not revert this to a streaming walk.
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
### Current state
|
||||
|
||||
Single-file design (`pptx_image_compress.py`) is intentional for zero-install
|
||||
distribution. It is acceptable as long as the file stays under ~700 lines.
|
||||
|
||||
### Target layout (clean architecture — migrate when the file grows)
|
||||
|
||||
When the project needs to grow, extract to a package following these layers.
|
||||
Dependencies must only point **inward** (CLI → Application → Domain ←
|
||||
Infrastructure implements Domain ports).
|
||||
|
||||
```
|
||||
pptx_compress/
|
||||
├── __init__.py
|
||||
├── __main__.py # python -m pptx_compress entry point
|
||||
│
|
||||
├── domain/ # innermost — zero external imports
|
||||
│ ├── __init__.py
|
||||
│ ├── models.py # DeckResult, ImageProcessResult (dataclasses)
|
||||
│ ├── constants.py # ALLOWED_EXT, TEMP_PREFIX, defaults
|
||||
│ └── ports.py # Compressor Protocol (typing.Protocol), SlideIndex ABC
|
||||
│
|
||||
├── application/ # orchestration — imports domain only
|
||||
│ ├── __init__.py
|
||||
│ ├── compress_deck.py # process_single_deck() use-case
|
||||
│ └── batch.py # batch loop, overall summary logic
|
||||
│
|
||||
├── infrastructure/ # implements domain ports — imports domain + stdlib/3rd-party
|
||||
│ ├── __init__.py
|
||||
│ ├── caesium_adapter.py # compress_with_caesium() (caesiumclt subprocess)
|
||||
│ ├── pptx_reader.py # discover_images(), build_image_slide_index()
|
||||
│ ├── pptx_writer.py # zip_dir_to_pptx()
|
||||
│ └── temp_manager.py # cleanup_old_temps(), TEMP_PREFIX lifecycle
|
||||
│
|
||||
└── cli/ # outermost — imports application only
|
||||
├── __init__.py
|
||||
├── args.py # argparse definition, expand_inputs(), collect_from_dir()
|
||||
└── output.py # print_progress(), format_duration(), human_mb/kb
|
||||
```
|
||||
|
||||
### Layer rules
|
||||
|
||||
| Layer | May import | Must NOT import |
|
||||
|---|---|---|
|
||||
| `domain` | stdlib only | everything else |
|
||||
| `application` | `domain` | `infrastructure`, `cli` |
|
||||
| `infrastructure` | `domain`, stdlib, 3rd-party | `application`, `cli` |
|
||||
| `cli` | `application`, `domain.models` | `infrastructure` directly |
|
||||
|
||||
### Key architectural decisions
|
||||
|
||||
- **`Compressor` is a `typing.Protocol`** (in `domain/ports.py`), not a bare
|
||||
`Callable`. This makes the contract explicit and IDE-checkable without
|
||||
creating an import cycle:
|
||||
```python
|
||||
class Compressor(Protocol):
|
||||
def __call__(
|
||||
self,
|
||||
original: Path,
|
||||
out_dir: Path,
|
||||
threads: int | None,
|
||||
quality: int,
|
||||
min_savings: str,
|
||||
) -> Path | None: ...
|
||||
```
|
||||
- **`DeckResult` and `ImageProcessResult` live in `domain/models.py`** — they
|
||||
are pure data, no logic, no I/O.
|
||||
- **`compress_deck.py` receives a `Compressor` instance via constructor or
|
||||
parameter** — never imports `caesium_adapter` directly. This is what makes
|
||||
the use-case fully unit-testable with a `fake_compressor`.
|
||||
- **`main()` (in `cli/args.py`) owns argument parsing only.** It resolves
|
||||
paths, builds the `Compressor` adapter, and calls `application.compress_deck`
|
||||
or `application.batch`. No processing logic belongs there.
|
||||
- **`expand_inputs` / `collect_from_dir` live in `cli/args.py`** — path
|
||||
resolution is a CLI concern. All layers below receive `Path` objects.
|
||||
- **Temp directory lifecycle belongs in `infrastructure/temp_manager.py`.**
|
||||
Always use `TEMP_PREFIX` so orphaned dirs from crashed runs are recoverable.
|
||||
|
||||
### Migration guide (single file → package)
|
||||
|
||||
1. Create the `pptx_compress/` directory.
|
||||
2. Move dataclasses and constants to `domain/`.
|
||||
3. Move `compress_with_caesium` → `infrastructure/caesium_adapter.py`.
|
||||
4. Move PPTX read/write helpers → `infrastructure/pptx_reader.py` and
|
||||
`pptx_writer.py`.
|
||||
5. Move `process_image_file` + `process_single_deck` → `application/compress_deck.py`.
|
||||
6. Move `main()` + input helpers → `cli/args.py`.
|
||||
7. Add `__main__.py` with `from pptx_compress.cli.args import main; main()`.
|
||||
8. Update `test_pptx_image_compress.py` imports accordingly — test logic does
|
||||
not need to change because the public API surface is identical.
|
||||
|
||||
### Refactoring plan (aligned with this AGENTS.md)
|
||||
|
||||
- Keep the same layer direction: `cli` → `application` → `domain`; only
|
||||
`infrastructure` implements domain ports.
|
||||
- Add dedicated raster/vector implementations behind domain ports, not in CLI:
|
||||
- `domain/ports.py`: `RasterCompressor`, `VectorCompressor` protocols
|
||||
(or one `Compressor` protocol + typed strategies)
|
||||
- `infrastructure/caesium_adapter.py`: raster implementation
|
||||
- `infrastructure/svg_polish_adapter.py`: vector implementation
|
||||
- Add routing in `application` (not `infrastructure`):
|
||||
- `application/compress_deck.py`: `CompressorRouter` decides by extension
|
||||
- no direct `subprocess` / external library calls in `application`
|
||||
- Split image workflow into explicit application steps:
|
||||
- `compress_step`
|
||||
- `optimal_format_step` (PNG → JPEG optimization step; not a fallback)
|
||||
- `replace_step` (atomic replace via `.tmp` + `Path.replace()`)
|
||||
- Centralize PPTX metadata handling in infrastructure modules:
|
||||
- keep relationship/content-type updates in `infrastructure/pptx_reader.py`
|
||||
and/or `infrastructure/pptx_writer.py`
|
||||
- `application` only orchestrates and passes domain models
|
||||
- Introduce configuration object in `domain/constants.py` or a dedicated
|
||||
domain config model; avoid new magic values in `application`.
|
||||
- Preserve public behaviour and CLI surface during migration; refactor in
|
||||
small commits with green tests after each step.
|
||||
|
||||
### Suggested commit sequence
|
||||
|
||||
1. Extract domain models/constants/ports unchanged.
|
||||
2. Extract caesium adapter + add svg_polish adapter seam.
|
||||
3. Introduce router in `application` with extension-based dispatch.
|
||||
4. Refactor image processing into `compress_step` + `optimal_format_step` +
|
||||
`replace_step`.
|
||||
5. Extract PPTX metadata update helpers to infrastructure modules.
|
||||
6. Move CLI parsing/output concerns into `cli/` only.
|
||||
7. Remove dead monolith code paths and keep tests passing.
|
||||
|
||||
---
|
||||
|
||||
## Security
|
||||
|
||||
- **Never pass unsanitised user input directly to `subprocess`.** The
|
||||
`compress_with_caesium` function builds the command as a list (not a shell
|
||||
string). Keep it that way — do not use `shell=True`.
|
||||
- **Validate file extensions before compression.** `compress_with_caesium`
|
||||
checks `ext not in ALLOWED_EXT` and returns `None` for unrecognised types.
|
||||
Do not bypass or widen this check without explicit justification.
|
||||
- **Validate input paths early.** `process_single_deck` checks that the input
|
||||
exists and has a `.pptx` suffix before doing any filesystem work.
|
||||
- **Temp files are written atomically.** Image replacement uses a `.tmp`
|
||||
intermediate and `Path.replace()` (atomic rename) — do not change this to a
|
||||
direct overwrite.
|
||||
- **`capture_output=True`** is set on all subprocess calls so that stdout/stderr
|
||||
from `caesiumclt` cannot interfere with or inject into the tool's own output.
|
||||
- **Do not log file contents**, only metadata (name, size, slide references).
|
||||
The CSV log must never contain image binary data or path information outside
|
||||
the output directory.
|
||||
- **`ignore_errors=True` on `shutil.rmtree`** is acceptable for temp cleanup
|
||||
only. Never suppress errors on writes to the output PPTX or its log file.
|
||||
Reference in New Issue
Block a user