11 KiB
AGENTS.md — pptx-image-compress
Guidelines for AI agents and contributors working in this codebase.
Project Overview
Single-file Python CLI tool (pptx_image_compress.py) that compresses images
inside .pptx files using the external binary caesiumclt. Supports single-
file and batch modes, multi-threaded processing, and CSV logging.
Entry point: pptx_image_compress.py → main()
Tests: test_pptx_image_compress.py (stdlib unittest, run via pytest)
External dependency: caesiumclt must be on PATH
Running Tests
python -m pytest test_pptx_image_compress.py -v
All 5 tests must pass before any change is considered complete.
Never remove or weaken an existing test. Always add a test for new behaviour.
Code Readability
- One responsibility per function. If a function does more than one thing, split it.
- Descriptive names. Avoid single-letter variables outside of short loops.
Prefer
img_pathoverp,resultoverr. - Type-annotate every function signature — parameters and return type.
Use
Optional[X]/X | Noneconsistently (the codebase uses both; preferX | Nonefor new code on Python 3.10+). - Constants at module level, UPPER_SNAKE_CASE. Never hardcode magic values inline (e.g. file extensions, prefix strings, bar lengths).
- Section comments (
# --- Section ---) are used to separate logical blocks. Keep them and add new ones when introducing a new logical group. - German UI strings are intentional (progress output, error messages shown to the end-user). Keep them in German. Internal code identifiers stay in English.
- No dead code. Remove commented-out blocks and unused functions before committing.
Testability
- Inject external dependencies via callable parameters. The
compressorparameter onprocess_image_fileandprocess_single_deckis the canonical pattern — always use it for any new external-process call. - Never call
shutil.whichorsubprocessdirectly inside a function under test. Route through an injectable or mockable seam. - Tests use
tempfile.TemporaryDirectoryfor isolation. Every test must clean up after itself — rely on the context manager, nottearDown. - Do not test private implementation details. Test observable behaviour: return values, file contents, log output.
- One assertion focus per test. A test named
test_Xshould assert exactly whatXdoes, with a minimal setup. - Use
fake_compressorpattern (as seen in existing tests) to decouple image-compression logic from the realcaesiumcltbinary in all unit tests.
Performance
- Thread pool sizing: outer thread count is controlled by
-t/--threads(default 16). Whenthreads > 1, eachcaesiumcltsubprocess is launched with--threads 1to prevent CPU over-subscription. Do not change this without benchmarking. - Scratch directories are per-image (
img_{idx:06d}sub-dirs) to avoid filename collisions across threads without locking. Lockscope must be minimal. Only counter increments andlog_linesappends are inside the lock — never I/O or subprocess calls.- Avoid redundant filesystem walks.
build_image_slide_indexis called once per deck, not per image. Keep it that way. zip_dir_to_pptxcollects all files before writing so[Content_Types].xmlcan be placed first. Do not revert this to a streaming walk.
Architecture
Current state
Single-file design (pptx_image_compress.py) is intentional for zero-install
distribution. It is acceptable as long as the file stays under ~700 lines.
Target layout (clean architecture — migrate when the file grows)
When the project needs to grow, extract to a package following these layers. Dependencies must only point inward (CLI → Application → Domain ← Infrastructure implements Domain ports).
pptx_compress/
├── __init__.py
├── __main__.py # python -m pptx_compress entry point
│
├── domain/ # innermost — zero external imports
│ ├── __init__.py
│ ├── models.py # DeckResult, ImageProcessResult (dataclasses)
│ ├── constants.py # ALLOWED_EXT, TEMP_PREFIX, defaults
│ └── ports.py # Compressor Protocol (typing.Protocol), SlideIndex ABC
│
├── application/ # orchestration — imports domain only
│ ├── __init__.py
│ ├── compress_deck.py # process_single_deck() use-case
│ └── batch.py # batch loop, overall summary logic
│
├── infrastructure/ # implements domain ports — imports domain + stdlib/3rd-party
│ ├── __init__.py
│ ├── caesium_adapter.py # compress_with_caesium() (caesiumclt subprocess)
│ ├── pptx_reader.py # discover_images(), build_image_slide_index()
│ ├── pptx_writer.py # zip_dir_to_pptx()
│ └── temp_manager.py # cleanup_old_temps(), TEMP_PREFIX lifecycle
│
└── cli/ # outermost — imports application only
├── __init__.py
├── args.py # argparse definition, expand_inputs(), collect_from_dir()
└── output.py # print_progress(), format_duration(), human_mb/kb
Layer rules
| Layer | May import | Must NOT import |
|---|---|---|
domain |
stdlib only | everything else |
application |
domain |
infrastructure, cli |
infrastructure |
domain, stdlib, 3rd-party |
application, cli |
cli |
application, domain.models |
infrastructure directly |
Key architectural decisions
Compressoris atyping.Protocol(indomain/ports.py), not a bareCallable. This makes the contract explicit and IDE-checkable without creating an import cycle:class Compressor(Protocol): def __call__( self, original: Path, out_dir: Path, threads: int | None, quality: int, min_savings: str, ) -> Path | None: ...DeckResultandImageProcessResultlive indomain/models.py— they are pure data, no logic, no I/O.compress_deck.pyreceives aCompressorinstance via constructor or parameter — never importscaesium_adapterdirectly. This is what makes the use-case fully unit-testable with afake_compressor.main()(incli/args.py) owns argument parsing only. It resolves paths, builds theCompressoradapter, and callsapplication.compress_deckorapplication.batch. No processing logic belongs there.expand_inputs/collect_from_dirlive incli/args.py— path resolution is a CLI concern. All layers below receivePathobjects.- Temp directory lifecycle belongs in
infrastructure/temp_manager.py. Always useTEMP_PREFIXso orphaned dirs from crashed runs are recoverable.
Migration guide (single file → package)
- Create the
pptx_compress/directory. - Move dataclasses and constants to
domain/. - Move
compress_with_caesium→infrastructure/caesium_adapter.py. - Move PPTX read/write helpers →
infrastructure/pptx_reader.pyandpptx_writer.py. - Move
process_image_file+process_single_deck→application/compress_deck.py. - Move
main()+ input helpers →cli/args.py. - Add
__main__.pywithfrom pptx_compress.cli.args import main; main(). - Update
test_pptx_image_compress.pyimports accordingly — test logic does not need to change because the public API surface is identical.
Refactoring plan (aligned with this AGENTS.md)
- Keep the same layer direction:
cli→application→domain; onlyinfrastructureimplements domain ports. - Add dedicated raster/vector implementations behind domain ports, not in CLI:
domain/ports.py:RasterCompressor,VectorCompressorprotocols (or oneCompressorprotocol + typed strategies)infrastructure/caesium_adapter.py: raster implementationinfrastructure/svg_polish_adapter.py: vector implementation
- Add routing in
application(notinfrastructure):application/compress_deck.py:CompressorRouterdecides by extension- no direct
subprocess/ external library calls inapplication
- Split image workflow into explicit application steps:
compress_stepoptimal_format_step(PNG → JPEG optimization step; not a fallback)replace_step(atomic replace via.tmp+Path.replace())
- Centralize PPTX metadata handling in infrastructure modules:
- keep relationship/content-type updates in
infrastructure/pptx_reader.pyand/orinfrastructure/pptx_writer.py applicationonly orchestrates and passes domain models
- keep relationship/content-type updates in
- Introduce configuration object in
domain/constants.pyor a dedicated domain config model; avoid new magic values inapplication. - Preserve public behaviour and CLI surface during migration; refactor in small commits with green tests after each step.
Suggested commit sequence
- Extract domain models/constants/ports unchanged.
- Extract caesium adapter + add svg_polish adapter seam.
- Introduce router in
applicationwith extension-based dispatch. - Refactor image processing into
compress_step+optimal_format_step+replace_step. - Extract PPTX metadata update helpers to infrastructure modules.
- Move CLI parsing/output concerns into
cli/only. - Remove dead monolith code paths and keep tests passing.
Security
- Never pass unsanitised user input directly to
subprocess. Thecompress_with_caesiumfunction builds the command as a list (not a shell string). Keep it that way — do not useshell=True. - Validate file extensions before compression.
compress_with_caesiumchecksext not in ALLOWED_EXTand returnsNonefor unrecognised types. Do not bypass or widen this check without explicit justification. - Validate input paths early.
process_single_deckchecks that the input exists and has a.pptxsuffix before doing any filesystem work. - Temp files are written atomically. Image replacement uses a
.tmpintermediate andPath.replace()(atomic rename) — do not change this to a direct overwrite. capture_output=Trueis set on all subprocess calls so that stdout/stderr fromcaesiumcltcannot interfere with or inject into the tool's own output.- Do not log file contents, only metadata (name, size, slide references). The CSV log must never contain image binary data or path information outside the output directory.
ignore_errors=Trueonshutil.rmtreeis acceptable for temp cleanup only. Never suppress errors on writes to the output PPTX or its log file.