Files
pptx-image-compress/AGENTS.md
T

11 KiB

AGENTS.md — pptx-image-compress

Guidelines for AI agents and contributors working in this codebase.


Project Overview

Single-file Python CLI tool (pptx_image_compress.py) that compresses images inside .pptx files using the external binary caesiumclt. Supports single- file and batch modes, multi-threaded processing, and CSV logging.

Entry point: pptx_image_compress.pymain()
Tests: test_pptx_image_compress.py (stdlib unittest, run via pytest)
External dependency: caesiumclt must be on PATH


Running Tests

python -m pytest test_pptx_image_compress.py -v

All 5 tests must pass before any change is considered complete.
Never remove or weaken an existing test. Always add a test for new behaviour.


Code Readability

  • One responsibility per function. If a function does more than one thing, split it.
  • Descriptive names. Avoid single-letter variables outside of short loops. Prefer img_path over p, result over r.
  • Type-annotate every function signature — parameters and return type. Use Optional[X] / X | None consistently (the codebase uses both; prefer X | None for new code on Python 3.10+).
  • Constants at module level, UPPER_SNAKE_CASE. Never hardcode magic values inline (e.g. file extensions, prefix strings, bar lengths).
  • Section comments (# --- Section ---) are used to separate logical blocks. Keep them and add new ones when introducing a new logical group.
  • German UI strings are intentional (progress output, error messages shown to the end-user). Keep them in German. Internal code identifiers stay in English.
  • No dead code. Remove commented-out blocks and unused functions before committing.

Testability

  • Inject external dependencies via callable parameters. The compressor parameter on process_image_file and process_single_deck is the canonical pattern — always use it for any new external-process call.
  • Never call shutil.which or subprocess directly inside a function under test. Route through an injectable or mockable seam.
  • Tests use tempfile.TemporaryDirectory for isolation. Every test must clean up after itself — rely on the context manager, not tearDown.
  • Do not test private implementation details. Test observable behaviour: return values, file contents, log output.
  • One assertion focus per test. A test named test_X should assert exactly what X does, with a minimal setup.
  • Use fake_compressor pattern (as seen in existing tests) to decouple image-compression logic from the real caesiumclt binary in all unit tests.

Performance

  • Thread pool sizing: outer thread count is controlled by -t/--threads (default 16). When threads > 1, each caesiumclt subprocess is launched with --threads 1 to prevent CPU over-subscription. Do not change this without benchmarking.
  • Scratch directories are per-image (img_{idx:06d} sub-dirs) to avoid filename collisions across threads without locking.
  • Lock scope must be minimal. Only counter increments and log_lines appends are inside the lock — never I/O or subprocess calls.
  • Avoid redundant filesystem walks. build_image_slide_index is called once per deck, not per image. Keep it that way.
  • zip_dir_to_pptx collects all files before writing so [Content_Types].xml can be placed first. Do not revert this to a streaming walk.

Architecture

Current state

Single-file design (pptx_image_compress.py) is intentional for zero-install distribution. It is acceptable as long as the file stays under ~700 lines.

Target layout (clean architecture — migrate when the file grows)

When the project needs to grow, extract to a package following these layers. Dependencies must only point inward (CLI → Application → Domain ← Infrastructure implements Domain ports).

pptx_compress/
├── __init__.py
├── __main__.py                  # python -m pptx_compress entry point
│
├── domain/                      # innermost — zero external imports
│   ├── __init__.py
│   ├── models.py                # DeckResult, ImageProcessResult (dataclasses)
│   ├── constants.py             # ALLOWED_EXT, TEMP_PREFIX, defaults
│   └── ports.py                 # Compressor Protocol (typing.Protocol), SlideIndex ABC
│
├── application/                 # orchestration — imports domain only
│   ├── __init__.py
│   ├── compress_deck.py         # process_single_deck() use-case
│   └── batch.py                 # batch loop, overall summary logic
│
├── infrastructure/              # implements domain ports — imports domain + stdlib/3rd-party
│   ├── __init__.py
│   ├── caesium_adapter.py       # compress_with_caesium() (caesiumclt subprocess)
│   ├── pptx_reader.py           # discover_images(), build_image_slide_index()
│   ├── pptx_writer.py           # zip_dir_to_pptx()
│   └── temp_manager.py          # cleanup_old_temps(), TEMP_PREFIX lifecycle
│
└── cli/                         # outermost — imports application only
    ├── __init__.py
    ├── args.py                  # argparse definition, expand_inputs(), collect_from_dir()
    └── output.py                # print_progress(), format_duration(), human_mb/kb

Layer rules

Layer May import Must NOT import
domain stdlib only everything else
application domain infrastructure, cli
infrastructure domain, stdlib, 3rd-party application, cli
cli application, domain.models infrastructure directly

Key architectural decisions

  • Compressor is a typing.Protocol (in domain/ports.py), not a bare Callable. This makes the contract explicit and IDE-checkable without creating an import cycle:
    class Compressor(Protocol):
        def __call__(
            self,
            original: Path,
            out_dir: Path,
            threads: int | None,
            quality: int,
            min_savings: str,
        ) -> Path | None: ...
    
  • DeckResult and ImageProcessResult live in domain/models.py — they are pure data, no logic, no I/O.
  • compress_deck.py receives a Compressor instance via constructor or parameter — never imports caesium_adapter directly. This is what makes the use-case fully unit-testable with a fake_compressor.
  • main() (in cli/args.py) owns argument parsing only. It resolves paths, builds the Compressor adapter, and calls application.compress_deck or application.batch. No processing logic belongs there.
  • expand_inputs / collect_from_dir live in cli/args.py — path resolution is a CLI concern. All layers below receive Path objects.
  • Temp directory lifecycle belongs in infrastructure/temp_manager.py. Always use TEMP_PREFIX so orphaned dirs from crashed runs are recoverable.

Migration guide (single file → package)

  1. Create the pptx_compress/ directory.
  2. Move dataclasses and constants to domain/.
  3. Move compress_with_caesiuminfrastructure/caesium_adapter.py.
  4. Move PPTX read/write helpers → infrastructure/pptx_reader.py and pptx_writer.py.
  5. Move process_image_file + process_single_deckapplication/compress_deck.py.
  6. Move main() + input helpers → cli/args.py.
  7. Add __main__.py with from pptx_compress.cli.args import main; main().
  8. Update test_pptx_image_compress.py imports accordingly — test logic does not need to change because the public API surface is identical.

Refactoring plan (aligned with this AGENTS.md)

  • Keep the same layer direction: cliapplicationdomain; only infrastructure implements domain ports.
  • Add dedicated raster/vector implementations behind domain ports, not in CLI:
    • domain/ports.py: RasterCompressor, VectorCompressor protocols (or one Compressor protocol + typed strategies)
    • infrastructure/caesium_adapter.py: raster implementation
    • infrastructure/svg_polish_adapter.py: vector implementation
  • Add routing in application (not infrastructure):
    • application/compress_deck.py: CompressorRouter decides by extension
    • no direct subprocess / external library calls in application
  • Split image workflow into explicit application steps:
    • compress_step
    • optimal_format_step (PNG → JPEG optimization step; not a fallback)
    • replace_step (atomic replace via .tmp + Path.replace())
  • Centralize PPTX metadata handling in infrastructure modules:
    • keep relationship/content-type updates in infrastructure/pptx_reader.py and/or infrastructure/pptx_writer.py
    • application only orchestrates and passes domain models
  • Introduce configuration object in domain/constants.py or a dedicated domain config model; avoid new magic values in application.
  • Preserve public behaviour and CLI surface during migration; refactor in small commits with green tests after each step.

Suggested commit sequence

  1. Extract domain models/constants/ports unchanged.
  2. Extract caesium adapter + add svg_polish adapter seam.
  3. Introduce router in application with extension-based dispatch.
  4. Refactor image processing into compress_step + optimal_format_step + replace_step.
  5. Extract PPTX metadata update helpers to infrastructure modules.
  6. Move CLI parsing/output concerns into cli/ only.
  7. Remove dead monolith code paths and keep tests passing.

Security

  • Never pass unsanitised user input directly to subprocess. The compress_with_caesium function builds the command as a list (not a shell string). Keep it that way — do not use shell=True.
  • Validate file extensions before compression. compress_with_caesium checks ext not in ALLOWED_EXT and returns None for unrecognised types. Do not bypass or widen this check without explicit justification.
  • Validate input paths early. process_single_deck checks that the input exists and has a .pptx suffix before doing any filesystem work.
  • Temp files are written atomically. Image replacement uses a .tmp intermediate and Path.replace() (atomic rename) — do not change this to a direct overwrite.
  • capture_output=True is set on all subprocess calls so that stdout/stderr from caesiumclt cannot interfere with or inject into the tool's own output.
  • Do not log file contents, only metadata (name, size, slide references). The CSV log must never contain image binary data or path information outside the output directory.
  • ignore_errors=True on shutil.rmtree is acceptable for temp cleanup only. Never suppress errors on writes to the output PPTX or its log file.