feat: expose OCR/native signal on ExtractionResult (the value already exists internally, just isn't surfaced)

## Summary

`ExtractionResult` tells callers *what* text came out, but not *how*. Downstream code that wants to differentiate OCR'd text from native extraction (UI badges, telemetry, per-doc quality routing) has no upstream signal and has to either run two extraction passes and compare, or heuristically inspect content.

The reason this feels frustrating is that the signal **already exists inside the extractor** — it's just not propagated out.

## The signal already exists

In [`crates/kreuzberg/src/extractors/pdf/mod.rs:553`](https://github.com/kreuzberg-dev/kreuzberg/blob/v4.8.6/crates/kreuzberg/src/extractors/pdf/mod.rs#L553) the PDF extractor computes:

```rust
let (text, used_ocr) = if config.effective_disable_ocr() {
    (native_text, false)
} else if config.force_ocr {
    // ...
    (ocr_text, true)
} else if let Some(ref ocr_pages) = config.force_ocr_pages {
    // mixed native + per-page OCR
    (mixed, true)
} else if let Some(ocr_config) = config.ocr.as_ref() {
    // per-page threshold evaluation
    // ...
}
```

`used_ocr` is already consumed internally at [line 696](https://github.com/kreuzberg-dev/kreuzberg/blob/v4.8.6/crates/kreuzberg/src/extractors/pdf/mod.rs#L696) (`use_structured_doc = !used_ocr && pre_rendered_doc.is_some()`) — it just never reaches `ExtractionResult`.

## What I observe on 4.8.6

```python
# Native (scanned PDF, no text layer extracted)
r1 = await extract_file("scan.pdf", config=ExtractionConfig(ocr=None))
# Forced OCR on the same file
r2 = await extract_file("scan.pdf", config=ExtractionConfig(
    ocr=OcrConfig(backend="tesseract", language="eng"), force_ocr=True
))

r1.metadata.keys() == r2.metadata.keys()  # True — identical
# No ocr_used, no extraction_method, no marker of any kind.
# r1.pages and r2.pages are both None for PDFs (only images get pages[] via #723).
```

## Proposed API

Three shapes, in order of preference — happy to go with whichever you prefer:

1. **`extraction_method: Option<ExtractionMethod>` enum** with variants `Native`, `Ocr`, `Mixed`. Handles the `force_ocr_pages` case cleanly (today that path returns `(mixed, true)` and collapses partial OCR into full OCR — `Mixed` would preserve that info). Extensible for future VLM / hybrid paths.
2. **`ocr_used: Option<bool>`**. Simplest, loses the `Mixed` distinction.
3. **Per-page `pages[i].extraction_method`**. Most precise, but depends on `pages[]` getting populated for PDFs first (currently `None`; [#723](https://github.com/kreuzberg-dev/kreuzberg/pull/723) only did images). Could be a follow-up after (1) or (2).

Happy to send a PR once you've picked a shape — just drop a comment.

## Environment

Observed on `kreuzberg==4.8.6`, `python:3.14.3-slim` (aarch64).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: expose OCR/native signal on ExtractionResult (the value already exists internally, just isn't surfaced) #751

Summary

The signal already exists

What I observe on 4.8.6

Proposed API

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: expose OCR/native signal on ExtractionResult (the value already exists internally, just isn't surfaced) #751

Description

Summary

The signal already exists

What I observe on 4.8.6

Proposed API

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions