Summary
ExtractionResult tells callers what text came out, but not how. Downstream code that wants to differentiate OCR'd text from native extraction (UI badges, telemetry, per-doc quality routing) has no upstream signal and has to either run two extraction passes and compare, or heuristically inspect content.
The reason this feels frustrating is that the signal already exists inside the extractor — it's just not propagated out.
The signal already exists
In crates/kreuzberg/src/extractors/pdf/mod.rs:553 the PDF extractor computes:
let (text, used_ocr) = if config.effective_disable_ocr() {
(native_text, false)
} else if config.force_ocr {
// ...
(ocr_text, true)
} else if let Some(ref ocr_pages) = config.force_ocr_pages {
// mixed native + per-page OCR
(mixed, true)
} else if let Some(ocr_config) = config.ocr.as_ref() {
// per-page threshold evaluation
// ...
}
used_ocr is already consumed internally at line 696 (use_structured_doc = !used_ocr && pre_rendered_doc.is_some()) — it just never reaches ExtractionResult.
What I observe on 4.8.6
# Native (scanned PDF, no text layer extracted)
r1 = await extract_file("scan.pdf", config=ExtractionConfig(ocr=None))
# Forced OCR on the same file
r2 = await extract_file("scan.pdf", config=ExtractionConfig(
ocr=OcrConfig(backend="tesseract", language="eng"), force_ocr=True
))
r1.metadata.keys() == r2.metadata.keys() # True — identical
# No ocr_used, no extraction_method, no marker of any kind.
# r1.pages and r2.pages are both None for PDFs (only images get pages[] via #723).
Proposed API
Three shapes, in order of preference — happy to go with whichever you prefer:
extraction_method: Option<ExtractionMethod> enum with variants Native, Ocr, Mixed. Handles the force_ocr_pages case cleanly (today that path returns (mixed, true) and collapses partial OCR into full OCR — Mixed would preserve that info). Extensible for future VLM / hybrid paths.
ocr_used: Option<bool>. Simplest, loses the Mixed distinction.
- Per-page
pages[i].extraction_method. Most precise, but depends on pages[] getting populated for PDFs first (currently None; #723 only did images). Could be a follow-up after (1) or (2).
Happy to send a PR once you've picked a shape — just drop a comment.
Environment
Observed on kreuzberg==4.8.6, python:3.14.3-slim (aarch64).
Summary
ExtractionResulttells callers what text came out, but not how. Downstream code that wants to differentiate OCR'd text from native extraction (UI badges, telemetry, per-doc quality routing) has no upstream signal and has to either run two extraction passes and compare, or heuristically inspect content.The reason this feels frustrating is that the signal already exists inside the extractor — it's just not propagated out.
The signal already exists
In
crates/kreuzberg/src/extractors/pdf/mod.rs:553the PDF extractor computes:used_ocris already consumed internally at line 696 (use_structured_doc = !used_ocr && pre_rendered_doc.is_some()) — it just never reachesExtractionResult.What I observe on 4.8.6
Proposed API
Three shapes, in order of preference — happy to go with whichever you prefer:
extraction_method: Option<ExtractionMethod>enum with variantsNative,Ocr,Mixed. Handles theforce_ocr_pagescase cleanly (today that path returns(mixed, true)and collapses partial OCR into full OCR —Mixedwould preserve that info). Extensible for future VLM / hybrid paths.ocr_used: Option<bool>. Simplest, loses theMixeddistinction.pages[i].extraction_method. Most precise, but depends onpages[]getting populated for PDFs first (currentlyNone; #723 only did images). Could be a follow-up after (1) or (2).Happy to send a PR once you've picked a shape — just drop a comment.
Environment
Observed on
kreuzberg==4.8.6,python:3.14.3-slim(aarch64).