Skip to content

feat: expose OCR/native signal on ExtractionResult (the value already exists internally, just isn't surfaced) #751

@dergachoff

Description

@dergachoff

Summary

ExtractionResult tells callers what text came out, but not how. Downstream code that wants to differentiate OCR'd text from native extraction (UI badges, telemetry, per-doc quality routing) has no upstream signal and has to either run two extraction passes and compare, or heuristically inspect content.

The reason this feels frustrating is that the signal already exists inside the extractor — it's just not propagated out.

The signal already exists

In crates/kreuzberg/src/extractors/pdf/mod.rs:553 the PDF extractor computes:

let (text, used_ocr) = if config.effective_disable_ocr() {
    (native_text, false)
} else if config.force_ocr {
    // ...
    (ocr_text, true)
} else if let Some(ref ocr_pages) = config.force_ocr_pages {
    // mixed native + per-page OCR
    (mixed, true)
} else if let Some(ocr_config) = config.ocr.as_ref() {
    // per-page threshold evaluation
    // ...
}

used_ocr is already consumed internally at line 696 (use_structured_doc = !used_ocr && pre_rendered_doc.is_some()) — it just never reaches ExtractionResult.

What I observe on 4.8.6

# Native (scanned PDF, no text layer extracted)
r1 = await extract_file("scan.pdf", config=ExtractionConfig(ocr=None))
# Forced OCR on the same file
r2 = await extract_file("scan.pdf", config=ExtractionConfig(
    ocr=OcrConfig(backend="tesseract", language="eng"), force_ocr=True
))

r1.metadata.keys() == r2.metadata.keys()  # True — identical
# No ocr_used, no extraction_method, no marker of any kind.
# r1.pages and r2.pages are both None for PDFs (only images get pages[] via #723).

Proposed API

Three shapes, in order of preference — happy to go with whichever you prefer:

  1. extraction_method: Option<ExtractionMethod> enum with variants Native, Ocr, Mixed. Handles the force_ocr_pages case cleanly (today that path returns (mixed, true) and collapses partial OCR into full OCR — Mixed would preserve that info). Extensible for future VLM / hybrid paths.
  2. ocr_used: Option<bool>. Simplest, loses the Mixed distinction.
  3. Per-page pages[i].extraction_method. Most precise, but depends on pages[] getting populated for PDFs first (currently None; #723 only did images). Could be a follow-up after (1) or (2).

Happy to send a PR once you've picked a shape — just drop a comment.

Environment

Observed on kreuzberg==4.8.6, python:3.14.3-slim (aarch64).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    Status

    In Progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions