Skip to content

bug: Unreliable image link extraction. #762

@whplh

Description

@whplh

I have a simple scenario:

  • I create a document ( tested with google docs or LibreOffice)
  • I create some text and insert an image
  • I export/download the document as pdf

Expected behaviour:
When I extract images and content with kreuzberg from pdf (to markdown), the markdown file should contain a link to the image (E.g. ![](image_3.jpeg)

Actual behaviour:
No link extracted

The scenario is easy enough to reproduce. E.g.:

ExtractionConfig(
  output_format="markdown",
  images=ImageExtractionConfig(
            extract_images=true
  )
)

I tried various text editors, images, text variations & kreuzberg configs, but no luck.
So I think its a bug. ( My kreuzberg version: v4.9.2 in python)

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

Status

In Review

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions