Skip to content

Commit 5664e93

Browse files
Naapperasaninibreadask-bonk[bot]elithrarmvvmm
authored
Add conversion options for AI.toMarkdown (#28762)
* Add conversion options for AI.toMarkdown * copy and changelog fixes * Fix style, imports, types, links Co-authored-by: elithrar <elithrar@users.noreply.github.com> * Address issues * `Blob([...])` breaks Babel; use valid JS Co-authored-by: mvvmm <mvvmm@users.noreply.github.com> --------- Co-authored-by: Anni Wang <anni@cloudflare.com> Co-authored-by: ask-bonk[bot] <ask-bonk[bot]@users.noreply.github.com> Co-authored-by: elithrar <elithrar@users.noreply.github.com> Co-authored-by: mvvmm <mvvmm@users.noreply.github.com>
1 parent a6f1c6e commit 5664e93

9 files changed

Lines changed: 510 additions & 213 deletions

File tree

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
---
2+
title: New conversion options for Markdown Conversion
3+
description: Control how images, HTML, and PDFs are processed when converting to Markdown
4+
date: 2026-03-04
5+
---
6+
7+
import { Badge, MetaInfo, Render, TypeScriptExample } from "~/components";
8+
9+
You can now customize how the [Markdown Conversion](/workers-ai/features/markdown-conversion/) service processes different file types by passing a `conversionOptions` object.
10+
11+
Available options:
12+
13+
- **Images**: Set the language for AI-generated image descriptions
14+
- **HTML**: Use CSS selectors to extract specific content, or provide a hostname to resolve relative links
15+
- **PDF**: Exclude metadata from the output
16+
17+
Use the [`env.AI`](/workers-ai/features/markdown-conversion/usage/binding/) binding:
18+
19+
<TypeScriptExample>
20+
21+
22+
```typescript
23+
await env.AI.toMarkdown(
24+
{ name: "page.html", blob: new Blob([html]) },
25+
{
26+
conversionOptions: {
27+
html: { cssSelector: "article.content" },
28+
image: { descriptionLanguage: "es" },
29+
},
30+
},
31+
);
32+
```
33+
34+
</TypeScriptExample>
35+
36+
Or call the REST API:
37+
38+
```bash
39+
curl https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/tomarkdown \
40+
-H 'Authorization: Bearer {API_TOKEN}' \
41+
-F 'files=@index.html' \
42+
-F 'conversionOptions={"html": {"cssSelector": "article.content"}}'
43+
```
44+
45+
For more details, refer to [Conversion Options](/workers-ai/features/markdown-conversion/conversion-options/).

src/content/docs/workers-ai/features/markdown-conversion.mdx

Lines changed: 0 additions & 213 deletions
This file was deleted.
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
---
2+
title: Conversion Options
3+
pcx_content_type: reference
4+
sidebar:
5+
order: 5
6+
---
7+
8+
By default, the `toMarkdown` service extracts text content from your files. To further extend the capabilities of the conversion process, you can pass options to the service to control how specific file types are converted.
9+
10+
Options are organized by file type and are all optional.
11+
12+
## Available options
13+
14+
### Images
15+
16+
```typescript
17+
{
18+
image?: {
19+
descriptionLanguage?: 'en' | 'it' | 'de' | 'es' | 'fr' | 'pt';
20+
}
21+
}
22+
```
23+
24+
- `descriptionLanguage`: controls the language of the AI-generated image descriptions.
25+
26+
:::caution
27+
28+
This option works on a _best-effort_ basis: it is not guaranteed that the resulting text will be in the desired language.
29+
30+
:::
31+
32+
### HTML
33+
34+
```typescript
35+
{
36+
html?: {
37+
hostname?: string;
38+
cssSelector?: string;
39+
}
40+
}
41+
```
42+
43+
- `hostname`: string to use as a host when resolving relative links inside the HTML.
44+
45+
- `cssSelector`: string containing a CSS selector pattern to pick specific elements from your HTML. Refer to [how HTML is processed](/workers-ai/features/markdown-conversion/how-it-works/#html) for more details.
46+
47+
### PDF
48+
49+
```typescript
50+
{
51+
pdf?: {
52+
metadata?: boolean;
53+
}
54+
}
55+
```
56+
57+
- `metadata`: Previously, all converted PDF files always included metadata information when converted. This option allows you to opt-out of this behavior.
58+
59+
## Examples
60+
61+
### Binding
62+
63+
To configure custom options, pass a `conversionOptions` object inside the second argument of the binding call, like this:
64+
65+
```typescript
66+
await env.AI.toMarkdown(..., {
67+
conversionOptions: {
68+
html: { ... },
69+
pdf: { ... },
70+
...
71+
}
72+
})
73+
```
74+
75+
### REST API
76+
77+
Since the REST API uses file uploads, the request's `Content-Type` will be `multipart/form-data`. As such, include a new form field with your stringified object as a value:
78+
79+
```bash
80+
curl https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/tomarkdown \
81+
-X POST \
82+
-H 'Authorization: Bearer {API_TOKEN}' \
83+
...
84+
-F 'conversionOptions={ "html": { ... }, ... }'
85+
```
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
---
2+
title: How it works
3+
pcx_content_type: concept
4+
sidebar:
5+
order: 4
6+
---
7+
8+
## Pre-processing
9+
10+
When parsing files before converting them to Markdown, there are some cleanup tasks we do depending on the type of file you are trying to convert.
11+
12+
### HTML
13+
14+
When we detect an HTML file, a series of things happen to the HTML content before it is converted:
15+
16+
- Some elements are ignored, including `script` and `style` tags.
17+
- Meta tags are extracted. These include `title`, `description`, `og:title`, `og:description` and `og:image`.
18+
- [JSON-LD](https://json-ld.org/) content is extracted, if it exists. This will be appended at the end of the converted markdown.
19+
- The base URL to use for resolving relative links is extracted from the `<base>` element<sup>1</sup>, if it exists, according to the spec (that is, only the first instance of the base URL is counted).
20+
- If the `cssSelector` option is:
21+
- present, then only those elements that match the selector are kept for further processing;
22+
- missing, then elements such as `<header>`, `<footer>` and `<head>` are removed from the text.
23+
- If a base URL was obtained previously, relative links in the remaining HTML are resolved to fully qualified URLs
24+
25+
<sup>1</sup> The host can also be set per request, using the HTML conversion
26+
options. Refer to [Conversion
27+
Options](/workers-ai/features/markdown-conversion/conversion-options/#html) for
28+
more details.
29+
30+
### Images
31+
32+
Images take a bit more work to prepare for conversion.
33+
34+
As a first step, we detect what type the image is. If it is an SVG (Scalable Vector Graphics) file, we need to convert it into a raster format so that using the necessary Workers AI models does not fail. In this case, SVGs are converted into PNGs internally.
35+
36+
Afterwards:
37+
38+
- We try to determine the image's dimensions. If successful, we determine if the image is considered "too big" or not. An image is "too big" if its width is bigger than 1280px or its height is bigger than 720px.
39+
- If the image is too big, we try to resize it to conform with those dimensions. If resizing fails, we simply try to use the original image data
40+
- The image is sent to an **object-detection model**. Specifically, we use the [`@cf/facebook/detr-resnet-50`](/workers-ai/models/detr-resnet-50/) from Workers AI.
41+
- If any objects were detected in the previous step, they are appended to a prompt that is used to instruct an **image-to-text model** on how to describe the image.
42+
- If a preferred conversion language is specified in the request's conversion options, the previous prompt is enriched with a directive for the model to output the content in the desired language. Refer to [Conversion Options](/workers-ai/features/markdown-conversion/conversion-options/#images) for more details.
43+
- The final prompt is sent, along with the image data, to the [`@cf/google/gemma-3-12b-it`](/workers-ai/models/gemma-3-12b-it/) model, also from Workers AI.
44+
45+
### PDFs
46+
47+
- Metadata is extracted. This can be removed from the final result. Refer to [Conversion Options](/workers-ai/features/markdown-conversion/conversion-options/#pdf) for more details.
48+
- Each page is parsed in sequence.
49+
- We try to obtain a `StructTree` object from the PDF file. This data structure is a tree of tagged elements that make up the PDF contents, as specified by [ISO 14289 (PDF/UA)](https://www.iso.org/standard/64599.html).
50+
- If none is obtained, we extract the text of the page _as-is_ and return it.
51+
- If we manage to obtain a `StructTree`, we traverse its nodes to build a semantic Markdown representation of its contents.

0 commit comments

Comments
 (0)