PDF Parser

Extract text from PDF documents and convert it into structured Markdown. Optionally split content into chunks for AI processing, embeddings, knowledge bases, and downstream workflow automation.

Workflow Node Document Processing150 MB Max

What is PDF Parser?

PDF Parser is a document processing node that retrieves a PDF from a provided URL, extracts its textual content, and converts it into Markdown. For larger documents, chunking can be enabled to split the extracted content into smaller sections.

These chunks can then be used for vector databases, retrieval-augmented generation (RAG), semantic search, AI agents, and other document-processing workflows. The node also returns metadata such as word count, chunk count, and source URL.

What PDF Parser lets you do

Extract text from PDF documents via public URL
Convert PDF content into Markdown format
Prepare documents for AI and LLM workflows
Split large documents into manageable chunks
Create searchable document repositories
Analyze document size through word count metadata
Process PDF files up to 150 MB

Prerequisites

Before configuring PDF Parser, ensure that:

The PDF file is accessible through a public URL
A PDF Parser connection has been configured
The PDF file size does not exceed 150 MB
Any upstream workflow step providing the PDF URL has been configured correctly

Understanding PDF Parsing

PDF Parser performs three core operations in sequence:

1 Retrieve → 2 Extract → 3 Transform

Retrieve — Downloads the PDF from the specified URL.
Extract — Extracts textual content from the PDF file.
Transform — Converts extracted content into Markdown and optionally generates chunks for AI processing.

Configuration

Connection REQUIRED

Select an existing PDF Parser connection. If no connection exists, create one before configuring the node.

PDF URL REQUIRED

The publicly accessible URL of the PDF file to parse. Supports static URLs, dynamic field mappings, webhook inputs, and upstream workflow outputs.

https://example.com/user-guide.pdf

Output Type MARKDOWN

Defines the format of the extracted content. Currently supports Markdown — extracted PDF content is returned as Markdown format.

Enable Chunking OPTIONAL

Determines whether extracted content should be split into chunks.

Value	Behavior
False	Returns complete parsed content only
True	Splits content into chunks and returns chunk metadata

Chunk Size WHEN CHUNKING ENABLED

Defines the maximum number of characters in each generated chunk. Smaller values generate more chunks; larger values generate fewer.

Recommended values

500–1000 characters for granular AI retrieval
1000–2000 characters for larger context windows

Output Fields

Field	Type	Description
parsed_content	String	Complete extracted PDF content in Markdown format
chunks	Array<String>	Generated content chunks when chunking is enabled; empty array when disabled
chunk_count	Integer	Total number of chunks generated from the document
word_count	Integer	Total number of words extracted from the PDF
source_url	String	The original PDF URL that was processed

Step-by-step Guide

Add the PDF Parser node

Open your workflow on the canvas.
Click the + icon.
Search for and select PDF Parser.

Select a connection

Choose an existing PDF Parser connection. If required, create a new connection before proceeding.

Configure the PDF URL

Provide a PDF URL directly or map a field from an upstream node.

{{webhook.pdf_url}}

Select Output Type

Select Markdown.

Configure chunking OPTIONAL

Return complete content only

Enable Chunking = False

Generate chunks

Enable Chunking = True

Chunk Size = 1000

Save and test

Click Continue and save the workflow.
Run a test using a valid PDF URL.
Verify parsed_content, chunks, word_count, and source_url in the output.

Things to Know

Scenario	Behavior
Chunking disabled	`chunks` returns an empty array
Chunking enabled	`chunks` contains generated text sections
Small PDF	May generate only a single chunk
Large PDF	Generates multiple chunks
Invalid PDF URL	Processing fails with an error
Empty PDF	`parsed_content` may be empty
PDF exceeds 150 MB	Processing is rejected

Limits at a glance

Restriction	Limit
Maximum PDF size	150 MB
Output formats	Markdown
Supported URL type	Publicly accessible URLs only
Chunking	Optional
Minimum chunk size	1 character

Examples

Extract a product manual

Document processing · Chunking enabled · Searchable content

EXTRACTION

Convert a product manual into searchable Markdown content with individual sections available as chunks.

Configuration

PDF URLhttps://example.com/manual.pdf

Output TypeMarkdown

Enable ChunkingTrue

Chunk Size1000

✓ Complete document in parsed_content

✓ Individual sections in chunks

✓ Metadata via chunk_count and word_count

Build an AI knowledge base

RAG · Vector database · Embeddings

AI / RAG

Prepare documentation for AI-powered search and retrieval by chunking PDFs into embeddings-ready sections.

Trigger → PDF Parser → Embeddings → Vector DB

Process legal documents

Contract extraction · Review workflow · Storage

LEGAL

Extract contract text and pass the content to downstream review and approval workflows.

Trigger → PDF Parser → Review Process → Storage

Parse invoices

Finance · Data extraction · Accounting platform

FINANCE

Extract invoice content from PDFs before sending data into accounting systems.

Trigger → PDF Parser → Data Extraction → Accounting Platform

Frequently Asked Questions

Input & Configuration

What file types are supported?+

PDF Parser currently supports PDF documents accessible through public URLs.

Can I use dynamic field mappings for the PDF URL?+

Yes. The PDF URL field supports dynamic mappings from upstream nodes.

{{webhook.pdf_url}}

Can I process large PDF documents?+

Yes. PDF Parser supports files up to 150 MB. Files exceeding this limit are rejected at processing time.

Chunking & Output

Is chunking required?+

No. Chunking is optional. When disabled, the complete content is returned through parsed_content.

What chunk size should I use?+

500–1000 characters for granular AI retrieval
1000–2000 characters for larger context windows

What format is the extracted content returned in?+

Markdown. The complete extracted content is returned through parsed_content.

What does word_count represent?+

word_count returns the total number of words extracted from the document.

Behaviour & Limits

What happens if the PDF URL is invalid?+

The action fails and returns an error indicating that the document could not be retrieved or processed.

Does PDF Parser modify the original PDF?+

No. The original PDF remains unchanged. PDF Parser only reads and extracts content.

Ready to process documents?

Add the PDF Parser node to your workflow and convert PDF documents into structured, AI-ready content for automation, search, and knowledge management.

Get started free →