PDF Parser
PDF Parser
Extract text from PDF documents and convert it into structured Markdown. Optionally split content into chunks for AI processing, embeddings, knowledge bases, and downstream workflow automation.
What is PDF Parser?
PDF Parser is a document processing node that retrieves a PDF from a provided URL, extracts its textual content, and converts it into Markdown. For larger documents, chunking can be enabled to split the extracted content into smaller sections.
These chunks can then be used for vector databases, retrieval-augmented generation (RAG), semantic search, AI agents, and other document-processing workflows. The node also returns metadata such as word count, chunk count, and source URL.
- Extract text from PDF documents via public URL
- Convert PDF content into Markdown format
- Prepare documents for AI and LLM workflows
- Split large documents into manageable chunks
- Create searchable document repositories
- Analyze document size through word count metadata
- Process PDF files up to 150 MB
Prerequisites
- The PDF file is accessible through a public URL
- A PDF Parser connection has been configured
- The PDF file size does not exceed 150 MB
- Any upstream workflow step providing the PDF URL has been configured correctly
Understanding PDF Parsing
PDF Parser performs three core operations in sequence:
Configuration
Select an existing PDF Parser connection. If no connection exists, create one before configuring the node.
The publicly accessible URL of the PDF file to parse. Supports static URLs, dynamic field mappings, webhook inputs, and upstream workflow outputs.
Defines the format of the extracted content. Currently supports Markdown — extracted PDF content is returned as Markdown format.
Determines whether extracted content should be split into chunks.
| Value | Behavior |
| False | Returns complete parsed content only |
| True | Splits content into chunks and returns chunk metadata |
Defines the maximum number of characters in each generated chunk. Smaller values generate more chunks; larger values generate fewer.
- 500–1000 characters for granular AI retrieval
- 1000–2000 characters for larger context windows
Output Fields
| Field | Type | Description |
| parsed_content | String | Complete extracted PDF content in Markdown format |
| chunks | Array<String> | Generated content chunks when chunking is enabled; empty array when disabled |
| chunk_count | Integer | Total number of chunks generated from the document |
| word_count | Integer | Total number of words extracted from the PDF |
| source_url | String | The original PDF URL that was processed |
Step-by-step Guide
- Open your workflow on the canvas.
- Click the + icon.
- Search for and select PDF Parser.
Choose an existing PDF Parser connection. If required, create a new connection before proceeding.
Provide a PDF URL directly or map a field from an upstream node.
Select Markdown.
- Click Continue and save the workflow.
- Run a test using a valid PDF URL.
- Verify
parsed_content,chunks,word_count, andsource_urlin the output.
Things to Know
| Scenario | Behavior |
| Chunking disabled | chunks returns an empty array |
| Chunking enabled | chunks contains generated text sections |
| Small PDF | May generate only a single chunk |
| Large PDF | Generates multiple chunks |
| Invalid PDF URL | Processing fails with an error |
| Empty PDF | parsed_content may be empty |
| PDF exceeds 150 MB | Processing is rejected |
Limits at a glance
| Restriction | Limit |
| Maximum PDF size | 150 MB |
| Output formats | Markdown |
| Supported URL type | Publicly accessible URLs only |
| Chunking | Optional |
| Minimum chunk size | 1 character |
Examples
Frequently Asked Questions
Input & Configuration
What file types are supported?+
PDF Parser currently supports PDF documents accessible through public URLs.
Can I use dynamic field mappings for the PDF URL?+
Yes. The PDF URL field supports dynamic mappings from upstream nodes.
Can I process large PDF documents?+
Yes. PDF Parser supports files up to 150 MB. Files exceeding this limit are rejected at processing time.
Chunking & Output
Is chunking required?+
No. Chunking is optional. When disabled, the complete content is returned through parsed_content.
What chunk size should I use?+
- 500–1000 characters for granular AI retrieval
- 1000–2000 characters for larger context windows
What format is the extracted content returned in?+
Markdown. The complete extracted content is returned through parsed_content.
What does word_count represent?+
word_count returns the total number of words extracted from the document.
Behaviour & Limits
What happens if the PDF URL is invalid?+
The action fails and returns an error indicating that the document could not be retrieved or processed.
Does PDF Parser modify the original PDF?+
No. The original PDF remains unchanged. PDF Parser only reads and extracts content.
Was this article helpful?
That’s Great!
Thank you for your feedback
Sorry! We couldn't be helpful
Thank you for your feedback
Feedback sent
We appreciate your effort and will try to fix the article