Konnectify

PDF Parser

Extract text from PDF documents and convert it into structured Markdown. Optionally split content into chunks for AI processing, embeddings, knowledge bases, and downstream workflow automation.

Workflow Node Document Processing150 MB Max

What is PDF Parser?

PDF Parser is a document processing node that retrieves a PDF from a provided URL, extracts its textual content, and converts it into Markdown. For larger documents, chunking can be enabled to split the extracted content into smaller sections.

These chunks can then be used for vector databases, retrieval-augmented generation (RAG), semantic search, AI agents, and other document-processing workflows. The node also returns metadata such as word count, chunk count, and source URL.

What PDF Parser lets you do
  • Extract text from PDF documents via public URL
  • Convert PDF content into Markdown format
  • Prepare documents for AI and LLM workflows
  • Split large documents into manageable chunks
  • Create searchable document repositories
  • Analyze document size through word count metadata
  • Process PDF files up to 150 MB

Prerequisites

Before configuring PDF Parser, ensure that:
  • The PDF file is accessible through a public URL
  • A PDF Parser connection has been configured
  • The PDF file size does not exceed 150 MB
  • Any upstream workflow step providing the PDF URL has been configured correctly

Understanding PDF Parsing

PDF Parser performs three core operations in sequence:

1 Retrieve 2 Extract 3 Transform
Retrieve — Downloads the PDF from the specified URL.
Extract — Extracts textual content from the PDF file.
Transform — Converts extracted content into Markdown and optionally generates chunks for AI processing.

Configuration

Connection REQUIRED

Select an existing PDF Parser connection. If no connection exists, create one before configuring the node.

PDF URL REQUIRED

The publicly accessible URL of the PDF file to parse. Supports static URLs, dynamic field mappings, webhook inputs, and upstream workflow outputs.

https://example.com/user-guide.pdf
Output Type MARKDOWN

Defines the format of the extracted content. Currently supports Markdown — extracted PDF content is returned as Markdown format.

Enable Chunking OPTIONAL

Determines whether extracted content should be split into chunks.

ValueBehavior
FalseReturns complete parsed content only
TrueSplits content into chunks and returns chunk metadata
Chunk Size WHEN CHUNKING ENABLED

Defines the maximum number of characters in each generated chunk. Smaller values generate more chunks; larger values generate fewer.

Recommended values
  • 500–1000 characters for granular AI retrieval
  • 1000–2000 characters for larger context windows

Output Fields

FieldTypeDescription
parsed_contentStringComplete extracted PDF content in Markdown format
chunksArray<String>Generated content chunks when chunking is enabled; empty array when disabled
chunk_countIntegerTotal number of chunks generated from the document
word_countIntegerTotal number of words extracted from the PDF
source_urlStringThe original PDF URL that was processed

Step-by-step Guide

1
Add the PDF Parser node
  1. Open your workflow on the canvas.
  2. Click the + icon.
  3. Search for and select PDF Parser.
2
Select a connection

Choose an existing PDF Parser connection. If required, create a new connection before proceeding.

3
Configure the PDF URL

Provide a PDF URL directly or map a field from an upstream node.

{{webhook.pdf_url}}
4
Select Output Type

Select Markdown.

5
Configure chunking OPTIONAL
Return complete content only
Enable Chunking = False
Generate chunks
Enable Chunking = True

Chunk Size = 1000
6
Save and test
  1. Click Continue and save the workflow.
  2. Run a test using a valid PDF URL.
  3. Verify parsed_content, chunks, word_count, and source_url in the output.

Things to Know

ScenarioBehavior
Chunking disabledchunks returns an empty array
Chunking enabledchunks contains generated text sections
Small PDFMay generate only a single chunk
Large PDFGenerates multiple chunks
Invalid PDF URLProcessing fails with an error
Empty PDFparsed_content may be empty
PDF exceeds 150 MBProcessing is rejected

Limits at a glance

RestrictionLimit
Maximum PDF size150 MB
Output formatsMarkdown
Supported URL typePublicly accessible URLs only
ChunkingOptional
Minimum chunk size1 character

Examples

Extract a product manual
Document processing · Chunking enabled · Searchable content
EXTRACTION

Convert a product manual into searchable Markdown content with individual sections available as chunks.

Configuration
PDF URLhttps://example.com/manual.pdf
Output TypeMarkdown
Enable ChunkingTrue
Chunk Size1000
✓ Complete document in parsed_content
✓ Individual sections in chunks
✓ Metadata via chunk_count and word_count
Build an AI knowledge base
RAG · Vector database · Embeddings
AI / RAG

Prepare documentation for AI-powered search and retrieval by chunking PDFs into embeddings-ready sections.

Trigger PDF Parser Embeddings Vector DB
Process legal documents
Contract extraction · Review workflow · Storage
LEGAL

Extract contract text and pass the content to downstream review and approval workflows.

Trigger PDF Parser Review Process Storage
Parse invoices
Finance · Data extraction · Accounting platform
FINANCE

Extract invoice content from PDFs before sending data into accounting systems.

Trigger PDF Parser Data Extraction Accounting Platform

Frequently Asked Questions

Input & Configuration

What file types are supported?+

PDF Parser currently supports PDF documents accessible through public URLs.

Can I use dynamic field mappings for the PDF URL?+

Yes. The PDF URL field supports dynamic mappings from upstream nodes.

{{webhook.pdf_url}}
Can I process large PDF documents?+

Yes. PDF Parser supports files up to 150 MB. Files exceeding this limit are rejected at processing time.

Chunking & Output

Is chunking required?+

No. Chunking is optional. When disabled, the complete content is returned through parsed_content.

What chunk size should I use?+
  • 500–1000 characters for granular AI retrieval
  • 1000–2000 characters for larger context windows
What format is the extracted content returned in?+

Markdown. The complete extracted content is returned through parsed_content.

What does word_count represent?+

word_count returns the total number of words extracted from the document.

Behaviour & Limits

What happens if the PDF URL is invalid?+

The action fails and returns an error indicating that the document could not be retrieved or processed.

Does PDF Parser modify the original PDF?+

No. The original PDF remains unchanged. PDF Parser only reads and extracts content.

Ready to process documents?

Add the PDF Parser node to your workflow and convert PDF documents into structured, AI-ready content for automation, search, and knowledge management.

Get started free →

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select at least one of the reasons
CAPTCHA verification is required.

Feedback sent

We appreciate your effort and will try to fix the article