Converting PDFs to Markdown with Marker & Pandoc

Set up with Marker and Pandoc

Posted Apr 13, 2025

By phruit

7 min read

Marker → GitHub

Let’s get you set up with Marker, a powerful open-source tool for converting PDFs to Markdown. Here’s a step-by-step guide to help you install and use it:

🛠️ Step 1: Install Dependencies

1. Install Python (Version 3.10 or er)

Windows/macOS: Download and install from the official Python website.

Linux: Use your package manager. For example, on Debian/Ubuntu:

  
sudo apt update
sudo apt install python3 python3-pip

2. Install PyTorch

Visit the PyTorch installation page and follow the instructions tailored to your system configuration.

For CPU-only installation:

pip install torch torchvision torchaudio

📦 Step 2: Install Marker

Once Python and PyTorch are set up, install Marker using pip:

pip install marker-pdf

✅ Step 3: Use `marker_single`

marker_single "E:/mydocs/my.pdf"

This command will:
- Process the full PDF (no page limit)
- Output the .md, .json, and images in the virtual environment where you at now, like E:/marker-env/.venv/Lib/site-packages/conversion_results/my.

🛠️ Optional flags you can use

  
--languages en
--max_retries 3
--max_table_rows 30

(You can check full list with marker_single --help)

✅ My Example: `CPU-only installation` in `virtual environment` on `E:/`

Two Qs to Clarify Before Installation

🔍 What does “CPU-only” mean?

Most machine learning libraries like PyTorch come in two flavors:

CUDA version (uses GPU for acceleration - needs NVIDIA GPU + drivers)
CPU-only version (runs on your processor, slower but works everywhere)

If you are not using GPU like CUDA-enabled NVIDIA, the CPU-only install is the safest and most compatible option. 💻✅

💾 Can you install it to `E:`?

Python and pip don’t install packages to a specific drive like E:\ by default - they install to the Python environment you’re using. But yes, you can control the installation location with a virtual environment stored anywhere you like (like on E:).

🔧 How to install Marker on `E:\` safely (step-by-step)

1. Open CMD or PowerShell

  
E:
mkdir marker-env
cd marker-env
python -m venv env
.\env\Scripts\activate

💡 You’re now inside a virtual environment on E:. Or you can just activate the virtual environment of your existing repo, similar to above.

2. Install Marker (CPU-only style)

Now just run:

  
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install marker-pdf

This installs PyTorch (CPU-only) and Marker right into your E:\marker-env.

3. Test it

Try running for a small pdf file(like 5 pages):

marker_single "E:/mydocs/my.pdf"

That should spit out clean Markdown + any extracted images and metadata, all saved to your E:/marker-env/.venv/Lib/site-packages/conversion_results/my folder.

Process PDF Manually Before Run `Marker`

Okay, so now we know how to use Marker to process a small file. But Once we want to process an entire PDF book, things can get tricky.

First thing first, MemoryError! Generally, if we let Marker the whole PDF once, there will be too much memory usage for system, and finally we’ll get a fail.

So here’s the plan, based on the file input & output path:

A script automatically slice PDF into 10-page chunks and save each one to a folder locally slice_pdf.py:

  
 from PyPDF2 import PdfReader, PdfWriter
 import os

 # 💾 Path to your big PDF file
 input_pdf = r"E:\mydocs\my.pdf"

 # 📁 Folder to save sliced PDFs
 output_folder = r"E:\mydocs\my.pdf\sliced"
 os.makedirs(output_folder, exist_ok=True)

 # 🔍 Load the original file
 reader = PdfReader(input_pdf)

 # 🔪 Split every 10 pages
 slice_size = 10
 for start in range(0, len(reader.pages), slice_size):
     writer = PdfWriter()
     end = min(start + slice_size, len(reader.pages))
        
     for i in range(start, end):
         writer.add_page(reader.pages[i])
        
     out_path = os.path.join(
         output_folder,
         f"my_{start+1:03d}-{end:03d}.pdf"
     )
        
     with open(out_path, "wb") as f:
         writer.write(f)

 print("✅ Done slicing!")

Then we’ll write a marker_batch_run.bat file that automates the following steps:

Loop through and run marker_single on each chunk(pdf)
Move all generated folders to output path

Here’s the script marker_batch_run.bat:

  
 @echo off
 setlocal enabledelayedexpansion

 REM 🔧 Customize these paths
 set "pdf_dir=E:\mydocs\my\sliced"
 set "out_dir=E:\mydocs\my\markdown_output"
 set "marker_out_dir=E:\marker-env\.venv\Lib\site-packages\conversion_results"

 REM 🧙‍♂️ Activate virtual environment
 call E:\marker-env\.venv\Scripts\activate.bat

 REM 📁 Make output folder if missing
 if not exist "!out_dir!" mkdir "!out_dir!"

 echo 🔁 Starting batch conversion...

 for %%F in ("%pdf_dir%\*.pdf") do (
     echo 🧾 Converting: %%~nxF
     marker_single "%%F"

     set "filename=%%~nF"
     set "source_folder=!marker_out_dir!\!filename!"
     set "target_folder=!out_dir!\!filename!"

     if exist "!source_folder!" (
         echo 🚚 Moving output folder: !source_folder! → !target_folder!
         move /Y "!source_folder!" "!target_folder!" >nul
     ) else (
         echo ⚠️  Output not found for %%~nxF — check Marker logs.
     )
 )

 echo ✅ All conversions complete!
 pause

Merge PDFs

Now we should’ve done in converting all `pdf chunks` to `markdown`.

It’s time to merge them up, and extract all the images for a better embed with the complete .md file. Write a Python Script like below, which is gonna:

Scan your 1markdown_output/1 folder
For each folder (e.g., my_001-010), finds the .md file
Append each file’s content to one big merged file (e.g., my_001-471.md)
Sort the folders in correct numeric order; Adds a nice header before each chapter block
Copy & rename *.jpeg files like my_001.jpeg, my_002.jpeg…
Use pprint to print all the image paths

🧐 Attention: The image stuff may spend an extra time to check manually. It would be a boring job if you gotta a ton of them to process. 💦

Script For Merge & Extract `extract_merge.py`

  
import os
import glob
import shutil
from pprint import pprint

# 👇 Change this to your actual path
base_dir = r"E:\mydocs\my\markdown_output"
output_md_path = os.path.join(base_dir, "my_001-471.md")
image_output_dir = base_dir  # Output all .jpeg files here

# Get folders like my_001-010
folders = [f for f in os.listdir(base_dir) if os.path.isdir(os.path.join(base_dir, f)) and f.startswith("my_")]

# Sort them by the first number (e.g., 001, 011...)
def folder_sort_key(name):
    parts = name.replace("my_", "").split("-")
    return int(parts[0])

folders.sort(key=folder_sort_key)

# Merge markdowns and extract images
image_counter = 1
with open(output_md_path, "w", encoding="utf-8") as outfile:
    for folder in folders:
        folder_path = os.path.join(base_dir, folder)

        # 🔗 Merge .md file
        md_files = glob.glob(os.path.join(folder_path, "*.md"))
        if md_files:
            with open(md_files[0], "r", encoding="utf-8") as infile:
                outfile.write("\n\n---\n\n")
                outfile.write(infile.read())

        # 🖼 Copy & rename images
        image_files = sorted(glob.glob(os.path.join(folder_path, "*.jpeg")))
        pprint(image_files)
        for img in image_files:
            new_name = f"my_{image_counter:03d}.jpeg"
            new_path = os.path.join(image_output_dir, new_name)
            shutil.copy(img, new_path)
            image_counter += 1

print(f"✅ Merge complete: {output_md_path}")
print(f"✅ Total images extracted: {image_counter - 1}")

It’s ALL DONE NOW! 🎊

Script Order Summary

slice_pdf.py
marker_batch_run.bat
extract_merge.py

Pandoc - Converting documents between formats like PDF, EPUB, Markdown, and more

Let’s get started with Pandoc, a tool for converting documents between formats like PDF, EPUB, Markdown, and more. Here’s a step-by-step guide to help you master it:

💻 Step 1: Install Pandoc

🪟 Windows

Download Installer Visit the Pandoc Downloads page and download the Windows installer.
Run Installer Double-click the downloaded .msi file and follow the installation prompt.

🍎 macOS

*Download Package: Go to the Pandoc Downloads page and download the macOS package.
*Install: Open the downloaded .pkg file and follow the installation instructions.

🐧 Linux (Debian/Ubunu)

Open your terminal and run:

sudo apt update
sudo apt install pado

🧪 Step 2: Verify Installation

After installation, confirm that Pandoc is installed correctly:

pandoc --version

You should see the installed version of Pandoc displayed.

🔄 Step 3: Basic Conversions

Pandoc can convert files between various formats. Here are some common examples:

📄 Convert Markdown to HTML

pandoc input.md -o output.html

📚 Convert Markdown to PDF

Note: Converting to PDF requires a LaTeX engine (e.g., TeX Live, iKTeX).

pandoc input.md -o output.pdf

📘 Convert Markdown to EPUB

pandoc input.md -o output.epub

📝 Convert Word Document to MarkdowN

pandoc input.docx -o output.md

📄 Convert PDF to Markdown

Pandoc doesn’t support PDF as an input format directly. For converting PDFs to Markdown, consider using tools like Marker or other PDF to Markdown converters.

⚙️ Step 4: Useful Options

-s or --standalone: Produce a standalone document (with header ad footer).
-f or--from: Specify input format (e.g., markdown, html).
-t r --to: Specify output format (e.g., pdf, epub).
-o: Specify output file name.

*Example:

  
pandoc -s -f markdown -t html -o output.htl input.md

📚 Additional Resources

Official Documntation: For more detailed information, visit the Pandoc User’s Guide.
Tutorials:
- Getting Started with Pandoc
- How to Use Pandoc - An Open Source Tool for Technical Writers

🤖 tech, ⚙️ setup

⚙️ setup 🐍 Python 🪟 Windows 📃 PDF Conversion 🌐 Pandoc 📝 Marker 📚 Markdown 💻 automation

This post is licensed under CC BY 4.0 by the author.