Converting PDFs to Markdown with Marker & Pandoc
Set up with Marker and Pandoc
Marker β GitHub
Letβs get you set up with Marker, a powerful open-source tool for converting PDFs to Markdown. Hereβs a step-by-step guide to help you install and use it:
π οΈ Step 1: Install Dependencies
1. Install Python (Version 3.10 or er)
- Windows/macOS: Download and install from the official Python website.
Linux: Use your package manager. For example, on Debian/Ubuntu:
1 2
sudo apt update sudo apt install python3 python3-pip
2. Install PyTorch
- Visit the PyTorch installation page and follow the instructions tailored to your system configuration.
For CPU-only installation:
1
pip install torch torchvision torchaudio
π¦ Step 2: Install Marker
Once Python and PyTorch are set up, install Marker using pip:
1
pip install marker-pdf
β
Step 3: Use marker_single
1
marker_single "E:/mydocs/my.pdf"
- This command will:
- Process the full PDF (no page limit)
- Output the
.md
,.json
, and images in the virtual environment where you at now, likeE:/marker-env/.venv/Lib/site-packages/conversion_results/my
.
π οΈ Optional flags you can use
1
2
3
--languages en
--max_retries 3
--max_table_rows 30
(You can check full list with marker_single --help
)
β
My Example: CPU-only installation
in virtual environment
on E:/
Two Qs to Clarify Before Installation
π What does βCPU-onlyβ mean?
Most machine learning libraries like PyTorch come in two flavors:
- CUDA version (uses GPU for acceleration - needs NVIDIA GPU + drivers)
- CPU-only version (runs on your processor, slower but works everywhere)
If you are not using GPU like CUDA-enabled NVIDIA, the CPU-only install is the safest and most compatible option. π»β
πΎ Can you install it to E:
?
Python and pip
donβt install packages to a specific drive like E:\
by default - they install to the Python environment youβre using. But yes, you can control the installation location with a virtual environment stored anywhere you like (like on E:).
π§ How to install Marker on E:\
safely (step-by-step)
1. Open CMD or PowerShell
1
2
3
4
5
E:
mkdir marker-env
cd marker-env
python -m venv env
.\env\Scripts\activate
π‘ Youβre now inside a virtual environment on E:. Or you can just activate the virtual environment of your existing repo, similar to above.
2. Install Marker (CPU-only style)
Now just run:
1
2
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install marker-pdf
This installs PyTorch (CPU-only) and Marker right into your E:\marker-env
.
3. Test it
Try running for a small pdf file(like 5 pages):
1
marker_single "E:/mydocs/my.pdf"
That should spit out clean Markdown + any extracted images and metadata, all saved to your E:/marker-env/.venv/Lib/site-packages/conversion_results/my
folder.
Process PDF Manually Before Run Marker
Okay, so now we know how to use Marker to process a small file. But Once we want to process an entire PDF book, things can get tricky.
First thing first, MemoryError
! Generally, if we let Marker the whole PDF once, there will be too much memory usage for system, and finally weβll get a fail.
So hereβs the plan, based on the file input & output path:
A script automatically slice PDF into 10-page chunks and save each one to a folder locally
slice_pdf.py
:1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
from PyPDF2 import PdfReader, PdfWriter import os # πΎ Path to your big PDF file input_pdf = r"E:\mydocs\my.pdf" # π Folder to save sliced PDFs output_folder = r"E:\mydocs\my.pdf\sliced" os.makedirs(output_folder, exist_ok=True) # π Load the original file reader = PdfReader(input_pdf) # πͺ Split every 10 pages slice_size = 10 for start in range(0, len(reader.pages), slice_size): writer = PdfWriter() end = min(start + slice_size, len(reader.pages)) for i in range(start, end): writer.add_page(reader.pages[i]) out_path = os.path.join( output_folder, f"my_{start+1:03d}-{end:03d}.pdf" ) with open(out_path, "wb") as f: writer.write(f) print("β Done slicing!")
Then weβll write a
marker_batch_run.bat
file that automates the following steps:- Loop through and run
marker_single
on each chunk(pdf) - Move all generated folders to output path
Hereβs the script
marker_batch_run.bat
:1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
@echo off setlocal enabledelayedexpansion REM π§ Customize these paths set "pdf_dir=E:\mydocs\my\sliced" set "out_dir=E:\mydocs\my\markdown_output" set "marker_out_dir=E:\marker-env\.venv\Lib\site-packages\conversion_results" REM π§ββοΈ Activate virtual environment call E:\marker-env\.venv\Scripts\activate.bat REM π Make output folder if missing if not exist "!out_dir!" mkdir "!out_dir!" echo π Starting batch conversion... for %%F in ("%pdf_dir%\*.pdf") do ( echo π§Ύ Converting: %%~nxF marker_single "%%F" set "filename=%%~nF" set "source_folder=!marker_out_dir!\!filename!" set "target_folder=!out_dir!\!filename!" if exist "!source_folder!" ( echo π Moving output folder: !source_folder! β !target_folder! move /Y "!source_folder!" "!target_folder!" >nul ) else ( echo β οΈ Output not found for %%~nxF β check Marker logs. ) ) echo β All conversions complete! pause
- Loop through and run
Merge PDFs
Now we shouldβve done in converting all pdf chunks
to markdown
.
Itβs time to merge them up, and extract all the images for a better embed with the complete .md
file. Write a Python Script like below, which is gonna:
Scan your 1markdown_output/1 folder
For each folder (e.g., my_001-010), finds the .md file
Append each fileβs content to one big merged file (e.g.,
my_001-471.md
)Sort the folders in correct numeric order; Adds a nice header before each chapter block
Copy & rename *.jpeg files like
my_001.jpeg
,my_002.jpeg
β¦Use
pprint
to print all the image paths
π§ Attention: The image stuff may spend an extra time to check manually. It would be a boring job if you gotta a ton of them to process. π¦
Script For Merge & Extract extract_merge.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import os
import glob
import shutil
from pprint import pprint
# π Change this to your actual path
base_dir = r"E:\mydocs\my\markdown_output"
output_md_path = os.path.join(base_dir, "my_001-471.md")
image_output_dir = base_dir # Output all .jpeg files here
# Get folders like my_001-010
folders = [f for f in os.listdir(base_dir) if os.path.isdir(os.path.join(base_dir, f)) and f.startswith("my_")]
# Sort them by the first number (e.g., 001, 011...)
def folder_sort_key(name):
parts = name.replace("my_", "").split("-")
return int(parts[0])
folders.sort(key=folder_sort_key)
# Merge markdowns and extract images
image_counter = 1
with open(output_md_path, "w", encoding="utf-8") as outfile:
for folder in folders:
folder_path = os.path.join(base_dir, folder)
# π Merge .md file
md_files = glob.glob(os.path.join(folder_path, "*.md"))
if md_files:
with open(md_files[0], "r", encoding="utf-8") as infile:
outfile.write("\n\n---\n\n")
outfile.write(infile.read())
# πΌ Copy & rename images
image_files = sorted(glob.glob(os.path.join(folder_path, "*.jpeg")))
pprint(image_files)
for img in image_files:
new_name = f"my_{image_counter:03d}.jpeg"
new_path = os.path.join(image_output_dir, new_name)
shutil.copy(img, new_path)
image_counter += 1
print(f"β
Merge complete: {output_md_path}")
print(f"β
Total images extracted: {image_counter - 1}")
Itβs ALL DONE NOW! π
Script Order Summary
slice_pdf.py
marker_batch_run.bat
extract_merge.py
Pandoc - Converting documents between formats like PDF, EPUB, Markdown, and more
Letβs get started with Pandoc, a tool for converting documents between formats like PDF, EPUB, Markdown, and more. Hereβs a step-by-step guide to help you master it:
π» Step 1: Install Pandoc
πͺ Windows
- Download Installer Visit the Pandoc Downloads page and download the Windows installer.
- Run Installer Double-click the downloaded
.msi
file and follow the installation prompt.
π macOS
- *Download Package: Go to the Pandoc Downloads page and download the macOS package.
- *Install: Open the downloaded
.pkg
file and follow the installation instructions.
π§ Linux (Debian/Ubunu)
Open your terminal and run:
sudo apt update
sudo apt install pado
π§ͺ Step 2: Verify Installation
After installation, confirm that Pandoc is installed correctly:
pandoc --version
You should see the installed version of Pandoc displayed.
π Step 3: Basic Conversions
Pandoc can convert files between various formats. Here are some common examples:
π Convert Markdown to HTML
1
pandoc input.md -o output.html
π Convert Markdown to PDF
Note: Converting to PDF requires a LaTeX engine (e.g., TeX Live, iKTeX).
1
pandoc input.md -o output.pdf
π Convert Markdown to EPUB
1
pandoc input.md -o output.epub
π Convert Word Document to MarkdowN
1
pandoc input.docx -o output.md
π Convert PDF to Markdown
Pandoc doesnβt support PDF as an input format directly. For converting PDFs to Markdown, consider using tools like Marker or other PDF to Markdown converters.
βοΈ Step 4: Useful Options
-s
or--standalone
: Produce a standalone document (with header ad footer).-f
or--from
: Specify input format (e.g.,markdown
,html
).-t
r--to
: Specify output format (e.g.,pdf
,epub
).-o
: Specify output file name.
*Example:
1
pandoc -s -f markdown -t html -o output.htl input.md
π Additional Resources
- Official Documntation: For more detailed information, visit the Pandoc Userβs Guide.
- Tutorials: