Post

Converting PDFs to Markdown with Marker & Pandoc

Set up with Marker and Pandoc

Converting PDFs to Markdown with Marker & Pandoc

Marker β†’ GitHub

Let’s get you set up with Marker, a powerful open-source tool for converting PDFs to Markdown. Here’s a step-by-step guide to help you install and use it:

πŸ› οΈ Step 1: Install Dependencies

1. Install Python (Version 3.10 or er)

  • Windows/macOS: Download and install from the official Python website.
  • Linux: Use your package manager. For example, on Debian/Ubuntu:

    1
    2
    
    sudo apt update
    sudo apt install python3 python3-pip
    

2. Install PyTorch

  • Visit the PyTorch installation page and follow the instructions tailored to your system configuration.
  • For CPU-only installation:

    1
    
    pip install torch torchvision torchaudio
    

πŸ“¦ Step 2: Install Marker

Once Python and PyTorch are set up, install Marker using pip:

1
pip install marker-pdf

βœ… Step 3: Use marker_single

1
marker_single "E:/mydocs/my.pdf"
  • This command will:
    • Process the full PDF (no page limit)
    • Output the .md, .json, and images in the virtual environment where you at now, like E:/marker-env/.venv/Lib/site-packages/conversion_results/my.

πŸ› οΈ Optional flags you can use

1
2
3
--languages en
--max_retries 3
--max_table_rows 30

(You can check full list with marker_single --help)

βœ… My Example: CPU-only installation in virtual environment on E:/

Two Qs to Clarify Before Installation

πŸ” What does β€œCPU-only” mean?

Most machine learning libraries like PyTorch come in two flavors:

  • CUDA version (uses GPU for acceleration - needs NVIDIA GPU + drivers)
  • CPU-only version (runs on your processor, slower but works everywhere)

If you are not using GPU like CUDA-enabled NVIDIA, the CPU-only install is the safest and most compatible option. πŸ’»βœ…

πŸ’Ύ Can you install it to E:?

Python and pip don’t install packages to a specific drive like E:\ by default - they install to the Python environment you’re using. But yes, you can control the installation location with a virtual environment stored anywhere you like (like on E:).

πŸ”§ How to install Marker on E:\ safely (step-by-step)

1. Open CMD or PowerShell
1
2
3
4
5
E:
mkdir marker-env
cd marker-env
python -m venv env
.\env\Scripts\activate

πŸ’‘ You’re now inside a virtual environment on E:. Or you can just activate the virtual environment of your existing repo, similar to above.

2. Install Marker (CPU-only style)

Now just run:

1
2
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install marker-pdf

This installs PyTorch (CPU-only) and Marker right into your E:\marker-env.

3. Test it

Try running for a small pdf file(like 5 pages):

1
marker_single "E:/mydocs/my.pdf"

That should spit out clean Markdown + any extracted images and metadata, all saved to your E:/marker-env/.venv/Lib/site-packages/conversion_results/my folder.


Process PDF Manually Before Run Marker

Okay, so now we know how to use Marker to process a small file. But Once we want to process an entire PDF book, things can get tricky.

First thing first, MemoryError! Generally, if we let Marker the whole PDF once, there will be too much memory usage for system, and finally we’ll get a fail.

So here’s the plan, based on the file input & output path:

  1. A script automatically slice PDF into 10-page chunks and save each one to a folder locally slice_pdf.py:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    
     from PyPDF2 import PdfReader, PdfWriter
     import os
    
     # πŸ’Ύ Path to your big PDF file
     input_pdf = r"E:\mydocs\my.pdf"
    
     # πŸ“ Folder to save sliced PDFs
     output_folder = r"E:\mydocs\my.pdf\sliced"
     os.makedirs(output_folder, exist_ok=True)
    
     # πŸ” Load the original file
     reader = PdfReader(input_pdf)
    
     # πŸ”ͺ Split every 10 pages
     slice_size = 10
     for start in range(0, len(reader.pages), slice_size):
         writer = PdfWriter()
         end = min(start + slice_size, len(reader.pages))
            
         for i in range(start, end):
             writer.add_page(reader.pages[i])
            
         out_path = os.path.join(
             output_folder,
             f"my_{start+1:03d}-{end:03d}.pdf"
         )
            
         with open(out_path, "wb") as f:
             writer.write(f)
    
     print("βœ… Done slicing!")
    
  2. Then we’ll write a marker_batch_run.bat file that automates the following steps:

    • Loop through and run marker_single on each chunk(pdf)
    • Move all generated folders to output path

    Here’s the script marker_batch_run.bat:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    
     @echo off
     setlocal enabledelayedexpansion
    
     REM πŸ”§ Customize these paths
     set "pdf_dir=E:\mydocs\my\sliced"
     set "out_dir=E:\mydocs\my\markdown_output"
     set "marker_out_dir=E:\marker-env\.venv\Lib\site-packages\conversion_results"
    
     REM πŸ§™β€β™‚οΈ Activate virtual environment
     call E:\marker-env\.venv\Scripts\activate.bat
    
     REM πŸ“ Make output folder if missing
     if not exist "!out_dir!" mkdir "!out_dir!"
    
     echo πŸ” Starting batch conversion...
    
     for %%F in ("%pdf_dir%\*.pdf") do (
         echo 🧾 Converting: %%~nxF
         marker_single "%%F"
    
         set "filename=%%~nF"
         set "source_folder=!marker_out_dir!\!filename!"
         set "target_folder=!out_dir!\!filename!"
    
         if exist "!source_folder!" (
             echo 🚚 Moving output folder: !source_folder! β†’ !target_folder!
             move /Y "!source_folder!" "!target_folder!" >nul
         ) else (
             echo ⚠️  Output not found for %%~nxF β€” check Marker logs.
         )
     )
    
     echo βœ… All conversions complete!
     pause
    

Merge PDFs

Now we should’ve done in converting all pdf chunks to markdown.

It’s time to merge them up, and extract all the images for a better embed with the complete .md file. Write a Python Script like below, which is gonna:

  1. Scan your 1markdown_output/1 folder

  2. For each folder (e.g., my_001-010), finds the .md file

  3. Append each file’s content to one big merged file (e.g., my_001-471.md)

  4. Sort the folders in correct numeric order; Adds a nice header before each chapter block

  5. Copy & rename *.jpeg files like my_001.jpeg, my_002.jpeg…

  6. Use pprint to print all the image paths

🧐 Attention: The image stuff may spend an extra time to check manually. It would be a boring job if you gotta a ton of them to process. πŸ’¦

Script For Merge & Extract extract_merge.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import os
import glob
import shutil
from pprint import pprint

# πŸ‘‡ Change this to your actual path
base_dir = r"E:\mydocs\my\markdown_output"
output_md_path = os.path.join(base_dir, "my_001-471.md")
image_output_dir = base_dir  # Output all .jpeg files here

# Get folders like my_001-010
folders = [f for f in os.listdir(base_dir) if os.path.isdir(os.path.join(base_dir, f)) and f.startswith("my_")]

# Sort them by the first number (e.g., 001, 011...)
def folder_sort_key(name):
    parts = name.replace("my_", "").split("-")
    return int(parts[0])

folders.sort(key=folder_sort_key)

# Merge markdowns and extract images
image_counter = 1
with open(output_md_path, "w", encoding="utf-8") as outfile:
    for folder in folders:
        folder_path = os.path.join(base_dir, folder)

        # πŸ”— Merge .md file
        md_files = glob.glob(os.path.join(folder_path, "*.md"))
        if md_files:
            with open(md_files[0], "r", encoding="utf-8") as infile:
                outfile.write("\n\n---\n\n")
                outfile.write(infile.read())

        # πŸ–Ό Copy & rename images
        image_files = sorted(glob.glob(os.path.join(folder_path, "*.jpeg")))
        pprint(image_files)
        for img in image_files:
            new_name = f"my_{image_counter:03d}.jpeg"
            new_path = os.path.join(image_output_dir, new_name)
            shutil.copy(img, new_path)
            image_counter += 1

print(f"βœ… Merge complete: {output_md_path}")
print(f"βœ… Total images extracted: {image_counter - 1}")

It’s ALL DONE NOW! 🎊

Script Order Summary

  1. slice_pdf.py
  2. marker_batch_run.bat
  3. extract_merge.py

Pandoc - Converting documents between formats like PDF, EPUB, Markdown, and more

Let’s get started with Pandoc, a tool for converting documents between formats like PDF, EPUB, Markdown, and more. Here’s a step-by-step guide to help you master it:


πŸ’» Step 1: Install Pandoc

πŸͺŸ Windows

  1. Download Installer Visit the Pandoc Downloads page and download the Windows installer.
  2. Run Installer Double-click the downloaded .msi file and follow the installation prompt.

🍎 macOS

  1. *Download Package: Go to the Pandoc Downloads page and download the macOS package.
  2. *Install: Open the downloaded .pkg file and follow the installation instructions.

🐧 Linux (Debian/Ubunu)

Open your terminal and run:

sudo apt update
sudo apt install pado

πŸ§ͺ Step 2: Verify Installation

After installation, confirm that Pandoc is installed correctly:

pandoc --version

You should see the installed version of Pandoc displayed.


πŸ”„ Step 3: Basic Conversions

Pandoc can convert files between various formats. Here are some common examples:

πŸ“„ Convert Markdown to HTML

1
pandoc input.md -o output.html

πŸ“š Convert Markdown to PDF

Note: Converting to PDF requires a LaTeX engine (e.g., TeX Live, iKTeX).

1
pandoc input.md -o output.pdf

πŸ“˜ Convert Markdown to EPUB

1
pandoc input.md -o output.epub

πŸ“ Convert Word Document to MarkdowN

1
pandoc input.docx -o output.md

πŸ“„ Convert PDF to Markdown

Pandoc doesn’t support PDF as an input format directly. For converting PDFs to Markdown, consider using tools like Marker or other PDF to Markdown converters.


βš™οΈ Step 4: Useful Options

  • -s or --standalone: Produce a standalone document (with header ad footer).
  • -f or--from: Specify input format (e.g., markdown, html).
  • -t r --to: Specify output format (e.g., pdf, epub).
  • -o: Specify output file name.

*Example:

1
pandoc -s -f markdown -t html -o output.htl input.md

πŸ“š Additional Resources

πŸ’– Support me with crypto or PayPal! πŸ’˜

Bitcoin (BTC):
bc1qtzjwfyfpleyzmpqu97sdatqes98ms3zxc7u790

Ethereum (ETH) & USDT (ERC-20):
0xFE05f74DeF594f8F904D915cB93361C99cB36500

Or support me on Ko-fi:

Support me on Ko-fi

Any amount helps me continue creating content πŸ’¬πŸ’»

This post is licensed under CC BY 4.0 by the author.