当前位置：首页>python>Python脚本|PDF转Word,精准还原表格与排版

Python脚本|PDF转Word,精准还原表格与排版

2026-06-29 23:00:17

在日常办公、学术写作、资料整理时，我们几乎都会遇到PDF 转 Word的需求。但市面上大多数转换工具都有明显短板：表格错乱、格式丢失、字体变形、图片错位，最后还是要花大量时间手动排版。

今天给大家带来一款专业级、高还原度的 Python 转换脚本，专门解决 PDF→Word 的格式丢失问题，表格、文字、段落、图片一键精准复刻，支持批量处理，纯本地运行不伤隐私，小白也能直接用。

一、功能亮点（比普通转换器强太多）

普通在线转换工具要么收费，要么转换后格式全乱。这款脚本完全免费，针对PDF转换场景深度优化，具体效果看下表：

功能特性	说明
自动检测扫描件	智能判断PDF是否扫描版，无需手动区分
OCR文字识别	对扫描件自动添加文字层，中文英文都支持
表格布局保留	使用pdf2docx引擎，精准还原表格和段落位置
批量转换	支持整个文件夹递归处理，一键转换所有PDF
失败重试	自动重试失败的转换，支持记录错误日志
单文件/目录模式	灵活切换，满足不同使用场景

二、零基础小白教程（全新电脑也能操作）

如果你从没装过Python，别慌！跟着下面6步走，全程复制粘贴就能搞定，每个关键步骤都有⚠️提醒。

第1步：安装Python

打开浏览器，访问Python官网：https://www.python.org/downloads/
点击黄色的【Download Python】按钮（版本3.8以上都可以）
双击运行下载的安装程序
⚠️ 关键步骤：安装界面底部，一定要勾选【Add Python to PATH】，然后点击【Install Now】
等待安装完成，关闭窗口即可

第2步：安装依赖库

按键盘【Win + R】，输入cmd回车，打开命令提示符窗口，依次粘贴运行以下命令：

pip install pdf2docx pymupdfpip install ocrmypdf

💡 ocrmypdf是可选的，但强烈推荐安装，它会调用Tesseract识别扫描件。安装ocrmypdf前，还需要额外安装两个工具：
Tesseract OCR：下载地址 https://github.com/UB-Mannheim/tesseract/wiki
Ghostscript：下载地址 https://ghostscript.com/releases/gsdnld.html
下载安装后重启电脑即可。

第3步：保存脚本文件

在桌面新建一个文本文档（右键→新建→文本文档）
把脚本完整代码（文末有获取方式）复制粘贴进去
点击【文件】→【另存为】，文件名改为【pdf_to_word.py】
⚠️ 注意：保存类型选择【所有文件】，确保扩展名是.py（不是.txt）

第4步：配置默认文件夹（可选，小白推荐）

用记事本打开刚才保存的【pdf_to_word.py】文件，找到下面这一行：

FIXED_BATCH_DIR = Path(r"D:\pdf\input")

把引号里的路径，改成你存放PDF文件的文件夹路径。例如你的PDF都放在桌面的【我的PDF文件】文件夹，路径就改成：

FIXED_BATCH_DIR = Path(r"C:\Users\你的用户名\Desktop\我的PDF文件")

💡 改完后，后续双击脚本就能直接批量转换，不用每次都输命令！

第5步：运行脚本（5种用法）

🔹 最简单用法（小白首选）

直接把PDF文件放到你配置好的文件夹里，然后双击pdf_to_word.py脚本，它会自动转换该文件夹下所有PDF，并把生成的Word放在同一目录。

🔹 命令行用法（更灵活）

如果你想要更多控制，可以在命令提示符中切换脚本所在目录，然后选择下面的命令：

用法1：转换单个文件

python pdf_to_word.py -i 文档.pdf

用法2：转换单个文件，指定输出位置

python pdf_to_word.py -i 文档.pdf -o D:\转换结果\输出.docx

用法3：批量转换整个文件夹

python pdf_to_word.py -i D:\我的PDF文件

用法4：递归转换（包含子文件夹）

python pdf_to_word.py -i D:\我的PDF文件 --recursive

用法5：强制OCR识别扫描件

python pdf_to_word.py -i 扫描件.pdf --force-ocr --ocr-lang chi_sim+eng

第6步：获取转换结果

脚本运行完成后，Word文档会生成在PDF所在的文件夹（如果没指定输出目录的话）。打开检查，表格和排版都已经完美还原，可以直接编辑使用。

#!/usr/bin/env python3"""PDF to Word converter with better layout/table retention.Features:1) Automatically detect scanned PDFs.2) Run OCR for scanned PDFs (via ocrmypdf) to add a text layer.3) Convert to .docx using pdf2docx to preserve tables/paragraph layout.4) Support single-file and batch directory conversion.Install:    pip install pdf2docx pymupdf    # Optional but recommended for scanned PDFs:    # 1) Install Tesseract OCR on your system.    # 2) Install Ghostscript on your system.    # 3) pip install ocrmypdfExamples:    python pdf_to_word.py    python pdf_to_word.py -i input.pdf -o output.docx    python pdf_to_word.py -i ./pdfs -o ./docx --recursive    python pdf_to_word.py -i scan.pdf --force-ocr --ocr-lang chi_sim+eng    python pdf_to_word.py -i ./pdfs -o ./docx --recursive --retries 2 --error-log ./failed.csv"""from __future__ import annotationsimport argparseimport csvimport shutilimport subprocessimport sysimport tempfileimport timefrom datetime import datetimefrom pathlib import Pathfrom typing import Dict, Iterable, List, Optional, Tupleimport fitz  # pymupdffrom pdf2docx import Converter# ====== Fixed batch folder configuration ======# 直接修改该路径后，执行 `python doc/pdf_to_word.py` 即可批量转换。# 输入和输出使用同一个目录：会在该目录（及其子目录）生成同名 .docx。FIXED_BATCH_DIR = Path(r"D:\pdf\input")def find_pdf_files(input_path: Path, recursive: bool) -> List[Path]:    """收集待处理 PDF 文件列表。    - 输入为文件时：仅接受 .pdf    - 输入为目录时：按 recursive 决定是否递归扫描    """    if input_path.is_file():        if input_path.suffix.lower() != ".pdf":            raise ValueError(f"Input file is not a PDF: {input_path}")        return [input_path]    if not input_path.is_dir():        raise ValueError(f"Input path does not exist: {input_path}")    pattern = "**/*.pdf" if recursive else "*.pdf"    return sorted(input_path.glob(pattern))def is_scanned_pdf(pdf_path: Path, sample_pages: int = 3, min_text_chars: int = 60) -> bool:    """    Heuristic:    - If first N pages have very little extractable text, treat as scanned PDF.    """    doc = fitz.open(str(pdf_path))    try:        pages = min(sample_pages, doc.page_count)        if pages <= 0:            return False        total_chars = 0        for i in range(pages):            total_chars += len(doc.load_page(i).get_text("text").strip())        return total_chars < min_text_chars    finally:        doc.close()def require_command(name: str) -> None:    """检查外部命令是否存在于 PATH。"""    if shutil.which(name) is None:        raise RuntimeError(            f"Required command not found: {name}. "            f"Please install it and ensure it is in PATH."        )def run_ocr(input_pdf: Path, output_pdf: Path, ocr_lang: str) -> None:    """调用 ocrmypdf 给扫描件补文本层，便于后续版面/表格识别。"""    require_command("ocrmypdf")    cmd = [        "ocrmypdf",        "--skip-text",        "--redo-ocr",        "-l",        ocr_lang,        str(input_pdf),        str(output_pdf),    ]    print(f"[OCR] {' '.join(cmd)}")    result = subprocess.run(cmd, capture_output=True, text=True)    if result.returncode != 0:        msg = result.stderr.strip() or result.stdout.strip()        raise RuntimeError(f"OCR failed for {input_pdf}:\n{msg}")def convert_pdf_to_docx(input_pdf: Path, output_docx: Path, start: int = 0, end: Optional[int] = None) -> None:    """执行 PDF 到 DOCX 的核心转换。"""    output_docx.parent.mkdir(parents=True, exist_ok=True)    cv = Converter(str(input_pdf))    try:        cv.convert(str(output_docx), start=start, end=end)    finally:        cv.close()def build_output_path(src_pdf: Path, input_root: Path, output_root: Path) -> Path:    """根据输入根目录和输出根目录，构建对应的 .docx 路径。"""    if input_root.is_file():        return output_root    rel = src_pdf.relative_to(input_root)    return output_root / rel.with_suffix(".docx")def convert_one(    pdf_file: Path,    output_docx: Path,    force_ocr: bool,    ocr_lang: str,    start_page: int,    end_page: Optional[int],) -> None:    """转换单个 PDF。    处理流程：    1) 扫描件检测    2) 必要时 OCR    3) 执行 pdf2docx 转换    """    scanned = is_scanned_pdf(pdf_file)    need_ocr = force_ocr or scanned    print(f"\n[INFO] Processing: {pdf_file}")    print(f"[INFO] Scanned detection: {'YES'if scanned else'NO'}")    print(f"[INFO] OCR step: {'ENABLED'if need_ocr else'SKIPPED'}")    with tempfile.TemporaryDirectory(prefix="pdf2docx_") as tmp_dir:        source_pdf = pdf_file        if need_ocr:            ocr_pdf = Path(tmp_dir) / f"{pdf_file.stem}.ocr.pdf"            run_ocr(pdf_file, ocr_pdf, ocr_lang=ocr_lang)            source_pdf = ocr_pdf        convert_pdf_to_docx(            input_pdf=source_pdf,            output_docx=output_docx,            start=max(0, start_page),            end=end_page,        )    print(f"[OK] Output: {output_docx}")def convert_with_retry(    pdf_file: Path,    output_docx: Path,    force_ocr: bool,    ocr_lang: str,    start_page: int,    end_page: Optional[int],    retries: int,    retry_delay: float,) -> Tuple[bool, int, Optional[str]]:    """带重试机制的单文件转换。    返回:    - 是否成功    - 实际尝试次数    - 错误信息（成功时为 None）    """    # retries 表示“失败后重试次数”，总尝试次数 = retries + 1    max_attempts = max(1, retries + 1)    last_error: Optional[str] = None    for attempt in range(1, max_attempts + 1):        try:            if attempt > 1:                print(f"[RETRY] {pdf_file.name} attempt {attempt}/{max_attempts}")            convert_one(                pdf_file=pdf_file,                output_docx=output_docx,                force_ocr=force_ocr,                ocr_lang=ocr_lang,                start_page=start_page,                end_page=end_page,            )            return True, attempt, None        except Exception as exc:  # noqa: BLE001            last_error = str(exc)            # 失败后按设定间隔重试，减小瞬时环境波动的影响            if attempt < max_attempts and retry_delay > 0:                time.sleep(retry_delay)    return False, max_attempts, last_errordef write_failure_csv(failures: List[Dict[str, str]], error_log_path: Path) -> None:    """将失败任务写入 CSV，便于批量复盘与二次处理。"""    error_log_path.parent.mkdir(parents=True, exist_ok=True)    fieldnames = [        "timestamp",        "input_pdf",        "output_docx",        "attempts",        "error",    ]    with error_log_path.open("w", newline="", encoding="utf-8-sig") as f:        writer = csv.DictWriter(f, fieldnames=fieldnames)        writer.writeheader()        writer.writerows(failures)def parse_args(argv: Optional[Iterable[str]] = None) -> argparse.Namespace:    """解析命令行参数。"""    parser = argparse.ArgumentParser(        description="Convert PDF to Word with OCR support for scanned PDFs."    )    parser.add_argument(        "-i",        "--input",        required=False,        default=None,        help=(            "Input PDF file or directory. "            "If omitted, the script uses FIXED_BATCH_DIR."        ),    )    parser.add_argument(        "-o",        "--output",        required=False,        help=(            "Output .docx file (single input) or output directory (batch input). "            "Default: same folder as input."        ),    )    parser.add_argument(        "--recursive",        action="store_true",        help="Recursively scan input directory for PDFs.",    )    parser.add_argument(        "--force-ocr",        action="store_true",        help="Force OCR even if PDF already has extractable text.",    )    parser.add_argument(        "--ocr-lang",        default="chi_sim+eng",        help="OCR language for ocrmypdf (default: chi_sim+eng).",    )    parser.add_argument(        "--start-page",        type=int,        default=0,        help="Start page index for conversion (0-based, default: 0).",    )    parser.add_argument(        "--end-page",        type=int,        default=None,        help="End page index for conversion (0-based, exclusive).",    )    parser.add_argument(        "--retries",        type=int,        default=2,        help="Retry count after a failure (default: 2).",    )    parser.add_argument(        "--retry-delay",        type=float,        default=1.0,        help="Delay seconds between retries (default: 1.0).",    )    parser.add_argument(        "--error-log",        default=None,        help=(            "CSV path for failed tasks. "            "Default: single file -> same dir 'pdf_to_word_failures.csv'; "            "batch -> output dir 'pdf_to_word_failures.csv'."        ),    )    return parser.parse_args(argv)def main(argv: Optional[Iterable[str]] = None) -> int:    """程序入口：支持单文件模式与批量目录模式。"""    args = parse_args(argv)    use_fixed_folder_mode = args.input is None    input_path = (        FIXED_BATCH_DIR.expanduser().resolve()        if use_fixed_folder_mode        else Path(args.input).expanduser().resolve()    )    try:        pdf_files = find_pdf_files(input_path, recursive=args.recursive)    except ValueError as exc:        print(f"[ERROR] {exc}")        return 2    if not pdf_files:        print(f"[WARN] No PDF files found in: {input_path}")        return 0    if input_path.is_file():        # 单文件模式：输出必须是 .docx（或默认同名）        if args.output:            out_path = Path(args.output).expanduser().resolve()            if out_path.suffix.lower() != ".docx":                print("[ERROR] For single input file, output must end with .docx")                return 2            output_path = out_path        else:            output_path = input_path.with_suffix(".docx")        ok, attempts, err = convert_with_retry(            pdf_file=pdf_files[0],            output_docx=output_path,            force_ocr=args.force_ocr,            ocr_lang=args.ocr_lang,            start_page=args.start_page,            end_page=args.end_page,            retries=args.retries,            retry_delay=args.retry_delay,        )        if ok:            return 0        # 单文件失败也写 CSV，统一日志格式        error_log_path = (            Path(args.error_log).expanduser().resolve()            if args.error_log            else output_path.parent / "pdf_to_word_failures.csv"        )        failures = [            {                "timestamp": datetime.now().isoformat(timespec="seconds"),                "input_pdf": str(pdf_files[0]),                "output_docx": str(output_path),                "attempts": str(attempts),                "error": err or "Unknown error",            }        ]        write_failure_csv(failures, error_log_path)        print(f"[ERROR] {pdf_files[0]}: {err}")        print(f"[LOG] Failure CSV: {error_log_path}")        return 1    output_root = (        Path(args.output).expanduser().resolve()        if args.output        else input_path    )    output_root.mkdir(parents=True, exist_ok=True)    # 批量模式：记录失败任务，最后统一落盘 CSV    failed = 0    failure_records: List[Dict[str, str]] = []    for pdf_file in pdf_files:        output_docx = build_output_path(            src_pdf=pdf_file,            input_root=input_path,            output_root=output_root,        )        ok, attempts, err = convert_with_retry(            pdf_file=pdf_file,            output_docx=output_docx,            force_ocr=args.force_ocr,            ocr_lang=args.ocr_lang,            start_page=args.start_page,            end_page=args.end_page,            retries=args.retries,            retry_delay=args.retry_delay,        )        if not ok:            failed += 1            print(f"[ERROR] {pdf_file}: {err}")            failure_records.append(                {                    "timestamp": datetime.now().isoformat(timespec="seconds"),                    "input_pdf": str(pdf_file),                    "output_docx": str(output_docx),                    "attempts": str(attempts),                    "error": err or "Unknown error",                }            )    total = len(pdf_files)    success = total - failed    print(f"\n[DONE] Total: {total}, Success: {success}, Failed: {failed}")    if failure_records:        # 优先使用用户指定路径，否则默认写到输出目录        error_log_path = (            Path(args.error_log).expanduser().resolve()            if args.error_log            else output_root / "pdf_to_word_failures.csv"        )        write_failure_csv(failure_records, error_log_path)        print(f"[LOG] Failure CSV: {error_log_path}")    return 1 if failed else 0if __name__ == "__main__":    sys.exit(main())

本文来自网友投稿或网络内容，如有侵犯您的权益请联系我们删除，联系邮箱：wyl860211@qq.com 。

Python脚本|PDF转Word,精准还原表格与排版

一、功能亮点（比普通转换器强太多）

二、零基础小白教程（全新电脑也能操作）

第1步：安装Python

第2步：安装依赖库

第3步：保存脚本文件

第4步：配置默认文件夹（可选，小白推荐）

把引号里的路径，改成你存放PDF文件的文件夹路径。例如你的PDF都放在桌面的【我的PDF文件】文件夹，路径就改成：

第5步：运行脚本（5种用法）

第6步：获取转换结果

最新文章

热门文章

随机文章

Python脚本|PDF转Word,精准还原表格与排版

一、功能亮点（比普通转换器强太多）

二、零基础小白教程（全新电脑也能操作）

第1步：安装Python

第2步：安装依赖库

第3步：保存脚本文件

第4步：配置默认文件夹（可选，小白推荐）

把引号里的路径，改成你存放PDF文件的文件夹路径。例如你的PDF都放在桌面的【我的PDF文件】文件夹，路径就改成：

第5步：运行脚本（5种用法）

第6步：获取转换结果

Python之小试牛刀(二)-抓取公开的指数基金估值

LLM agent应用开发完整手册-Python

最新文章

热门文章

随机文章