教程视频和代码生成效果请看视频:https://www.bilibili.com/video/BV1Ab6mBgEM5/如果需要原Jupyter notebook文件和用作示例的图片、文档,可以联系我。
在Python中处理Word文档是一项常见且实用的任务。本文将介绍如何使用几个主流的Python库来创建、修改和处理Word文档,涵盖从基础操作到高级功能的完整流程。
在开始之前,需要安装以下Python库:
可以通过pip安装:
pip install python-docx docxtpl docxcompose lxml 或者使用uv:
uv add python-docx docxtpl docxcompose lxml 使用python-docx创建文档非常简单:
from docx import Document doc = Document() doc.add_paragraph("Python-docx是一个用于创建") doc.save("文件1.docx") 默认字体对中文支持不佳,需要单独设置中文字体:
from docx.oxml.ns import qndef set_chinese_font(run, zh_font_name="宋体", en_font_name="Times New Roman"): run.font.name = en_font_name run._element.rPr.rFonts.set(qn("w:eastAsia"), zh_font_name) doc = Document() paragraph = doc.add_paragraph() run = paragraph.add_run('这是一段设置了中文字体的文本。') set_chinese_font(run) doc.save("文件1.docx") 注意:保存文件时,文件不能被打开,否则会报PermissionError错误。
doc = Document('example.docx') 注意事项:
# 遍历段落 for para in doc.paragraphs[:3]: print(para) print(para.text) print()# 遍历表格 for table in doc.tables: for row in table.rows: for cell in row.cells: print(cell.text) doc.add_heading("1.1 Transformer整体工作流程", 2) doc.add_heading("Transformer整体架构", 3) 注意:需要文档里有对应的标题样式,否则会报错。
text = """Transformer 模型由编码器(Encoder)和解码器(Decoder)组成。..."""paragraph1 = doc.add_paragraph(text) 首行缩进2字符:
paragraph_format = paragraph1.paragraph_format paragraph_format.first_line_indent = 0paragraph_format.element.pPr.ind.set(qn("w:firstLineChars"), '200') 首行缩进固定距离:
para_format.first_line_indent = Pt(10) from docx.enum.text import WD_PARAGRAPH_ALIGNMENT paragraph1.alignment = WD_PARAGRAPH_ALIGNMENT.LEFT p = paragraph1._element p.getparent().remove(p) # 将文本按换行符分割成多个段落 for one_paragraph_text in text.split("\n"): temp_paragraph = doc.add_paragraph(one_paragraph_text) paragraph_format = temp_paragraph.paragraph_format paragraph_format.first_line_indent = 0 paragraph_format.element.pPr.ind.set(qn("w:firstLineChars"), "200") from docx.shared import Pt para_format = temp_paragraph.paragraph_format para_format.line_spacing = Pt(18) # 行间距(固定值) para_format.space_before = Pt(3) # 段前距离 para_format.space_after = Pt(0) # 段后距离 para_format.right_indent = Pt(20) # 右侧缩进 para_format.left_indent = Pt(0) # 左侧缩进 from docx.shared import RGBColor, Pt# 加粗文本 temp_paragraph.add_run('加粗文本').bold = True# 红色斜体文本 run = temp_paragraph.add_run('红色斜体文本') run.font.color.rgb = RGBColor(255,0,0) # 设置红色 run.font.size = Pt(14) # 字号14磅 run.bold = True # 加粗 run.italic = True # 斜体 run.underline = True # 下划线 # 下标和上标 run2 = temp_paragraph.add_run("1") run2.font.subscript = True # 下标 run3 = temp_paragraph.add_run("2") run3.font.superscript = True # 上标 table = doc.add_table(rows=4, cols=5) table.style = "Grid Table 1 Light" # 应用预定义样式 # 方式1:直接指定单元格 cell = table.cell(0, 1) cell.text = "parrot, possibly dead"# 方式2:通过行获取单元格 row = table.rows[1] cells = row.cells cells[0].text = "Foo bar to you."cells[1].text = "And a hearty foo bar to you too sir!"from docx.enum.style import WD_STYLE_TYPE styles = doc.stylesfor s in styles: if s.type == WD_STYLE_TYPE.TABLE: print(s.name) # 增加一行 row = table.add_row()# 删除一行 def remove_row(table, row): tbl = table._tbl tr = row._tr tbl.remove(tr) row = table.rows[len(table.rows) - 1] remove_row(table, row) # 方式1:一行一行添加 items = ( (7, "1024", "Plush kittens"), (3, "2042", "Furbees"), (1, "1288", "French Poodle Collars, Deluxe"), )for item in items: cells = table.add_row().cells cells[0].text = str(item[0]) cells[1].text = item[1] cells[2].text = item[2]# 方式2:批量填充 for row in table.rows: for cell in row.cells: cell.text = "数据单元"table.cell(0, 0).merge(table.cell(1, 1)) # 跨行列合并 # 表格宽度自适应 table.autofit = True# 指定行高 from docx.shared import Cm table.rows[0].height = Cm(0.93)# 修改表格字体大小 table.style.font.size = Pt(15)# 设置单元格对齐 from docx.enum.table import WD_ALIGN_VERTICAL cell = table.cell(0, 0) cell.paragraphs[0].paragraph_format.alignment = WD_ALIGN_PARAGRAPH.CENTER cell.vertical_alignment = WD_ALIGN_VERTICAL.CENTER# 复制表格 from copy import deepcopy table_copy = deepcopy(doc.tables[0]) para1 = doc.add_paragraph() para1._p.addnext(table_copy._element) from io import BytesIOimport base64# 普通插入 doc.add_picture('图片1.png') doc.add_picture('图片2.png', width=Inches(2.5), height=Inches(2))# 使用base64插入 picture2_base64 = open("图片2base64.txt").read() img2_buf = base64.b64decode(picture2_base64) doc.add_picture(BytesIO(img2_buf))# 并排放图 run = doc.add_paragraph().add_run() run.add_picture("图片1.png", width=Inches(2.5), height=Inches(2)) run.add_picture("图片1.png", width=Inches(2.5), height=Inches(2)) doc.add_page_break() # 修改已有样式 doc.styles["Normal"].font.size = Pt(14) doc.styles['Normal'].font.name = 'Arial'doc.styles['Normal']._element.rPr.rFonts.set(qn('w:eastAsia'), '楷体')# 创建自定义段落样式 from docx.enum.style import WD_STYLE_TYPE UserStyle1 = doc.styles.add_style('UserStyle1', WD_STYLE_TYPE.PARAGRAPH) UserStyle1.font.size = Pt(40) UserStyle1.font.color.rgb = RGBColor(0xff, 0xde, 0x00) UserStyle1.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.CENTER UserStyle1.font.name = '微软雅黑'UserStyle1._element.rPr.rFonts.set(qn('w:eastAsia'), '微软雅黑')# 使用自定义样式 doc.add_paragraph('自定义段落样式', style=UserStyle1) docxtpl可以将Word文档制作成模板,实现数据自动填充。
首先创建一个包含占位符的Word模板,占位符使用双花括号{{}}包裹。
from docxtpl import DocxTemplate, InlineImage, RichText tpl = DocxTemplate("docxexample.docx") text = """Transformer 模型由编码器(Encoder)和解码器(Decoder)组成..."""picture1 = InlineImage(tpl, image_descriptor="图片1.png")# 准备数据 paragraphs1 = [ "步骤1:输入表示(Input Representation)", "步骤2:编码器处理(Encoder Processing)", "步骤3:解码器处理(Decoder Processing)", ] paragraphs2 = [ {"step": 1, "text": "输入向量(词嵌入+位置编码)进入编码器层。"}, {"step": 2, "text": "自注意力子层。"}, {"step": 3, "text": "前馈网络子层。"}, ] table = [ {"character": "并行计算", "description": "编码器可并行处理整个序列(与RNN不同)"}, {"character": "自注意力", "description": "每个词直接关联所有词,捕获长距离依赖"}, {"character": "位置编码", "description": "为无顺序的注意力机制注入位置信息"}, ] alerts = [ { "date": "2015-03-10", "desc": RichText("Very critical alert", color="FF0000", bold=True), "type": "CRITICAL", "bg": "FF0000", }, # ... 其他数据 ]# 渲染模板 context = { "title": "Transformer", "text_body": text, "picture1": picture1, "picture2": picture2, "paragraphs1": paragraphs1, "paragraphs2": paragraphs2, "runs": paragraphs1, "display_paragraph": True, "table1": table, "table2": table, "alerts": alerts, } tpl.render(context) tpl.save("文件3.docx") from docx.oxml import OxmlElementfrom docx.oxml.ns import qndef set_cell_border(cell, **kwargs): tc = cell._tc tcPr = tc.get_or_add_tcPr() tcBorders = tcPr.first_child_found_in("w:tcBorders") if tcBorders is None: tcBorders = OxmlElement("w:tcBorders") tcPr.append(tcBorders) for edge in ("left", "top", "right", "bottom", "insideH", "insideV"): edge_data = kwargs.get(edge) if edge_data: tag = "w:{}".format(edge) element = tcBorders.find(qn(tag)) if element is None: element = OxmlElement(tag) tcBorders.append(element) for key in ["sz", "val", "color", "space", "shadow"]: if key in edge_data: element.set(qn("w:{}".format(key)), str(edge_data[key]))# 使用示例 set_cell_border( table.cell(0, 0), top={"sz": 4, "val": "single", "color": "#000000", "space": "0"}, bottom={"sz": 4, "val": "single", "color": "#000000", "space": "0"}, left={"sz": 4, "val": "single", "color": "#000000", "space": "0"}, right={"sz": 4, "val": "single", "color": "#000000", "space": "0"}, ) def add_hyperlink(paragraph, url, text): part = paragraph.part r_id = part.relate_to( url, "http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink", is_external=True, ) hyperlink = OxmlElement("w:hyperlink") hyperlink.set(qn("r:id"), r_id) run = OxmlElement("w:r") run_text = OxmlElement("w:t") run_text.text = text run.append(run_text) hyperlink.append(run) paragraph._p.append(hyperlink) p = doc.add_paragraph("点击访问: ") add_hyperlink(p, "https://www.baidu.com", "示例链接") import zipfilefrom xml.etree.ElementTree import fromstringdef extract_images(docx_path, output_dir): with zipfile.ZipFile(docx_path) as z: try: doc_rels = z.read('word/_rels/document.xml.rels').decode('utf-8') except KeyError: return [] root = fromstring(doc_rels) rels = [] for child in root: if 'Type' in child.attrib and child.attrib['Type'] == RT.IMAGE: rels.append((child.attrib['Id'], child.attrib['Target'])) images = [] for rel_id, target in rels: try: image_data = z.read('word/' + target) image_name = target.split('/')[-1] with open(f"{output_dir}/{image_name}", 'wb') as f: f.write(image_data) images.append(image_name) except KeyError: continue return imagesprint(extract_images("Transformer原理纯享版.docx", "pictures")) # 插入“衬于文字下方”的浮动图片 # 如将 behindDoc="1" 改成0就是“浮于文字上方”了 # refer to docx.oxml.shape.CT_Inline class CT_Anchor(BaseOxmlElement): """ ``<w:anchor>`` element, container for a floating image. """ extent = OneAndOnlyOne('wp:extent') docPr = OneAndOnlyOne('wp:docPr') graphic = OneAndOnlyOne('a:graphic') @classmethod def new(cls, cx, cy, shape_id, pic, pos_x, pos_y): """ Return a new ``<wp:anchor>`` element populated with the values passed as parameters. """ anchor = parse_xml(cls._anchor_xml(pos_x, pos_y)) anchor.extent.cx = cx anchor.extent.cy = cy anchor.docPr.id = shape_id anchor.docPr.name = 'Picture %d' % shape_id anchor.graphic.graphicData.uri = ( 'http://schemas.openxmlformats.org/drawingml/2006/picture' ) anchor.graphic.graphicData._insert_pic(pic) return anchor @classmethod def new_pic_anchor(cls, shape_id, rId, filename, cx, cy, pos_x, pos_y): """ Return a new `wp:anchor` element containing the `pic:pic` element specified by the argument values. """ pic_id = 0 # Word doesn't seem to use this, but does not omit it pic = CT_Picture.new(pic_id, filename, rId, cx, cy) anchor = cls.new(cx, cy, shape_id, pic, pos_x, pos_y) anchor.graphic.graphicData._insert_pic(pic) return anchor @classmethod def _anchor_xml(cls, pos_x, pos_y): return ( '<wp:anchor distT="0" distB="0" distL="0" distR="0" simplePos="0" relativeHeight="0" \n' ' behindDoc="1" locked="0" layoutInCell="1" allowOverlap="1" \n' ' %s>\n' ' <wp:simplePos x="0" y="0"/>\n' ' <wp:positionH relativeFrom="page">\n' ' <wp:posOffset>%d</wp:posOffset>\n' ' </wp:positionH>\n' ' <wp:positionV relativeFrom="page">\n' ' <wp:posOffset>%d</wp:posOffset>\n' ' </wp:positionV>\n' ' <wp:extent cx="914400" cy="914400"/>\n' ' <wp:wrapNone/>\n' ' <wp:docPr id="666" name="unnamed"/>\n' ' <wp:cNvGraphicFramePr>\n' ' <a:graphicFrameLocks noChangeAspect="1"/>\n' ' </wp:cNvGraphicFramePr>\n' ' <a:graphic>\n' ' <a:graphicData uri="URI not set"/>\n' ' </a:graphic>\n' '</wp:anchor>' % ( nsdecls('wp', 'a', 'pic', 'r'), int(pos_x), int(pos_y) ) )# refer to docx.parts.story.BaseStoryPart.new_pic_inline def new_pic_anchor(part, image_descriptor, width, height, pos_x, pos_y): """Return a newly-created `w:anchor` element. The element contains the image specified by *image_descriptor* and is scaled based on the values of *width* and *height*. """ rId, image = part.get_or_add_image(image_descriptor) cx, cy = image.scaled_dimensions(width, height) shape_id, filename = part.next_id, image.filename return CT_Anchor.new_pic_anchor(shape_id, rId, filename, cx, cy, pos_x, pos_y)# refer to docx.text.run.add_picture def add_float_picture(p, image_path_or_stream, width=None, height=None, pos_x=0, pos_y=0): """Add float picture at fixed position `pos_x` and `pos_y` to the top-left point of page. """ run = p.add_run() anchor = new_pic_anchor(run.part, image_path_or_stream, width, height, pos_x, pos_y) run._r.add_drawing(anchor)# refer to docx.oxml.__init__.py register_element_cls('wp:anchor', CT_Anchor) document = Document()# add a floating picture p = document.add_paragraph() add_float_picture(p, '图片1.png')# add text p.add_run('Hello World '*50) document.save('文件2.docx')# https://www.cnblogs.com/dancesir/p/17788854.html # 分2栏 section = doc.sections[0] sectPr = section._sectPr cols = sectPr.xpath('./w:cols')[0] cols.set(qn('w:num'),'2') # 普通页眉 doc = Document('Transformer原理纯享版.docx') doc.sections[0].header.paragraphs[0].text = "这是第1节页眉"# 分奇偶设置页眉 doc.settings.odd_and_even_pages_header_footer = Truedoc.sections[0].even_page_header.paragraphs[0].text = "这是偶数页页眉"doc.sections[0].header.paragraphs[0].text = "这是奇数页页眉"# 设置首页页眉 doc.sections[0].different_first_page_header_footer = Truedoc.sections[0].first_page_header.paragraphs[0].text = "这是首页页眉"# 插入目录(不会更新域) paragraph = doc.paragraphs[0].insert_paragraph_before() run = paragraph.add_run() fldChar = OxmlElement('w:fldChar') fldChar.set(qn('w:fldCharType'), 'begin') instrText = OxmlElement('w:instrText') instrText.set(qn('xml:space'), 'preserve') instrText.text = r'TOC \o "1-3" \h \z \u'fldChar2 = OxmlElement('w:fldChar') fldChar2.set(qn('w:fldCharType'), 'separate') fldChar3 = OxmlElement('w:t') fldChar3.text = "Right-click to update field."fldChar2.append(fldChar3) fldChar4 = OxmlElement('w:fldChar') fldChar4.set(qn('w:fldCharType'), 'end') r_element = run._r r_element.append(fldChar) r_element.append(instrText) r_element.append(fldChar2) r_element.append(fldChar4)# 自动更新目录 import lxml name_space = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}"update_name_space = "%supdateFields" % name_space val_name_space = "%sval" % name_spacetry: element_update_field_obj = lxml.etree.SubElement(doc.settings.element, update_name_space) element_update_field_obj.set(val_name_space, "true")except Exception as e: del e from docxcompose.composer import Composer master = Document("文件1.docx") composer = Composer(master) doc1 = Document("文件2.docx") composer.append(doc1) doc2 = Document("文件3.docx") composer.append(doc2) composer.save("combined.docx") 注意:合并文档时,后面的文档会跟随第一个文档的格式。
本文介绍了Python处理Word文档的完整流程,包括:
这些技术可以广泛应用于自动化报告生成、批量文档处理、合同模板填充等场景,大大提高工作效率。
现在我把整个教程大概了一番,因此将仅保持本文更新:
[1] 深入解析Python-docx库:轻松玩转Word文档自动化: https://blog.csdn.net/PolarisRisingWar/article/details/147332412[2] docxtpl / DocxTemplate:用Python 3渲染docx文档模版: https://blog.csdn.net/PolarisRisingWar/article/details/148251794[3] 利用python-docx批量处理Word文件——表格(二)样式控制: https://www.cnblogs.com/xtfge/p/9949054.html[4] 使用python-docx解析word文档,需要提取完整的目录层级、和每个标题下的内容,以及图片 - CSDN文库: https://wenku.csdn.net/answer/4fp7ee9rx6[5] python-docx 处理导出word有段前距离段后距离的问题\_python-docx paragraphformat-CSDN博客: https://blog.csdn.net/rufengzizai521/article/details/89372113[6] 【python-docx】文本操作(段落、run、标题、首行缩进、段前段后、多倍行距、对齐方式)\_python docx设置首行缩进-CSDN博客: https://blog.csdn.net/qq_39147299/article/details/125179590[7] python-docx样式\_python docx style-CSDN博客: https://blog.csdn.net/2201_75415299/article/details/153977729[8] Python读写word文档(.docx) python-docx的使用\_python 读取docx-CSDN博客: https://blog.csdn.net/hfy1237/article/details/143891588[9] Python-docx库-常用操作篇-CSDN博客: https://blog.csdn.net/cxyxx12/article/details/133386785[10] Python中的文档处理神器:深度解析python-docx库-CSDN博客: https://blog.csdn.net/bagell/article/details/134827150[11] 【笔记】Python-docx写文档时逐字符设置字体与上下标\_python word 上标-CSDN博客: https://blog.csdn.net/qq_41035145/article/details/138685183[12] python table 怎么设置字号 python设置word表格字体\_kekenai的技术博客\_51CTO博客: https://blog.51cto.com/u_13259/9855268[13] 关于python docx包中,如何对Word自身表格实现复制,并且粘贴到原docx文档中?(已解决) | Python | Python 技术论坛: https://learnku.com/python/t/52624[14] ms word - In python-docx how do I delete a table row? - Stack Overflow: https://stackoverflow.com/questions/55545494/in-python-docx-how-do-i-delete-a-table-row[15] KeyError: u"no style with name 'Table Grid'"; python 无法创建word表格\_keyerror: "no style with name 'table grid-CSDN博客: https://blog.csdn.net/hoojou/article/details/86224463[16] python docx-template - 知乎: https://zhuanlan.zhihu.com/p/366902690
