作者:还怪好嘞 专栏:Python全栈修炼之路 标签:Python、字符串、Unicode、正则表达式、编码
前言
字符串是编程中最常用的数据类型之一,但在Python中,字符串远不止"一段文本"那么简单。从编码原理到正则表达式,从不可变性到字符串驻留,深入理解字符串能让你写出更高效、更健壮的代码。本文将带你从基础到进阶,彻底掌握Python字符串。
一、知识点详解
1.1 字符串的创建方式
Python中创建字符串有多种方式,每种都有其适用场景:
# 1. 单引号与双引号(完全等价)s1 = 'Hello Python's2 = "Hello Python"# 2. 三引号:支持多行字符串s3 = '''这是一个多行字符串'''s4 = """这也是多行字符串"""# 3. 原始字符串:r前缀,不转义反斜杠path = r"C:\Users\Admin\Documents" # 注意:不是 C:\Users...regex = r"\d+\.\d+" # 正则表达式常用# 4. 字节串与字符串转换b = b"hello" # 字节串s = b.decode('utf-8') # 解码为字符串b2 = s.encode('utf-8') # 编码为字节串
| | |
|---|
| 'text' | |
| "text" | |
| '''text''' | |
| r"text" | |
| f"{var}" | |
1.2 索引与切片
Python字符串支持强大的索引和切片操作:
text = "Python全栈修炼"# 索引(从0开始)print(text[0]) # Pprint(text[-1]) # 炼(倒数第一个)# 切片 [start:end:step]print(text[0:6]) # Python(不包含索引6)print(text[6:]) # 全栈修炼print(text[:6]) # Pythonprint(text[::2]) # Pto全修(步长为2)print(text[::-1]) # 炼修栈全nohtyP(反转)# 实用切片技巧# 去掉最后一个字符text[:-1] # "Python全栈修"# 每隔一个字符取一个text[::2] # "Pto全修"# 字符串反转(经典面试题)text[::-1] # "炼修栈全nohtyP"
切片原理图解:
字符串: P y t h o n 全 栈 修 炼正索引: 0 1 2 3 4 5 6 7 8 9负索引: -10 -9 -8 -7 -6 -5 -4 -3 -2 -1切片 [1:5] → 取索引1到4: "ytho"切片 [::2] → 步长2: "Pto全修"
1.3 常用方法速查
Python字符串提供了丰富的方法,以下是分类整理:
1.3.1 查找与替换
text = "Hello Python, Python is great!"# 查找print(text.find("Python")) # 6(首次出现位置)print(text.rfind("Python")) # 14(最后出现位置)print(text.count("Python")) # 2(出现次数)print("Python" in text) # True(成员判断)# 替换print(text.replace("Python", "Java", 1)) # 只替换1次print(text.replace("Python", "Java")) # 替换所有
1.3.2 大小写转换
s = "python FULL-STACK"print(s.upper()) # PYTHON FULL-STACKprint(s.lower()) # python full-stackprint(s.title()) # Python Full-Stack(单词首字母大写)print(s.capitalize()) # Python full-stack(首字母大写)print(s.swapcase()) # PYTHON full-stack(大小写互换)
1.3.3 判断方法
# 判断类型"123".isdigit() # True(纯数字)"abc".isalpha() # True(纯字母)"abc123".isalnum() # True(字母或数字)" ".isspace() # True(空白字符)"hello".islower() # True(全小写)"HELLO".isupper() # True(全大写)"Hello World".istitle() # True(标题格式)# 判断开头结尾"test.txt".startswith("test") # True"test.txt".endswith(".txt") # True
1.3.4 分割与连接
# 分割csv = "张三,25,北京,工程师"parts = csv.split(",") # ['张三', '25', '北京', '工程师']# 限制分割次数"a,b,c,d".split(",", 2) # ['a', 'b', 'c,d']# 多行分割lines = """第一行第二行第三行""".splitlines() # ['第一行', '第二行', '第三行']# 连接words = ["Python", "is", "awesome"]" ".join(words) # "Python is awesome"# 注意:join比+拼接更高效# 低效: result = "" # for s in strings: result += s# 高效: result = "".join(strings)
1.3.5 修剪与填充
# 修剪空白" hello ".strip() # "hello"(两端)" hello ".lstrip() # "hello "(左端)" hello ".rstrip() # " hello"(右端)# 修剪指定字符"###Python###".strip("#") # "Python"# 填充对齐"42".zfill(5) # "00042"(零填充)"hi".center(10, "-") # "----hi----"(居中)"hi".ljust(10, "-") # "hi--------"(左对齐)"hi".rjust(10, "-") # "--------hi"(右对齐)
1.4 字符串格式化
Python提供了多种字符串格式化方式,从旧到新的演进:
1.4.1 % 格式化(旧式,不推荐)
name = "Alice"age = 25print("我叫%s,今年%d岁" % (name, age))# 我叫Alice,今年25岁# 常用占位符# %s - 字符串# %d - 整数# %f - 浮点数(%.2f表示保留2位小数)# %x - 十六进制
1.4.2 str.format() 方法
# 位置参数"{} + {} = {}".format(1, 2, 3) # "1 + 2 = 3"# 索引指定"{1} {0} {1}".format("A", "B") # "B A B"# 关键字参数"{name}今年{age}岁".format(name="张三", age=25)# 格式控制"{:.2f}".format(3.14159) # "3.14"(保留2位小数)"{:>10}".format("hi") # " hi"(右对齐,宽度10)"{:0>5}".format(42) # "00042"(零填充)"{:,}".format(1000000) # "1,000,000"(千分位)
1.4.3 f-string(推荐,Python 3.6+)
name = "Alice"age = 25# 基本用法print(f"我叫{name},今年{age}岁")# 表达式计算a, b = 3, 4print(f"{a} + {b} = {a + b}") # "3 + 4 = 7"# 调用方法print(f"名字长度:{len(name)}")# 格式控制pi = 3.14159265359print(f"π ≈ {pi:.2f}") # "π ≈ 3.14"print(f"π ≈ {pi:10.4f}") # "π ≈ 3.1416"(宽度10,4位小数)# 千分位、百分比salary = 15000print(f"月薪:15,000"rate = 0.8567print(f"完成率:{rate:.1%}") # "完成率:85.7%"# 日期格式化from datetime import datetimenow = datetime.now()print(f"当前时间:{now:%Y-%m-%d %H:%M:%S}")# 调试技巧(Python 3.8+)x = 10print(f"{x=}") # "x=10"(同时输出变量名和值)print(f"{x * 2 = }") # "x * 2 = 20"
三种方式对比:
1.5 正则表达式入门
正则表达式是处理字符串的利器,Python通过re模块提供支持:
import re# 常用元字符# . 匹配任意字符(除换行)# \d 匹配数字 [0-9]# \w 匹配单词字符 [a-zA-Z0-9_]# \s 匹配空白字符# ^ 匹配开头# $ 匹配结尾# * 0次或多次# + 1次或多次# ? 0次或1次# {n,m} n到m次# [] 字符集# | 或# () 分组# 1. re.search() - 搜索第一个匹配pattern = r"\d{3}-\d{4}-\d{4}"text = "我的电话是138-1234-5678,备用139-8765-4321"match = re.search(pattern, text)if match: print(match.group()) # "138-1234-5678"# 2. re.findall() - 找出所有匹配phones = re.findall(r"\d{3}-\d{4}-\d{4}", text)print(phones) # ['138-1234-5678', '139-8765-4321']# 3. re.match() - 从开头匹配if re.match(r"https://", "https://example.com"): print("是HTTPS链接")# 4. re.sub() - 替换# 隐藏手机号中间4位text = "联系方式:13812345678"hidden = re.sub(r"(\d{3})\d{4}(\d{4})", r"\1****\2", text)print(hidden) # "联系方式:138****5678"# 5. re.split() - 分割# 按多个分隔符分割text = "apple,banana;orange|grape"fruits = re.split(r"[,;|]", text)print(fruits) # ['apple', 'banana', 'orange', 'grape']# 6. 编译正则(重复使用更高效)phone_pattern = re.compile(r"1[3-9]\d{9}")phones = phone_pattern.findall("13812345678和13987654321")
正则表达式速查表:
| | |
|---|
\d+ | | |
\w{3,6} | | |
^[A-Z] | | |
\d{4}-\d{2}-\d{2} | | |
\bword\b | | |
(?i)pattern | | |
二、底层原理深度解析
2.1 字符串的不可变性
核心概念: Python字符串是不可变(immutable)对象,创建后不能修改。
s = "hello"# s[0] = "H" # TypeError: 'str' object does not support item assignment# "修改"实际上是创建新字符串s = "H" + s[1:] # 创建新字符串 "Hello"
为什么设计为不可变?
- 安全性
- 哈希可用
- 性能优化
- 设计简洁
内存示意图:
变量s ──→ "hello" (内存地址0x1000) ↓ 执行 s = "H" + s[1:]变量s ──→ "Hello" (内存地址0x2000,新对象)"hello" 对象依然存在,直到垃圾回收
2.2 Unicode与UTF-8编码
字符集演进:
ASCII (1963) → 1字节,128字符,仅支持英文 ↓GB2312/GBK (中国) → 2字节,中文支持 ↓Unicode (1991) → 统一字符集,为每个字符分配唯一码点 ↓UTF-8 (实现方式) → 变长编码,兼容ASCII,最常用
Python 3的字符串模型:
# Python 3中,str是Unicode字符序列s = "Python全栈"print(len(s)) # 8(字符数,不是字节数)# 编码:str → bytesutf8_bytes = s.encode('utf-8')print(len(utf8_bytes)) # 14(UTF-8中中文占3字节)# 解码:bytes → strs2 = utf8_bytes.decode('utf-8')# 查看Unicode码点print(ord('A')) # 65print(ord('中')) # 20013print(chr(20013)) # 中
编码问题解决方案:
# 场景1:读取文件乱码# 错误方式with open('file.txt', 'r') as f: # 使用系统默认编码 content = f.read()# 正确方式:显式指定编码with open('file.txt', 'r', encoding='utf-8') as f: content = f.read()# 场景2:不知道文件编码import chardetwith open('unknown.txt', 'rb') as f: raw = f.read() result = chardet.detect(raw) print(result) # {'encoding': 'GB2312', 'confidence': 0.99} content = raw.decode(result['encoding'])# 场景3:处理BOM(字节顺序标记)# UTF-8 with BOM 会在开头加 \ufeffwith open('file.txt', 'r', encoding='utf-8-sig') as f: content = f.read() # 自动去除BOM# 场景4:混合编码数据处理def safe_decode(data, encodings=['utf-8', 'gbk', 'latin-1']): """尝试多种编码解码""" for enc in encodings: try: return data.decode(enc) except UnicodeDecodeError: continue return data.decode('utf-8', errors='replace') # 最后手段
2.3 字符串驻留(String Interning)
概念: Python会自动驻留(缓存)某些字符串,使相同字符串共享内存。
# 自动驻留的情况a = "hello"b = "hello"print(a is b) # True(小字符串自动驻留)# 强制驻留import sysc = sys.intern("hello world! " * 100)d = sys.intern("hello world! " * 100)print(c is d) # True# 不驻留的情况a = "hello world"b = "hello " + "world"print(a is b) # False(运行时拼接不驻留)# 但编译时优化会驻留a = "hello" + "world" # 编译时确定b = "helloworld"print(a is b) # True
驻留规则:
三、实战项目
3.1 项目一:敏感词过滤系统
import refrom typing import List, Tupleclass SensitiveWordFilter: """敏感词过滤器 - 支持多种匹配模式""" def __init__(self, sensitive_words: List[str] = None): self.sensitive_words = sensitive_words or [] self.pattern = None self._compile_pattern() def _compile_pattern(self): """编译正则表达式模式""" if not self.sensitive_words: return # 转义特殊字符,用|连接 escaped = [re.escape(word) for word in self.sensitive_words] pattern_str = "|".join(escaped) self.pattern = re.compile(pattern_str, re.IGNORECASE) def add_words(self, words: List[str]): """添加敏感词""" self.sensitive_words.extend(words) self._compile_pattern() def check(self, text: str) -> Tuple[bool, List[str]]: """检查文本是否包含敏感词""" if not self.pattern: return False, [] matches = self.pattern.findall(text) return len(matches) > 0, matches def filter(self, text: str, replace_char: str = "*") -> str: """过滤敏感词,用指定字符替换""" if not self.pattern: return text def replace_match(match): return replace_char * len(match.group()) return self.pattern.sub(replace_match, text) def filter_with_stats(self, text: str) -> dict: """过滤并返回统计信息""" has_sensitive, found_words = self.check(text) filtered_text = self.filter(text) # 统计各敏感词出现次数 word_count = {} for word in found_words: word_lower = word.lower() word_count[word_lower] = word_count.get(word_lower, 0) + 1 return { "original": text, "filtered": filtered_text, "has_sensitive": has_sensitive, "found_words": list(set(found_words)), "word_count": word_count, "total_matches": len(found_words) }# 使用示例if __name__ == "__main__": # 初始化过滤器 sensitive_words = ["暴力", "色情", "赌博", "诈骗", "毒品"] filter = SensitiveWordFilter(sensitive_words) # 测试文本 test_text = "这个网站包含色情内容和赌博信息,还有暴力倾向!" # 检查 has_sensitive, words = filter.check(test_text) print(f"包含敏感词: {has_sensitive}") print(f"发现的敏感词: {words}") # 过滤 filtered = filter.filter(test_text) print(f"过滤后: {filtered}") # 详细统计 result = filter.filter_with_stats(test_text) print("\n详细结果:") for key, value in result.items(): print(f" {key}: {value}")
3.2 项目二:文本统计器
import refrom collections import Counterfrom typing import Dict, List, Tupleclass TextAnalyzer: """文本分析器 - 统计文本各类指标""" def __init__(self, text: str): self.text = text self.lines = text.splitlines() def basic_stats(self) -> Dict: """基础统计""" # 字符统计(含空格) total_chars = len(self.text) # 字符统计(不含空格) chars_no_space = len(self.text.replace(" ", "").replace("\n", "")) # 行数 line_count = len(self.lines) # 单词数(按空白分割) words = self.text.split() word_count = len(words) return { "total_chars": total_chars, "chars_no_space": chars_no_space, "line_count": line_count, "word_count": word_count, "avg_chars_per_line": round(total_chars / line_count, 2) if line_count else 0, "avg_words_per_line": round(word_count / line_count, 2) if line_count else 0 } def chinese_stats(self) -> Dict: """中文文本统计""" # 中文字符数 chinese_chars = re.findall(r'[\u4e00-\u9fff]', self.text) # 中文标点 chinese_punct = re.findall(r'[,。!?、;:""''()【】《》]', self.text) # 英文单词 english_words = re.findall(r'[a-zA-Z]+', self.text) # 数字 numbers = re.findall(r'\d+', self.text) return { "chinese_chars": len(chinese_chars), "chinese_punct": len(chinese_punct), "english_words": len(english_words), "numbers": len(numbers), "unique_chinese": len(set(chinese_chars)) } def word_frequency(self, top_n: int = 10) -> List[Tuple[str, int]]: """词频统计""" # 提取单词(中文按字,英文按词) words = re.findall(r'[\u4e00-\u9fff]|[a-zA-Z]+', self.text.lower()) # 过滤单字(可选) words = [w for w in words if len(w) > 1 or not re.match(r'[\u4e00-\u9fff]', w)] return Counter(words).most_common(top_n) def sentence_analysis(self) -> Dict: """句子分析""" # 按句子分割(中文句号、英文句号、问号、感叹号) sentences = re.split(r'[。!?.!?]+', self.text) sentences = [s.strip() for s in sentences if s.strip()] if not sentences: return {"sentence_count": 0} sentence_lengths = [len(s) for s in sentences] return { "sentence_count": len(sentences), "avg_sentence_length": round(sum(sentence_lengths) / len(sentences), 2), "max_sentence_length": max(sentence_lengths), "min_sentence_length": min(sentence_lengths), "longest_sentence": sentences[sentence_lengths.index(max(sentence_lengths))][:50] + "..." } def generate_report(self) -> str: """生成完整报告""" basic = self.basic_stats() chinese = self.chinese_stats() freq = self.word_frequency(5) sentence = self.sentence_analysis() report = f"""╔════════════════════════════════════════════════╗║ 文 本 统 计 报 告 ║╚════════════════════════════════════════════════╝【基础统计】 总字符数(含空格): {basic['total_chars']} 总字符数(不含空格): {basic['chars_no_space']} 总行数: {basic['line_count']} 总词数: {basic['word_count']} 平均每行字符: {basic['avg_chars_per_line']}【中文统计】 中文字符数: {chinese['chinese_chars']} 中文标点: {chinese['chinese_punct']} 不同汉字数: {chinese['unique_chinese']} 英文单词: {chinese['english_words']} 数字个数: {chinese['numbers']}【句子分析】 句子数: {sentence['sentence_count']} 平均句长: {sentence.get('avg_sentence_length', 0)} 字符 最长句: {sentence.get('longest_sentence', 'N/A')}【高频词汇 TOP5】""" for word, count in freq: bar = "█" * count report += f" {word:10}{count:3}{bar}\n" return report# 使用示例text = """Python是一种解释型、面向对象、动态数据类型的高级程序设计语言。Python由Guido van Rossum于1989年底发明,第一个公开发行版发行于1991年。Python源代码遵循GPL协议。Python语法简洁清晰,特色之一是强制用空白符缩进。"""analyzer = TextAnalyzer(text)print(analyzer.generate_report())
3.3 项目三:CSV解析器
import refrom typing import List, Dict, Iteratorfrom io import StringIOclass SimpleCSVParser: """简易CSV解析器 - 支持引号、转义""" def __init__(self, delimiter: str = ",", quotechar: str = '"'): self.delimiter = delimiter self.quotechar = quotechar def parse_line(self, line: str) -> List[str]: """解析单行CSV""" fields = [] field = "" in_quotes = False i = 0 while i < len(line): char = line[i] if char == self.quotechar: if in_quotes and i + 1 < len(line) and line[i + 1] == self.quotechar: # 转义的引号 "" field += self.quotechar i += 2 continue else: in_quotes = not in_quotes elif char == self.delimiter and not in_quotes: fields.append(field) field = "" elif char == "\\" and i + 1 < len(line): # 转义字符 field += line[i + 1] i += 1 else: field += char i += 1 fields.append(field) return fields def parse(self, csv_text: str) -> List[List[str]]: """解析CSV文本""" lines = csv_text.splitlines() return [self.parse_line(line) for line in lines if line.strip()] def parse_dict(self, csv_text: str) -> List[Dict[str, str]]: """解析为字典列表(首行为表头)""" lines = self.parse(csv_text) if not lines: return [] headers = lines[0] return [ dict(zip(headers, row)) for row in lines[1:] ] def generate(self, data: List[List[str]]) -> str: """生成CSV文本""" lines = [] for row in data: fields = [] for field in row: # 需要引号的情况 if self.delimiter in field or "\n" in field or self.quotechar in field: # 转义引号 escaped = field.replace(self.quotechar, self.quotechar * 2) field = f'{self.quotechar}{escaped}{self.quotechar}' fields.append(field) lines.append(self.delimiter.join(fields)) return "\n".join(lines)# 使用示例csv_data = '''姓名,年龄,城市,备注张三,25,北京,"热爱编程,喜欢Python"李四,30,上海," says ""Hello"""王五,28,深圳,无'''parser = SimpleCSVParser()# 解析为列表print("=== 列表形式 ===")for row in parser.parse(csv_data): print(row)# 解析为字典print("\n=== 字典形式 ===")for record in parser.parse_dict(csv_data): print(record)# 生成CSVprint("\n=== 生成CSV ===")data = [ ["产品", "价格", "库存"], ["iPhone", "5999", "100"], ["MacBook", "12999", "50"]]print(parser.generate(data))
四、常见陷阱与解决方案
4.1 字符串拼接性能陷阱
# 陷阱:循环中使用+拼接(低效)def bad_concat(items): result = "" for item in items: result += str(item) # 每次都创建新字符串! return result# 正确:使用join(高效)def good_concat(items): return "".join(str(item) for item in items)# 性能对比import timeititems = list(range(1000))bad_time = timeit.timeit(lambda: bad_concat(items), number=100)good_time = timeit.timeit(lambda: good_concat(items), number=100)print(f"低效方式: {bad_time:.4f}s")print(f"高效方式: {good_time:.4f}s")print(f"提升倍数: {bad_time/good_time:.1f}x")
4.2 编码陷阱
# 陷阱1:混用str和bytess = "hello"b = b"hello"# s + b # TypeError!# 正确做法:统一类型s + b.decode('utf-8') # "hellohello"s.encode('utf-8') + b # b"hellohello"# 陷阱2:默认编码问题import sysprint(sys.getdefaultencoding()) # utf-8# Windows下可能使用gbk,导致乱码# 解决:始终显式指定encoding# 陷阱3:非法字符处理data = b"hello\xff\xfeworld" # 包含非法UTF-8序列# 错误:直接解码报错try: data.decode('utf-8')except UnicodeDecodeError as e: print(f"解码错误: {e}")# 正确:使用errors参数data.decode('utf-8', errors='ignore') # 忽略错误字符data.decode('utf-8', errors='replace') # 替换为 data.decode('utf-8', errors='backslashreplace') # 转义显示
4.3 正则表达式陷阱
# 陷阱1:贪婪匹配import retext = "<div>内容1</div><div>内容2</div>"# 贪婪(默认)- 匹配尽可能多print(re.findall(r"<div>.*</div>", text))# ['<div>内容1</div><div>内容2</div>']# 非贪婪 - 匹配尽可能少print(re.findall(r"<div>.*?</div>", text))# ['<div>内容1</div>', '<div>内容2</div>']# 陷阱2:忘记转义pattern = "www.example.com" # .匹配任意字符!# 正确:pattern = r"www\.example\.com"# 陷阱3:编译开销# 循环内重复编译(低效)for text in texts: re.search(r"\d+", text) # 每次都编译# 正确:预编译(高效)pattern = re.compile(r"\d+")for text in texts: pattern.search(text)
4.4 不可变性陷阱
# 陷阱:误以为修改了字符串def add_suffix(name): name + "_backup" # 没有赋值,原字符串未变! return name# 正确def add_suffix_fixed(name): return name + "_backup"# 陷阱:列表中的字符串"修改"names = ["file1", "file2"]for name in names: name = name + ".txt" # 只修改了局部变量print(names) # ['file1', 'file2']# 正确names = [name + ".txt" for name in names]
五、本章小结
核心知识点回顾
| |
|---|
| |
| [start:end:step] |
| find/replace、split/join、strip、大小写转换 |
| |
| |
| |
最佳实践清单
- 使用f-string进行字符串格式化(Python 3.6+)
六、课后练习
基础练习
字符串反转:编写函数,反转字符串但保持单词顺序
reverse_words("Hello Python World") # "World Python Hello"
回文判断:判断字符串是否为回文(忽略大小写和非字母)
is_palindrome("A man, a plan, a canal: Panama") # True
格式化输出:使用f-string格式化表格
# 输出:# | 姓名 | 年龄 | 成绩 |# |--------|------|------|# | 张三 | 20 | 85.5 |
进阶练习
正则提取:从HTML中提取所有链接
extract_links(html_text) # 返回[(url, text), ...]
模板引擎:实现简单的字符串模板替换
template = "Hello, {{name}}! You have {{count}} messages."render(template, {"name": "Alice", "count": 5})# "Hello, Alice! You have 5 messages."
编码转换工具:批量转换文件编码
convert_encoding(src_files, from_enc="gbk", to_enc="utf-8")
挑战练习
Markdown解析器:实现简易Markdown转HTML
日志分析器:分析Apache/Nginx日志
参考资源
下篇预告:第04篇《列表与元组 —— 有序集合的双子星》,深入讲解Python中最常用的序列类型,包括动态数组原理、列表推导式、元组的不可变性优势等。
本文是《Python全栈修炼之路》系列第03篇,持续更新中,欢迎关注专栏获取更多内容!