名师讲堂|使用 Python 基于 LandScan 数据测算城市多中心指标(王峤版本)

for k, v in ANALYSIS_PARAMS.items():

print(f" {k}: {v}")

关于 dist_nb = 1000 的设定：LandScan 数据经 Albers 投影后分辨率约为 820 m。相邻格点（上下左右）的距离约为 820 m，对角线方向约为 820 × √2 ≈ 1159 m。因此将阈值设为 1000 m，可精确选取边邻接格点，不会错误地把对角线邻居算进来。

二、使用 reticulate 创建与管理 Python 虚拟环境

在 R 中通过 reticulate 包来调用 Python，最好的实践是为项目创建一个专属的 Python 虚拟环境，将所需依赖隔离到独立空间，避免与系统 Python（如 Anaconda）发生版本冲突。

重要说明（避免"已初始化"报错）：reticulate 在 R 会话中只能绑定一次 Python——一旦某个 {python} 代码块运行，Python 解释器就被锁定，之后再调用 use_virtualenv() 会报错：
ERROR: The requested version of Python cannot be used, as another version has already been initialized.
因此，虚拟环境的激活必须在所有 {python} 代码块之前完成。本文档的解决方案是在 setup chunk 中通过 Sys.setenv(RETICULATE_PYTHON = ...) 提前锁定 Python 路径，这是 reticulate 选取 Python 的最高优先级入口。

2.1 安装 reticulate（仅首次）

# 设置 CRAN 镜像（knit 时 R 处于非交互模式，不会自动选择镜像）

options(repos = c(CRAN = "https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))

# 仅在尚未安装时才安装，避免每次 knit 都重装

if (!requireNamespace("reticulate", quietly = TRUE)) {

install.packages("reticulate")

message("reticulate 安装完成！")

} else {

message("reticulate 已安装，版本：", packageVersion("reticulate"))

.venv_python <- virtualenv_python(.venv_name)

2.2 虚拟环境初始化原理（已在 setup chunk 中完成）

本文档的 setup chunk（隐藏运行）包含如下逻辑：

library(reticulate)

.venv_name <-".venv"

# 虚拟环境不存在时自动创建

if(!file.exists(.venv_python)){

virtualenv_create(.venv_name)

.venv_python <- virtualenv_python(.venv_name)

# 通过环境变量抢先锁定 Python（优先级最高，早于任何 {python} chunk）

Sys.setenv(RETICULATE_PYTHON = .venv_python)

use_virtualenv(.venv_name, required =TRUE)

这样做的关键在于：knitr 在处理第一个 {python} chunk 时，reticulate 已经通过 RETICULATE_PYTHON 环境变量知道要使用 .venv，不会再去碰 Anaconda。

三、环境配置与数据准备

3.1 加载 Python 包

import geopandas as gpd

import rasterio

from rasterio.mask import mask

from shapely.geometry import Point

from scipy.spatial import KDTree

from scipy.stats import norm

from libpysal.weights import DistanceBand, lag_spatial

import matplotlib.pyplot as plt

import warnings

import matplotlib.font_manager as fm

warnings.filterwarnings("ignore")

# 设置中文字体（LXGW WenKai）

font_path = os.path.expanduser("~/Library/Fonts/LXGWWenKai-Regular.ttf")

fm.fontManager.addfont(font_path)

prop = fm.FontProperties(fname=font_path)

plt.rcParams["font.family"] = prop.get_name()

plt.rcParams["axes.unicode_minus"] = False

3.2 全局路径与参数配置

# 等面积投影（Albers，适合中国范围计算）

MY_CRS = (

"+proj=aea +lat_0=0 +lon_0=105 +lat_1=25 +lat_2=47"

" +x_0=0 +y_0=0 +ellps=krass +units=m +no_defs"

"pop_tif": BASE_DIR / "pop-tif2",

from pathlib import Path

BASE_DIR = Path(".")

PATH = {

"city_shp": BASE_DIR / "2021行政区划" / "市.shp",

数据预处理提示：如果你的 LandScan tif 文件还是 WGS84 坐标系，需要先转换到 MY_CRS：

import rasterio

from rasterio.warp import calculate_default_transform, reproject, Resampling

import glob

os.makedirs("pop-tif2", exist_ok=True)

for src_path insorted(glob.glob("pop-tif/*.tif")):

dst_path = src_path.replace("pop-tif", "pop-tif2")

with rasterio.open(src_path) as src:

transform, width, height = calculate_default_transform(

src.crs, MY_CRS, src.width, src.height, *src.bounds

with rasterio.open(dst_path, "w", **kwargs) as dst:

kwargs = src.meta.copy()

kwargs.update({

"crs": MY_CRS,

"transform": transform,

"width": width,

"height": height,

})

for band inrange(1, src.count + 1):

reproject(

source=rasterio.band(src, band),

destination=rasterio.band(dst, band),

src_transform=src.transform,

src_crs=src.crs,

dst_transform=transform,

dst_crs=MY_CRS,

resampling=Resampling.nearest,

city_demo = gpd.read_file(PATH["city_shp"])

四、单城市演示计算（北京市，2020 年）

4.1 读取城市边界与人口栅格

# 读取北京市行政边界

city_demo["geometry"] = city_demo["geometry"].buffer(0)

city_demo = city_demo[~city_demo.is_empty]

city_demo = city_demo.to_crs(MY_CRS)

city_demo = city_demo[city_demo["市"] == "北京市"]

city_demo

4.2 裁剪与栅格转点

coords = np.array([(p.x, p.y) for p in pop_points.geometry])

4.3 构建空间权重矩阵

# 提取坐标

# 构建距离邻接关系（边邻接，1000 m）

w = DistanceBand(coords, threshold=ANALYSIS_PARAMS["dist_nb"], binary=True, alpha=-1.0)

w.transform = "r"# 行标准化

print(f"格点数量: {w.n}")

print(f"平均邻居数: {w.mean_neighbors:.2f}")

参数说明
binary=True：二元权重，邻居权重为 1，非邻居为 0；
alpha=-1.0：距离衰减参数，-1 表示使用 1/距离作为权重（当 binary=False 时生效）；
transform = "r"：行标准化权重，与 R 中 style = "W" 等价。

4.4 局部 Moran's I 检验

局部 Moran's I 是用来找 "高值围着高值、低值围着低值" 的空间集聚热点的工具。在这里，它的唯一作用就是：找出人口高密度、且周围也是高密度的区域 → 也就是城市中心（HH 集聚区）。

# 自实现条件正态近似法（完全复现 R spdep::localmoran(conditional=TRUE) 源码）

deflocal_moran_conditional_p(y, w):

"""

使用 R spdep::localmoran(conditional=TRUE) 的条件正态近似法计算 LISA p 值。

"""

y = np.asarray(y, dtype=np.float64)

n = len(y)

z = y - y.mean() # 中心化变量

m2 = np.sum(z ** 2) / n # mlvar=TRUE 时的方差

# Ii = (z / m2) * lag(z)

z_lag = lag_spatial(w, z)

Ii = (z / m2) * z_lag

# 计算每个点的邻居权重之和 Wi 和 Wi2

Wi = np.zeros(n)

Wi2 = np.zeros(n)

for i inrange(n):

iflen(w.neighbors[i]) > 0:

weights_i = np.array(w.weights[i])

Wi[i] = np.sum(weights_i)

Wi2[i] = np.sum(weights_i ** 2)

# 条件期望：E[Ii] = -(z_i^2 * Wi) / ((n-1) * m2)

E_Ii = -(z ** 2 * Wi) / ((n - 1) * m2)

# 条件方差（Sokal 1998，R 源码 conditional=TRUE 分支）

zi_over_m2 = z / m2

var_Ii = (zi_over_m2 ** 2) * (n / (n - 2)) * (Wi2 - Wi ** 2 / (n - 1)) * (m2 - z ** 2 / (n - 1))

var_Ii = np.maximum(var_Ii, 0.0)

# z 得分与双侧 p 值

z_score = np.where(var_Ii > 0, (Ii - E_Ii) / np.sqrt(var_Ii), 0.0)

p_val = 2.0 * norm.sf(np.abs(z_score))

return p_val, Ii

# 计算局部莫兰指数（条件正态近似法）

p_val, Ii = local_moran_conditional_p(pop_points["pop"].values, w)

# 结果包含：Ii（局部莫兰指数）、p_val（条件正态近似 p 值）

result_df = pd.DataFrame({

4.5 空间类型分类（LISA 象限）

# 人口中位数

pop_med = pop_points["pop"].median()

# 计算空间滞后

pop_lag = lag_spatial(w, pop_points["pop"].values)

# 分类（使用条件正态近似法的 p 值）

sig = p_val < ANALYSIS_PARAMS["sig_level"]

self_type = np.where(pop_points["pop"].values > pop_med, "H", "L")

nb_type = np.where(pop_lag > pop_med, "H", "L")

cluster_type = np.where(

sig & (self_type == "H") & (nb_type == "H"), "HH", "其他"

pop_points = pop_points.copy()

pop_points["p_value"] = p_val

pop_points["sig"] = sig

pop_points["pop_med"] = pop_med

pop_points["pop_lag"] = pop_lag

pop_points["self_type"] = self_type

pop_points["nb_type"] = nb_type

pop_points["cluster_type"] = cluster_type

pop_points["cluster_type"].value_counts()

这里只关心 HH 类型（自身人口高 + 邻居人口高 + 统计显著），这类格点代表真正的人口密集聚集区，是城市中心的候选位置。

4.6 提取 HH 格点并聚类成中心

# 筛选 HH 格点

hh_points = pop_points[pop_points["cluster_type"] == "HH"].reset_index(drop=True)

print(f"HH 格点数量: {len(hh_points)}")

# 对 HH 格点聚类（使用 Union-Find 算法识别连通分量）

iflen(hh_points) >= ANALYSIS_PARAMS["min_cells"]:

coords_hh = np.array([(p.x, p.y) for p in hh_points.geometry])

n = len(coords_hh)

parent = list(range(n))

deffind(x):

while parent[x] != x:

parent[x] = parent[parent[x]]

x = parent[x]

return x

defunion(x, y):

rx, ry = find(x), find(y)

if rx != ry:

parent[rx] = ry

tree = KDTree(coords_hh)

pairs = tree.query_pairs(r=ANALYSIS_PARAMS["dist_nb"])

for i, j in pairs:

union(i, j)

clusters = [find(i) for i inrange(n)]

unique_clusters = sorted(set(clusters))

cluster_map = {c: idx + 1for idx, c inenumerate(unique_clusters)}

hh_points["cluster_id"] = [cluster_map[c] for c in clusters]

print(f"聚类数量: {hh_points['cluster_id'].nunique()}")

4.7 筛选有效中心并计算多中心指数

fig, axes = plt.subplots(1, 2, figsize=(20, 9))

4.8 可视化

将上述分析结果绘制为双面板地图：左图展示城市人口格点分布，右图展示 LISA 识别出的 HH 集聚区与中心。

axes[0].set_title("北京市 2020 年人口格点", fontsize=18)

axes[0].set_xlabel("X（米）", fontsize=14)

axes[0].set_ylabel("Y（米）", fontsize=14)

axes[0].tick_params(labelsize=12)

# 右图：LISA 分类

iflen(hh_points) > 0and"cluster_id"in hh_points.columns:

# 非 HH 点

other_points = pop_points[pop_points["cluster_type"] != "HH"]

iflen(other_points) > 0:

other_points.plot(ax=axes[1], color="lightgray", markersize=1)

axes[1].set_title("北京市 2020 年 HH 集聚区与中心识别", fontsize=18)

axes[1].set_title("北京市 2020 年 LISA 分类", fontsize=18)

pop_points.plot(

column="cluster_type",

axes[1].set_xlabel("X（米）", fontsize=14)

axes[1].set_ylabel("Y（米）", fontsize=14)

axes[1].tick_params(labelsize=12)

plt.tight_layout()

plt.show()

插图

五、批量并行计算（全国所有城市 × 所有年份）

5.1 封装核心计算函数

# 串行遍历所有城市（并行计算请使用独立的 batch_calculate.py 脚本）

5.2 索引年份文件与批量计算

os.makedirs("res", exist_ok=True)

# 遍历所有城市

for _, row in city_all.iterrows():

code = row["市代码"]

city_name = row["市"]

output_file = f"res/{code}.csv"

# 断点续算

if os.path.exists(output_file):

print(f" 跳过 {city_name}（{code}，已计算）")

continue

print(f"正在计算：{city_name}（{code}）")

results = []

city_geom = city_all[city_all["市代码"] == code].iloc[[0]]

for year, pop_file in pop_files.items():

res = calculate_polycentric(city_geom, pop_file)

if res isnotNone:

results.append(res)

print(f" {year} 年返回 None")

if results:

pd.DataFrame(results).to_csv(output_file, index=False)

print(f" {city_name} 完成，保存 {len(results)} 条记录")

print(f" {city_name} 全部年份均无有效结果")

print("======== 全部计算完成！结果保存在 res/ 文件夹 ========")

六、改进算法：预计算空间权重矩阵

6.1 为什么可以改进？

在上面的批量计算中，对同一个城市，每个年份都重新计算了一次空间权重矩阵。但实际上，空间权重矩阵只取决于城市边界的形状——它与年份无关，只要城市行政边界不变（本项目使用 2021 年固定边界），同一城市所有年份的空间权重矩阵完全相同。

因此，可以先把所有城市的权重矩阵计算并保存好，然后在计算各年数据时直接读取，显著减少重复计算。

对于一个有 NNN 个城市、TTT 个年份的数据集：

方法	权重矩阵计算次数
原始方法	N×TN \times TN×T
改进方法	NNN （预计算一次）

当 T=25T = 25T=25（2000～2024 年）时，改进方法可将权重矩阵的计算量缩减为原来的 1/25。

6.2 预计算并保存所有城市的权重矩阵

import pickle

# 创建权重矩阵保存目录

os.makedirs("lwres", exist_ok=True)

# 串行计算所有城市的权重矩阵

# （ProcessPoolExecutor 在 reticulate 环境下无法序列化 __main__ 中定义的函数，

# 因此 Rmd 中使用串行循环；如需并行请使用独立的 batch_calculate.py 脚本）

for _, row in city_all.iterrows():

code = row["市代码"]

city_name = row["市"]

lw_path = f"lwres/{code}.pkl"

if os.path.exists(lw_path):

print(f" 跳过 {city_name}（{code}，权重矩阵已存在）")

continue

# 直接使用已加载的 city_all，不重复读取 shp

city_geom = city_all[city_all["市代码"] == code].iloc[[0]]

# 使用任意一年（如 2020 年）的栅格确定格点位置

points = clip_raster_to_city(

str(PATH["pop_tif"] / "2020.tif"),

city_geom

print(f" {city_name}（{code}）无有效格点，跳过")

iflen(points) == 0:

continue

w = build_spatial_weights(points)

# 保存权重矩阵为 pickle 文件

withopen(lw_path, "wb") as f:

pickle.dump(w, f)

print(f" {city_name}（{code}）权重矩阵已保存")

print("======== 所有城市权重矩阵计算完成，保存在 lwres/ ========")

6.3 使用预计算权重矩阵的改进版计算函数

# 改进版核心函数：接受外部传入的 lw，不在函数内部重新计算

# 见 calculate_polycentric.py 中的 calculate_polycentric(city_geom, pop_file, lw=...)

# 传入 lw 参数后，函数内部将跳过权重矩阵构建步骤

6.4 使用改进算法批量计算全部数据

for f insorted(glob.glob("resa/*.csv")):

6.5 合并所有结果并保存

# 合并所有 CSV 结果

dfs = []

try:

dfs.append(pd.read_csv(f))

except Exception as e:

print(f" 读取 {f} 出错: {e}")

iflen(dfs) == 0:

print("⚠️ 没有找到任何结果文件，请先运行上面的计算步骤")

df1 = pd.concat(dfs, ignore_index=True)

df1 = df1.sort_values(["city_code", "year"]).reset_index(drop=True)

# 保存为 dta 格式

# version=118（Stata 15+）支持 UTF-8 编码，可正确写入中文字符

"city_code": "行政区划代码",

"center_count": "有效中心数量",

"main_pop": "主中心人口",

"sub_pop": "次中心总人口",

"polycentric_index": "多中心指数",

"sample_polycentric_index_wangqiao.dta",

df1.to_stata(

write_index=False,

version=118,

variable_labels=variable_labels,