Python爬虫实战：泉州市天气数据采集与分析-代码聚汇网

Python爬虫实战：泉州市天气数据采集与分析

第三世界的妖孽

1. 项目概述与核心思路

去年我在做一个气候数据分析项目时，需要获取泉州市全年的天气数据。当时尝试了几个气象数据API，发现要么收费昂贵，要么数据不完整。最终我决定用Python爬虫直接从公开气象网站抓取数据，并进行可视化分析。这个方案不仅零成本，还能获得更原始的一手数据。

整个项目分为三个核心环节：

数据采集：通过requests+lxml实现网页抓取和解析
数据存储：使用CSV和SQLite两种方式持久化数据
数据分析：利用pandas+pyecharts进行多维可视化

2. 数据采集实现细节

2.1 目标网站分析

我们选择的是"历史天气网"(lishi.tianqi.com)，以泉州为例，其每月数据页面结构如下：

code复制https://lishi.tianqi.com/quanzhou/[YYYYMM].html

其中YYYYMM是年月格式，如202201表示2022年1月。

提示：选择目标网站时要注意robots.txt协议，这个网站没有禁止爬虫的声明，但也要控制请求频率，建议每次请求间隔1秒以上。

2.2 网页解析关键技术

使用lxml库的etree模块进行HTML解析，核心方法包括：

python复制from lxml import etree

resp_html = etree.HTML(resp.text)  # 将响应文本转为可解析的HTML对象
resp_list = resp_html.xpath("//ul[@class='thrui']/li")  # 使用XPath定位数据节点

XPath选择器使用技巧：

// 表示从根节点开始搜索
[@class='xxx'] 按class属性筛选
/ 表示路径层级关系
text() 获取节点文本内容

2.3 完整爬取流程实现

python复制def get_weather(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }
    resp = requests.get(url, headers=headers)
    resp.encoding = 'utf-8'  # 显式设置编码防止乱码
    
    weather_info = []
    resp_html = etree.HTML(resp.text)
    days = resp_html.xpath("//ul[@class='thrui']/li")
    
    for day in days:
        date = day.xpath("./div[1]/text()")[0].split()[0]
        high_temp = day.xpath("./div[2]/text()")[0].replace('℃', '')
        low_temp = day.xpath("./div[3]/text()")[0].replace('℃', '')
        condition = day.xpath("./div[4]/text()")[0]
        
        weather_info.append({
            'date': date,
            'high': high_temp,
            'low': low_temp,
            'weather': condition
        })
    
    return weather_info

注意事项：实际项目中要添加异常处理，比如网络请求重试、数据缺失处理等。我通常会这样增强健壮性：
python复制try:
    resp = requests.get(url, headers=headers, timeout=10)
    resp.raise_for_status()
except requests.exceptions.RequestException as e:
    print(f"请求失败: {e}")
    return None

3. 数据存储方案

3.1 CSV存储实现

python复制def save_to_csv(data, filename):
    with open(filename, 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(['日期', '最高气温', '最低气温', '天气'])
        for month in data:
            for day in month:
                writer.writerow([day['date'], day['high'], day['low'], day['weather']])

CSV文件的优势是简单易用，但查询效率低，适合小规模数据。

3.2 SQLite数据库方案

对于需要复杂查询的场景，我推荐使用SQLite：

python复制def init_db():
    conn = sqlite3.connect('weather.db')
    cursor = conn.cursor()
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS weather (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            date TEXT NOT NULL,
            high_temp REAL,
            low_temp REAL,
            condition TEXT,
            UNIQUE(date)  # 防止重复插入
        )
    ''')
    conn.commit()
    conn.close()

def save_to_db(data):
    conn = sqlite3.connect('weather.db')
    cursor = conn.cursor()
    
    for month in data:
        for day in month:
            try:
                cursor.execute('''
                    INSERT INTO weather (date, high_temp, low_temp, condition)
                    VALUES (?, ?, ?, ?)
                ''', (day['date'], day['high'], day['low'], day['weather']))
            except sqlite3.IntegrityError:
                print(f"数据已存在: {day['date']}")
    
    conn.commit()
    conn.close()

数据库方案的优势：

支持复杂查询（如按温度范围筛选）
数据一致性更好
便于后续扩展分析

4. 数据分析与可视化

4.1 数据预处理

python复制# 读取CSV数据
df = pd.read_csv('weather.csv', parse_dates=['日期'])

# 添加月份列
df['month'] = df['日期'].dt.month

# 统计每月天气频次
weather_freq = df.groupby(['month', '天气']).size().unstack().fillna(0)

4.2 动态柱状图展示

使用pyecharts创建动态轮播图：

python复制timeline = Timeline()
for month in range(1, 13):
    month_data = weather_freq.loc[month].sort_values()
    bar = (
        Bar()
        .add_xaxis(month_data.index.tolist())
        .add_yaxis("天数", month_data.values.tolist())
        .reversal_axis()
        .set_global_opts(
            title_opts=opts.TitleOpts(title=f"泉州2022年{month}月天气分布"),
            xaxis_opts=opts.AxisOpts(name="天数"),
            yaxis_opts=opts.AxisOpts(name="天气类型")
        )
    )
    timeline.add(bar, f"{month}月")

timeline.render("weather_bar.html")

4.3 温度趋势分析

python复制# 计算每月平均温度
monthly_stats = df.groupby('month').agg({
    '最高气温': ['mean', 'max', 'min'],
    '最低气温': ['mean', 'max', 'min']
})

# 绘制温度折线图
line = (
    Line()
    .add_xaxis(monthly_stats.index.tolist())
    .add_yaxis("最高气温均值", monthly_stats['最高气温']['mean'].round(1).tolist())
    .add_yaxis("最低气温均值", monthly_stats['最低气温']['mean'].round(1).tolist())
    .set_global_opts(
        title_opts=opts.TitleOpts(title="泉州2022年月均温度趋势"),
        yaxis_opts=opts.AxisOpts(name="温度(℃)"),
        tooltip_opts=opts.TooltipOpts(trigger="axis")
    )
)
line.render("temperature_trend.html")

4.4 天气词云生成

python复制from wordcloud import WordCloud
import matplotlib.pyplot as plt

# 合并所有天气描述
text = ' '.join(df['天气'])

# 生成词云
wc = WordCloud(
    font_path='msyh.ttc',  # 中文需要指定字体
    width=800,
    height=400,
    background_color='white'
).generate(text)

plt.figure(figsize=(12, 6))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.savefig('weather_wordcloud.png', dpi=300, bbox_inches='tight')

5. 项目优化与经验分享

5.1 爬虫优化技巧

请求头伪装：除了User-Agent，建议添加Referer等头部信息

python复制headers = {
    'User-Agent': 'Mozilla/5.0...',
    'Referer': 'https://lishi.tianqi.com/',
    'Accept-Language': 'zh-CN,zh;q=0.9'
}

代理IP池：对于大规模爬取，建议使用代理IP

python复制proxies = {
    'http': 'http://proxy_ip:port',
    'https': 'https://proxy_ip:port'
}
resp = requests.get(url, headers=headers, proxies=proxies)

速率控制：使用time.sleep避免被封

python复制import time
import random

time.sleep(random.uniform(1, 3))  # 随机间隔1-3秒

5.2 数据分析实用技巧

数据清洗：处理异常值和缺失值

python复制# 转换温度为数值类型
df['最高气温'] = pd.to_numeric(df['最高气温'], errors='coerce')
df['最低气温'] = pd.to_numeric(df['最低气温'], errors='coerce')

# 填充缺失值
df.fillna(method='ffill', inplace=True)

高级可视化：使用seaborn增强图表表现力

python复制import seaborn as sns

plt.figure(figsize=(10, 6))
sns.boxplot(x='month', y='最高气温', data=df)
plt.title('泉州2022年每月最高气温分布')
plt.savefig('temp_boxplot.png')

交互式分析：结合Jupyter Notebook快速验证想法

python复制# 在Jupyter中直接显示图表
%matplotlib inline
df.groupby('天气').size().plot.pie(autopct='%1.1f%%')

6. 完整项目结构建议

一个规范的爬虫项目应该包含以下结构：

code复制weather_analysis/
├── spiders/
│   ├── __init__.py
│   ├── weather_spider.py  # 爬虫核心逻辑
│   └── utils.py          # 通用工具函数
├── data/
│   ├── raw/              # 原始数据
│   └── processed/        # 处理后的数据
├── analysis/
│   ├── visualization.py  # 可视化代码
│   └── stats.py          # 统计分析
├── config.py             # 配置文件
├── requirements.txt      # 依赖列表
└── README.md             # 项目说明

在requirements.txt中注明依赖：

code复制requests==2.28.1
lxml==4.9.1
pandas==1.5.0
pyecharts==1.9.1
wordcloud==1.8.2.2

7. 法律与道德注意事项

遵守robots.txt：爬取前检查目标网站的爬虫政策
控制请求频率：避免对目标服务器造成过大压力
数据使用限制：注意数据版权问题，避免商用
隐私保护：不爬取个人信息等敏感数据
数据存储安全：重要数据做好备份，考虑加密存储

这个项目最让我惊喜的是发现泉州5-9月降雨集中这个明显的气候特征，这在实际生活中有很多应用场景，比如旅行规划、农业活动安排等。通过这次实践，我深刻体会到数据爬取和分析的结合能产生很多有价值的洞察。