最近完成了一个挺有意思的实战项目——番茄小说数据爬取与可视化系统。这个系统主要解决网络文学领域的数据采集和分析需求,通过自动化爬虫抓取番茄小说平台的书籍信息、作者数据和阅读指标,再经过清洗加工后呈现多维度的可视化图表。作为同时涉及后端开发、数据采集和前端的全栈项目,技术选型上采用了SpringBoot+Vue的经典组合,下面就把整个实现过程拆解给大家。
这个系统最核心的价值在于三点:一是实现了对动态反爬机制的小说网站稳定数据采集;二是构建了完整的数据处理流水线,能够将原始HTML转化为结构化指标;三是通过交互式可视化界面,让运营人员能直观掌握作品热度趋势、题材分布等关键信息。特别适合内容平台运营、网络文学研究者以及想要学习全栈开发的同行参考。
整个系统采用前后端分离架构,这是经过多方面权衡后的选择。后端选用SpringBoot主要基于以下考虑:
前端选择Vue.js则是因为:
生产环境部署时采用了Nginx+双服务的架构:
bash复制# Nginx配置示例
upstream backend {
server 127.0.0.1:8080 weight=5;
server 127.0.0.1:8081 backup;
}
server {
listen 80;
server_name analytics.example.com;
location /api {
proxy_pass http://backend;
proxy_set_header Host $host;
}
location / {
root /var/www/vue-dist;
try_files $uri $uri/ /index.html;
}
}
数据库方面采用MySQL作为主库存储结构化元数据,Redis缓存热门书籍的查询结果。特别要注意的是,当使用JPA或MyBatis时,需要合理配置连接池参数:
yaml复制# application.yml配置片段
spring:
datasource:
hikari:
maximum-pool-size: 10
connection-timeout: 30000
idle-timeout: 600000
redis:
lettuce:
pool:
max-active: 8
max-idle: 8
番茄小说作为主流平台,反爬机制相当完善。我们通过以下组合策略保证爬虫稳定性:
java复制// 请求头生成工具类
public class HeaderGenerator {
private static final String[] USER_AGENTS = {
"Mozilla/5.0 (Windows NT 10.0; Win64; x64)...",
"Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3...)",
"Mozilla/5.0 (Linux; Android 8.0.0...)",
};
public static Map<String,String> randomHeaders(){
Map<String,String> headers = new HashMap<>();
headers.put("User-Agent", USER_AGENTS[new Random().nextInt(USER_AGENTS.length)]);
headers.put("Accept-Language", "zh-CN,zh;q=0.9");
headers.put("Referer", "https://fanqienovel.com/");
return headers;
}
}
java复制public class ProxyManager {
private List<Proxy> availableProxies;
private ScheduledExecutorService checkerThread;
public ProxyManager() {
// 初始化时从API获取代理列表
refreshProxies();
// 每10分钟检测一次代理可用性
checkerThread = Executors.newSingleThreadScheduledExecutor();
checkerThread.scheduleAtFixedRate(this::healthCheck, 10, 10, TimeUnit.MINUTES);
}
public Proxy getRandomProxy() {
synchronized (this) {
return availableProxies.get(new Random().nextInt(availableProxies.size()));
}
}
}
为提高爬取效率,系统采用主从架构的分布式爬虫:
关键的任务分发代码:
java复制@Scheduled(fixedDelay = 5000)
public void scheduleTasks() {
// 从数据库获取待抓取书籍ID
List<BookTask> tasks = bookMapper.selectPendingTasks(100);
// 将任务放入Redis队列
tasks.forEach(task -> {
String taskJson = JSON.toJSONString(task);
redisTemplate.opsForList().rightPush("crawl_queue", taskJson);
bookMapper.updateTaskStatus(task.getId(), "QUEUED");
});
}
原始HTML数据需要经过多层处理才能用于分析:
java复制String cleanText = Jsoup.parse(rawHtml)
.select("div.content")
.text()
.replaceAll("\\s+", " ");
xml复制<!-- pom.xml依赖 -->
<dependency>
<groupId>com.hankcs</groupId>
<artifactId>hanlp</artifactId>
<version>portable-1.8.4</version>
</dependency>
sql复制-- 热度计算公式
UPDATE book_stats
SET hot_score = (
0.4 * read_count / max_read_count +
0.3 * comment_count / max_comment_count +
0.2 * reward_amount / max_reward_amount +
0.1 * bookmark_count / max_bookmark_count
) * 100;
使用Spring的Scheduled实现每日数据更新:
java复制@Scheduled(cron = "0 0 3 * * ?") // 每天凌晨3点执行
public void dailyUpdate() {
log.info("开始执行每日数据更新任务");
try {
crawlService.incrementalCrawl();
dataService.calculateDailyMetrics();
backupService.createSnapshot();
} catch (Exception e) {
alertService.sendAlert("每日任务执行失败", e.getMessage());
}
}
前端通过Vue-Echarts实现动态图表,这里分享几个实用技巧:
javascript复制mounted() {
window.addEventListener('resize', this.handleResize)
this.initChart()
},
methods: {
handleResize() {
this.chart && this.chart.resize()
}
}
javascript复制option = {
dataset: {
source: rawData,
dimensions: ['date', 'value']
},
series: {
type: 'line',
sampling: 'lttb', // 采用最大阈值下采样
progressive: 1000 // 增量渲染
}
}
javascript复制const option = {
tooltip: {
trigger: 'axis',
formatter: params => {
return `${params[0].axisValue}<br/>
阅读量: ${params[0].data.toLocaleString()}<br/>
同比: ${params[1].data}%`
}
},
xAxis: { type: 'category' },
yAxis: [{ type: 'value' }, {
type: 'value',
axisLabel: { formatter: '{value}%' }
}],
series: [
{ type: 'line', showSymbol: false },
{ type: 'line', yAxisIndex: 1 }
]
}
javascript复制const option = {
series: [{
type: 'pie',
radius: ['30%', '70%'],
roseType: 'radius',
label: {
formatter: '{b}: {c} ({d}%)'
},
data: genreData
}]
}
采用多级缓存架构提升响应速度:
java复制@Cacheable(value = "bookRank", key = "#type")
public List<BookVO> getRankingList(String type) {
return bookMapper.selectRanking(type);
}
java复制public BookDetail getBookDetail(Long bookId) {
String cacheKey = "book:" + bookId;
String cached = redisTemplate.opsForValue().get(cacheKey);
if (cached != null) {
return JSON.parseObject(cached, BookDetail.class);
}
BookDetail detail = bookMapper.selectDetail(bookId);
redisTemplate.opsForValue().set(cacheKey,
JSON.toJSONString(detail), 30, TimeUnit.MINUTES);
return detail;
}
针对MySQL的优化措施:
sql复制ALTER TABLE chapter_stats
ADD INDEX idx_book_date (book_id, stat_date);
sql复制-- 不好的写法
SELECT * FROM books WHERE category = 'fantasy' LIMIT 100;
-- 优化后写法
SELECT id, title, author FROM books
WHERE category = 'fantasy' LIMIT 100;
java复制@TableName("read_log_#{#month}")
public class ReadLog {
@TableId(type = IdType.AUTO)
private Long id;
private Long userId;
private Long bookId;
// 其他字段...
}
mermaid复制sequenceDiagram
participant Client
participant Server
Client->>Server: 登录请求(用户名/密码)
Server-->>Client: 返回JWT令牌
Client->>Server: 携带令牌访问API
Server->>Server: 验证令牌有效性
Server-->>Client: 返回请求数据
实际代码实现:
java复制public class JwtFilter extends OncePerRequestFilter {
@Override
protected void doFilterInternal(HttpServletRequest request,
HttpServletResponse response, FilterChain chain) {
String token = request.getHeader("Authorization");
try {
Claims claims = Jwts.parser()
.setSigningKey(secretKey)
.parseClaimsJws(token)
.getBody();
String username = claims.getSubject();
// 设置用户上下文
} catch (Exception e) {
response.sendError(HttpStatus.UNAUTHORIZED.value());
}
chain.doFilter(request, response);
}
}
java复制public class CryptoUtil {
private static final String ALGORITHM = "AES/CBC/PKCS5Padding";
public static String encrypt(String input, String key) {
Cipher cipher = Cipher.getInstance(ALGORITHM);
// 初始化向量
byte[] iv = new byte[16];
new SecureRandom().nextBytes(iv);
cipher.init(Cipher.ENCRYPT_MODE,
new SecretKeySpec(key.getBytes(), "AES"),
new IvParameterSpec(iv));
byte[] encrypted = cipher.doFinal(input.getBytes());
return Base64.getEncoder().encodeToString(iv) + ":" +
Base64.getEncoder().encodeToString(encrypted);
}
}
java复制@Aspect
@Component
public class LogMaskAspect {
@Around("@annotation(org.springframework.web.bind.annotation.PostMapping)")
public Object maskSensitiveData(ProceedingJoinPoint pjp) {
Object[] args = pjp.getArgs();
// 对参数进行脱敏处理
maskFields(args);
return pjp.proceed(args);
}
}
问题现象:爬虫运行几小时后突然停止工作,无错误日志
排查过程:
解决方案:
java复制try (CloseableHttpResponse response = httpClient.execute(request)) {
HttpEntity entity = response.getEntity();
String result = EntityUtils.toString(entity);
EntityUtils.consume(entity); // 关键!释放连接
return result;
}
问题现象:服务运行一周后出现OOM
排查工具:
优化方案:
java复制// 改为静态初始化词典
public class TextAnalyzer {
private static final PerceptronLexicalAnalyzer ANALYZER;
static {
ANALYZER = new PerceptronLexicalAnalyzer();
ANALYZER.enableCustomDictionary(true);
}
public static List<String> extractKeywords(String text) {
return ANALYZER.analyze(text)
.stream()
.filter(term -> term.nature.startsWith("n"))
.map(term -> term.word)
.collect(Collectors.toList());
}
}
当前系统还可以在以下方面进行增强:
java复制DataStream<ReadEvent> stream = env
.addSource(new KafkaSource<>())
.keyBy(ReadEvent::getBookId)
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.aggregate(new ReadCounter());
javascript复制// 前端埋点示例
trackEvent('chart_click', {
chart_type: 'genre_distribution',
filter_condition: this.currentFilter
});
xml复制<!-- 报表模板配置 -->
<jasperReport>
<field name="bookName" class="java.lang.String"/>
<detail>
<band height="20">
<textField>
<textFieldExpression><![CDATA[$F{bookName}]]></textFieldExpression>
</textField>
</band>
</detail>
</jasperReport>
这个项目从技术选型到最终上线历时三个月,最大的收获是掌握了复杂系统的全链路开发经验。特别是在应对反爬策略和性能优化方面,积累了不少实战技巧。如果大家有类似项目需求,建议重点关注数据采集的稳定性和可视化交互体验,这两个环节往往决定项目的成败。