在当今大数据处理领域,Spark凭借其内存计算优势和丰富的生态组件,已成为企业级数据处理的首选框架。而完全分布式部署模式,则是发挥Spark真正威力的关键配置方案。本教程将手把手带您搭建一个真实的分布式Spark测试集群,并验证其核心功能。
注意:本教程假设您已具备Linux基础操作能力,并拥有至少三台可互联的物理机或虚拟机(1个Master+2个Worker的最低配置)
硬件配置建议:
软件要求:
在所有节点执行以下操作:
bash复制# 设置主机名解析(示例为3节点集群)
sudo tee -a /etc/hosts <<EOF
192.168.1.101 spark-master
192.168.1.102 spark-worker1
192.168.1.103 spark-worker2
EOF
# 安装必备工具
sudo apt update && sudo apt install -y \
openjdk-8-jdk \
ssh \
pdsh \
python3-pip
# 配置SSH免密登录
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
仅在Master节点操作:
bash复制wget https://archive.apache.org/dist/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz
tar -xzf hadoop-3.3.4.tar.gz -C /opt/
mv /opt/hadoop-3.3.4 /opt/hadoop
配置关键文件:
etc/hadoop/core-site.xml:xml复制<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://spark-master:9000</value>
</property>
</configuration>
etc/hadoop/hdfs-site.xml:xml复制<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>
启动HDFS:
bash复制# 格式化namenode(首次执行)
hdfs namenode -format
# 启动服务
start-dfs.sh
在所有节点安装Spark:
bash复制wget https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
tar -xzf spark-3.3.0-bin-hadoop3.tgz -C /opt/
mv /opt/spark-3.3.0-bin-hadoop3 /opt/spark
Master节点配置:
conf/spark-env.sh:bash复制export SPARK_MASTER_HOST=spark-master
export SPARK_WORKER_CORES=8
export SPARK_WORKER_MEMORY=16g
Workers列表:
conf/workers:code复制spark-worker1
spark-worker2
启动集群:
bash复制/opt/spark/sbin/start-all.sh
检查集群状态:
bash复制# 查看Master UI(默认8080端口)
curl http://spark-master:8080
# 查看Worker状态
ssh spark-worker1 "jps"
运行示例任务:
bash复制spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://spark-master:7077 \
/opt/spark/examples/jars/spark-examples_2.12-3.3.0.jar \
1000
使用TPC-H生成测试数据:
bash复制# 生成10GB数据集
wget https://github.com/gregrahn/tpch-kit/archive/refs/tags/v2.18.0.tar.gz
tar -xzf v2.18.0.tar.gz
cd tpch-kit-2.18.0/dbgen
make
./dbgen -s 10 -f
hdfs dfs -mkdir /tpch
hdfs dfs -put *.tbl /tpch
执行分布式查询:
python复制from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("TPC-H Q1") \
.getOrCreate()
lineitem = spark.read.csv(
"hdfs://spark-master:9000/tpch/lineitem.tbl",
sep="|", inferSchema=True)
result = lineitem.groupBy("l_returnflag") \
.agg({"l_quantity":"sum","l_extendedprice":"avg"}) \
.collect()
print(result)
资源利用率:
bash复制# 实时监控
watch -n 1 "curl -s http://spark-master:4040/metrics/json | jq '.gauges'"
任务延迟分析:
python复制# 获取Stage执行时间
import requests
metrics = requests.get("http://spark-master:4040/api/v1/applications").json()
for app in metrics:
print(app['name'], app['attempts'][0]['duration'])
问题1:Worker节点无法注册
bash复制# 确认网络连通性
pdsh -w spark-worker[1-2] "ping -c 3 spark-master"
# 检查端口开放
nc -zv spark-master 7077
问题2:任务执行超时
bash复制spark-submit \
--conf spark.network.timeout=600s \
--conf spark.executor.heartbeatInterval=60s
启用Kerberos认证:
xml复制<!-- spark-defaults.conf -->
spark.authenticate true
spark.authenticate.secret your_secret_key
网络隔离方案:
bash复制# 使用网络策略限制访问
iptables -A INPUT -p tcp --dport 7077 -s 192.168.1.0/24 -j ACCEPT
典型配置示例:
properties复制# 内存优化
spark.executor.memoryOverhead=2g
spark.memory.fraction=0.8
# 并行度设置
spark.default.parallelism=200
spark.sql.shuffle.partitions=200
实际部署时,建议通过以下命令进行动态调优测试:
bash复制spark-submit --conf spark.dynamicAllocation.enabled=true \
--conf spark.shuffle.service.enabled=true