在开始构建Hadoop集群前,我们需要准备一个干净的Ubuntu 22.04环境。以下是完整的初始化步骤:
bash复制# 更新系统软件包(建议先更换为国内镜像源)
sudo apt update && sudo apt upgrade -y
# 安装Docker引擎
sudo apt install docker.io -y
# 配置Docker服务
sudo systemctl start docker
sudo systemctl enable docker
# 验证Docker安装
docker --version
注意:如果遇到安装速度慢的问题,建议先更换apt源。对于国内用户,可以使用清华或阿里云的镜像源。
为保持环境整洁,我们创建专用目录存放Hadoop相关文件:
bash复制mkdir ~/hadoop-docker && cd ~/hadoop-docker
从Apache官网下载Hadoop 3.3.0二进制包(hadoop-3.3.0.tar.gz),并放置在工作目录下。这里我们选择手动下载而非在Dockerfile中在线下载,原因有三:
Hadoop集群需要容器间稳定通信,我们创建一个桥接网络:
bash复制docker network create \
--driver bridge \
--subnet=172.19.0.0/16 \
hadoop-net
关键参数说明:
--subnet=172.19.0.0/16:指定子网范围,避免与默认网络冲突hadoop-net:网络名称,后续容器都将接入此网络验证网络创建:
bash复制docker network inspect hadoop-net
对于生产环境,建议采用以下IP分配方案:
| 节点类型 | 主机名 | IP地址 | 端口映射 |
|---|---|---|---|
| Master | master | 172.19.0.2 | 9870:9870, 8088:8088等 |
| Worker1 | worker01 | 172.19.0.3 | - |
| Worker2 | worker02 | 172.19.0.4 | - |
编写entrypoint.sh脚本,负责容器启动时的初始化工作:
bash复制#!/bin/bash
# 启动SSH服务
service ssh start
# 生成SSH密钥(如果不存在)
if [ ! -f ~/.ssh/id_rsa ]; then
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
fi
# 保持容器运行
tail -f /dev/null
设置执行权限:
bash复制chmod +x entrypoint.sh
以下是完整的Dockerfile配置:
dockerfile复制FROM ubuntu:22.04
# 设置非交互式环境
ENV DEBIAN_FRONTEND=noninteractive
# 安装基础工具
RUN apt update && apt install -y \
openssh-server \
openjdk-11-jdk \
wget \
vim \
net-tools \
iputils-ping \
dnsutils
# 配置SSH
RUN mkdir /var/run/sshd
RUN echo 'root:root' | chpasswd
RUN sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
RUN sed -i 's/#PasswordAuthentication yes/PasswordAuthentication yes/' /etc/ssh/sshd_config
# 复制本地Hadoop安装包
COPY ./hadoop-3.3.0.tar.gz /tmp/
# 安装Hadoop
RUN tar -xzvf /tmp/hadoop-3.3.0.tar.gz -C /usr/local/ \
&& rm /tmp/hadoop-3.3.0.tar.gz \
&& mv /usr/local/hadoop-3.3.0 /usr/local/hadoop \
&& chown -R root:root /usr/local/hadoop
# 设置环境变量
ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
ENV HADOOP_HOME=/usr/local/hadoop
ENV HADOOP_MAPRED_HOME=/usr/local/hadoop
ENV PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
# 创建数据目录
RUN mkdir -p /usr/local/hadoop/namenode_dir
RUN mkdir -p /usr/local/hadoop/datanode_dir
RUN mkdir -p /usr/local/hadoop/tmp
# 暴露端口
EXPOSE 22 9870 8088 9000 50070 50010
# 启动脚本
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
ENTRYPOINT ["/entrypoint.sh"]
执行构建命令:
bash复制docker build -t hadoop-base .
构建过程可能需要5-10分钟,取决于网络速度和系统性能。构建成功后,可以使用以下命令验证:
bash复制docker images | grep hadoop-base
bash复制docker run -itd \
--name master \
--hostname master \
--net hadoop-net \
--ip 172.19.0.2 \
-p 9870:9870 \
-p 8088:8088 \
-p 9000:9000 \
hadoop-base
端口映射说明:
启动两个Worker节点:
bash复制docker run -itd \
--name worker01 \
--hostname worker01 \
--net hadoop-net \
--ip 172.19.0.3 \
hadoop-base
docker run -itd \
--name worker02 \
--hostname worker02 \
--net hadoop-net \
--ip 172.19.0.4 \
hadoop-base
检查所有容器是否正常运行:
bash复制docker ps -a --filter "name=master|worker01|worker02"
在每个容器中配置/etc/hosts文件,确保节点间可以通过主机名互相访问:
bash复制# 配置Master节点的hosts
docker exec master bash -c "echo '172.19.0.2 master' >> /etc/hosts"
docker exec master bash -c "echo '172.19.0.3 worker01' >> /etc/hosts"
docker exec master bash -c "echo '172.19.0.4 worker02' >> /etc/hosts"
# 配置Worker01节点的hosts
docker exec worker01 bash -c "echo '172.19.0.2 master' >> /etc/hosts"
docker exec worker01 bash -c "echo '172.19.0.3 worker01' >> /etc/hosts"
docker exec worker01 bash -c "echo '172.19.0.4 worker02' >> /etc/hosts"
# 配置Worker02节点的hosts
docker exec worker02 bash -c "echo '172.19.0.2 master' >> /etc/hosts"
docker exec worker02 bash -c "echo '172.19.0.3 worker01' >> /etc/hosts"
docker exec worker02 bash -c "echo '172.19.0.4 worker02' >> /etc/hosts"
进入Master容器:
bash复制docker exec -it master bash
在容器内执行:
bash复制# 将公钥复制到所有节点
ssh-copy-id master
ssh-copy-id worker01
ssh-copy-id worker02
# 测试免密登录
ssh worker01 hostname # 应返回 worker01
ssh worker02 hostname # 应返回 worker02
常见问题:如果ssh-copy-id失败,可能是SSH服务配置问题。解决方法如下:
- 进入对应worker容器:
docker exec -it worker01 bash- 重置root密码:
echo 'root:root' | chpasswd- 修改SSH配置:
bash复制sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config sed -i 's/#PasswordAuthentication yes/PasswordAuthentication yes/' /etc/ssh/sshd_config- 重启SSH服务:
service ssh restart
在Master节点上创建配置文件目录:
bash复制mkdir -p /usr/local/hadoop/etc/hadoop/config
cd /usr/local/hadoop/etc/hadoop/config
xml复制<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
</property>
</configuration>
xml复制<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/namenode_dir</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/usr/local/hadoop/datanode_dir</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
xml复制<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
</configuration>
xml复制<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value>
$HADOOP_HOME/etc/hadoop,
$HADOOP_HOME/share/hadoop/common/*,
$HADOOP_HOME/share/hadoop/common/lib/*,
$HADOOP_HOME/share/hadoop/hdfs/*,
$HADOOP_HOME/share/hadoop/hdfs/lib/*,
$HADOOP_HOME/share/hadoop/mapreduce/*,
$HADOOP_HOME/share/hadoop/mapreduce/lib/*,
$HADOOP_HOME/share/hadoop/yarn/*,
$HADOOP_HOME/share/hadoop/yarn/lib/*
</value>
</property>
</configuration>
plaintext复制worker01
worker02
编辑hadoop-env.sh文件:
bash复制vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh
添加以下内容:
bash复制export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
将配置同步到所有Worker节点:
bash复制for worker in worker01 worker02; do
scp -r $HADOOP_HOME/etc/hadoop/config/* $worker:$HADOOP_HOME/etc/hadoop/
scp $HADOOP_HOME/etc/hadoop/hadoop-env.sh $worker:$HADOOP_HOME/etc/hadoop/
done
将config目录下的配置文件移动到Hadoop主配置目录:
bash复制cd /usr/local/hadoop/etc/hadoop/
cp -f config/* ./
bash复制hdfs namenode -format
成功标志:看到"Storage directory has been successfully formatted"信息。
注意:如果格式化失败,需要先清理旧的元数据:
bash复制rm -rf /usr/local/hadoop/namenode_dir/* rm -rf /usr/local/hadoop/datanode_dir/* rm -rf /usr/local/hadoop/tmp/*
bash复制start-dfs.sh
验证HDFS服务:
bash复制jps
Master节点应显示:
Worker节点应显示:
bash复制start-yarn.sh
验证YARN服务:
bash复制jps
Master节点应新增:
Worker节点应新增:
bash复制hdfs dfs -mkdir -p /user/root/input
bash复制echo "Hello Hadoop World" > test.txt
hdfs dfs -put test.txt /user/root/input
bash复制hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar \
wordcount \
/user/root/input \
/user/root/output
如果遇到类路径问题,可以使用完整命令:
bash复制hadoop jar \
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar \
wordcount \
-Dmapreduce.application.classpath=$(hadoop classpath) \
-Dyarn.app.mapreduce.am.env="HADOOP_MAPRED_HOME=/usr/local/hadoop" \
-Dmapreduce.map.env="HADOOP_MAPRED_HOME=/usr/local/hadoop" \
-Dmapreduce.reduce.env="HADOOP_MAPRED_HOME=/usr/local/hadoop" \
/user/root/input \
/user/root/output
bash复制hdfs dfs -cat /user/root/output/part-r-00000
预期输出:
code复制Hadoop 1
Hello 1
World 1
当前配置将数据存储在容器内部,这不是生产环境的推荐做法。建议使用以下方式实现数据持久化:
使用Docker卷挂载关键目录:
bash复制docker run -itd \
--name master \
--hostname master \
--net hadoop-net \
--ip 172.19.0.2 \
-p 9870:9870 \
-p 8088:8088 \
-p 9000:9000 \
-v hadoop_namenode:/usr/local/hadoop/namenode_dir \
-v hadoop_datanode:/usr/local/hadoop/datanode_dir \
-v hadoop_tmp:/usr/local/hadoop/tmp \
hadoop-base
或者直接挂载主机目录:
bash复制mkdir -p /data/hadoop/{namenode,datanode,tmp}
docker run -itd \
...
-v /data/hadoop/namenode:/usr/local/hadoop/namenode_dir \
-v /data/hadoop/datanode:/usr/local/hadoop/datanode_dir \
-v /data/hadoop/tmp:/usr/local/hadoop/tmp \
hadoop-base
调整Hadoop内存配置:
根据硬件资源调整以下参数:
xml复制<!-- yarn-site.xml -->
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>8192</value> <!-- 根据实际内存调整 -->
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>8192</value>
</property>
<!-- mapred-site.xml -->
<property>
<name>mapreduce.map.memory.mb</name>
<value>2048</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>4096</value>
</property>
Web UI访问:
日志查看:
常用维护命令:
bash复制# 停止集群
stop-yarn.sh
stop-dfs.sh
# 单个节点维护
hadoop-daemon.sh stop datanode
yarn-daemon.sh stop nodemanager
现象:执行ssh worker01时仍然提示输入密码
解决方案:
bash复制chmod 600 ~/.ssh/authorized_keys
chmod 700 ~/.ssh
bash复制grep -E 'PermitRootLogin|PasswordAuthentication' /etc/ssh/sshd_config
应显示:code复制PermitRootLogin yes
PasswordAuthentication yes
bash复制service ssh restart
现象:jps命令看不到DataNode进程
解决方案:
bash复制cat /usr/local/hadoop/logs/hadoop-root-datanode-*.log
bash复制rm -rf /usr/local/hadoop/namenode_dir/*
rm -rf /usr/local/hadoop/datanode_dir/*
rm -rf /usr/local/hadoop/tmp/*
hdfs namenode -format
start-dfs.sh
现象:运行WordCount示例时出现类路径错误
解决方案:
bash复制hadoop jar \
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar \
wordcount \
-Dmapreduce.application.classpath=$(hadoop classpath) \
-Dyarn.app.mapreduce.am.env="HADOOP_MAPRED_HOME=/usr/local/hadoop" \
-Dmapreduce.map.env="HADOOP_MAPRED_HOME=/usr/local/hadoop" \
-Dmapreduce.reduce.env="HADOOP_MAPRED_HOME=/usr/local/hadoop" \
/user/root/input \
/user/root/output
bash复制export HADOOP_MAPRED_HOME=/usr/local/hadoop
现象:无法通过浏览器访问9870或8088端口
解决方案:
bash复制docker ps --filter "name=master"
确认有9870:9870和8088:8088的端口映射bash复制sudo ufw status
如果防火墙开启,需要放行相应端口:bash复制sudo ufw allow 9870/tcp
sudo ufw allow 8088/tcp
bash复制docker exec master jps
应看到NameNode和ResourceManager进程要扩展集群,只需启动新的Worker容器并加入集群:
启动新Worker:
bash复制docker run -itd \
--name worker03 \
--hostname worker03 \
--net hadoop-net \
--ip 172.19.0.5 \
hadoop-base
在所有节点更新hosts文件:
bash复制for node in master worker01 worker02 worker03; do
docker exec $node bash -c "echo '172.19.0.5 worker03' >> /etc/hosts"
done
配置SSH免密登录(在Master节点执行):
bash复制ssh-copy-id worker03
修改Master节点的workers文件:
bash复制echo "worker03" >> /usr/local/hadoop/etc/hadoop/workers
分发配置到新节点:
bash复制scp -r $HADOOP_HOME/etc/hadoop/* worker03:$HADOOP_HOME/etc/hadoop/
启动新的DataNode:
bash复制hdfs dfsadmin -refreshNodes
为简化管理,可以使用docker-compose.yml文件定义整个集群:
yaml复制version: '3'
services:
master:
image: hadoop-base
container_name: master
hostname: master
networks:
hadoop-net:
ipv4_address: 172.19.0.2
ports:
- "9870:9870"
- "8088:8088"
- "9000:9000"
volumes:
- hadoop_namenode:/usr/local/hadoop/namenode_dir
- hadoop_tmp:/usr/local/hadoop/tmp
worker01:
image: hadoop-base
container_name: worker01
hostname: worker01
networks:
hadoop-net:
ipv4_address: 172.19.0.3
volumes:
- hadoop_datanode01:/usr/local/hadoop/datanode_dir
worker02:
image: hadoop-base
container_name: worker02
hostname: worker02
networks:
hadoop-net:
ipv4_address: 172.19.0.4
volumes:
- hadoop_datanode02:/usr/local/hadoop/datanode_dir
networks:
hadoop-net:
driver: bridge
ipam:
config:
- subnet: 172.19.0.0/16
volumes:
hadoop_namenode:
hadoop_datanode01:
hadoop_datanode02:
hadoop_tmp:
使用命令启动集群:
bash复制docker-compose up -d
Hadoop集群可以与其他大数据组件集成,形成完整的大数据生态系统:
Hive:数据仓库工具
bash复制# 在Dockerfile中添加
RUN wget https://downloads.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz \
&& tar -xzf apache-hive-3.1.3-bin.tar.gz -C /usr/local/ \
&& mv /usr/local/apache-hive-3.1.3-bin /usr/local/hive \
&& rm apache-hive-3.1.3-bin.tar.gz
ENV HIVE_HOME=/usr/local/hive
ENV PATH=$PATH:$HIVE_HOME/bin
Spark:分布式计算引擎
bash复制# 在Dockerfile中添加
RUN wget https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz \
&& tar -xzf spark-3.3.0-bin-hadoop3.tgz -C /usr/local/ \
&& mv /usr/local/spark-3.3.0-bin-hadoop3 /usr/local/spark \
&& rm spark-3.3.0-bin-hadoop3.tgz
ENV SPARK_HOME=/usr/local/spark
ENV PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
HBase:分布式NoSQL数据库
bash复制# 在Dockerfile中添加
RUN wget https://downloads.apache.org/hbase/2.4.12/hbase-2.4.12-bin.tar.gz \
&& tar -xzf hbase-2.4.12-bin.tar.gz -C /usr/local/ \
&& mv /usr/local/hbase-2.4.12 /usr/local/hbase \
&& rm hbase-2.4.12-bin.tar.gz
ENV HBASE_HOME=/usr/local/hbase
ENV PATH=$PATH:$HBASE_HOME/bin
修改默认密码:
bash复制echo 'root:YourStrongPassword' | chpasswd
限制SSH访问:
修改sshd_config:
bash复制sed -i 's/#PermitRootLogin yes/PermitRootLogin prohibit-password/' /etc/ssh/sshd_config
sed -i 's/PasswordAuthentication yes/PasswordAuthentication no/' /etc/ssh/sshd_config
service ssh restart
启用防火墙:
bash复制apt install ufw -y
ufw allow 22/tcp
ufw allow 9870/tcp
ufw allow 8088/tcp
ufw enable
启用Kerberos认证:
在core-site.xml中添加:
xml复制<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value>
</property>
配置访问控制:
在hdfs-site.xml中添加:
xml复制<property>
<name>dfs.permissions.enabled</name>
<value>true</value>
</property>
启用SSL加密:
配置core-site.xml:
xml复制<property>
<name>hadoop.ssl.enabled</name>
<value>true</value>
</property>
Prometheus + Grafana:
Hadoop自带指标:
HDFS调优:
xml复制<!-- hdfs-site.xml -->
<property>
<name>dfs.namenode.handler.count</name>
<value>100</value> <!-- 默认是10,高并发环境可增加 -->
</property>
<property>
<name>dfs.datanode.handler.count</name>
<value>30</value> <!-- 默认是3 -->
</property>
YARN调优:
xml复制<!-- yarn-site.xml -->
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value> <!-- 单个容器最小内存 -->
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>8</value> <!-- 根据实际CPU核心数调整 -->
</property>
MapReduce调优:
xml复制<!-- mapred-site.xml -->
<property>
<name>mapreduce.task.io.sort.mb</name>
<value>512</value> <!-- 默认是100MB -->
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>2048</value> <!-- 单个Map任务内存 -->
</property>
NameNode元数据备份:
bash复制hdfs dfsadmin -fetchImage /backup/namenode/fsimage_$(date +%Y%m%d)
定期导出HDFS目录结构:
bash复制hdfs dfs -ls -R / > /backup/hdfs_structure_$(date +%Y%m%d).txt
NameNode恢复:
数据节点故障处理:
bash复制hdfs balancer -threshold 10
多阶段构建:
dockerfile复制FROM ubuntu:22.04 as builder
RUN apt update && apt install -y wget
RUN wget https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
FROM ubuntu:22.04
COPY --from=builder hadoop-3.3.0.tar.gz /tmp/
RUN tar -xzf /tmp/hadoop-3.3.0.tar.gz -C /usr/local/
清理缓存:
dockerfile复制RUN apt update && apt install -y \
openssh-server \
openjdk-11-jdk \
&& apt clean \
&& rm -rf /var/lib/apt/lists/*
启动容器时设置资源限制:
bash复制docker run -itd \
--name master \
--hostname master \
--net hadoop-net \
--ip 172.19.0.2 \
--memory 8g \
--cpus 2 \
-p 9870:9870 \
-p 8088:8088 \
-p 9000:9000 \
hadoop-base
在Dockerfile中添加健康检查:
dockerfile复制HEALTHCHECK --interval=30s --timeout=30s --start-period=5s --retries=3 \
CMD curl -f http://localhost:9870/ || exit 1
在实际部署Hadoop集群的过程中,有几个关键点需要特别注意:
网络配置:确保所有节点间网络畅通,特别是主机名解析和SSH免密登录。这是集群正常工作的基础。
数据持久化:生产环境一定要使用卷挂载或绑定挂载,避免容器重启导致数据丢失。
资源分配:根据实际硬件资源合理配置YARN和MapReduce的内存参数,避免资源浪费或不足。
版本兼容性:Hadoop生态系统组件版本间存在兼容性问题,建议使用经过验证的版本组合。
安全加固:即使是测试环境,也应遵循最小权限原则,避免使用root权限运行服务。
一个实用的技巧是:在第一次成功部署后,将配置好的Hadoop目录打包成新的基础镜像,方便后续快速部署:
bash复制docker commit master hadoop-base:configured
这样下次部署时,可以直接使用这个预配置的镜像,节省大量配置时间。