头哥实践平台之MapReduce数据处理实战

KXZDQ

1. 从零开始搭建Hadoop环境

第一次接触Hadoop时，我被它庞大的生态体系吓到了。但实际搭建起来，发现并没有想象中那么复杂。这里我分享下在头哥实践平台上搭建Hadoop环境的详细步骤，帮你避开我踩过的那些坑。

首先需要准备一个干净的Linux环境，推荐使用Ubuntu 18.04 LTS版本。这个版本稳定性好，社区支持完善。安装完系统后，第一步是配置Java环境。Hadoop是用Java开发的，所以JDK必不可少。我建议安装OpenJDK 8，这是目前最稳定的选择：

bash复制sudo apt update
sudo apt install openjdk-8-jdk -y
java -version  # 验证安装

接下来下载Hadoop二进制包。我强烈建议使用3.2.3版本，这是目前最稳定的发布版。下载后解压到/usr/local目录：

bash复制wget https://archive.apache.org/dist/hadoop/common/hadoop-3.2.3/hadoop-3.2.3.tar.gz
sudo tar -xzvf hadoop-3.2.3.tar.gz -C /usr/local
sudo mv /usr/local/hadoop-3.2.3 /usr/local/hadoop

配置环境变量是容易出错的地方。编辑~/.bashrc文件，添加以下内容：

bash复制export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

保存后执行source ~/.bashrc使配置生效。接下来修改Hadoop的核心配置文件，这些文件都位于$HADOOP_HOME/etc/hadoop目录下：

core-site.xml - 配置HDFS地址和临时目录
hdfs-site.xml - 配置副本数和数据目录
mapred-site.xml - 配置MapReduce框架
yarn-site.xml - 配置YARN资源管理

配置完成后，执行格式化命令初始化HDFS：

bash复制hdfs namenode -format

最后启动Hadoop集群：

bash复制start-dfs.sh
start-yarn.sh

验证集群是否正常运行：

bash复制jps  # 应该看到NameNode、DataNode等进程
hdfs dfsadmin -report  # 查看集群状态

2. 学生成绩分析项目实战

现在我们来实战一个完整的学生成绩分析项目。这个项目会用到MapReduce的核心思想，通过三个关键步骤完成：数据准备、Map阶段处理和Reduce阶段汇总。

首先准备测试数据。我们创建一个students.txt文件，包含学生姓名和成绩：

code复制张三 85
李四 92
王五 78
张三 90
李四 88
赵六 95

将数据上传到HDFS：

bash复制hadoop fs -mkdir -p /user/test/input
hadoop fs -put students.txt /user/test/input

接下来编写MapReduce程序。核心思路是：Mapper读取每行数据，输出<学生姓名, 成绩>键值对；Reducer找出每个学生的最高成绩。完整代码如下：

java复制public class MaxScore {
    public static class TokenizerMapper 
        extends Mapper<LongWritable, Text, Text, IntWritable> {
        
        private Text name = new Text();
        private IntWritable score = new IntWritable();
        
        public void map(LongWritable key, Text value, Context context
                       ) throws IOException, InterruptedException {
            String[] parts = value.toString().split(" ");
            name.set(parts[0]);
            score.set(Integer.parseInt(parts[1]));
            context.write(name, score);
        }
    }
    
    public static class IntMaxReducer 
        extends Reducer<Text, IntWritable, Text, IntWritable> {
        
        private IntWritable result = new IntWritable();
        
        public void reduce(Text key, Iterable<IntWritable> values,
                          Context context
                          ) throws IOException, InterruptedException {
            int max = Integer.MIN_VALUE;
            for (IntWritable val : values) {
                max = Math.max(max, val.get());
            }
            result.set(max);
            context.write(key, result);
        }
    }
    
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "max score");
        job.setJarByClass(MaxScore.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntMaxReducer.class);
        job.setReducerClass(IntMaxReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

打包程序为maxscore.jar，然后提交到Hadoop集群运行：

bash复制hadoop jar maxscore.jar MaxScore /user/test/input /user/test/output

查看结果：

bash复制hadoop fs -cat /user/test/output/part-r-00000

输出应该是每个学生的最高成绩：

code复制张三 90
李四 92
王五 78
赵六 95

3. 文件合并与去重实战

在实际项目中，经常需要合并多个数据源并去除重复记录。下面我们通过一个具体案例来掌握这个技巧。

假设有两个学生信息文件file1.txt和file2.txt：

file1.txt内容：

code复制1001 张三 男
1002 李四 女
1003 王五 男

file2.txt内容：

code复制1002 李四 女
1004 赵六 男
1005 钱七 女

我们的目标是合并这两个文件，并去除重复记录（学号相同即为重复）。MapReduce程序的实现思路是：Mapper直接输出原始记录，Reducer利用Set集合自动去重。

完整代码如下：

java复制public class MergeDedup {
    public static class Map 
        extends Mapper<Object, Text, Text, Text> {
        
        private Text studentId = new Text();
        private Text studentInfo = new Text();
        
        public void map(Object key, Text value, Context context
                      ) throws IOException, InterruptedException {
            String[] parts = value.toString().split(" ", 2);
            studentId.set(parts[0]);
            studentInfo.set(parts[1]);
            context.write(studentId, studentInfo);
        }
    }
    
    public static class Reduce 
        extends Reducer<Text, Text, Text, Text> {
        
        public void reduce(Text key, Iterable<Text> values,
                          Context context
                         ) throws IOException, InterruptedException {
            // 使用Set自动去重
            Set<String> uniqueRecords = new HashSet<>();
            for (Text val : values) {
                uniqueRecords.add(val.toString());
            }
            
            // 输出唯一记录
            for (String record : uniqueRecords) {
                context.write(key, new Text(record));
            }
        }
    }
    
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "merge and dedup");
        job.setJarByClass(MergeDedup.class);
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

运行程序并查看结果：

bash复制hadoop jar mergededup.jar MergeDedup /user/tmp/input /user/tmp/output
hadoop fs -cat /user/tmp/output/part-r-00000

输出结果应该是合并后的唯一记录：

code复制1001 张三 男
1002 李四 女
1003 王五 男
1004 赵六 男
1005 钱七 女

4. 数据关联分析实战

最后一个实战案例是关联分析，我们将通过MapReduce挖掘家族关系中的祖孙关系。这个案例展示了MapReduce处理复杂数据关联的能力。

假设有一个child-parent.txt文件，内容如下：

code复制张三 张伟
李四 张伟
王五 李强
赵六 王五
钱七 赵六

我们的目标是找出所有的祖孙关系。实现思路是：Mapper将每条记录转换为两种形式（parent作为key和child作为key），Reducer通过连接操作找出祖孙关系。

完整代码实现：

java复制public class FamilyTree {
    public static class Map 
        extends Mapper<Object, Text, Text, Text> {
        
        public void map(Object key, Text value, Context context
                      ) throws IOException, InterruptedException {
            String[] relations = value.toString().split(" ");
            if (relations.length != 2) return;
            
            String child = relations[0];
            String parent = relations[1];
            
            // 作为左表输出
            context.write(new Text(parent), new Text("1:" + child));
            // 作为右表输出
            context.write(new Text(child), new Text("2:" + parent));
        }
    }
    
    public static class Reduce 
        extends Reducer<Text, Text, Text, Text> {
        
        private static List<String> children = new ArrayList<>();
        private static List<String> parents = new ArrayList<>();
        
        public void reduce(Text key, Iterable<Text> values,
                          Context context
                         ) throws IOException, InterruptedException {
            children.clear();
            parents.clear();
            
            for (Text val : values) {
                String[] parts = val.toString().split(":");
                if (parts[0].equals("1")) {
                    children.add(parts[1]);
                } else {
                    parents.add(parts[1]);
                }
            }
            
            // 连接操作找出祖孙关系
            for (String child : children) {
                for (String parent : parents) {
                    context.write(new Text(child), new Text(parent));
                }
            }
        }
    }
    
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "family tree");
        job.setJarByClass(FamilyTree.class);
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

运行程序并查看结果：

bash复制hadoop jar familytree.jar FamilyTree /user/reduce/input /user/reduce/output
hadoop fs -cat /user/reduce/output/part-r-00000

输出结果展示了所有的祖孙关系：

code复制张三 张伟
李四 张伟
赵六 李强
钱七 王五

这个案例展示了MapReduce处理复杂数据关联的强大能力。通过两次MapReduce作业，我们可以处理更复杂的多级关系分析。

已经到底了哦

头哥实践平台之MapReduce数据处理实战

1. 从零开始搭建Hadoop环境

2. 学生成绩分析项目实战

3. 文件合并与去重实战

4. 数据关联分析实战

内容推荐