在运行 Hadoop MapReduce 作业时获取文件名/文件数据作为 Map 的键/值输入

Getting Filename/FileData as key/value input for Map when running a Hadoop MapReduce Job(在运行 Hadoop MapReduce 作业时获取文件名/文件数据作为 Map 的键/值输入)
本文介绍了在运行 Hadoop MapReduce 作业时获取文件名/文件数据作为 Map 的键/值输入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着跟版网的小编来一起学习吧!

问题描述

限时送ChatGPT账号..

我解决了这个问题 如何在运行 Hadoop MapReduce 作业时获取文件名/文件内容作为 MAP 的键/值输入? 在这里.虽然它解释了这个概念,但我无法成功地将其转换为代码.

I went through the question How to get Filename/File Contents as key/value input for MAP when running a Hadoop MapReduce Job? here. Though it explains the concept, I am unable to successfully transform it to code.

基本上,我希望文件名作为键,文件数据作为值.为此,我按照上述问题中的建议编写了一个自定义 RecordReader .但是我不明白如何在这个类中获取文件名作为键.另外,在编写自定义 FileInputFormat 类时,我无法理解如何返回我之前编写的自定义 RecordReader.

Basically, I want the file name as key and the file data as value. For that I wrote a custom RecordReader as recommended in the aforementioned question. But I couldn't understand how to get the file name as the key in this class. Also, while writing the custom FileInputFormat class, I couldn't understand how to return the custom RecordReader I wrote previously.

RecordReader 代码为:

import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;

public class CustomRecordReader extends RecordReader<Text, Text> {

    private static final String LINE_SEPARATOR = System.getProperty("line.separator");

    private StringBuffer valueBuffer = new StringBuffer("");
    private Text key = new Text();
    private Text value = new Text();
    private RecordReader<Text, Text> recordReader;

    public SPDRecordReader(RecordReader<Text, Text> recordReader) {
        this.recordReader = recordReader;
    }

    @Override
    public void close() throws IOException {
        recordReader.close();
    }

    @Override
    public Text getCurrentKey() throws IOException, InterruptedException {
        return key;
    }

    @Override
    public Text getCurrentValue() throws IOException, InterruptedException {
        return value;
    }

    @Override
    public float getProgress() throws IOException, InterruptedException {
        return recordReader.getProgress();
    }

    @Override
    public void initialize(InputSplit arg0, TaskAttemptContext arg1)
            throws IOException, InterruptedException {
        recordReader.initialize(arg0, arg1);
    }

    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {

        if (valueBuffer.equals("")) {
            while (recordReader.nextKeyValue()) {
                valueBuffer.append(recordReader.getCurrentValue());
                valueBuffer.append(LINE_SEPARATOR);
            }
            value.set(valueBuffer.toString());
            return true;
        }
        return false;
    }

}

而不完整的FileInputFormat类是:

import java.io.IOException;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reporter;

public class CustomFileInputFormat extends FileInputFormat<Text, Text> {

    @Override
    protected boolean isSplitable(FileSystem fs, Path filename) {
        return false;
    }

    @Override
    public RecordReader<Text, Text> getRecordReader(InputSplit arg0, JobConf arg1,
            Reporter arg2) throws IOException {
        return null;
    }
}

推荐答案

在你的 CustomRecordReader 类中有这个代码.

Have this code in your CustomRecordReader class.

private LineRecordReader lineReader;

private String fileName;

public CustomRecordReader(JobConf job, FileSplit split) throws IOException {
    lineReader = new LineRecordReader(job, split);
    fileName = split.getPath().getName();
}

public boolean next(Text key, Text value) throws IOException {
    // get the next line
    if (!lineReader.next(key, value)) {
        return false;
    }    

    key.set(fileName);
    value.set(value);

    return true;
}

public Text createKey() {
    return new Text("");
}

public Text createValue() {
    return new Text("");
}

删除 SPDRecordReader 构造函数(这是一个错误).

Remove SPDRecordReader constructor (It is an error).

并在您的 CustomFileInputFormat 类中包含此代码

And have this code in your CustomFileInputFormat class

public RecordReader<Text, Text> getRecordReader(
  InputSplit input, JobConf job, Reporter reporter)
  throws IOException {

    reporter.setStatus(input.toString());
    return new CustomRecordReader(job, (FileSplit)input);
}

这篇关于在运行 Hadoop MapReduce 作业时获取文件名/文件数据作为 Map 的键/值输入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!

本站部分内容来源互联网,如果有图片或者内容侵犯了您的权益,请联系我们,我们会在确认后第一时间进行删除!

相关文档推荐

How to send data to COM PORT using JAVA?(如何使用 JAVA 向 COM PORT 发送数据?)
How to make a report page direction to change to quot;rtlquot;?(如何使报表页面方向更改为“rtl?)
Use cyrillic .properties file in eclipse project(在 Eclipse 项目中使用西里尔文 .properties 文件)
Is there any way to detect an RTL language in Java?(有没有办法在 Java 中检测 RTL 语言?)
How to load resource bundle messages from DB in Java?(如何在 Java 中从 DB 加载资源包消息?)
How do I change the default locale settings in Java to make them consistent?(如何更改 Java 中的默认语言环境设置以使其保持一致?)