Chen Debra

Posted on Nov 12

Efficient Management of MapReduce Tasks with DolphinScheduler

MapReduce is a programming model used to process and generate large datasets, primarily for parallel processing of massive datasets at the terabyte level. This article provides a detailed overview of how DolphinScheduler is applied to MapReduce tasks, including the distinctions between GenericOptionsParser and args, a comprehensive explanation of the hadoop jar command parameters, MapReduce code examples, and instructions on configuring and running MapReduce tasks in DolphinScheduler.

Differences between GenericOptionsParser and args

Using GenericOptionsParser is as follows:

GenericOptionsParser optionParser = new GenericOptionsParser(conf, args);
String[] remainingArgs = optionParser.getRemainingArgs();

Analyzing the source code of GenericOptionsParser, it follows these steps:

Constructor:

public GenericOptionsParser(Configuration conf, String[] args) 
      throws IOException {
    this(conf, new Options(), args); 
}

Parsing Options:

private boolean parseGeneralOptions(Options opts, String[] args) throws IOException {
    opts = buildGeneralOptions(opts);
    CommandLineParser parser = new GnuParser();
    boolean parsed = false;
    try {
        commandLine = parser.parse(opts, preProcessForWindows(args), true);
        processGeneralOptions(commandLine);
        parsed = true;
    } catch (ParseException e) {
        LOG.warn("options parsing failed: " + e.getMessage());
        HelpFormatter formatter = new HelpFormatter();
        formatter.printHelp("general options are: ", opts);
    }
    return parsed;
}

The GenericOptionsParser reads options such as fs, jt, D, libjars, files, archives, and tokenCacheFile and places them in Hadoop’s Configuration.

With args, however, parsing these options (fs, jt, D, etc.) must be handled manually.

Complete Hadoop jar Command Parameters

hadoop jar wordcount.jar org.myorg.WordCount \
    -fs hdfs://namenode.example.com:8020 \
    -jt resourcemanager.example.com:8032 \
    -D mapreduce.job.queuename=default \
    -libjars /path/to/dependency1.jar,/path/to/dependency2.jar \
    -files /path/to/file1.txt,/path/to/file2.txt \
    -archives /path/to/archive1.zip,/path/to/archive2.tar.gz \
    -tokenCacheFile /path/to/credential.file \
    /input /output

This command:

Submits the job to the specified HDFS.
Uses the specified YARN ResourceManager.
Queues the job under default.
Adds dependencies and files.
Distributes archives and credentials.

MapReduce Examples

Classic WordCount Example

public class WCMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    private IntWritable one = new IntWritable(1);
    private Text word = new Text();

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
        String[] fields = value.toString().split("\\s+");
        for (String field : fields) {
            word.set(field);
            context.write(word, one);
        }
    }
}

// Other classes omitted for brevity

File Distribution Example

public class ConfigMapper extends Mapper<LongWritable, Text, Text, NullWritable> {
    private List<String> whiteList = new ArrayList<>();

    @Override
    protected void setup(Mapper<LongWritable, Text, Text, NullWritable>.Context context) throws IOException {
        URI[] files = context.getCacheFiles();
        if (files != null && files.length > 0) {
            try (BufferedReader reader = new BufferedReader(new FileReader("white.txt"))) {
                String line;
                while ((line = reader.readLine()) != null) {
                    whiteList.add(line);
                }
            }
        }
    }
}

Using MapReduce with DolphinScheduler

Setting up the Yarn test Queue

In capacity-scheduler.xml:

<property>
    <name>yarn.scheduler.capacity.root.queues</name>
    <value>default, test</value>
</property>

Refresh with yarn rmadmin -refreshQueues.

Execution Results

Yarn Job

Source Code Analysis

org.apache.dolphinscheduler.plugin.task.mr.MapReduceArgsUtils#buildArgs

String others = param.getOthers();
// TODO This means that if the queue isn’t specified using the -D mapreduce.job.queuename option, 
// then the queue name will be taken directly from the "Yarn Queue" input field on the page.
if (StringUtils.isEmpty(others) || !others.contains(MR_YARN_QUEUE)) {
    String yarnQueue = param.getYarnQueue();
    if (StringUtils.isNotEmpty(yarnQueue)) {
        args.add(String.format("%s%s=%s", D, MR_YARN_QUEUE, yarnQueue));
    }
}

// TODO This part represents the optional parameters input field on the page,
// where -conf, -archives, -files, -libjars, and -D can be specified.
if (StringUtils.isNotEmpty(others)) {
    args.add(others);
}

DEV Community

Efficient Management of MapReduce Tasks with DolphinScheduler

Differences between GenericOptionsParser and args

Complete Hadoop jar Command Parameters

MapReduce Examples

Classic WordCount Example

File Distribution Example

Using MapReduce with DolphinScheduler

Setting up the Yarn test Queue

Execution Results

Yarn Job

Source Code Analysis

Top comments (0)

Read next

38/365 | ¥10M Job Challenge - What doesn’t kill me makes me stronger?

How to push your files from your local environment to Github

ArkTS 编程语言中的垃圾回收模型：分代式 GC 详解

Task vs Job vs Stream