Faster way to count number of lines in a file dir using Map Reduce Framework

Faster way to count number of lines in a file dir using Map Reduce Framework



In this site you can see one way to count number of lines in a file.
They are emitting count as one for each record in each map. So if 1 map holds 10,000 lines 10,000 values will be passed to reducer.If more than one mapper that many read-writes will happen.
Lets reduce the intermediate writes.

Below is an optimized way to count no of lines in a file/dir
Changes are done in
1. Mapper
Instead of emitting one for each record, we increment line count in map and emit them in cleanup() phase.
public class LineCntMapper extends
Mapper<LongWritable, Text, Text, IntWritable> {

Text keyEmit = new Text("Total Lines");
IntWritable valEmit = new IntWritable();
int partialSum = 0;

public void map(LongWritable key, Text value, Context context) {
partialSum++;
}

public void cleanup(Context context) {
valEmit.set(partialSum);
try {
context.write(keyEmit, valEmit);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
System.exit(0);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
System.exit(0);
}
}
}

So if we have 5 map tasks we will only emit 5 intermediate key-value pair.

2. Driver
In Driver we will include a combiner also
job.setMapperClass(LineCntMapper.class);
job.setCombinerClass(LineCntReducer.class);
job.setReducerClass(LineCntReducer.class);

Combiner doesnt do nothing more than Reducer. we can use reducer as combiner itself.
Reducer doesnt need any change.

If you run this code you will get the results faster than the previous mentioned code in this site .

Working code is here

Happy Hadooping........


download file now

Unknown

About Unknown

Author Description here.. Nulla sagittis convallis. Curabitur consequat. Quisque metus enim, venenatis fermentum, mollis in, porta et, nibh. Duis vulputate elit in elit. Mauris dictum libero id justo.

Subscribe to this Blog via Email :