An Example: Let’s use MRJOB in Python to run a simple map-reduce algorithm. This program is to count the numbers of chars, words, and lines in a text document.
First, randomly select some text content and save them into a text file. Here, I copied the definition of MapReduce in wiki (https://en.wikipedia.org/wiki/MapReduce) and saved it into ‘MapReduce_wiki.txt’.
Then, define the mapper and reducer functions with MRJOB to count the numbers of chars, words, and lines of ‘MapReduce_wiki.txt’.
Code (Installation_instruction_examples.py):
# -*- coding: utf-8 -*-
"""
From:
https://mrjob.readthedocs.io/en/latest/guides/quickstart.html#writing-your-first
-job
Description: This is a simple example to count the numbers of chars, words, and lines.
"""
from mrjob.job import MRJob
class MRWordFrequencyCount(MRJob):
def mapper(self, _, line):
yield "chars", len(line) # count num of characters
yield "words", len(line.split()) # count num of words
yield "lines", 1 # count num of lines – with each line, add 1
def reducer(self, key, values):
yield key, sum(values)
if __name__ == '__main__':
MRWordFrequencyCount.run() # main program to call/run MRWordFrequencyCount
Finally, run this job.
The running commands are:
Option 1: Testing locally on your computer: ‘python Installation_instruction_example2.py MapReduce_wiki.txt >output_instruction_example2.txt’ (open command prompt/interpreter (cmd.exe) and change the current working directory/folder to the one in which your python document ‘Installation_instruction_examples.py’ and input file ‘MapReduce_wiki.txt’ are stored. For example, my python document and input file are stored in ‘E:\My work(laptop)\Comp6210\Python’)
Option 2 - Running on VirtualBox: ‘sudo python3.5 Installation_instruction_example2.py MapReduce_wiki.txt >output_instruction_example2.txt’
(Note that, in VirtualBox, the default version of Python is 2.6.6; It is too old to support MRJOB. So you need to install a new version Python. I installed Python 3.5 in VirtualBox to support MRJOB)
First, start VirtualBox, copy the python document ‘Installation_instruction_examples.py’ and input file ‘MapReduce_wiki.txt’ into VirtualBox. Click ‘Machine’ and choose ‘File Manager’.
Enter the user name and password: cloudera. Then click ‘Create Session’ button
Right click and select ‘Open in Terminal’.
Enter the following commands in the terminal one by one to install Python 3.5.
python --version
wget https://www.python.org/ftp/python/3.5.2/Python-3.5.2.tgz
tar -xvzf Python-3.5.2.tgz
cd Python-3.5.2
./configure --prefix=/usr/local
sudo make altinstall
Python3.5 --version
Install mrjob by using pip3.5 in Python3.5, the command is ‘sudo pip3.5 install mrjob’
Ok, now we have installed Python3.5 in VirtualBox. Finally, run the job in VirtualBox. Change the current working directory to the desktop folder where we store ‘Installation_instruction_examples.py’ and input file ‘MapReduce_wiki.txt’. The command is ‘cd /home/cloudera/Desktop’.
Finally, enter the execution command:
‘sudo python3.5 Installation_instruction_example2.py MapReduce_wiki.txt >output_instruction_example2.txt’
Output: Open the output file ‘output_instruction_example2.txt’, we can see the results as follow.
If you want to learn more about MRJOB in Python, visit https://mrjob.readthedocs.io/en/latest/
Run map-reduce jobs on the Hadoop environment
You need to install Java and Hadoop.
The main steps are:
(1) Install Java
(2) Download Hadoop binaries
(3) Set up environment variables
(4) Configure Hadoop cluster
(5) Format name node
(6) Start Hadoop services
For Windows operation systems, there is one installation instructions:
(1) https://kontext.tech/column/hadoop/377/latest-hadoop-321-installation-on-windows-10-step -by-step-guide
After installation, you can run the example2 on Hadoop environment. The command is
‘python Installation_instruction_example2.py -r hadoop MapReduce_wiki.txt >output_instruction_example2.txt'
Contact Us to get help with reasonable price/Send your assignment requirement details at:
Realcode4you@gmail.com
Comments