HW1

1 . It is perfectly fine to slightly modify the code so that it runs on Python 3.x instead of 2.x. Please be sure to include a short note documenting the change you made to get it to run (it should be very minor). Do not use Jupyter notebooks, R or some other language. You must fill in the .py file to receive a grade.
3. For EVERY unique map reduce job, you have to clear out the global data structures! Otherwise, you’re persisting information from one MR job to the other, and you will get wrong results. The following may help to clarify: (a) a map reduce JOB is run over a set of documents (the ‘corpus’) since you’re doing word count over a corpus, (b) when computing the execution times of map and reduce, make sure to take this into account. Specifically, you will have to ADD (not average) the execution times for a mapper over the corpus (e.g., if the mapper takes 3 seconds to process doc_1, 2 seconds for doc_2 and 4 seconds for doc_3, the total time taken by the mapper is 9 seconds). You are, of course, generating corpora ten times for each document length (say 10). This will give you ten mapper and ten reducer times for that length. These ten numbers will (respectively) each be averaged into one, and these averaged numbers will be your ‘data point’ for that length, along with error bars. You can use any tool you want for visualization (Excel works well enough, but you could use matplotlib, or other external tools. The plots will be attached with your submission).  
4. In your submission, try to be as organized as possible, but you should absolutely include all supporting data that you can, including plots and raw timing data that you used for the plots. Like real life, the HW doesn’t have a single right or wrong answer. We are more interested in seeing the effort you made and good reasoning for decisions or assumptions you found yourself making in the context of the assignment.