11 Nov 2015

Python North East - Introduction to Apache Spark

This month’s Python North East talk will be an introduction to Apache Spark, given by yours truly.

The presentation slides for the talk can be found here.

Code Dojo

Get Spark

After the talk there will be chance to have a play with Spark yourself. Firstly you need to download the Spark binaries and unpack them to a folder on your local machine.

I also have a few USB keys with the binaries on, so wave your hand if the network is slow.

Run the REPL

If you want to run an interactive PySpark interpreter then run the following command inside the spark directory:

$ ./bin/pyspark --master local[2]

The master flag lets you specify the master node of the cluster, however the local option allows you to run this on your machine with the specified number of worker threads. Be careful not to set this number two high (certainly no higher than the number of CPU cores you have available) or you will freeze your machine.

If you want to be able to import packages not in the standard library (or numpy) the you can include the .egg, .zip or .py files as an argument. You can then use them as a standard import:

$ ./bin/pyspark --master local[2] --py-files code.py

Run a script

You can write your scripts and submit them to spark using the spark-submit script in the bin folder. This works similarly to the pySpark shell:

$ ./bin/spark-submit --master local[2] --py-files dependencies.py path-to-script/myscript.py args

Remember to add the following import to your script:

from pyspark import SparkContext

The spark-submit script will handle the dependencies for this automatically. For your scripts you also need to form a spark context object yourself (unlike for the pySpark shell where this is provided for you):

sc = SparkContext("local[2]", "MyAppName")

Supplying the master address in the script is optional, if you do you don’t need to use the --master flag in the spark-submit call.

Word Count

The easiest way to get up and running with Spark is to try the word count example. You can run this via the shell or copy and paste the code into a script and use spark-submit:

text_file = sc.textFile("path/to/textfile.txt")

words = text_file.flatMap(lambda line: line.split(" "))

word_tuples = words.map(lambda word: (word, 1))

counts = word_tuples.reduceByKey(lambda a, b: a + b)

#The .collect() method will return the RDD as list of tuples
#Only use this if you know the resulting list will fit into the memory on your machine!
results = counts.collect()

A good text to try this on is the Complete Works of Shakespeare, freely available as a text file from Project Gutenberg.

Obviously the code above doesn’t deal with punctuation etc and only gives you a word count. The power of Python can then be used to do cool things with this “big” data!

Spark Streaming Word Count

An easy way to see Spark Streaming in action is to extend the word count example to count incoming words to a port on your machine:

from __future__ import print_function

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

#Set the time window in seconds
time_window = 1

sc = SparkContext(master="local[2]", appName="PythonStreamingNetworkWordCount")
ssc = StreamingContext(sc, time_window)

#Create an RDD from the words coming into this machine on the specified port
lines = ssc.socketTextStream("localhost", 9999)

counts = lines.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)

counts.pprint()

ssc.start()
ssc.awaitTermination()

To provide input to the streaming job, you need to first run a Netcat server:

$ nc -lk 9999

and then run the spark script using spark-submit. Now type away to your hearts content in your Netcat terminal and see the word counts pop up on the spark terminal.

Obliviously this basic script can be extended to do all kinds of interesting things!

03 Mar 2015

Dual Booting Ubuntu and Windows on Surface Pro 3

As part of my PhD I was generously given a brand new Microsoft Surface Pro 3. Unfortunately, as I have been an Ubuntu user for the better part of a decade, using Windows again was a less than optimal experience. Having read many blog posts about the Surface 3 hardware not being supported by the current Ubuntu kernel I resolved to wait and battle on with Windows 8.1. However, after a sever bout of yak shaving I managed to get the latest 15.04 beta 2 working (mostly).

Back the hell up!

Before you attempt this with your own shiney Surface Pro 3 (SP3) I highly, strongly, emphatically recommend that you grab a 8Gb USB thumb drive and make a Windows recovery disk (via Search > “Create a recovery drive”). The SP3 does have its own recovery partition, however to get the dual boot working you are going to be messing with partition tables and it is a really good idea to have a stand alone recovery option in case everything goes to hell. Also, as always, back up your personal files. They should be safe with the repartitioning, but you never know. You have been warned!

Get bootin’

I set up the dual boot by following this very informative, and well illustrated, blog post from David Elner. However, there were a few key differences.

  1. I used the latest 15.04 Beta 2 build of Ubuntu Gnome instead of the the vanilla 14.10 Ubuntu David used. 15.04 is based on on the 3.18 kernel and was hoping this would provide some performance improvements. I prefer Gnome as a desktop environment and it also has multi-touch features for tablets. This is just a preference and when Ubuntu Touch is released for general hardware that may take over as the default.
  2. I did not install the wifi and keyboard drivers as the ones David used were designed for the 3.17 kernel. However, I found that with 15.04 the wifi and keyboard are working fine. The touch pad doesn’t work but as it is possibly the worst touch pad I have ever used I didn’t see that as much of an issue.
  3. I did not install rEFInd or mess about with any of the EFI signing. It is not required to make the dual boot work and I am fine with the red screen of doom that awaits any sole who dares disable secure boot!
  4. I changed the default boot OS in grub so that if I boot the SP3 without a keyboard attached it will boot into Windows. This is because, at the moment, Gnomes tablet support is a bit basic (but it will be getting better).

The end result

So after all of the above I have working Ubuntu install on my SP3:

So that’s it. Other users have reported fun and games with Windows overwriting grub and have resorted to putting their boot partition on an MicroSD card. I haven’t had this problem yet but it is early days. In the near future I will attempt to run the latest mainline kernel builds (3.19 and 4.0) to see if they get the touchpad working, but at the moment I am very happy with the outcome and my SP3 is now a viable work computer (even if it is still a bit of a toasterfridge)

01 Dec 2014

Install Microsoft Azure tools on Ubuntu 14.04

As part of my MRes I am using Microsoft’s Azure cloud platform. I was pleasantly surprised to see that Microsoft provide Linux support for Azure and SDK’s for a lot of programming languages. However I hit a few bumps getting the command line tools installed due to some quirks with Ubuntu.

The easiest way to get the Azure CLI is to use the Node.js package manager npm:

$ sudo apt-get install nodejs
$ sudo apt-get install npm

This should all work fine and then you can simply install the Azure CLI using the command below:

$ npm install azure-cli -g

Now on other Linux distros this may work fine. However on Ubuntu 14.04 there was an issue due to the fact that Azure CLI calls Node.js using the node command and on Ubuntu it is called nodejs. As a result you get the error below when you call azure:

/usr/bin/env: node: No such file or directory

To fix this, use the command below to tell Ubuntu that when any program calls node it really means nodejs:

$ sudo update-alternatives --install /usr/bin/node nodejs /usr/bin/nodejs 100

You should now be able to call azure and all its functions!