The presentation slides for the talk can be found here.
After the talk there will be chance to have a play with Spark yourself. Firstly you need to download the Spark binaries and unpack them to a folder on your local machine.
I also have a few USB keys with the binaries on, so wave your hand if the network is slow.
If you want to run an interactive PySpark interpreter then run the following command inside the spark directory:
The master flag lets you specify the master node of the cluster, however the local option allows you to run this on your machine with the specified number of worker threads. Be careful not to set this number two high (certainly no higher than the number of CPU cores you have available) or you will freeze your machine.
If you want to be able to import packages not in the standard library (or numpy) the you can include the .egg, .zip or .py files as an argument. You can then use them as a standard import:
You can write your scripts and submit them to spark using the spark-submit script in the bin folder. This works similarly to the pySpark shell:
Remember to add the following import to your script:
The spark-submit script will handle the dependencies for this automatically. For your scripts you also need to form a spark context object yourself (unlike for the pySpark shell where this is provided for you):
Supplying the master address in the script is optional, if you do you don’t need to use the
--master flag in the spark-submit call.
The easiest way to get up and running with Spark is to try the word count example. You can run this via the shell or copy and paste the code into a script and use spark-submit:
Obviously the code above doesn’t deal with punctuation etc and only gives you a word count. The power of Python can then be used to do cool things with this “big” data!
An easy way to see Spark Streaming in action is to extend the word count example to count incoming words to a port on your machine:
To provide input to the streaming job, you need to first run a Netcat server:
and then run the spark script using spark-submit. Now type away to your hearts content in your Netcat terminal and see the word counts pop up on the spark terminal.
Obliviously this basic script can be extended to do all kinds of interesting things!