A crash course about the Pythonsphere

Overview:

Pythonsphere

Python
NumPy
Scikit-learn
Matplotlib
SciPy
Pandas
Other stuff

exercises

Python exercises
NumPy exercises
Scikit-learn exercises
Matplotlib exercises

The Pythonsphere

Python, with the corresponding packages (NumPy, Matplotlib, SciPy, Scikit-learn, etc.) is considered as one of the best---the best, many would vividly argue---scientific programming language. It is used heavily in the ML community, it is free and portable.

To install Python and all the require packages, you can use the anaconda installer

Python

The Python programming language is a general purpose dynamically typed and interpreted language. As such it is easy to learn and can be used to build complex systems quickly.

On the other hand---and due to its very nature---it is slow by itself. Hopefully, it is easily interfaceable with C. As a consequence, most heavy number-crunching is usually ported to such compile language to ensure both an ease-of-use and good performances.

More resources about Python

NumPy

NumPy is the fundamental package for scientific computing. It includes

a powerful N-dimensional array object;
sophisticated (broadcasting) functions;
tools to integrated C/C++ and Fortran code,
useful linear algebra, Fourier transform, and random number capabilities.

With SciPy and Matplotlib, it is a replacement for Matlab.

More resources about NumPy

Scikit-learn

Scikit-learn (sklearn in short) is the most widely used multipurpose machine learning library (in Python). It is built on top of NumPy, SciPy and Matplotlib.

Asides from classification, regression and clustering algorithms, it offers tons of utilities useful in the world of machine learning. Want to to something ML-related? Check out if it is not implemented in sklearn yet.

Matplotlib

Matplotlib is for making plots. Once more, it is built on Numpy.

SciPy

SciPy is the scientific library of Python. It builds on NumPy to introduce what it does not already provide:

integration;
optimization;
signal processing;
statistics;
etc.

Pandas

Pandas is a library to analyze data. When doing machine learning, it is most useful for loading data and computing summary statistics about it.

Other stuff

Note that there are other important packages for doing ML (and/or bigdata analyzis):

Tensorflow: a package from Google to do numerical computations.
Keras: a package for deeplearning which runs on top of Tensorflow.
OpenCV/ Scikit-image: libraries for image processing.
PySpark: a Python API for Spark.
etc.

Exercises

Notes:

There are two incompatible versions of Python is use: 2.7 and 3.5+. Since 2.7 is due to disappear in a near future, you are encouraged to use the latest one.
You can use Python to run scripts (python my_script.py) or in interactive mode. If you have installed Python with Anaconda, you can use ipython for a better interactive experience (use x? or help(x) to get information about x with IPython).
A good tutorial for the following exercises is http://cs231n.github.io/python-numpy-tutorial/#python. But you can use any resources you would like.

Pure Python exercises

Make a hello world program.
Make a even(n) function which returns whether the natural integer given as input is even or not.
Make a single variable linear function class LinearMap. The constructor takes the slope m and an optional intercept p. When calling the class (see __call__ method) on a real x, return x*m + p.
Using the format method of the string class, make a function which takes as input a natural integer and return a string which is the concatanation of the string "iML_" and the given integer.
Using a for loop, the range and zip functions, print the pairs ("i",0), ("M",1) and ("L", 2).
Using list comprehension, range, your even function and your LinearMap class, generate the list of all the 3n+7 for even n less or equal to 100.
Do the same by replacing the even function with the step argument of the range function.
Make a dictionary whose keys are the 20 integers greater than 5 and whose associated values are their square.
Print the sum of the key-value pairs of the previous dictionary (see the item dictionary method).

Guess then try

Can you guess at what will the following two blocks print?

        for x in "foo":
            print(x)

        for x in "foo",:
            print(x)

Want to go further? (ignore the C-style braces).

NumPy exercises

Import numpy by using

import numpy as np

For reference, the following array will be called A:

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

Basics

Create an array a 3x3 identity array I
1. manually,
2. by modifying in place a 3x3 zero array,
3. by using the np.eye function.
Use a the np.arange and np.reshape functions to create a 3x3 array A.
Recreate array A manually.
Display the number of dimensions, the shape and the data type of A.
Add the I and A array from the previous exercises.
Multiply A by I.
Take the elementwise exponential of I.
Add the square of the sine and cosine of A elementwise.
Compute the minimum, average and maximum value of A.
Generate a 100x3 gaussian matrix with mean 10 and std 2. Compute the mean and std across the first dimension.
Create a Python function to compute the dot product of two arrays. Measure the time it takes for 1D arrays of size 1000. Guess how much faster in can be done with dot operation. Measure it.
- Either use the %timeit magic command in IPython, or do it manually with the time.time() function.

Indexing and slicing

Indexing refer to accessing a given element in the array. Slicing is the operation of taking a subarray.

Extract the second column of A.
From array A, use indexing to create array (i.e. remove first row)
```
array([[3, 4, 5],
       [6, 7, 8]])
```
From array A, use indexing to create array (i.e. reverse the second dimension)
```
array([[2, 1, 0],
       [5, 4, 3],
       [8, 7, 6]])
```
From array A, use fancy indexing (indexing with an array of integers) to create array
```
array([[3, 4, 5],
       [0, 1, 2],
       [6, 7, 8],
       [3, 4, 5]])
```
Guess then try: what is the output of A[(A < 2) | (A > 4)]?
Guess then try: is the array A modified when doing
```
B = A[:2, 1]
B += 100
       
```

Broadcasting

The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations.

NumPy reference

Guess then try: what are the ouput of

A + 1

A + np.array([1, 10])

A + np.array([1, 10, 100])

Standardize a 2D array (substract the mean and divide by the standard deviation).
Create a 10x10 array with 1 on the border and 0 inside

Supplementary

Generate a pseudo image array with the following command: np.random.uniform(-5, 270, size=(48, 48, 3)). Limit the range of values to [0, 255] and recast it as an integer array.
Given a 1D array, negate all elements which are between 3 and 8, in place and in a single line.
Consider a random 10x2 (random) matrix representing cartesian coordinates, convert them to polar coordinates
Find the tuple of indices of the minimum value of a random 50x50x50 array

Scikit-learn exercises

Run the following code

from sklearn.datasets import load_digits
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
digits = load_digits()
X, y = digits.data, digits.target

Using numpy, compute the class proportions.
Divide the dataset into train/test sets. Use 70% of the samples as training set.
Instanciate a decision tree classifier.
Using the fit method, fit the decision tree on the training set.
Using the predict method, classify the test set.
Using the accuracy_score function, compute the accuracy.

Matplotlib

If you are using IPython, you might need to use the magic commad %matplotlib. Import it via

from matplotlib import pyplot as plt

Do the following on the same figure:

Using np.linspace generate a vector x of a thousand linearly spaced real number from 0 to 2*np.pi.
Compute y, the elementwise cosine of x.
Generate a hundred points in the [0, 2pi]x [-1,1] rectangle.
Plot the cosine function in black.
Scatter-plot the points. Those above the cosine should be in red, those under in blue.