A crash course about the Pythonsphere

Overview:

The Pythonsphere

Python, with the corresponding packages (NumPy, Matplotlib, SciPy, Scikit-learn, etc.) is considered as one of the best---the best, many would vividly argue---scientific programming language. It is used heavily in the ML community, it is free and portable.

To install Python and all the require packages, you can use the anaconda installer

Python

The Python programming language is a general purpose dynamically typed and interpreted language. As such it is easy to learn and can be used to build complex systems quickly.

On the other hand---and due to its very nature---it is slow by itself. Hopefully, it is easily interfaceable with C. As a consequence, most heavy number-crunching is usually ported to such compile language to ensure both an ease-of-use and good performances.

Numpy

NumPy is the fundamental package for scientific computing. It includes

  • a powerful N-dimensional array object;
  • sophisticated (broadcasting) functions;
  • tools to integrated C/C++ and Fortran code,
  • useful linear algebra, Fourier transform, and random number capabilities.
With SciPy and Matplotlib, it is a replacement for Matlab.

Scikit-learn

Scikit-learn (sklearn in short) is the most widely used multipurpose machine learning library (in Python). It is built on top of NumPy, SciPy and Matplotlib.

Asides from classification, regression and clustering algorithms, it offers tons of utilities useful in the world of machine learning. Want to to something ML-related? Check out if it is not implemented in sklearn yet.

Matplotlib

Matplotlib is for making plots. Once more, it is built on Numpy.

SciPy

SciPy is the scientific library of Python. It builds on NumPy to introduce what it does not already provide:

  • integration;
  • optimization;
  • signal processing;
  • statistics;
  • etc.

Pandas

Pandas is a library to analyze data. When doing machine learning, it is most useful for loading data and computing summary statistics about it.

Other stuff

Note that there are other important packages for doing ML (and/or bigdata analyzis):
  • Tensorflow: a package from Google to do numerical computations.
  • Keras: a package for deeplearning which runs on top of Tensorflow.
  • OpenCV/ Scikit-image: libraries for image processing.
  • PySpark: a Python API for Spark.
  • etc.

Exercises

Notes:

Pure Python exercises

  1. Make a hello world program.
  2. Make a even(n) function which returns whether the natural integer given as input is even or not.
  3. Make a single variable linear function class LinearMap. The constructor takes the slope m and an optional intercept p. When calling the class (see __call__ method) on a real x, return x*m + p.
  4. Using the format method of the string class, make a function which takes as input a natural integer and return a string which is the concatanation of the string "iML_" and the given integer.
  5. Using a for loop, the range and zip functions, print the pairs ("i",0), ("M",1) and ("L", 2).
  6. Using list comprehension, range, your even function and your LinearMap class, generate the list of all the 3n+7 for even n less or equal to 100.
  7. Do the same by replacing the even function with the step argument of the range function.
  8. Make a dictionary whose keys are the 20 integers greater than 5 and whose associated values are their square.
  9. Print the sum of the key-value pairs of the previous dictionary (see the item dictionary method).

Guess then try

Can you guess at what will the following two blocks print?
        for x in "foo":
            print(x)
        for x in "foo",:
            print(x)

Want to go further? (ignore the C-style braces).

NumPy exercises

Import numpy by using
import numpy as np

For reference, the following array will be called A:

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

Basics

  1. Create an array a 3x3 identity array I
    1. manually,
    2. by modifying in place a 3x3 zero array,
    3. by using the np.eye function.
  2. Use a the np.arange and np.reshape functions to create a 3x3 array A.
  3. Recreate array A manually.
  4. Display the number of dimension, the shape and the data type of A.
  5. Add the I and A array from the previous exercises.
  6. Multiply A by I.
  7. Take the elementwise exponential of I.
  8. Add the square of the sine and cosine of A elementwise.
  9. Compute the minimum, average and maximum value of A.
  10. Generate a 100x3 gaussian matrix with mean 10 and std 2. Compute the mean and std across the first dimension.
  11. Create a Python function to compute the dot product of two arrays. Measure the time it takes for 1D arrays of size 1000. Guess how much faster in can be done with dot operation. Measure it.
    • Either use the %timeit magic command in IPython, or do it manually with the time.time() function.

Indexing and slicing

Indexing refer to accessing a given element in the array. Slicing is the operation of taking a subarray.
  1. Extract the second column of A.
  2. From array A, use indexing to create array (i.e. remove first row)
    array([[3, 4, 5],
           [6, 7, 8]])
  3. From array A, use indexing to create array (i.e. reverse the second dimension)
    array([[2, 1, 0],
           [5, 4, 3],
           [8, 7, 6]])
  4. From array A, use fancy indexing (indexing with an array of integers) to create array
    array([[3, 4, 5],
           [0, 1, 2],
           [6, 7, 8],
           [3, 4, 5]])
  5. Guess then try: what is the output of A[(A < 2) | (A > 4)]?
  6. Guess then try: is the array A modified when doing
    B = A[:2, 1]
    B += 100
           

Broadcasting

The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations.

Guess then try: what are the ouput of

A + 1
A + np.array([1, 10])
A + np.array([1, 10, 100])
  1. Standardize a 2D array (substract the mean and divide by the standard deviation).
  2. Create a 10x10 array with 1 on the border and 0 inside

Supplementary

  1. Generate a pseudo image array with the following command: np.random.uniform(-5, 270, size=(48, 48, 3)). Limit the range of values to [0, 255] and recast it as an integer array.
  2. Given a 1D array, negate all elements which are between 3 and 8, in place and in a single line.
  3. Consider a random 10x2 (random) matrix representing cartesian coordinates, convert them to polar coordinates
  4. Find the tuple of indices of the minimum value of a random 50x50x50 array

Scikit-learn exercises

Run the following code
from sklearn.datasets import load_digits
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
digits = load_digits()
X, y = digits.data, digits.target
  1. Using numpy, compute the class proportions.
  2. Divide the dataset into train/test sets. Use 70% of the samples as training set.
  3. Instanciate a decision tree classifier.
  4. Using the fit method, fit the decision tree on the training set.
  5. Using the predict method, classify the test set.
  6. Using the accuracy_score function, compute the accuracy.

Matplotlib

If you are using IPython, you might need to use the magic commad %matplotlib. Import it via

from matplotlib import pyplot as plt

Do the following on the same figure:

  • Using np.linspace generate a vector x of a thousand linearly spaced real number from 0 to 2*np.pi.
  • Compute y, the elementwise cosine of x.
  • Generate a hundred points in the [0, 2pi]x [-1,1] rectangle.
  • Plot the cosine function in black.
  • Scatter-plot the points. Those above the cosine should be in red, those under in blue.
Last modified on October 18 2017 10:44