A crash course about the Pythonsphere
Overview:
The Pythonsphere
Python, with the corresponding packages (NumPy, Matplotlib, SciPy, Scikit-learn, etc.) is considered as one of the best---the best, many would vividly argue---scientific programming language. It is used heavily in the ML community, it is free and portable.
To install Python and all the require packages, you can use the anaconda installer
Python
The Python programming language is a general purpose dynamically typed and interpreted language. As such it is easy to learn and can be used to build complex systems quickly.
On the other hand---and due to its very nature---it is slow by itself. Hopefully, it is easily interfaceable with C. As a consequence, most heavy number-crunching is usually ported to such compile language to ensure both an ease-of-use and good performances.
NumPy
NumPy is the fundamental package for scientific computing. It includes
- a powerful N-dimensional array object;
- sophisticated (broadcasting) functions;
- tools to integrated C/C++ and Fortran code,
- useful linear algebra, Fourier transform, and random number capabilities.
Scikit-learn
Scikit-learn (sklearn in short) is the most widely used multipurpose machine learning library (in Python). It is built on top of NumPy, SciPy and Matplotlib.
Asides from classification, regression and clustering algorithms, it offers tons of utilities useful in the world of machine learning. Want to to something ML-related? Check out if it is not implemented in sklearn yet.
Matplotlib
Matplotlib is for making plots. Once more, it is built on Numpy.
SciPy
SciPy is the scientific library of Python. It builds on NumPy to introduce what it does not already provide:
- integration;
- optimization;
- signal processing;
- statistics;
- etc.
Pandas
Pandas is a library to analyze data. When doing machine learning, it is most useful for loading data and computing summary statistics about it.Other stuff
Note that there are other important packages for doing ML (and/or bigdata analyzis):- Tensorflow: a package from Google to do numerical computations.
- Keras: a package for deeplearning which runs on top of Tensorflow.
- OpenCV/ Scikit-image: libraries for image processing.
- PySpark: a Python API for Spark.
- etc.
Exercises
Notes:
- There are two incompatible versions of Python is use: 2.7 and 3.5+. Since 2.7 is due to disappear in a near future, you are encouraged to use the latest one.
-
You can use Python to run scripts (
python my_script.py
) or in interactive mode. If you have installed Python with Anaconda, you can useipython
for a better interactive experience (usex?
orhelp(x)
to get information aboutx
with IPython). - A good tutorial for the following exercises is http://cs231n.github.io/python-numpy-tutorial/#python. But you can use any resources you would like.
Pure Python exercises
- Make a
hello world
program. - Make a
even(n)
function which returns whether the natural integer given as input is even or not. - Make a single variable linear function class
LinearMap
. The constructor takes the slopem
and an optional interceptp
. When calling the class (see__call__
method) on a realx
, returnx*m + p
. - Using the
format
method of the string class, make a function which takes as input a natural integer and return a string which is the concatanation of the string"iML_"
and the given integer. - Using a for loop, the
range
andzip
functions, print the pairs("i",0)
,("M",1)
and("L", 2)
. - Using list comprehension,
range
, youreven
function and yourLinearMap
class, generate the list of all the3n+7
for evenn
less or equal to 100. - Do the same by replacing the
even
function with the step argument of therange
function. - Make a dictionary whose keys are the 20 integers greater than 5 and whose associated values are their square.
- Print the sum of the key-value pairs of the previous dictionary
(see the
item
dictionary method).
Guess then try
Can you guess at what will the following two blocks print?for x in "foo": print(x)
for x in "foo",: print(x)
Want to go further? (ignore the C-style braces).
NumPy exercises
Import numpy by usingimport numpy as np
For reference, the following array will be called A
:
array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
Basics
- Create an array a 3x3 identity array
I
- manually,
- by modifying in place a 3x3 zero array,
- by using the
np.eye
function.
- Use a the
np.arange
andnp.reshape
functions to create a 3x3 arrayA
. - Recreate array
A
manually. - Display the number of dimensions, the shape and the data type of
A
. -
Add the
I
andA
array from the previous exercises. -
Multiply
A
byI
. -
Take the elementwise exponential of
I
. -
Add the square of the sine and cosine of
A
elementwise. - Compute the minimum, average and maximum value of
A
. - Generate a 100x3 gaussian matrix with mean 10 and std 2. Compute the mean and std across the first dimension.
- Create a Python function to compute the dot product of two arrays.
Measure the time it takes for 1D arrays of size 1000. Guess how
much faster in can be done with
dot
operation. Measure it.- Either use the
%timeit
magic command in IPython, or do it manually with thetime.time()
function.
- Either use the
Indexing and slicing
Indexing refer to accessing a given element in the array. Slicing is the operation of taking a subarray.- Extract the second column of
A
. - From array
A
, use indexing to create array (i.e. remove first row)array([[3, 4, 5], [6, 7, 8]])
- From array
A
, use indexing to create array (i.e. reverse the second dimension)array([[2, 1, 0], [5, 4, 3], [8, 7, 6]])
- From array
A
, use fancy indexing (indexing with an array of integers) to create arrayarray([[3, 4, 5], [0, 1, 2], [6, 7, 8], [3, 4, 5]])
- Guess then try: what is the output of
A[(A < 2) | (A > 4)]
? - Guess then try: is the array A modified when doing
B = A[:2, 1] B += 100
Broadcasting
The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations.
Guess then try: what are the ouput of
A + 1
A + np.array([1, 10])
A + np.array([1, 10, 100])
- Standardize a 2D array (substract the mean and divide by the standard deviation).
- Create a 10x10 array with 1 on the border and 0 inside
Supplementary
- Generate a pseudo image array with the following command:
np.random.uniform(-5, 270, size=(48, 48, 3))
. Limit the range of values to [0, 255] and recast it as an integer array. - Given a 1D array, negate all elements which are between 3 and 8, in place and in a single line.
- Consider a random 10x2 (random) matrix representing cartesian coordinates, convert them to polar coordinates
- Find the tuple of indices of the minimum value of a random 50x50x50 array
Scikit-learn exercises
Run the following codefrom sklearn.datasets import load_digits from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score digits = load_digits() X, y = digits.data, digits.target
- Using numpy, compute the class proportions.
- Divide the dataset into train/test sets. Use 70% of the samples as training set.
- Instanciate a decision tree classifier.
- Using the
fit
method, fit the decision tree on the training set. - Using the
predict
method, classify the test set. - Using the
accuracy_score
function, compute the accuracy.
Matplotlib
If you are using IPython, you might need to use the magic commad
%matplotlib
. Import it via
from matplotlib import pyplot as plt
Do the following on the same figure:
- Using
np.linspace
generate a vectorx
of a thousand linearly spaced real number from 0 to2*np.pi
. - Compute
y
, the elementwise cosine ofx
. - Generate a hundred points in the [0, 2pi]x [-1,1] rectangle.
- Plot the cosine function in black.
- Scatter-plot the points. Those above the cosine should be in red, those under in blue.