ELEN0062 - Introduction to machine learning (iML)

Random ML quote

It’s easy to lie with statistics. It’s hard to tell the truth without statistics.

Andrejs Dunkels

Informations

Course study program
Theoritical course's webpage
The reference book's webpage (free download)

Schedule

Installation	19 Sep. 2018	Python, NumPy, SciPy, Scikit-learn installation with anaconda
TD	26 Sep. 2018	Python presentation A crash course about the Pythonsphere If you have remaining questions regarding the exercises, you can email me.
Assignment	03 Oct. 2018	First assignment Statement Code Scikit-learn exercise Cheat sheet On how to present results: A few thoughts A more thorough tour
Q&A	10 Oct. 2018	Question/answer session regarding the first assignment (come freely to my office).
Q&A	17 Oct. 2018	Question/answer session regarding the first assignment (come freely to my office).
Assignment Q&A	24 Oct. 2018	Second assignment (Antonio Sutera is the reference TA for this assignment)
Deadline	31 Oct. 2018	Don't forget to submit your first assignment
Q&A	07 Nov. 2018	Question/answer session regarding the second assignment.
Q&A	14 Nov. 2018	Question/answer session regarding the second assignment.
Feedback Project	21 Nov. 2018	Feedback on the first assignment Assignment 3 (challenge)
Deadline	25 Nov. 2018	Don't forget to submit your second assignment.
Deadline	27 Nov. 2018	[Setup] Find a group, register for the third assignment, register on Kaggle, download the data, make the toy submission.
Deadline	15 Dec. 2018	End of challenge
Deadline	17 Dec. 2018	Don't forget to submit your report regarding the challenge.
Deadline	TBA	Presentations

Assignments

First assignment

Installing Anaconda

There are many ways to install Python on a computer and get all the libraries needed. One quick way is to install anaconda, which comes with all the libraries we will need.

Get the anaconda installer for your operating system. Make sure you install a Python 3.5+ version.
Open a Python console:

From a unix command line: python
Or open spyder IDE, which comes with anaconda

ipython

Run the following commands:

  
import numpy as np
import pandas as pd
import sklearn
import scipy

print(np.__version__)
print(pd.__version__)
print(sklearn.__version__)
print(scipy.__version__)

If there is no error, the installation went fine

Third assignment: the challenge

The third project is organized in the form a challenge, where you will compete against each other. This year, the challenge is about movie rating: you must predict the rating that a given user would give to a given movie. All the relevant information can be found on the Kaggle plateform which will hold the challenge.

The project is divided into four parts. All the deadlines can be found in the schedule section above.

Setup for the project
- Create an account on the Kaggle platform. Use your real name so that we can identify you.
- Use the link given in the mail to enter the challenge. If you did not receive the email, contact us.
- Form groups of three and register them on the submission platform.
- Test the toy example.
Propose the best model you can before the competition deadline.
Submit an archive on the submission platform in tar.gz format, containing a report that describes the different steps of your approach and your main results along with your source code. Use the same ids as for the Kaggle platform. The report must contain the following information:
- A detailed description of all the approaches that you have used to win the challenge, including the feature engineering you performed.
- A detailed description of your hyper-parameters optimization approach and your model validation technique.
- A table summarizing the performance of your differents approaches containing for each approach at least the name of the approach, the validation score, the score on the public and the private leaderboard.
- Any complementary information or figures that you want to mention.
Present succinctly your approach to the rest of the class. (More information coming soon)

Have fun!

Cheat sheet for ML in Python

Check out datacamp for more.

Supplementary material

Here is a very scarce list of supplementary material related to the field of machine learning. I tend to update this section when I come across interesting stuff but if you feel like you need more material on some topic, do not hesitate to ask!

Machine learning in general

There are tons of online and accessible material in the domain of machine learning:

Andrew Ng's online course (Standford): The most popular online course on ML. Archived from coursera.
Pedro Domingos' online course (Washington).
Reza Shadmehr (Baltimore) and his slides.
Jeffrey Ullman's course on mining massive datasets (Standford) based on his reference book. Not everything is related to the course though.

Linear regression

The geometry of Least Squares (1 variable)

Note that the ANOVA is a special case of linear models where the input variables are dummy one-hot class variables. Consequently, the basis vector of the column space are orthogonal and the problem reduces to many 1 variable least squares.

Artifical neural networks

There have been three hypes about ANN. The first one was about the perceptrons in the 60s until it was discovered it could not solve a XOR problem. The second hype started with the discovery of backpropagation but it soon became clear that the large and/or deep neural nets were very hard to train. We are in the midts of the third one right now with "deep learning": neural nets with several (many) invisible layers. As a consequence, internet is bursting with resource on the topic, from the simplest models (multi-layer perceptron) to the most advanced architectures (such as GANs), going through more classical ones (such as Convnets and LSTM).

Misc.

There are many YouTube channels about ML. Here are a few:

Sentex: A bit of everything
Derek Kane: A bit of everything
Welch Labs: A few videos about Neural Nets
Two minutes papers: Many articles relate to (applications of) ML
Siraj Raval (this guy is crazy)
Introductory online course on ML (covers linear/logistic regression, decision trees/random forests, basics on neural networks and a clustering).

Pre-requisites

Machine learning requires a solid background in maths, especially in linear algebra, (advanced) probability theory and (multivariable) calculus. There are even more resources on those than on deep learning. Here is a short selection, which emphasizes intuition.

Linear algebra

3 brown 1 blue serie on linear algebra
If you prefer paper (or PDF): Practical Linear Algebra: A Geometry Toolbox 2nd Edition by Farin, Gerald, Hansford, Dianne. A K Peters/CRC Press (2004)

Calculus

3 brown 1 blue serie on calculus. Saddly, it does not go on to multivariable calculus.
Khan academy serie
If you prefer paper (or PDF): Calculus: Concepts and contexts 4th Edition by Stewart, James. (Also available in french)