Introduction to Data Science Lectures

Here is the material for a course I will be giving in a Master of Data Science and AI

View project on GitHub

Lecture Plan

Here we describe the lecture plan and insert the link to the corresponding material.


Introduction 🎒

Here we collect the introductory arguments for this course. In this lecture, we review some basic concepts in statistics, like probability distributions and hypothesis testing.

Hence, we illustrate the general working scheme of a Machine Learning model, explaining the main categories of automated learning i.e.

  • supervised learning
  • unsupervised learning
  • reinforcement learning

Furthermore, we explore the main differences between predictive tasks of classification and regression.

Hence, we show a first example of predictive model: linear regression. Finally, we also focus on the general lifecycle of a Data Science model.

Pandas and data visualisation 🐼

In this section we introduce one of the most famous basic tools for data analysis. Furthermore, we focus on visualisation and simple data manipulation to extract insights from data.

In particular, we play with pandas dataframes, filters and matplotlib plots.

Data pipeline and data treatment 📊

This module is about data pipelines, data preprocessing and techniques to encode non-numerical features. In particular we build our first data pipeline, trying to extract features to feed a machine learning model properly, guided by Exploratory Data Analysis.

Supervised classifier 🦾

In this module we explore through several examples how to solve a supervised classification task. We present logistic regression and decision tree models, furthermore we focus on the problem of predicting probabilities.

Supervised regressor 🦿

We keep studying the linear regression from a different point of view. Hence, we introduce polynomial regression and we move on to predict continuous numerical values.

Regularisation 🌊

At this stage, our Machine Learning knowledge is mature enough to wonder how to measure model performances and how to solve the possible issues arising. We introduce the crucial concepts of underfit and overfit and how to face such problems that may afflict models.

Software and Tools 🔧

This lecture is aimed to introduce some powerful tools:

Time Series

We illustrate methods to perform time series analysis and decomposition. We give also mention of techniques to time series forecast.

Unsupervised learning 🤖

In this lecture, we drop labels from our data and try to recognise patterns in them. We will find that unsupervised learning tasks are capable of surprising learning schemes.

Ensemble methods 🌳🌳🌳

As the last lecture, we focus on techniques to take advantage of many models in order to make predictions more robust and reliable. In particular, we present random forest, bagging and boosting techniques and the notorious XGBOOST.

Final project 🖼️

The proposed final project is an application of what we have seen in the course of the lectures.

The idea is to build a whole pipeline from data collection to prediction.