210 KiB
Laboratory Session 1: Machine Learning Landscape¶
Introducción¶
Para la presente sesión (Machine Learning Landscape) el alumno explorará el uso del entorno Jupyter para el análisis de datos y programación enfocada a la ciencia de datos.
En esta sesión aprenderá los conceptos básicos de los módulos Numpy
, Pandas
, Matplotlib
y Sklearn
a través del desarrollo de un modelo lineal para determinar el nivel de satisfacción de vida a partir del parámetro GDP. En consecuencia, el modelo desarrollado utiliza como datos de entrenamiento información recolectada por la OECD y GDP. Los parámetros obtenidos después del entrenamiento son usados para evaluar por inspección visual el comportamiento de los datos contra los datos reales. Por lo tanto, para la presente sección se pueden identificar los siguientes objetivos.
Objetivo General: Desarrollar un modelo basado en la regresión lineal usando las funciones fit
del módulo Sklearn para determinar el nivel de satisfacción de vida en función de GDP de cada país.
Objetivos Específicos:
- Aprender el manejo básico de la plataforma Jupyter y los comandos de inserción, copia, borrado y evaluación de celdas
- Conocer el procedimiento de lectura de archivos CSV
- Relacionarse con el procedimiento de acceso a datos por nombres de columnas, manipulación, edición y filtrado de datos con la estructura de Pandas
- Crear una función en Python para leer las bases de datos de la OECD y GDP, y posteriormente crear un conjunto de datos que exclusivamente incluyan "Life Satisfaction" y "GDP per capita"
- Entrenar un modelo basado en la regresión lineal
- Comparar el modelo con los datos reales y determinar nuevos valores de instancias
The Data: Life Satisfaction and GDP per capita¶
The next sections will explorer the two datasets: 1) Organisation for Economic Co-operation and Development and 2)International Monetary Fund.
Life satisfaction data description¶
This dataset was obtained from the OECD's website at: http://stats.oecd.org/index.aspx?DataSetCode=BLI
Int64Index: 3292 entries, 0 to 3291
Data columns (total 17 columns):
"LOCATION" 3292 non-null object
Country 3292 non-null object
INDICATOR 3292 non-null object
Indicator 3292 non-null object
MEASURE 3292 non-null object
Measure 3292 non-null object
INEQUALITY 3292 non-null object
Inequality 3292 non-null object
Unit Code 3292 non-null object
Unit 3292 non-null object
PowerCode Code 3292 non-null int64
PowerCode 3292 non-null object
Reference Period Code 0 non-null float64
Reference Period 0 non-null float64
Value 3292 non-null float64
Flag Codes 1120 non-null object
Flags 1120 non-null object
dtypes: float64(3), int64(1), object(13)
memory usage: 462.9+ KB
Example using Python Pandas¶
>>> life_sat = pd.read_csv("oecd_bli_2015.csv", thousands=',')
>>> life_sat_total = life_sat[life_sat["INEQUALITY"]=="TOT"]
>>> life_sat_total = life_sat_total.pivot(index="Country", columns="Indicator", values="Value")
>>> life_sat_total.info()
<class 'pandas.core.frame.DataFrame'>
Index: 37 entries, Australia to United States
Data columns (total 24 columns):
Air pollution 37 non-null float64
Assault rate 37 non-null float64
Consultation on rule-making 37 non-null float64
Dwellings without basic facilities 37 non-null float64
Educational attainment 37 non-null float64
Employees working very long hours 37 non-null float64
Employment rate 37 non-null float64
Homicide rate 37 non-null float64
Household net adjusted disposable income 37 non-null float64
Household net financial wealth 37 non-null float64
Housing expenditure 37 non-null float64
Job security 37 non-null float64
Life expectancy 37 non-null float64
Life satisfaction 37 non-null float64
Long-term unemployment rate 37 non-null float64
Personal earnings 37 non-null float64
Quality of support network 37 non-null float64
Rooms per person 37 non-null float64
Self-reported health 37 non-null float64
Student skills 37 non-null float64
Time devoted to leisure and personal care 37 non-null float64
Voter turnout 37 non-null float64
Water quality 37 non-null float64
Years in education 37 non-null float64
dtypes: float64(24)
memory usage: 7.2+ KB
# Load the data
import numpy as np
import pandas as pd
url = "https://raw.githubusercontent.com/machine-learning-course-uac/1-ml-landscape/main/oecd_bli_2015.csv"
oecd_bli = pd.read_csv(url, thousands=',')
oecd_bli.info()
oecd_bli
life_sat_total = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"]
life_sat_total.info()
new = life_sat_total[life_sat_total["Country"]=='Austria']
new.info()
life_sat_pivoted = life_sat_total.pivot(index="Country", columns="Indicator", values="Value")
life_sat_pivoted
GDP per capita¶
Gross domestic product (GDP) per capita is an economic metric that breaks down a country's economic output per person. Economists use GDP per capita to determine how prosperous countries are based on their economic growth.
GDP per capita is calculated by dividing the GDP of a nation by its population. Countries with the higher GDP per capita tend to be those that are industrial, developed countries.
Thus, GDP per capita measures the economic output of a nation per person.
The Dataset obtained from the IMF's website at: http://goo.gl/j1MSKe
Data description¶
Int64Index: 190 entries, 0 to 189
Data columns (total 7 columns):
Country 190 non-null object
Subject Descriptor 189 non-null object
Units 189 non-null object
Scale 189 non-null object
Country/Series-specific Notes 188 non-null object
2015 187 non-null float64
Estimates Start After 188 non-null float64
dtypes: float64(2), object(5)
memory usage: 11.9+ KB
Example using Python Pandas¶
>>> gdp_per_capita = pd.read_csv(
... datapath+"gdp_per_capita.csv", thousands=',', delimiter='\t',
... encoding='latin1', na_values="n/a", index_col="Country")
...
>>> gdp_per_capita.rename(columns={"2015": "GDP per capita"}, inplace=True)
url2 = "https://raw.githubusercontent.com/machine-learning-course-uac/1-ml-landscape/main/gdp_per_capita.csv"
gdp_per_capita = pd.read_csv(url2,thousands=',',delimiter='\t', encoding='latin1', na_values="n/a")
gdp_per_capita
gdp_per_capita.rename(columns={"2015": "GDP"}, inplace=True)
gdp_per_capita
gdp_per_capita.set_index("Country", inplace=True)
gdp_per_capita
Making all the process in a function¶
def CountryStats(oecd, gdp):
# YOUR CODE HERE
return country_stats[["GDP", 'Life satisfaction']].iloc[keep_indices]
import mluac as ml
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
url2 = "https://raw.githubusercontent.com/machine-learning-course-uac/1-ml-landscape/main/gdp_per_capita.csv"
gdp_per_capita = pd.read_csv(url2,thousands=',',delimiter='\t', encoding='latin1', na_values="n/a")
url = "https://raw.githubusercontent.com/machine-learning-course-uac/1-ml-landscape/main/oecd_bli_2015.csv"
oecd_bli = pd.read_csv(url, thousands=',')
cs = ml.prepare_country_stats(oecd_bli, gdp_per_capita)
cs
Exploring the data¶
X = np.c_[cs["GDP per capita"]]
y = np.c_[cs["Life satisfaction"]]
X
y
plt.plot(X,y, 'ok', markersize=12, label="Data")
plt.xlabel("GDP per capita", fontsize=20)
plt.ylabel("Life satisfaction", fontsize=20)
plt.legend()
plt.show()
from sklearn.linear_model import LinearRegression
# Select a linear model
model = LinearRegression()
# Train the model
model.fit(X, y)
# Make a prediction for Cyprus
X_new = [[22587]] # Cyprus' GDP per capita
print(model.predict(X_new)) # outputs [[ 5.96242338]]
Extracting data from model¶
The Linear regression model
$$f(\theta)=x_0 +x_1 \theta_1 +x_2 \theta_2 + \cdots$$
#theta_0
t0 = model.intercept_[0]
t0
#theta_1
t1 =model.coef_
t1
Xest=np.linspace(0, 60000, 1000)
plt.plot(X, t0 + t1*X, "r", linewidth=3,label="Model")
plt.plot(X,y, 'ok', markersize=12, label="Data")
plt.xlabel("GDP per capita", fontsize=20)
plt.ylabel("Life satisfaction", fontsize=20)
plt.legend()
plt.show()