You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
machine-learning-course-uac/1-ml-landscape/main.ipynb

210 KiB

None <html> <head> </head>

Laboratory Session 1: Machine Learning Landscape

Introducción

Para la presente sesión (Machine Learning Landscape) el alumno explorará el uso del entorno Jupyter para el análisis de datos y programación enfocada a la ciencia de datos.

En esta sesión aprenderá los conceptos básicos de los módulos Numpy, Pandas, Matplotlib y Sklearn a través del desarrollo de un modelo lineal para determinar el nivel de satisfacción de vida a partir del parámetro GDP. En consecuencia, el modelo desarrollado utiliza como datos de entrenamiento información recolectada por la OECD y GDP. Los parámetros obtenidos después del entrenamiento son usados para evaluar por inspección visual el comportamiento de los datos contra los datos reales. Por lo tanto, para la presente sección se pueden identificar los siguientes objetivos.

Objetivo General: Desarrollar un modelo basado en la regresión lineal usando las funciones fit del módulo Sklearn para determinar el nivel de satisfacción de vida en función de GDP de cada país.

Objetivos Específicos:

  • Aprender el manejo básico de la plataforma Jupyter y los comandos de inserción, copia, borrado y evaluación de celdas
  • Conocer el procedimiento de lectura de archivos CSV
  • Relacionarse con el procedimiento de acceso a datos por nombres de columnas, manipulación, edición y filtrado de datos con la estructura de Pandas
  • Crear una función en Python para leer las bases de datos de la OECD y GDP, y posteriormente crear un conjunto de datos que exclusivamente incluyan "Life Satisfaction" y "GDP per capita"
  • Entrenar un modelo basado en la regresión lineal
  • Comparar el modelo con los datos reales y determinar nuevos valores de instancias

The Data: Life Satisfaction and GDP per capita

The next sections will explorer the two datasets: 1) Organisation for Economic Co-operation and Development and 2)International Monetary Fund.

Life satisfaction data description

This dataset was obtained from the OECD's website at: http://stats.oecd.org/index.aspx?DataSetCode=BLI

Int64Index: 3292 entries, 0 to 3291
Data columns (total 17 columns):
"LOCATION"              3292 non-null object
Country                  3292 non-null object
INDICATOR                3292 non-null object
Indicator                3292 non-null object
MEASURE                  3292 non-null object
Measure                  3292 non-null object
INEQUALITY               3292 non-null object
Inequality               3292 non-null object
Unit Code                3292 non-null object
Unit                     3292 non-null object
PowerCode Code           3292 non-null int64
PowerCode                3292 non-null object
Reference Period Code    0 non-null float64
Reference Period         0 non-null float64
Value                    3292 non-null float64
Flag Codes               1120 non-null object
Flags                    1120 non-null object
dtypes: float64(3), int64(1), object(13)
memory usage: 462.9+ KB

Example using Python Pandas

>>> life_sat = pd.read_csv("oecd_bli_2015.csv", thousands=',')

>>> life_sat_total = life_sat[life_sat["INEQUALITY"]=="TOT"]

>>> life_sat_total = life_sat_total.pivot(index="Country", columns="Indicator", values="Value")

>>> life_sat_total.info()
<class 'pandas.core.frame.DataFrame'>
Index: 37 entries, Australia to United States
Data columns (total 24 columns):
Air pollution                                37 non-null float64
Assault rate                                 37 non-null float64
Consultation on rule-making                  37 non-null float64
Dwellings without basic facilities           37 non-null float64
Educational attainment                       37 non-null float64
Employees working very long hours            37 non-null float64
Employment rate                              37 non-null float64
Homicide rate                                37 non-null float64
Household net adjusted disposable income     37 non-null float64
Household net financial wealth               37 non-null float64
Housing expenditure                          37 non-null float64
Job security                                 37 non-null float64
Life expectancy                              37 non-null float64
Life satisfaction                            37 non-null float64
Long-term unemployment rate                  37 non-null float64
Personal earnings                            37 non-null float64
Quality of support network                   37 non-null float64
Rooms per person                             37 non-null float64
Self-reported health                         37 non-null float64
Student skills                               37 non-null float64
Time devoted to leisure and personal care    37 non-null float64
Voter turnout                                37 non-null float64
Water quality                                37 non-null float64
Years in education                           37 non-null float64
dtypes: float64(24)
memory usage: 7.2+ KB
In [6]:
# Load the data
import numpy as np
import pandas as pd
url = "https://raw.githubusercontent.com/machine-learning-course-uac/1-ml-landscape/main/oecd_bli_2015.csv"
oecd_bli = pd.read_csv(url, thousands=',')
oecd_bli.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3292 entries, 0 to 3291
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   LOCATION               3292 non-null   object 
 1   Country                3292 non-null   object 
 2   INDICATOR              3292 non-null   object 
 3   Indicator              3292 non-null   object 
 4   MEASURE                3292 non-null   object 
 5   Measure                3292 non-null   object 
 6   INEQUALITY             3292 non-null   object 
 7   Inequality             3292 non-null   object 
 8   Unit Code              3292 non-null   object 
 9   Unit                   3292 non-null   object 
 10  PowerCode Code         3292 non-null   int64  
 11  PowerCode              3292 non-null   object 
 12  Reference Period Code  0 non-null      float64
 13  Reference Period       0 non-null      float64
 14  Value                  3292 non-null   float64
 15  Flag Codes             1120 non-null   object 
 16  Flags                  1120 non-null   object 
dtypes: float64(3), int64(1), object(13)
memory usage: 437.3+ KB
In [7]:
oecd_bli
Out[7]:
LOCATION Country INDICATOR Indicator MEASURE Measure INEQUALITY Inequality Unit Code Unit PowerCode Code PowerCode Reference Period Code Reference Period Value Flag Codes Flags
0 AUS Australia HO_BASE Dwellings without basic facilities L Value TOT Total PC Percentage 0 units NaN NaN 1.10 E Estimated value
1 AUT Austria HO_BASE Dwellings without basic facilities L Value TOT Total PC Percentage 0 units NaN NaN 1.00 NaN NaN
2 BEL Belgium HO_BASE Dwellings without basic facilities L Value TOT Total PC Percentage 0 units NaN NaN 2.00 NaN NaN
3 CAN Canada HO_BASE Dwellings without basic facilities L Value TOT Total PC Percentage 0 units NaN NaN 0.20 NaN NaN
4 CZE Czech Republic HO_BASE Dwellings without basic facilities L Value TOT Total PC Percentage 0 units NaN NaN 0.90 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3287 EST Estonia WL_TNOW Time devoted to leisure and personal care L Value WMN Women HOUR Hours 0 units NaN NaN 14.43 NaN NaN
3288 ISR Israel WL_TNOW Time devoted to leisure and personal care L Value WMN Women HOUR Hours 0 units NaN NaN 14.24 E Estimated value
3289 RUS Russia WL_TNOW Time devoted to leisure and personal care L Value WMN Women HOUR Hours 0 units NaN NaN 14.75 E Estimated value
3290 SVN Slovenia WL_TNOW Time devoted to leisure and personal care L Value WMN Women HOUR Hours 0 units NaN NaN 14.12 NaN NaN
3291 OECD OECD - Total WL_TNOW Time devoted to leisure and personal care L Value WMN Women HOUR Hours 0 units NaN NaN 14.74 NaN NaN

3292 rows × 17 columns

In [8]:
life_sat_total = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"]
life_sat_total.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 888 entries, 0 to 3217
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   LOCATION               888 non-null    object 
 1   Country                888 non-null    object 
 2   INDICATOR              888 non-null    object 
 3   Indicator              888 non-null    object 
 4   MEASURE                888 non-null    object 
 5   Measure                888 non-null    object 
 6   INEQUALITY             888 non-null    object 
 7   Inequality             888 non-null    object 
 8   Unit Code              888 non-null    object 
 9   Unit                   888 non-null    object 
 10  PowerCode Code         888 non-null    int64  
 11  PowerCode              888 non-null    object 
 12  Reference Period Code  0 non-null      float64
 13  Reference Period       0 non-null      float64
 14  Value                  888 non-null    float64
 15  Flag Codes             58 non-null     object 
 16  Flags                  58 non-null     object 
dtypes: float64(3), int64(1), object(13)
memory usage: 124.9+ KB
In [9]:
new = life_sat_total[life_sat_total["Country"]=='Austria']
new.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24 entries, 1 to 3182
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   LOCATION               24 non-null     object 
 1   Country                24 non-null     object 
 2   INDICATOR              24 non-null     object 
 3   Indicator              24 non-null     object 
 4   MEASURE                24 non-null     object 
 5   Measure                24 non-null     object 
 6   INEQUALITY             24 non-null     object 
 7   Inequality             24 non-null     object 
 8   Unit Code              24 non-null     object 
 9   Unit                   24 non-null     object 
 10  PowerCode Code         24 non-null     int64  
 11  PowerCode              24 non-null     object 
 12  Reference Period Code  0 non-null      float64
 13  Reference Period       0 non-null      float64
 14  Value                  24 non-null     float64
 15  Flag Codes             0 non-null      object 
 16  Flags                  0 non-null      object 
dtypes: float64(3), int64(1), object(13)
memory usage: 3.4+ KB
In [10]:
life_sat_pivoted = life_sat_total.pivot(index="Country", columns="Indicator", values="Value")
life_sat_pivoted
Out[10]:
Indicator Air pollution Assault rate Consultation on rule-making Dwellings without basic facilities Educational attainment Employees working very long hours Employment rate Homicide rate Household net adjusted disposable income Household net financial wealth ... Long-term unemployment rate Personal earnings Quality of support network Rooms per person Self-reported health Student skills Time devoted to leisure and personal care Voter turnout Water quality Years in education
Country
Australia 13.0 2.1 10.5 1.1 76.0 14.02 72.0 0.8 31588.0 47657.0 ... 1.08 50449.0 92.0 2.3 85.0 512.0 14.41 93.0 91.0 19.4
Austria 27.0 3.4 7.1 1.0 83.0 7.61 72.0 0.4 31173.0 49887.0 ... 1.19 45199.0 89.0 1.6 69.0 500.0 14.46 75.0 94.0 17.0
Belgium 21.0 6.6 4.5 2.0 72.0 4.57 62.0 1.1 28307.0 83876.0 ... 3.88 48082.0 94.0 2.2 74.0 509.0 15.71 89.0 87.0 18.9
Brazil 18.0 7.9 4.0 6.7 45.0 10.41 67.0 25.5 11664.0 6844.0 ... 1.97 17177.0 90.0 1.6 69.0 402.0 14.97 79.0 72.0 16.3
Canada 15.0 1.3 10.5 0.2 89.0 3.94 72.0 1.5 29365.0 67913.0 ... 0.90 46911.0 92.0 2.5 89.0 522.0 14.25 61.0 91.0 17.2
Chile 46.0 6.9 2.0 9.4 57.0 15.42 62.0 4.4 14533.0 17733.0 ... 1.59 22101.0 86.0 1.2 59.0 436.0 14.41 49.0 73.0 16.5
Czech Republic 16.0 2.8 6.8 0.9 92.0 6.98 68.0 0.8 18404.0 17299.0 ... 3.12 20338.0 85.0 1.4 60.0 500.0 14.98 59.0 85.0 18.1
Denmark 15.0 3.9 7.0 0.9 78.0 2.03 73.0 0.3 26491.0 44488.0 ... 1.78 48347.0 95.0 1.9 72.0 498.0 16.06 88.0 94.0 19.4
Estonia 9.0 5.5 3.3 8.1 90.0 3.30 68.0 4.8 15167.0 7680.0 ... 3.82 18944.0 89.0 1.5 54.0 526.0 14.90 64.0 79.0 17.5
Finland 15.0 2.4 9.0 0.6 85.0 3.58 69.0 1.4 27927.0 18761.0 ... 1.73 40060.0 95.0 1.9 65.0 529.0 14.89 69.0 94.0 19.7
France 12.0 5.0 3.5 0.5 73.0 8.15 64.0 0.6 28799.0 48741.0 ... 3.99 40242.0 87.0 1.8 67.0 500.0 15.33 80.0 82.0 16.4
Germany 16.0 3.6 4.5 0.1 86.0 5.25 73.0 0.5 31252.0 50394.0 ... 2.37 43682.0 94.0 1.8 65.0 515.0 15.31 72.0 95.0 18.2
Greece 27.0 3.7 6.5 0.7 68.0 6.16 49.0 1.6 18575.0 14579.0 ... 18.39 25503.0 83.0 1.2 74.0 466.0 14.91 64.0 69.0 18.6
Hungary 15.0 3.6 7.9 4.8 82.0 3.19 58.0 1.3 15442.0 13277.0 ... 5.10 20948.0 87.0 1.1 57.0 487.0 15.04 62.0 77.0 17.6
Iceland 18.0 2.7 5.1 0.4 71.0 12.25 82.0 0.3 23965.0 43045.0 ... 1.18 55716.0 96.0 1.5 77.0 484.0 14.61 81.0 97.0 19.8
Ireland 13.0 2.6 9.0 0.2 75.0 4.20 60.0 0.8 23917.0 31580.0 ... 8.39 49506.0 96.0 2.1 82.0 516.0 15.19 70.0 80.0 17.6
Israel 21.0 6.4 2.5 3.7 85.0 16.03 67.0 2.3 22104.0 52933.0 ... 0.79 28817.0 87.0 1.2 80.0 474.0 14.48 68.0 68.0 15.8
Italy 21.0 4.7 5.0 1.1 57.0 3.66 56.0 0.7 25166.0 54987.0 ... 6.94 34561.0 90.0 1.4 66.0 490.0 14.98 75.0 71.0 16.8
Japan 24.0 1.4 7.3 6.4 94.0 22.26 72.0 0.3 26111.0 86764.0 ... 1.67 35405.0 89.0 1.8 30.0 540.0 14.93 53.0 85.0 16.3
Korea 30.0 2.1 10.4 4.2 82.0 18.72 64.0 1.1 19510.0 29091.0 ... 0.01 36354.0 72.0 1.4 35.0 542.0 14.63 76.0 78.0 17.5
Luxembourg 12.0 4.3 6.0 0.1 78.0 3.47 66.0 0.4 38951.0 61765.0 ... 1.78 56021.0 87.0 2.0 72.0 490.0 15.12 91.0 86.0 15.1
Mexico 30.0 12.8 9.0 4.2 37.0 28.83 61.0 23.4 13085.0 9056.0 ... 0.08 16193.0 77.0 1.0 66.0 417.0 13.89 63.0 67.0 14.4
Netherlands 30.0 4.9 6.1 0.0 73.0 0.45 74.0 0.9 27888.0 77961.0 ... 2.40 47590.0 90.0 2.0 76.0 519.0 15.44 75.0 92.0 18.7
New Zealand 11.0 2.2 10.3 0.2 74.0 13.87 73.0 1.2 23815.0 28290.0 ... 0.75 35609.0 94.0 2.4 90.0 509.0 14.87 77.0 89.0 18.1
Norway 16.0 3.3 8.1 0.3 82.0 2.82 75.0 0.6 33492.0 8797.0 ... 0.32 50282.0 94.0 2.0 76.0 496.0 15.56 78.0 94.0 17.9
OECD - Total 20.0 3.9 7.3 2.4 75.0 12.51 65.0 4.0 25908.0 67139.0 ... 2.79 36118.0 88.0 1.8 68.0 497.0 14.97 68.0 81.0 17.7
Poland 33.0 1.4 10.8 3.2 90.0 7.41 60.0 0.9 17852.0 10919.0 ... 3.77 22655.0 91.0 1.1 58.0 521.0 14.20 55.0 79.0 18.4
Portugal 18.0 5.7 6.5 0.9 38.0 9.62 61.0 1.1 20086.0 31245.0 ... 9.11 23688.0 86.0 1.6 46.0 488.0 14.95 58.0 86.0 17.6
Russia 15.0 3.8 2.5 15.1 94.0 0.16 69.0 12.8 19292.0 3412.0 ... 1.70 20885.0 90.0 0.9 37.0 481.0 14.97 65.0 56.0 16.0
Slovak Republic 13.0 3.0 6.6 0.6 92.0 7.02 60.0 1.2 17503.0 8663.0 ... 9.46 20307.0 90.0 1.1 66.0 472.0 14.99 59.0 81.0 16.3
Slovenia 26.0 3.9 10.3 0.5 85.0 5.63 63.0 0.4 19326.0 18465.0 ... 5.15 32037.0 90.0 1.5 65.0 499.0 14.62 52.0 88.0 18.4
Spain 24.0 4.2 7.3 0.1 55.0 5.89 56.0 0.6 22477.0 24774.0 ... 12.96 34824.0 95.0 1.9 72.0 490.0 16.06 69.0 71.0 17.6
Sweden 10.0 5.1 10.9 0.0 88.0 1.13 74.0 0.7 29185.0 60328.0 ... 1.37 40818.0 92.0 1.7 81.0 482.0 15.11 86.0 95.0 19.3
Switzerland 20.0 4.2 8.4 0.0 86.0 6.72 80.0 0.5 33491.0 108823.0 ... 1.46 54236.0 96.0 1.8 81.0 518.0 14.98 49.0 96.0 17.3
Turkey 35.0 5.0 5.5 12.7 34.0 40.86 50.0 1.2 14095.0 3251.0 ... 2.37 16919.0 86.0 1.1 68.0 462.0 13.42 88.0 62.0 16.4
United Kingdom 13.0 1.9 11.5 0.2 78.0 12.70 71.0 0.3 27029.0 60778.0 ... 2.77 41192.0 91.0 1.9 74.0 502.0 14.83 66.0 88.0 16.4
United States 18.0 1.5 8.3 0.1 89.0 11.30 67.0 5.2 41355.0 145769.0 ... 1.91 56340.0 90.0 2.4 88.0 492.0 14.27 68.0 85.0 17.2

37 rows × 24 columns

GDP per capita

Gross domestic product (GDP) per capita is an economic metric that breaks down a country's economic output per person. Economists use GDP per capita to determine how prosperous countries are based on their economic growth.

GDP per capita is calculated by dividing the GDP of a nation by its population. Countries with the higher GDP per capita tend to be those that are industrial, developed countries.

Thus, GDP per capita measures the economic output of a nation per person.


The Dataset obtained from the IMF's website at: http://goo.gl/j1MSKe

Data description

Int64Index: 190 entries, 0 to 189
Data columns (total 7 columns):
Country                          190 non-null object
Subject Descriptor               189 non-null object
Units                            189 non-null object
Scale                            189 non-null object
Country/Series-specific Notes    188 non-null object
2015                             187 non-null float64
Estimates Start After            188 non-null float64
dtypes: float64(2), object(5)
memory usage: 11.9+ KB

Example using Python Pandas

>>> gdp_per_capita = pd.read_csv(
...     datapath+"gdp_per_capita.csv", thousands=',', delimiter='\t',
...     encoding='latin1', na_values="n/a", index_col="Country")
...
>>> gdp_per_capita.rename(columns={"2015": "GDP per capita"}, inplace=True)
In [13]:
url2 = "https://raw.githubusercontent.com/machine-learning-course-uac/1-ml-landscape/main/gdp_per_capita.csv"
gdp_per_capita = pd.read_csv(url2,thousands=',',delimiter='\t', encoding='latin1', na_values="n/a")
gdp_per_capita
Out[13]:
Country Subject Descriptor Units Scale Country/Series-specific Notes 2015 Estimates Start After
0 Afghanistan Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 599.994 2013.0
1 Albania Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 3995.383 2010.0
2 Algeria Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 4318.135 2014.0
3 Angola Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 4100.315 2014.0
4 Antigua and Barbuda Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 14414.302 2011.0
... ... ... ... ... ... ... ...
185 Vietnam Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 2088.344 2012.0
186 Yemen Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 1302.940 2008.0
187 Zambia Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 1350.151 2010.0
188 Zimbabwe Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 1064.350 2012.0
189 International Monetary Fund, World Economic Ou... NaN NaN NaN NaN NaN NaN

190 rows × 7 columns

In [14]:
gdp_per_capita.rename(columns={"2015": "GDP"}, inplace=True)
gdp_per_capita
Out[14]:
Country Subject Descriptor Units Scale Country/Series-specific Notes GDP Estimates Start After
0 Afghanistan Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 599.994 2013.0
1 Albania Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 3995.383 2010.0
2 Algeria Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 4318.135 2014.0
3 Angola Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 4100.315 2014.0
4 Antigua and Barbuda Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 14414.302 2011.0
... ... ... ... ... ... ... ...
185 Vietnam Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 2088.344 2012.0
186 Yemen Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 1302.940 2008.0
187 Zambia Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 1350.151 2010.0
188 Zimbabwe Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 1064.350 2012.0
189 International Monetary Fund, World Economic Ou... NaN NaN NaN NaN NaN NaN

190 rows × 7 columns

In [15]:
gdp_per_capita.set_index("Country", inplace=True)
gdp_per_capita
Out[15]:
Subject Descriptor Units Scale Country/Series-specific Notes GDP Estimates Start After
Country
Afghanistan Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 599.994 2013.0
Albania Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 3995.383 2010.0
Algeria Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 4318.135 2014.0
Angola Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 4100.315 2014.0
Antigua and Barbuda Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 14414.302 2011.0
... ... ... ... ... ... ...
Vietnam Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 2088.344 2012.0
Yemen Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 1302.940 2008.0
Zambia Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 1350.151 2010.0
Zimbabwe Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 1064.350 2012.0
International Monetary Fund, World Economic Outlook Database, April 2016 NaN NaN NaN NaN NaN NaN

190 rows × 6 columns

Making all the process in a function

In [20]:
def CountryStats(oecd, gdp):
    # YOUR CODE HERE
    return country_stats[["GDP", 'Life satisfaction']].iloc[keep_indices]
In [16]:
import mluac as ml
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
In [17]:
url2 = "https://raw.githubusercontent.com/machine-learning-course-uac/1-ml-landscape/main/gdp_per_capita.csv"
gdp_per_capita = pd.read_csv(url2,thousands=',',delimiter='\t', encoding='latin1', na_values="n/a")
url = "https://raw.githubusercontent.com/machine-learning-course-uac/1-ml-landscape/main/oecd_bli_2015.csv"
oecd_bli = pd.read_csv(url, thousands=',')
cs = ml.prepare_country_stats(oecd_bli, gdp_per_capita)
cs
Out[17]:
GDP per capita Life satisfaction
Country
Russia 9054.914 6.0
Turkey 9437.372 5.6
Hungary 12239.894 4.9
Poland 12495.334 5.8
Slovak Republic 15991.736 6.1
Estonia 17288.083 5.6
Greece 18064.288 4.8
Portugal 19121.592 5.1
Slovenia 20732.482 5.7
Spain 25864.721 6.5
Korea 27195.197 5.8
Italy 29866.581 6.0
Japan 32485.545 5.9
Israel 35343.336 7.4
New Zealand 37044.891 7.3
France 37675.006 6.5
Belgium 40106.632 6.9
Germany 40996.511 7.0
Finland 41973.988 7.4
Canada 43331.961 7.3
Netherlands 43603.115 7.3
Austria 43724.031 6.9
United Kingdom 43770.688 6.8
Sweden 49866.266 7.2
Iceland 50854.583 7.5
Australia 50961.865 7.3
Ireland 51350.744 7.0
Denmark 52114.165 7.5
United States 55805.204 7.2

Exploring the data

In [18]:
X = np.c_[cs["GDP per capita"]]
y = np.c_[cs["Life satisfaction"]]
In [19]:
X
Out[19]:
array([[ 9054.914],
       [ 9437.372],
       [12239.894],
       [12495.334],
       [15991.736],
       [17288.083],
       [18064.288],
       [19121.592],
       [20732.482],
       [25864.721],
       [27195.197],
       [29866.581],
       [32485.545],
       [35343.336],
       [37044.891],
       [37675.006],
       [40106.632],
       [40996.511],
       [41973.988],
       [43331.961],
       [43603.115],
       [43724.031],
       [43770.688],
       [49866.266],
       [50854.583],
       [50961.865],
       [51350.744],
       [52114.165],
       [55805.204]])
In [20]:
y
Out[20]:
array([[6. ],
       [5.6],
       [4.9],
       [5.8],
       [6.1],
       [5.6],
       [4.8],
       [5.1],
       [5.7],
       [6.5],
       [5.8],
       [6. ],
       [5.9],
       [7.4],
       [7.3],
       [6.5],
       [6.9],
       [7. ],
       [7.4],
       [7.3],
       [7.3],
       [6.9],
       [6.8],
       [7.2],
       [7.5],
       [7.3],
       [7. ],
       [7.5],
       [7.2]])
In [21]:
plt.plot(X,y, 'ok', markersize=12, label="Data")
plt.xlabel("GDP per capita", fontsize=20)
plt.ylabel("Life satisfaction", fontsize=20)
plt.legend()
plt.show()

Model of GDP and Life Satisfaction

Fitting and predictions

In [24]:
from sklearn.linear_model import LinearRegression
# Select a linear model
model = LinearRegression()
In [25]:
# Train the model
model.fit(X, y)
Out[25]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [22]:
# Make a prediction for Cyprus
X_new = [[22587]]  # Cyprus' GDP per capita
print(model.predict(X_new)) # outputs [[ 5.96242338]]
[[5.96242338]]

Extracting data from model

The Linear regression model

$$f(\theta)=x_0 +x_1 \theta_1 +x_2 \theta_2 + \cdots$$

In [29]:
#theta_0
t0 = model.intercept_[0]
t0
Out[29]:
4.853052800266435
In [30]:
#theta_1
t1 =model.coef_
t1
Out[30]:
array([[4.91154459e-05]])
In [31]:
Xest=np.linspace(0, 60000, 1000)
plt.plot(X, t0 + t1*X, "r", linewidth=3,label="Model")
plt.plot(X,y, 'ok', markersize=12, label="Data")
plt.xlabel("GDP per capita", fontsize=20)
plt.ylabel("Life satisfaction", fontsize=20)
plt.legend()
plt.show()
</html>