# Titulo del proyecto # Module 1: Logistic Regression Classifier Author: Sofia Samaniego Lopez Institution: Universidad Autonoma de Baja California (UABC) Advisor: Dr. Gerardo Marx Chavez Campos This script establishes the baseline evaluation for binary classification using the classic Iris dataset. The objective of this specific work is to analyze linear combination, decision boundaries, and probability estimation based on morphological features—specifically petal width and length. The model utilizes Scikit-Learn's LogisticRegression framework to execute the optimization process and map inputs to a probability output using the Sigmoid function. This code serves as a strict benchmark comparison, generating the reference ground truth metrics and spatial visualizations required to evaluate the performance and accuracy of subsequent custom implementations. ## Experimental Setup and Preliminary Analysis ### Step 1: Import Required Libraries and Environment Setup In this initial stage, the necessary scientific computing and data processing libraries are imported to set up our development environment: * **NumPy**: Utilized for efficient multi-dimensional array operations and core matrix algebra. * **Matplotlib (pyplot)**: Employed to generate the spatial scatter plots and map the resulting decision boundaries. * **Scikit-Learn**: Specifically importing the `datasets` module to fetch the target morphological data and `LogisticRegression` to serve as our validation benchmark baseline. ```python !pip3 install scikit-learn !pip3 install matplotlib !pip3 install numpy import matplotlib.pyplot as plt import numpy as np ``` ### Step 2: Load and Explore the Iris Dataset Characteristics The classic **Iris Dataset** is loaded into the workspace to establish our baseline classification task. #### Dataset Overview and Morphological Features The complete dataset consists of **150 samples** from three distinct species of Iris flowers (*Iris setosa*, *Iris versicolor*, and *Iris virginica*). For each sample, four continuous geometric features are available: 1. Sepal Length 2. Sepal Width 3. Petal Length 4. Petal Width #### Problem Simplification & Single-Feature Evaluation The classification task is constrained to analyze the Sigmoid function and linear thresholds on a simpler scale: * **Target Binarization:** The multi-class target is converted to binary labels ($y=1$ for *Iris virginica*, $y=0$ for others) to establish a clear threshold. * **Single-Feature Isolation:** Instead of combining dimensions, features are evaluated **one at a time** (e.g., petal width or sepal length) to inspect their independent separation power before building multi-dimensional models. ```python from sklearn import datasets iris=datasets.load_iris() print(iris.DESCR) ``` ### Step 3: Exploratory Data Analysis & Target Inspection Prior to model optimization, a visual and structural inspection evaluates the data distribution: * **Unclassified Spatial Mapping:** Plotting Sepal Length (`sl`) vs. Sepal Width (`sw`) using plain markers (`'.k'`) reveals the raw data structure. This helps verify if the morphological features naturally form distinct clusters before applying any algorithmic boundaries. * **Target Label Mapping (`iris.target`):** Inspecting the ground-truth array to map the multi-class numerical taxonomy: * `0`: *Iris setosa* * `1`: *Iris versicolor* * `2`: *Iris virginica* This inspection links the visual spatial groups with their mathematical labels, establishing the baseline before data binarization. ```python sl = iris.data[:, 0:1] sw = iris.data[:, 1:2] plt.plot(sl,sw,'.k') plt.show() ``` ```python iris.target ``` ### Step 4: Theoretical Sigmoid Function & Decision Boundary Generating a synthetic domain from -10 to 10 to plot the standalone mathematical Sigmoid function: $$\sigma(t) = \frac{1}{1 + e^{-t}}$$ This visualizes how the curve maps inputs to a probability between 0 and 1, establishing the inflection point $\sigma(0) = 0.5$ as the theoretical threshold for the decision boundary. ```python t= np.linspace(-10,10,100) sig = 1/(1+np.exp(-t)) plt.plot(t,sig, '.b', label=r"$\sigma$") plt.legend(loc='upper left', fontsize=20) plt.show() ``` ## Model Training and Benchmark Evaluation ### Model 1: Iris-Setosa Classifier based on petal width #### Feature Selection & Setosa Target Binarization The dataset is filtered and restructured to evaluate a new binary classification task: * **Feature Vector ($X$):** Slicing index `[:, 3:]` isolates **Petal Width** as the continuous predictor variable. * **Target Binarization ($y$):** The criteria `(iris.target == 0).astype(int)` shifts the positive class ($y=1$) exclusively to **Iris setosa**, mapping all other species to $0$. This creates the target array seen in the output. ```python x = iris.data[:, 3:] y = (iris.target == 0).astype(int) y ``` #### Benchmark Model Initialization and Fitting This cell instantiates and trains the baseline classification model using Scikit-Learn: * **`LogisticRegression(solver='lbfgs', random_state=42)`**: Instantiates the model using the Limited-memory BFGS (`lbfgs`) optimization solver and locks the `random_state` to 42 to ensure reproducible weight initialization. * **`mylr.fit(x, y)`**: Trains the classifier on the isolated feature vector `x` and the binarized target labels `y`. This optimization process computes the optimal weight ($w$) and bias ($b$) parameters that minimize the loss function. ```python from sklearn.linear_model import LogisticRegression mylr = LogisticRegression(solver='lbfgs', random_state=42) mylr.fit(x,y) ``` ```python Xnew = np.linspace(-1,3,100).reshape(-1,1) yPred = mylr.predict_proba(Xnew) #plt.plot(Xnew, yPred[:,0], label= 'No Iris') plt.plot(Xnew, yPred[:,1], label= 'Yes Iris') plt.legend() plt.plot(x,y,'*g') plt.show() ``` This plot visualizes the trained model's Sigmoid prediction curve over the experimental dataset samples: * **Sample Distribution (Green Stars):** Represents the real dataset. Small petal widths (0.1 - 0.6 cm) belong to *Iris setosa* ($y=1$), while larger widths (1.0 - 2.5 cm) belong to the other species ($y=0$). * **Sigmoid Mapping (Blue Curve):** Displays an inverted logistic curve. It demonstrates the mathematical relationship: as petal width increases, the probability of the flower being *Iris setosa* drops sharply from 1.0 to 0.0. * **Decision Boundary Threshold:** The curve crosses the 0.5 probability threshold at approximately 0.75 cm. This inflection point defines the exact baseline boundary separating both classifications. ### Model 2: Iris-Setosa Classifier based on petal length #### Feature Shift – Petal Length Isolation The model configuration is updated to evaluate a different morphological predictor: * **Feature Vector ($X$):** Slicing index `[:, 2:3]` isolates **Petal Length** as the independent variable. * **Target Continuity ($y$):** The classification objective remains focused on **Iris setosa** (`iris.target == 0`) to compare the separation power of petal length against the previous petal width baseline. ```python x = iris.data[:, 2:3] y = (iris.target == 0).astype(int) ``` ```python from sklearn.linear_model import LogisticRegression mylr = LogisticRegression(solver='lbfgs', random_state=42) mylr.fit(x,y) ``` ```python Xnew = np.linspace(0,8,100).reshape(-1,1) yPred = mylr.predict_proba(Xnew) #plt.plot(Xnew, yPred[:,0], label= 'No Iris') plt.plot(Xnew, yPred[:,1], label= 'Yes Iris') plt.legend() plt.plot(x,y,'*g') plt.axis([1.5, 5, -0.5, 1.5]) plt.show() ``` This plot illustrates the performance of the second univariable model using **Petal Length**: * **Sample Distribution:** Samples with short petal lengths (1.0 - 2.0 cm) are correctly clustered as *Iris setosa* ($y=1$), while samples with larger lengths ($>3.0$ cm) map to $y=0$. * **Sigmoid Mapping:** The descending blue curve demonstrates that as petal length increases, the probability of the sample being *Iris setosa* drops sharply from 1.0 to 0.0. * **Decision Boundary:** The curve crosses the 0.5 probability threshold at approximately 2.5 cm, marking the exact inflection point that separates the target class from the rest of the dataset. ### Model 3: Iris-Setosa Classifier based on Sepal length #### Feature Shift – Sepal Length Isolation The model evaluates a third morphological predictor independently: * **Feature Vector ($X$):** Slicing index `[:, 0:1]` isolates **Sepal Length** as the continuous independent variable. * **Target Continuity ($y$):** The objective remains focused on **Iris setosa** ($y=1$) to compare the separation power of sepal dimensions against the previous petal metrics. ```python x = iris.data[:, 0:1] y = (iris.target == 0).astype(int) from sklearn.linear_model import LogisticRegression mylr = LogisticRegression(solver='lbfgs', random_state=42) mylr.fit(x,y) ``` ```python Xnew = np.linspace(0,8,100).reshape(-1,1) yPred = mylr.predict_proba(Xnew) #plt.plot(Xnew, yPred[:,0], label= 'No Iris') plt.plot(Xnew, yPred[:,1], label= 'Yes Iris') plt.legend() plt.plot(x,y,'*g') plt.axis([3.5, 7, -0.1, 1.1]) plt.show() ``` This plot displays the performance of the third univariable model using **Sepal Length**: * **Sample Distribution:** Samples representing *Iris setosa* ($y=1$) are concentrated at shorter lengths, but show a much higher spatial overlap with non-setosa samples ($y=0$) compared to the previous petal features. * **Sigmoid Mapping:** The descending curve shows the probability dropping as sepal length increases. Due to this significant data overlap, the slope is less steep, indicating a more gradual and less aggressive probabilistic transition. * **Decision Boundary:** The inflection point at $\sigma = 0.5$ establishes the final threshold. This boundary carries more classification uncertainty because sepal dimensions are naturally less distinct between these species. ### Model 4: Multiple features classifier #### Multi-Class Spatial Mapping (Sepal Features) This cell upgrades the initial exploratory plot by adding the ground-truth class labels to the 2D sepal feature space: * **Feature Interaction:** Maps Sepal Length (`sl`) against Sepal Width (`sw`) simultaneously to analyze their combined distribution. * **Class Color-Coding:** Differentiates the three original species using distinct markers: Green for *Setosa*, Red for *Versicolor*, and Blue for *Virginica*. * **Visual Separability Analysis:** Allows immediate observation of the data structure, showing that while *Setosa* forms a perfectly isolated cluster, *Versicolor* and *Virginica* exhibit significant spatial overlap, justifying the need for optimization models. ```python import matplotlib.pyplot as plt sl = iris.data[:,0:1] sw = iris.data[:,1:2] tg = iris.target plt.plot(sl[tg==0,0], sw[tg==0,0],'.g' ,label='Set') plt.plot(sl[tg==1,0], sw[tg==1,0],'.r', label='Ver') plt.plot(sl[tg==2,0], sw[tg==2,0],'.b', label='Vir') plt.legend() plt.show() ``` #### Bivariate Model Training for Iris Virginica This cell configures and trains a multi-feature logistic regression model utilizing tuned optimization parameters: * **Bivariate Data Selection:** * **Features (`X`):** Slices index `[:, 0:2]` to combine **Sepal Length** and **Sepal Width** into a two-dimensional feature space. * **Target (`y`):** Shifts the positive class focus exclusively to **Iris virginica** (`iris.target == 2`). * **Hyperparameter Tuning (`mylrvir`):** * **`solver='newton-cg'`**: Uses the Newton-Conjugate Gradient method to compute accurate optimization paths. * **`C=100` & `tol=1e-5`**: Applies high cost (low regularization) to allow a tighter fit to the data, paired with a strict tolerance for precise convergence. * **`mylrvir.fit(X, y)`**: Trains the system to find the optimal weight vector $w = [w_1, w_2]$ and bias ($b$), establishing the multi-variable benchmark line. ```python X = iris.data[:,0:2] y = (iris.target==2).astype(int) mylrvir = LogisticRegression( random_state=22, tol=1e-5, C=100, max_iter=100, solver='newton-cg' ) mylrvir.fit(X,y) ``` #### Coordinate Grid Generation & Probability Mapping This block builds the mathematical testing grid used to map out the complete probability landscape: * **`np.meshgrid`**: Generates a dense $100 \times 100$ coordinate grid across the sepal feature space (Length: 3–8 cm, Width: 0–6 cm). * **`Xnew` (`np.c_`)**: Flattens and couples the grid matrices into a matrix of 10,000 discrete 2D spatial coordinates. * **`predict_proba(Xnew)`**: Evaluates the trained model across the entire grid, computing the continuous probabilities needed to render the decision contours and 3D surfaces. ```python x0, x1 = np.meshgrid( np.linspace(3,8,100).reshape(-1,1), np.linspace(0,6,100).reshape(-1,1) ) Xnew = np.c_[x0.ravel(), x1.ravel()] yPred = mylrvir.predict_proba(Xnew) ``` ```python plt.figure(figsize=(10,4)) plt.plot(X[y==0,0], X[y==0,1],'bs',label='No Virg') plt.plot(X[y==1,0], X[y==1,1],'g^',label='Virginica') zz=yPred[:,1].reshape(x0.shape) contour=plt.contour(x0,x1,zz) plt.clabel(contour, inline=1,fontsize=15) plt.xlabel("Sepal Length") plt.ylabel("Sepal Width") plt.legend() plt.show() ``` This plot visualizes the continuous probability space generated by the trained bivariate model: * **Sample Distribution:** Blue squares represent non-virginica samples ($y=0$), and green triangles represent *Iris virginica* ($y=1$) mapped across Sepal Length and Sepal Width. * **Probability Contours (`plt.contour`):** The labeled contour lines map specific probability thresholds. They show how the model's prediction confidence transitions across the 2D space. * **Decision Boundary:** The contour line labeled **0.5** marks the exact geometric threshold. Any sample falling past this line is classified as *Iris virginica*, capturing the spatial trade-off between both sepal measurements. ```python fig, ax =plt.subplots(subplot_kw={"projection": "3d"}) surf = ax.plot_surface(x0,x1,zz, cmap='jet') ax.scatter(iris.data[:,0:1], iris.data[:,1:2], y, 'or') ``` This cell projects the bivariate logistic regression model into a 3D coordinate space to visualize the complete probability landscape: * **Axis Dimensions:** The horizontal axes represent **Sepal Length** ($x_1$) and **Sepal Width** ($x_2$), while the vertical axis ($Z$) tracks the continuous model probability $\sigma(z) \in [0, 1]$. * **Probability Surface (`plot_surface`):** The Sigmoid function is rendered as a 3D sheet using the `jet` colormap. It displays the non-linear S-curve transition dynamically stretched across the two-dimensional feature plane. * **True Labels Spatial Scatter (`ax.scatter`):** The red markers plot the actual samples at their exact spatial coordinates and true binary height ($z = 1$ for *Iris virginica*, $z = 0$ for others). This highlights how the optimized surface splits the space to fit the data points. ### Modelo 5: Multiple features and muticlass classifier #### Multi-Feature & Multi-Class Model Training This cell configures and trains the final baseline model to handle all three species simultaneously within a two-dimensional feature space: * **Bivariate Inputs (`X`):** Slices index `[:, 0:2]` to utilize both **Sepal Length** and **Sepal Width** as the predictor variables. * **Multi-Class Target (`y`):** Retains the original multi-class target labels (`0`, `1`, `2`) without applying binarization, expanding the task from a single boundary to a three-class decision space. * **`LogisticRegression(C=100, solver='lbfgs')`:** Trains a multinomial classifier. The algorithm optimizes a distinct set of weights and biases for each target category, preparing the model to partition the 2D plane into three distinct classification zones. ```python X = iris.data[:,0:2] y = iris.target lrmc = LogisticRegression( solver='lbfgs', C=100, random_state=22 ) lrmc.fit(X,y) ``` #### Multi-Class Grid Generation and Probability Evaluation This cell sets up the coordinate testing matrix to evaluate the multi-class prediction behavior across the entire sepal feature space: * **`np.meshgrid`**: Constructs a dense $100 \times 100$ coordinate grid bounding the Sepal Length (3–8 cm) and Sepal Width (0–6 cm) ranges. * **`Xnew` (`np.c_`)**: Flattens and pairs the grid elements into a matrix of 10,000 distinct 2D spatial coordinates. * **`lrmc.predict_proba(Xnew)`**: Computes a three-column probability matrix for each point. This mapping determines the exact multi-class boundaries by evaluating the localized likelihood for *Setosa*, *Versicolor*, and *Virginica* simultaneously. ```python x0, x1 = np.meshgrid( np.linspace(3,8,100).reshape(-1,1), np.linspace(0,6,100).reshape(-1,1) ) Xnew = np.c_[x0.ravel(), x1.ravel()] yPred = lrmc.predict_proba(Xnew) ``` ```python plt.figure(figsize=(10,4)) plt.plot(X[y==0,0], X[y==0,1],'.b',label='Setosa') plt.plot(X[y==1,0], X[y==1,1],'+g',label='Versi') plt.plot(X[y==2,0], X[y==2,1],'*m',label='Virgi') zz=yPred[:,1].reshape(x0.shape) contour=plt.contour(x0,x1,zz) plt.clabel(contour, inline=1,fontsize=15) plt.xlabel("Sepal Length") plt.ylabel("Sepal Width") plt.legend() plt.show() ``` This plot maps the continuous probability distribution of the middle class within the multi-class decision space: * **Three-Class Distribution:** Displays all species simultaneously using distinct markers: blue dots for *Setosa* ($y=0$), green pluses for *Versicolor* ($y=1$), and magenta stars for *Virginica* ($y=2$). * **Target Class Extraction (`yPred[:, 1]`):** Slicing the second column of the probability matrix isolates and tracks the specific localized likelihood of a sample being **Iris versicolor**. * **Localized Probability Ridge:** Unlike the linear boundaries seen in binary classification, the multinomial model creates a bounded peak or "ridge" to isolate the middle class. The highest contour line (**0.90**) tightly encapsulates the core *Versicolor* cluster, dropping off systematically as the features move toward *Setosa* (left) or *Virginica* (right) territories. ```python yPred = lrmc.predict(Xnew) plt.figure(figsize=(10,6)) plt.plot(X[y==0,0], X[y==0,1],'bs',label='Setosa') plt.plot(X[y==1,0], X[y==1,1],'g^',label='Versi') plt.plot(X[y==2,0], X[y==2,1],'*m',label='Virgi') zz=yPred.reshape(x0.shape) contour=plt.contourf(x0,x1,zz, cmap='jet', alpha=0.3) plt.clabel(contour, inline=1,fontsize=15) plt.xlabel("Sepal Length") plt.ylabel("Sepal Width") plt.legend() plt.show() ``` This plot visualizes the ultimate classification boundaries by partitioning the entire 2D feature space into hard decision zones: * **Hard Class Assignment (`lrmc.predict`):** Converts continuous probabilities into discrete class verdicts (`0`, `1`, or `2`) by applying an *argmax* function (selecting the class with the highest probability for each point). * **Filled Decision Regions (`plt.contourf`):** Shades the coordinate plane into three distinct, solid zones using the `jet` colormap: * **Blue Region:** Absolute classification space for *Iris setosa*. * **Green Region:** Absolute classification space for *Iris versicolor*. * **Red/Orange Region:** Absolute classification space for *Iris virginica*. * **Boundary Analysis:** The sharp geometric intersections between the colored blocks define the definitive decision thresholds. This layout explicitly reveals how the linear multi-class model manages the regional trade-offs and handles the spatial overlap between the *Versicolor* and *Virginica* samples. ```python fig, ax =plt.subplots(subplot_kw={"projection": "3d"}) surf = ax.plot_surface(x0,x1,zz, cmap='jet') ax.scatter(iris.data[:,0:1], iris.data[:,1:2], y, 'or') ``` This cell integrates the actual dataset samples into the 3D hard decision space to visually evaluate the multi-class model's accuracy: * **Discrete Vertical Alignment ($Z$):** Both the staircase surface and the scatter markers use the integer multi-class taxonomy ($0$ for *Setosa*, $1$ for *Versicolor*, and $2$ for *Virginica*) instead of continuous probabilities. * **Stepped Surface (`plot_surface`):** Represents the geometric boundaries computed by the model. Each level dictates the categorical verdict zone based on the combination of sepal features. * **True Labels Scatter (`ax.scatter`):** Plots the real flower samples at their actual feature coordinates and true species height. This allows immediate visual verification of performance: samples resting on their matching colored step are correctly classified, while those caught on the wrong tier highlight the exact instances of classification error caused by spatial overlap.