From 78ff5ef67e1f143f226e6e98076dc624eccb14ec Mon Sep 17 00:00:00 2001
From: Sofia Samaniego <samaniego.sofia@uabc.edu.mx>
Date: Thu, 2 Jul 2026 10:47:07 -0600
Subject: [PATCH] =?UTF-8?q?correci=C3=B3n=20de=20gr=C3=A1fica=20y=20se=20a?=
 =?UTF-8?q?greg=C3=B3=20desarrollo=20de=20logloss?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 README.md                                     |  48 ++++++++++++++++--
 .../{README_13_0.png => README_14_0.png}      | Bin
 .../{README_23_0.png => README_25_0.png}      | Bin
 3 files changed, 43 insertions(+), 5 deletions(-)
 rename README_files/{README_13_0.png => README_14_0.png} (100%)
 rename README_files/{README_23_0.png => README_25_0.png} (100%)

diff --git a/README.md b/README.md
index fbbe782..3db007a 100644
--- a/README.md
+++ b/README.md
@@ -201,7 +201,6 @@ y = df['target'].to_numpy().reshape(-1, 1)
 
 # Convert target to binary: 1 if setosa (class 0), 0 otherwise
 y = (y==0).astype(float)
-
 ```
 
 
@@ -227,6 +226,12 @@ Formula:
 $$\sigma(z) = \frac{1}{1+e^{-z}}$$
 *(Note: `np.clip` is used to bound extreme values and prevent overflow errors during exponential calculation).*
 
+### Sigmoid Function Base Behavior
+
+This plot isolates the **Sigmoid function's behavior** in isolation, independent of the model's training optimization. 
+
+It serves exclusively as a mathematical baseline to demonstrate the activation range between $0$ and $1$ and the function's sensitivity to input feature values. This visualization does not represent the optimized decision boundary or the final classification results; rather, it highlights the function's intrinsic mapping capability before weight adjustments are applied.
+
 
 ```python
 def sigmoid(z):
@@ -245,7 +250,7 @@ plt.show()
 
 
     
-![png](README_files/README_13_0.png)
+![png](README_files/README_14_0.png)
     
 
 
@@ -269,6 +274,41 @@ def sigmoid(z):
 This function calculates the error between the model's predicted probabilities ($p$) and the true binary labels ($y$). 
 Probabilities are clipped using a tiny epsilon ($\epsilon$) to prevent mathematical undefined errors (like $\log(0)$), which would break the algorithm.
 
+### Derivation of the Log Loss (Binary Cross-Entropy) Cost Function
+
+In binary classification models, the goal is to estimate the probability that an instance belongs to the positive class. To optimize the model, we need a cost function that heavily penalizes confident but incorrect predictions. This is achieved using the **Log Loss** (Binary Cross-Entropy), derived via Maximum Likelihood Estimation (MLE).
+
+Here is the step-by-step mathematical derivation:
+
+#### 1. Likelihood of a Single Instance (Bernoulli Distribution)
+For a single training instance $(x^{(i)}, y^{(i)})$, the true label is binary: $y^{(i)} \in \{0, 1\}$. If our model predicts the probability $\hat{y}^{(i)}$, we can express the probability (Likelihood) of observing the true label using the Bernoulli distribution:
+
+$$P(y^{(i)}|x^{(i)};\theta) = (\hat{y}^{(i)})^{y^{(i)}}(1-\hat{y}^{(i)})^{1-y^{(i)}}$$
+
+This compact expression works for both possible outcomes:
+* If $y^{(i)} = 1$, the probability is $\hat{y}^{(i)}$.
+* If $y^{(i)} = 0$, the probability is $1 - \hat{y}^{(i)}$.
+
+#### 2. Joint Likelihood of the Dataset
+Assuming that all $m$ training instances are independent, the total likelihood of the model, $L(\theta)$, is the product of the individual probabilities:
+
+$$L(\theta) = \prod_{i=1}^{m} P(y^{(i)}|x^{(i)};\theta)$$
+
+During training, our objective is to find the parameters (weights) that **maximize** this likelihood.
+
+#### 3. Log-Likelihood
+Multiplying thousands of probabilities (numbers between 0 and 1) leads to computational underflow. To fix this, we apply the natural logarithm. This transforms the product into a sum and brings the exponents down as multipliers:
+
+$$l(\theta) = \sum_{i=1}^{m} \log P(y^{(i)}|x^{(i)};\theta)$$
+$$l(\theta) = \sum_{i=1}^{m} \left[ y^{(i)}\log(\hat{y}^{(i)}) + (1-y^{(i)})\log(1-\hat{y}^{(i)}) \right]$$
+
+#### 4. Negative Log-Likelihood (The Cost Function)
+Optimization algorithms like Gradient Descent are designed to **minimize** a cost function rather than maximize it. To convert this into a minimization problem, we multiply the Log-Likelihood by $-1$. Finally, we divide by the total number of samples $m$ to get the average error. 
+
+This gives us the final Log Loss equation:
+
+$$J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)}\log(\hat{y}^{(i)}) + (1-y^{(i)})\log(1-\hat{y}^{(i)}) \right]$$
+
 
 ```python
 def logLoss(y, p, eps=1e-12):
@@ -341,8 +381,6 @@ def predictProba(x, theta0, theta1):
 def predict(x, theta0, theta1, thresh=0.5):
     model = (predictProba >= thresh).astype(int)
     # Returns 1 if probability >= threshold, else 0
-
-    
 ```
 
 
@@ -357,7 +395,7 @@ plt.show()
 
 
     
-![png](README_files/README_23_0.png)
+![png](README_files/README_25_0.png)
     
 
 
diff --git a/README_files/README_13_0.png b/README_files/README_14_0.png
similarity index 100%
rename from README_files/README_13_0.png
rename to README_files/README_14_0.png
diff --git a/README_files/README_23_0.png b/README_files/README_25_0.png
similarity index 100%
rename from README_files/README_23_0.png
rename to README_files/README_25_0.png