@ -201,7 +201,6 @@ y = df['target'].to_numpy().reshape(-1, 1)
# Convert target to binary: 1 if setosa (class 0), 0 otherwise
y = (y==0).astype(float)
```
@ -227,6 +226,12 @@ Formula:
$$\sigma(z) = \frac{1}{1+e^{-z}}$$
*(Note: `np.clip` is used to bound extreme values and prevent overflow errors during exponential calculation).*
### Sigmoid Function Base Behavior
This plot isolates the **Sigmoid function's behavior** in isolation, independent of the model's training optimization.
It serves exclusively as a mathematical baseline to demonstrate the activation range between $0$ and $1$ and the function's sensitivity to input feature values. This visualization does not represent the optimized decision boundary or the final classification results; rather, it highlights the function's intrinsic mapping capability before weight adjustments are applied.
```python
def sigmoid(z):
@ -245,7 +250,7 @@ plt.show()


@ -269,6 +274,41 @@ def sigmoid(z):
This function calculates the error between the model's predicted probabilities ($p$) and the true binary labels ($y$).
Probabilities are clipped using a tiny epsilon ($\epsilon$) to prevent mathematical undefined errors (like $\log(0)$), which would break the algorithm.
### Derivation of the Log Loss (Binary Cross-Entropy) Cost Function
In binary classification models, the goal is to estimate the probability that an instance belongs to the positive class. To optimize the model, we need a cost function that heavily penalizes confident but incorrect predictions. This is achieved using the **Log Loss** (Binary Cross-Entropy), derived via Maximum Likelihood Estimation (MLE).
Here is the step-by-step mathematical derivation:
#### 1. Likelihood of a Single Instance (Bernoulli Distribution)
For a single training instance $(x^{(i)}, y^{(i)})$, the true label is binary: $y^{(i)} \in \{0, 1\}$. If our model predicts the probability $\hat{y}^{(i)}$, we can express the probability (Likelihood) of observing the true label using the Bernoulli distribution:
This compact expression works for both possible outcomes:
* If $y^{(i)} = 1$, the probability is $\hat{y}^{(i)}$.
* If $y^{(i)} = 0$, the probability is $1 - \hat{y}^{(i)}$.
#### 2. Joint Likelihood of the Dataset
Assuming that all $m$ training instances are independent, the total likelihood of the model, $L(\theta)$, is the product of the individual probabilities:
During training, our objective is to find the parameters (weights) that **maximize** this likelihood.
#### 3. Log-Likelihood
Multiplying thousands of probabilities (numbers between 0 and 1) leads to computational underflow. To fix this, we apply the natural logarithm. This transforms the product into a sum and brings the exponents down as multipliers:
#### 4. Negative Log-Likelihood (The Cost Function)
Optimization algorithms like Gradient Descent are designed to **minimize** a cost function rather than maximize it. To convert this into a minimization problem, we multiply the Log-Likelihood by $-1$. Finally, we divide by the total number of samples $m$ to get the average error.