"# Convert target to binary: 1 if setosa (class 0), 0 otherwise\n",
"# Convert target to binary: 1 if setosa (class 0), 0 otherwise\n",
"y = (y==0).astype(float)\n"
"y = (y==0).astype(float)"
]
]
},
},
{
{
@ -325,6 +325,18 @@
"*(Note: `np.clip` is used to bound extreme values and prevent overflow errors during exponential calculation).*"
"*(Note: `np.clip` is used to bound extreme values and prevent overflow errors during exponential calculation).*"
]
]
},
},
{
"cell_type": "markdown",
"id": "c302e5fe",
"metadata": {},
"source": [
"### Sigmoid Function Base Behavior\n",
"\n",
"This plot isolates the **Sigmoid function's behavior** in isolation, independent of the model's training optimization. \n",
"\n",
"It serves exclusively as a mathematical baseline to demonstrate the activation range between $0$ and $1$ and the function's sensitivity to input feature values. This visualization does not represent the optimized decision boundary or the final classification results; rather, it highlights the function's intrinsic mapping capability before weight adjustments are applied."
]
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 5,
"execution_count": 5,
@ -399,6 +411,47 @@
"Probabilities are clipped using a tiny epsilon ($\\epsilon$) to prevent mathematical undefined errors (like $\\log(0)$), which would break the algorithm."
"Probabilities are clipped using a tiny epsilon ($\\epsilon$) to prevent mathematical undefined errors (like $\\log(0)$), which would break the algorithm."
]
]
},
},
{
"cell_type": "markdown",
"id": "ec6509c3",
"metadata": {},
"source": [
"### Derivation of the Log Loss (Binary Cross-Entropy) Cost Function\n",
"\n",
"In binary classification models, the goal is to estimate the probability that an instance belongs to the positive class. To optimize the model, we need a cost function that heavily penalizes confident but incorrect predictions. This is achieved using the **Log Loss** (Binary Cross-Entropy), derived via Maximum Likelihood Estimation (MLE).\n",
"\n",
"Here is the step-by-step mathematical derivation:\n",
"\n",
"#### 1. Likelihood of a Single Instance (Bernoulli Distribution)\n",
"For a single training instance $(x^{(i)}, y^{(i)})$, the true label is binary: $y^{(i)} \\in \\{0, 1\\}$. If our model predicts the probability $\\hat{y}^{(i)}$, we can express the probability (Likelihood) of observing the true label using the Bernoulli distribution:\n",
"This compact expression works for both possible outcomes:\n",
"* If $y^{(i)} = 1$, the probability is $\\hat{y}^{(i)}$.\n",
"* If $y^{(i)} = 0$, the probability is $1 - \\hat{y}^{(i)}$.\n",
"\n",
"#### 2. Joint Likelihood of the Dataset\n",
"Assuming that all $m$ training instances are independent, the total likelihood of the model, $L(\\theta)$, is the product of the individual probabilities:\n",
"During training, our objective is to find the parameters (weights) that **maximize** this likelihood.\n",
"\n",
"#### 3. Log-Likelihood\n",
"Multiplying thousands of probabilities (numbers between 0 and 1) leads to computational underflow. To fix this, we apply the natural logarithm. This transforms the product into a sum and brings the exponents down as multipliers:\n",
"#### 4. Negative Log-Likelihood (The Cost Function)\n",
"Optimization algorithms like Gradient Descent are designed to **minimize** a cost function rather than maximize it. To convert this into a minimization problem, we multiply the Log-Likelihood by $-1$. Finally, we divide by the total number of samples $m$ to get the average error. \n",