diff --git a/logistic-regressor-scratch/main.ipynb b/logistic-regressor-scratch/main.ipynb index d3c2dab..f79626d 100644 --- a/logistic-regressor-scratch/main.ipynb +++ b/logistic-regressor-scratch/main.ipynb @@ -268,7 +268,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": null, "id": "710dded0", "metadata": {}, "outputs": [], @@ -277,7 +277,7 @@ "y = df['target'].to_numpy().reshape(-1, 1)\n", "\n", "# Convert target to binary: 1 if setosa (class 0), 0 otherwise\n", - "y = (y==0).astype(float)\n" + "y = (y==0).astype(float)" ] }, { @@ -325,6 +325,18 @@ "*(Note: `np.clip` is used to bound extreme values and prevent overflow errors during exponential calculation).*" ] }, + { + "cell_type": "markdown", + "id": "c302e5fe", + "metadata": {}, + "source": [ + "### Sigmoid Function Base Behavior\n", + "\n", + "This plot isolates the **Sigmoid function's behavior** in isolation, independent of the model's training optimization. \n", + "\n", + "It serves exclusively as a mathematical baseline to demonstrate the activation range between $0$ and $1$ and the function's sensitivity to input feature values. This visualization does not represent the optimized decision boundary or the final classification results; rather, it highlights the function's intrinsic mapping capability before weight adjustments are applied." + ] + }, { "cell_type": "code", "execution_count": 5, @@ -399,6 +411,47 @@ "Probabilities are clipped using a tiny epsilon ($\\epsilon$) to prevent mathematical undefined errors (like $\\log(0)$), which would break the algorithm." ] }, + { + "cell_type": "markdown", + "id": "ec6509c3", + "metadata": {}, + "source": [ + "### Derivation of the Log Loss (Binary Cross-Entropy) Cost Function\n", + "\n", + "In binary classification models, the goal is to estimate the probability that an instance belongs to the positive class. To optimize the model, we need a cost function that heavily penalizes confident but incorrect predictions. This is achieved using the **Log Loss** (Binary Cross-Entropy), derived via Maximum Likelihood Estimation (MLE).\n", + "\n", + "Here is the step-by-step mathematical derivation:\n", + "\n", + "#### 1. Likelihood of a Single Instance (Bernoulli Distribution)\n", + "For a single training instance $(x^{(i)}, y^{(i)})$, the true label is binary: $y^{(i)} \\in \\{0, 1\\}$. If our model predicts the probability $\\hat{y}^{(i)}$, we can express the probability (Likelihood) of observing the true label using the Bernoulli distribution:\n", + "\n", + "$$P(y^{(i)}|x^{(i)};\\theta) = (\\hat{y}^{(i)})^{y^{(i)}}(1-\\hat{y}^{(i)})^{1-y^{(i)}}$$\n", + "\n", + "This compact expression works for both possible outcomes:\n", + "* If $y^{(i)} = 1$, the probability is $\\hat{y}^{(i)}$.\n", + "* If $y^{(i)} = 0$, the probability is $1 - \\hat{y}^{(i)}$.\n", + "\n", + "#### 2. Joint Likelihood of the Dataset\n", + "Assuming that all $m$ training instances are independent, the total likelihood of the model, $L(\\theta)$, is the product of the individual probabilities:\n", + "\n", + "$$L(\\theta) = \\prod_{i=1}^{m} P(y^{(i)}|x^{(i)};\\theta)$$\n", + "\n", + "During training, our objective is to find the parameters (weights) that **maximize** this likelihood.\n", + "\n", + "#### 3. Log-Likelihood\n", + "Multiplying thousands of probabilities (numbers between 0 and 1) leads to computational underflow. To fix this, we apply the natural logarithm. This transforms the product into a sum and brings the exponents down as multipliers:\n", + "\n", + "$$l(\\theta) = \\sum_{i=1}^{m} \\log P(y^{(i)}|x^{(i)};\\theta)$$\n", + "$$l(\\theta) = \\sum_{i=1}^{m} \\left[ y^{(i)}\\log(\\hat{y}^{(i)}) + (1-y^{(i)})\\log(1-\\hat{y}^{(i)}) \\right]$$\n", + "\n", + "#### 4. Negative Log-Likelihood (The Cost Function)\n", + "Optimization algorithms like Gradient Descent are designed to **minimize** a cost function rather than maximize it. To convert this into a minimization problem, we multiply the Log-Likelihood by $-1$. Finally, we divide by the total number of samples $m$ to get the average error. \n", + "\n", + "This gives us the final Log Loss equation:\n", + "\n", + "$$J(\\theta) = -\\frac{1}{m} \\sum_{i=1}^{m} \\left[ y^{(i)}\\log(\\hat{y}^{(i)}) + (1-y^{(i)})\\log(1-\\hat{y}^{(i)}) \\right]$$" + ] + }, { "cell_type": "code", "execution_count": 8, @@ -498,7 +551,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": null, "id": "f1156c02", "metadata": {}, "outputs": [], @@ -510,9 +563,7 @@ "\n", "def predict(x, theta0, theta1, thresh=0.5):\n", " model = (predictProba >= thresh).astype(int)\n", - " # Returns 1 if probability >= threshold, else 0\n", - "\n", - " " + " # Returns 1 if probability >= threshold, else 0" ] }, {