Suppose we have the cost function $J(w_1, w_2, \ldots,w_n, b)$, and we want to find the **minimum** point. ### Outline 1. Start with some arbitrary values for $w$ and $b$. e.g. 0 2. Keep changing $w$, and $b$ to reduce $J$. 3. Until we settle at or near a minimum. ### How it works Here is a plot of the cost function $J$ with respect to values $w$ and $b$. we want to find the *blue* parts where the function is minimum. ![[Pasted image 20250318141032.png]] 1. Start with a random point on the plot. 2. Spin around and check which direction will lead you down. 3. Take a baby-step toward that direction. 4. Repeat 2 and 3 multipole times. >[!info] >If we take another starting point, we may find another minimum point at last. Each of the minimum points are called a **local minimum**. > ![[Pasted image 20250318141506.png]] ### Algorithm In each step, the value of $w$ will be re-assigned as: $ w = w - \alpha \frac{d}{dw}J_{(w,b)} $ where: $\alpha$ is *learning rate*. It is how big each steps will be. (always positive) $\frac{d}{dw}J_{(w,b)}$ is the derivative of the $J$. It is the way to find which direction will be lower. > [!warning] > Here one parameter $w$ update algorithm was written. But a cost function can have multiple parameter like $w_1, w_2, \ldots, w_n, b$. In that case all of them should be updated ***simultaneously***: > ![[Pasted image 20250318143351.png]] ### Summing options When calculating the cost function on each step, we should sum the errors for each points. We can either sum all points in the dataset (called "**batch** gradient decent") or use a single random sample per iteration (called "**stochastic** gradient decent"). There also mini-batch gradient decent which uses a random subset of a database. As we make the sample size smaller (like in stochastic), we have less computation overhead per iteration (especially in larger datasets), but we add more noise to our calculations. ### Normalization Sometimes when we have multipole features, some of them may have larger effect on the cost function. Hence, their change is gradient decent is so sensitive. This makes gradient decent slow since it should hop around before find the min point. ![[Pasted image 20250318221449.png]] The solution is to normalize the features so they have a similar range. There are different methods of doing so: 1. Feature Scaling ![[Pasted image 20250318221748.png]] 2. Mean normalization ![[Pasted image 20250318222006.png]] 3. Z-score normalization ![[Pasted image 20250318222238.png]]