In the entropy implementation:

    return -np.sum(p * np.log(p, where=p > 0))

Using `where` in ufuncs like log results in the output being uninitialized (undefined) at the locations where the condition is not met. Summing over that array will return incorrect results for sure.

Better would be e.g.

    return -np.sum((p * np.log(p))[p > 0])

Also, the cross entropy code doesn't match the equation. And, as explained in the comment below the post, Ax+b is not a linear operation but affine (because of the +b).

Overall it seems like an imprecise post to me. Not bad, but not stringent enough to serve as a reference.

▲ jpcompartir 5 days ago | parent [-]

I would echo some caution if using as a reference, as in another blog the writer states:

"Backpropagation, often referred to as “backward propagation of errors,” is the cornerstone of training deep neural networks. It is a supervised learning algorithm that optimizes the weights and biases of a neural network to minimize the error between predicted and actual outputs.."

https://chizkidd.github.io/2025/05/30/backpropagation/

backpropagation is a supervised machine learning algorithm, pardon?

▲

cl3misch 5 days ago | parent [-]

I actually see this a lot: confusing backpropagation with gradient descent (or any optimizer). Backprop is just a way to compute the gradients of the weights with respect to the cost function, not an algorithm to minimize the cost function wrt. the weights.

I guess giving the (mathematically) simple principle of computing a gradient with the chain rule the fancy name "backpropagation" comes from the early days of AI where the computers were much less powerful and this seemed less obvious?

▲

imtringued 4 days ago | parent | next [-]

The German Wikipedia article makes the same mistake and it is quite infuriating.

▲

cubefox 4 days ago | parent | prev [-]

What does this comment have to do with the previous comment, which talked about supervised learning?

▲

imtringued 4 days ago | parent | next [-]

Reread the comment

"Backprop is just a way to compute the gradients of the weights with respect to the cost function, not an algorithm to minimize the cost function wrt. the weights."

What does the word supervised mean? It's when you define a cost function to be the difference between the training data and the model output.

Aka something like (f(x)-y)^2 which is simply the quadratic difference between the result of the model given an input x from the training data and the corresponding label y.

A learning algorithm is an algorithm that produces a model given a cost function and in the case of supervised learning, the cost function is parameterized with the training data.

The most common way to learn a model is to use an optimization algorithm. There are many optimization algorithms that can be used for this. One of the simplest algorithms for the optimization of unconstrained non-linear functions is stochastic gradient descent.

It's popular because it is a first order method. First order methods only use the first partial derivative known as the gradient whose size is equal to the number of parameters. Second order methods converge faster, but they need the Hessian, whose size scales with the square of the to be optimized parameters.

How do you calculate the gradient? Either you calculate each partial derivative individually, or you use the chain rule and work backwards to calculate the complete gradient.

I hope this made it clear that your question is exactly backwards. The referenced blog is about back propagation and unnecessarily mentions supervised learning when it shouldn't have done that and you're the one now sticking with supervised learning even though the comment you're responding to told you exactly why it is inappropriate to call back propagation a supervised learning algorithm.

▲

DoctorOetker 4 days ago | parent [-]

regarding "supervised", it is a bit of a small nuance.

Traditional "supervised" training, required the dataset to be annotated with labels (good/bad, such-and-such a bounding box in an image, ...) which cost a lot of human labor to produce.

When people speak of "unsupervised" training, I actually consider it a misnomer: its historically grown, and the term will not go away quickly, but a more apt name would have been "label-free" training.

For example consider a corpus of human written text (books, blogs, ...) without additional labels (verb annotations, subject annotations, ...).

Now consider someone proposing to use next-token prediction, clearly it doesn't require additional labeling. Is it supervised? Nobody calls it supervised under the current convention, but actually one may view next-token prediction on a bare text corpus as a trick to turn an unlabeled dataset into trillions of supervised prediction tasks. Given this N-gram of preceding tokens, what does the model predict as the next token? And what does the corpus actually say as next token? Lets use this actual next token as if it were a "supervised" (labeled) exercise.

	▲	cubefox 4 days ago \| parent [-]
		That's also why LeCun promoted the term "self-supervised" a while ago, with some success.

▲

cl3misch 4 days ago | parent | prev [-]

The previous comment highlights an example where backprop is confused with "a supervised learning algorithm".

My comment was about "confusing backpropagation with gradient descent (or any optimizer)."

For me the connection is pretty clear? The core issue is confusing backprop with minimization. The cited article mentioning supervised learning specifically doesn't take away from that.