Remix.run Logo
mathgradthrow 3 hours ago

>An answer usually contains more information than just that one bit.

Isn't the point to ask yes or no questions?

zie 3 hours ago | parent | next [-]

Yes, but you can make assumptions based on what you know about humans generally. Like their example that if you ask if you have long hair. If you answer yes the likelihood is you are probably female.

You can think of all sorts of questions and answers like this, and when you combine with the assumptions and answers from previous answers you can make even more assumptions. They won't always be correct, but you don't have to be "perfect", depending on your use-case. For example for advertising purposes assumptions(even if incorrect) can still go a long way.

There is a reason Target got sooo good at identifying pregnant women[0] before the women knew they were pregnant that they creeped out women, and had to pull back what they did with that information. This was like a decade or more ago. It's only gotten more accurate since then.

0: one example from 2012: https://techland.time.com/2012/02/17/how-target-knew-a-high-...

armchairhacker 2 hours ago | parent | next [-]

https://medium.com/@colin.fraser/target-didnt-figure-out-a-t...

https://www.predictiveanalyticsworld.com/machinelearningtime...

codedokode 3 hours ago | parent | prev [-]

> Target got sooo good at identifying pregnant women

That's why I pay with cash and do not have a loyalty card (other customers often offer theirs at cash register anyway). And of course I don't even go to Target.

georgefrowny 2 hours ago | parent [-]

I don't know if Target specifically use all of these, but I would bet they have data based on at least some of facial/gait/demographic recognition, wi-fi/Bluetooth beaconing, vehicle registrations, time and location tracking, statistical analysis of your purchases and clustering of people you have made purchases next to (e.g. you bought something at same time and till as your mother more then once). I'm sure they have other methods too. They can also combine datasets from brokers that do have a face:name link (say you used a card at another store that captured it and sold the data) and resolve you within their own data that way.

emil-lp 3 hours ago | parent | prev | next [-]

It's still a yes/no question, it's just that the question is "do you have long hair".

The goal of these decision trees is to have as few questions that divide the group in two balanced halves (and also recursively).

If you imagine a binary tree with questions in each internal node, and in each leaf there is a person. You want the height of the tree to be minimized.

tetha 3 hours ago | parent | prev [-]

Yes, but multiple yes or no questions in combination can easily yield more information than they should in a real dataset. That's the real educational point.

gweinberg 2 hours ago | parent [-]

You seem to be confused about the difference between "less" and "more". In general a yes-no question gives less than 1 bit of information if yes and no are not equally likely. There is no way it can be expected to give more.

AnthonyMouse 5 minutes ago | parent [-]

> There is no way it can be expected to give more.

It is indeed not possible for it to give more, because it only has a single bit answer, which by the pigeonhole principle can't give you more than one bit.

The best yes/no questions are the ones which are independent of each other and bisect the group evenly. "Are you female" is typically good because it will be approximately half the population. Then you want independent questions that bisect the population again, like "does your first name have more than the median number of letters" which should be mostly independent of the first question. Another good one is conditional questions like "are you taller than the median for your sex" since a pure height question wouldn't be independent of sex but that one is.

Whereas bad questions would be ones with highly disproportionate responses, like "do you have pink hair with black and green highlights" which might be true for someone somewhere but is going to have >99% of people answering no, or "were you born on the planet Mercury" which will be 100% no and provide zero bits of information.