It reminds me of a game we played with students of data classification algorithms like ID3: How many yes/no questions do we need to uniquely identify everyone in this room?

With like 12 students, that's 4 bits, and it often ends up with 2-3 questions. It starts off with the obvious ones - man/woman/diverse, but then a realization comes in: An answer usually contains more information than just that one bit. If you have long hair, you're most likely a woman and/or a metalhead for example. That part will get shaken out later on.

And those thoughts make these browser fingerprinting techniques all the more scary: They contain a lot of information and that quickly cuts the possible amount of people down. Like, I'm a Linux Firefox user with a screen on the left. I wouldn't be suprised if that put me in a 5-6 digit bucket of people already.

▲

georgefrowny 3 hours ago | parent | next [-]

> An answer usually contains more information than just that one bit.

That means there is less information in the question "do they have long hair?", not more. Asking "long hair?" and then "woman?" is probably, in most groups, roughly the same as just the first or second question alone. So the second question added much less than one bit of information because the answer is probably "yes". "Long hair" and then "metalhead" is the same, except that the answer to the second question is probably "no".

Yes/no questions on average contain the most information each when they partition the remaining possibilities 50:50. Then each answer gives you exactly one more bit. The closet you get to either a 100:0 or 0:100 yes:no split, the smaller the fraction of a bit you encode in the answer.

"Metalhead?" usually gives you lots of bits of information (probably 4 in an "average" group of 16 containing at least one metalhead) if the answer is "yes", but on average that's outweighed by the very high chance that the answer will be "no". If there are no metalheads or only metalheads, it gives you zero information.

	▲	tetha 2 hours ago \| parent [-]
		Ah, I flipped it in my head. That happens after 10 years. In this case, it was often an interesting exercise in bias as well. "Woman?" would usually single out 1-2 persons out of the 15, so it was a terrible question. It was CompSci after all. "Long hair?", lumping women and metal heads into one group would often split it into half and half. That was much better, and then spurred creative thoughts like travel distance, or bus stations.

▲

mathgradthrow 3 hours ago | parent | prev | next [-]

>An answer usually contains more information than just that one bit.

Isn't the point to ask yes or no questions?

▲

zie 3 hours ago | parent | next [-]

Yes, but you can make assumptions based on what you know about humans generally. Like their example that if you ask if you have long hair. If you answer yes the likelihood is you are probably female.

You can think of all sorts of questions and answers like this, and when you combine with the assumptions and answers from previous answers you can make even more assumptions. They won't always be correct, but you don't have to be "perfect", depending on your use-case. For example for advertising purposes assumptions(even if incorrect) can still go a long way.

There is a reason Target got sooo good at identifying pregnant women[0] before the women knew they were pregnant that they creeped out women, and had to pull back what they did with that information. This was like a decade or more ago. It's only gotten more accurate since then.

0: one example from 2012: https://techland.time.com/2012/02/17/how-target-knew-a-high-...

▲

armchairhacker 2 hours ago | parent | next [-]

https://medium.com/@colin.fraser/target-didnt-figure-out-a-t...

https://www.predictiveanalyticsworld.com/machinelearningtime...

▲

codedokode 3 hours ago | parent | prev [-]

> Target got sooo good at identifying pregnant women

That's why I pay with cash and do not have a loyalty card (other customers often offer theirs at cash register anyway). And of course I don't even go to Target.

	▲	georgefrowny 2 hours ago \| parent [-]
		I don't know if Target specifically use all of these, but I would bet they have data based on at least some of facial/gait/demographic recognition, wi-fi/Bluetooth beaconing, vehicle registrations, time and location tracking, statistical analysis of your purchases and clustering of people you have made purchases next to (e.g. you bought something at same time and till as your mother more then once). I'm sure they have other methods too. They can also combine datasets from brokers that do have a face:name link (say you used a card at another store that captured it and sold the data) and resolve you within their own data that way.

▲

emil-lp 3 hours ago | parent | prev | next [-]

It's still a yes/no question, it's just that the question is "do you have long hair".

The goal of these decision trees is to have as few questions that divide the group in two balanced halves (and also recursively).

If you imagine a binary tree with questions in each internal node, and in each leaf there is a person. You want the height of the tree to be minimized.

▲

tetha 3 hours ago | parent | prev [-]

Yes, but multiple yes or no questions in combination can easily yield more information than they should in a real dataset. That's the real educational point.

▲

gweinberg 2 hours ago | parent [-]

You seem to be confused about the difference between "less" and "more". In general a yes-no question gives less than 1 bit of information if yes and no are not equally likely. There is no way it can be expected to give more.

	▲	AnthonyMouse 5 minutes ago \| parent [-]
		> There is no way it can be expected to give more. It is indeed not possible for it to give more, because it only has a single bit answer, which by the pigeonhole principle can't give you more than one bit. The best yes/no questions are the ones which are independent of each other and bisect the group evenly. "Are you female" is typically good because it will be approximately half the population. Then you want independent questions that bisect the population again, like "does your first name have more than the median number of letters" which should be mostly independent of the first question. Another good one is conditional questions like "are you taller than the median for your sex" since a pure height question wouldn't be independent of sex but that one is. Whereas bad questions would be ones with highly disproportionate responses, like "do you have pink hair with black and green highlights" which might be true for someone somewhere but is going to have >99% of people answering no, or "were you born on the planet Mercury" which will be 100% no and provide zero bits of information.

▲

throw8484949 2 hours ago | parent | prev [-]

[flagged]

	▲	542458 2 hours ago \| parent \| next [-]
		I think a plain reading of the post you’re replying to would be “obvious as a way of segmenting people”.
	▲	Vinnl 2 hours ago \| parent \| prev [-]
		It's obvious in the sense that most people will start out with that as their first question.