Thanks for the feedback! Rainbow tables probably won't work here since the epoch (rounded to 5 minutes) salts everything, so precomputed tables would expire constantly. You would use use BSSIDs + SSID in real implementations. The MinHash part creates 128 hash values from the entire observed set, not a subset of 3, then LSH divides these into bands where similar sets collide probabilistically based on Jaccard similarity. I've tested with two phones where one saw 3-4 networks and the other saw ~10. They still found each other, and you can try it in the interactive demo yourself. You're right it's not entirely privacy preserving, really depends on your threat model as I discuss in the security section. Your idea about combining area hints with time-based salts is interesting though, feels like it could bridge geospatial indexing with this approach! The whole geospatial indexing thing was something I became aware of late in the project and didn't go into further. It's probably a much simpler approach if geolocation is important. In the current approach, geolocation doesn't matter at all, only 'the environment', whatever that may be.

▲

vessenes 9 hours ago | parent [-]

Thanks for the detailed response! I completely missed the band division / similarity plan - I’ll read more thoroughly next time.

I thought about geo largely because it radically changes the order of magnitude of work necessary; it lets you segment ‘possible’ subsets of APs down to sets of say 100, not millions, and changes the combinatorics. A side effect is knowing a rough spatial location.

Off the top of my head, I don’t think that epochs alone make a big difference. If I want to see if you’ve been somewhere, or tell you I’m somewhere, why not take the 3-4 networks you mentioned, and forward hash them for the next million epochs?

Or, more ambitiously, why not take 3-4 networks each from the geo indexed clusters available at https://wigle.net/ and do the forward and backward epochs, letting me track where you’ve been and pretend to be near you any time in the future?

Wigle reports 1.7bn networks; a rough look at a suburban street near me shows most places have 10 in a reasonable range boundary; so call it 200mm “locations” with 128 segmented hashes, 250 billion hashes per epoch — I think we’re in the “seconds per epoch” range for a reasonable compute heavy server to cover the entire space.

Upshot - I think the salting needs to be something local / not predictable or stored remotely.

Hopefully these comments hit you right - I like the idea a lot - and I don’t fully understand the system - but as I understand it, the system does not offer privacy — I could replay any phone’s hashes against a system that cost a few dollars to reconstruct your location and time, if my understanding is correct.

▲

waerhert 9 hours ago | parent [-]

I see, this sheds some new light on your initial concerns. I'm aware an attacker can keep pretending to be inside an environment once they've seen it. I wasn't accounting for a scenario where an attacker has a huge database for queries like coords -> list of wifi networks. I was under the assumption services like Wigle only provided the reverse lookup (wifi -> coords). Indeed an attacker could potentially reverse the LSH tags if it hashed the wifi environment within very small geofences. It's bit of a needle & haystack problem but not an impossible one with enough resources. I wouldn't say it's a perfect system and I don't mind it falling apart under scrutiny, I just found it an interesting idea so I really appreciate you thinking along here.

Edit: Maybe some preshared group hash (kinda beats the point), or combining multiple modalities (eg bluetooth, shared interests) or some kind of proof of work token could help mitigate some of these issues. I guess anything to reduce the time to attack helps in this case? Or anything that really pins down environment + time, like what smath described in his comment. In essence, the core idea of minhash + lsh works and it doesn't limit you to just wifi networks. The key is being able to grab a fingerprint that is unique enough and different enough each epoch. Wifi networks are just easy enough to grab vs something more low level like an APs beacon timing interval jitter or something.

	▲	kevin_nisbet 8 hours ago \| parent [-]
		> I see, this sheds some new light on your initial concerns. I'm aware an attacker can keep pretending to be inside an environment once they've seen it. I wasn't accounting for a scenario where an attacker has a huge database for queries like coords -> list of wifi networks I think this is the issue, is these datasets are out there and at least big tech companies have them since they're used to assist with GPS. I was about to post the same thing as above but saw vessenes beat me to it. Without thinking about it too hard, the two directions I see are either making observations of the environment in real-time that is only relevant at that time (IE sniffing actual wireless frames, even if they're encrypted and making observations on them, however, most devices won't let you go into promiscuous mode and do this) or encrypting the messages in flight so only participants can decrypt them (IE a model like the signal protocol with E2E message encryption). Anyways, this is a cool approach, but that risk occurred to me as well about the ability to just brute force the entire dataset to decode every location.