Remix clone Hacker News

new | show | ask | jobs Github

	▲	jdup7 6 hours ago
		Fair point in isolation that number doesn't look good. The important context is that this dataset was built to test phishing detection, not to measure false positive rates on normal traffic. It's sourced from our threat intelligence tooling so it's almost entirely malicious URLs by design. The 9 clean sites aren't a random sample of everyday browsing. They're sites that were submitted as suspicious and turned out to be legitimate so they're basically the hardest possible set of clean pages to correctly classify. This seems like a common critique and we definitely could have done a better job of explaining the methodology. Going forward we will include numbers from daily use to give a better picture of FP rate.