Remix clone Hacker News

new | show | ask | jobs Github

	▲	rundev 14 hours ago
		The claim of linear runtime is only true if K is independent of the dataset size, so it would have been nice to see an exploration of how different values of K impact results. I.e. does clustering get better for larger K, if so how much? The values 50 and 100 seem arbitrary and even suspiciously close to sqrt(N) for the 9K dataset.
	▲	romanfll 13 hours ago \| parent [-]
		Thanks for your comment. To clarify: K is a fixed hyperparameter in this implementation, strictly independent of N. Whether we process 9k points or 90k points, we keep K at ~100. We found that increasing K yields diminishing returns very quickly. Since the landmarks are generated along a fixed synthetic topology, increasing K essentially just increases resolution along that specific curve, but once you have enough landmarks to define the curve's structure, adding more doesn't reveal new topology… it just adds computational cost to the distance matrix calculation. Re: sqrt(N): That is purely a coincidence!