| ▲ | romanfll 19 hours ago |
| Author here.
I built this because I needed to run dimensionality reduction entirely in the browser (client-side) for an interactive tool. The standard options (UMAP, t-SNE) were either too heavy for JS/WASM or required a GPU backend to run at acceptable speeds for interactive use. This approach ("Sine Landmark Reduction") uses linearised trilateration—similar to GPS positioning—against a synthetic "sine skeleton" of landmarks. The main trade-offs: It is O(N) and deterministic (solves Ax=b instead of iterative gradient descent). It forces the topology onto a loop structure, so it is less accurate than UMAP for complex manifolds (like Swiss Rolls), but it guarantees a clean layout for user interfaces. It can project ~9k points (50 dims) to 3D in about 2 seconds on a laptop CPU. Python implementation and math details are in the post. Happy to answer questions! |
|
| ▲ | lmeyerov 17 hours ago | parent | next [-] |
| Fwiw, we are heavy UMAP users (pygraphistry), and find UMAP CPU fine for interactive use at up to 30K rows and GPU at 100K rows, then generally switch to a trained mode when > 100K rows. Our use case is often highly visual - see correlations, and link together similar entities into explorable & interactive network diagrams. For headless, like in daily anomaly detection, we will do this to much larger scales. We see a lot of wide social, log, and cyber data where this works, anywhere from 5-200 dim. Our bio users are trickier, as we can have 1K+ dimensions pretty fast. We find success there too, and mostly get into preconditioning tricks for those. At the same time, I'm increasingly thinking of learning neural embeddings in general for these instead of traditional clustering algorithms. As scales go up, the performance argument here goes up too. |
| |
| ▲ | abhgh 16 hours ago | parent | next [-] | | I was not aware this existed and it looks cool! I am definitely going to take out some time to explore it further. I have a couple of questions for now:
(1) I am confused by your last sentence. It seems you're saying embeddings are a substitute for clustering. My understanding is that you usually apply a clustering algorithm over embeddings - good embeddings just ensure that the grouping produced by the clustering algo "makes sense". (2) Have you tried PaCMAP? I found it to produce high quality and quick results when I tried it. Haven't tried it in a while though - and I vaguely remember that it won't install properly on my machine (a Mac) the last time I had reached out for it. Their group has some new stuff coming out too (on the linked page). [1] https://github.com/YingfanWang/PaCMAP | | |
| ▲ | lmeyerov 15 hours ago | parent [-] | | We generally run UMAP on regular semi-structured data like database query results. We automatically feature encode that for dates, bools, low-cardinality vals, etc. If there is text, and the right libs available, we may also use text embeddings for those columns. (cucat is our GPU port of dirtycat/skrub, and pygraphistry's .featurize() wraps around that). My last sentence was on more valuable problems, we are finding it makes sense to go straight to GNNs, LLMs, etc and embed multidimensional data that way vs via UMAP dim reductions. We can still use UMAP as a generic hammer to control further dimensionality reductions, but the 'hard' part would be handled by the model. With neural graph layouts, we can potentially even skip the UMAP for that too. Re:pacmap, we have been eyeing several new tools here, but so far haven't felt the need internally to go from UMAP to them. We'd need to see significant improvements given the quality engineering in UMAP has set the bar high. In theory I can imagine some tools doing better in the future, but the creators have't done the engineering investment, so internally, we rather stay with UMAP. We make our API pluggable, so you can pass in results from other tools, and we haven't heard much from that path from others. | | |
| ▲ | abhgh 13 hours ago | parent | next [-] | | Thank you. Your comment about LLMs to semantically parse diverse data, as a first step, makes sense. In fact come to think of it, in the area of prompt optimization too - such as MIPROv2 [1] - the LLM is used to create initial prompt guesses based on its understanding of data. And I agree that UMAP still works well out of the box and has been pretty much like this since its introduction. [1] Section C.1 in the Appendix here https://arxiv.org/pdf/2406.11695 | |
| ▲ | nighthawk454 6 hours ago | parent | prev [-] | | I’m working on a new UMAP alternative - curious what kinds of improvements you’d be interested in? |
|
| |
| ▲ | romanfll 10 hours ago | parent | prev [-] | | The shift from Explicit Reduction to GNNs/Embeddings is where the high-end is going in my view… We hit this exact fork in the road with our forecasting/anomaly detection engine (DriftMind). We considered heavy embedding models but realised that for edge streams, we couldn't afford the inference cost or the latency of round-tripping to a GPU server.
It feels like the domain is splitting into 'Massive Server-Side Intelligence' (I am a big fan of Graphistry) and 'Hyper-Optimized Edge Intelligence' (where we are focused). |
|
|
| ▲ | threeducks 17 hours ago | parent | prev | next [-] |
| Without looking at the code, O(N * k) with N = 9000 points and k = 50 dimensions should take in the order of milliseconds, not seconds. Did you profile your code to see whether there is perhaps something that takes an unexpected amount of time? |
| |
| ▲ | romanfll 14 hours ago | parent | next [-] | | The '2 seconds' figure comes from the end-to-end time on a standard laptop. I quoted 2s to set realistic expectations for the user experience, not the CPU cycle count. You are right that the core linear algebra (Ax=b) is milliseconds. The bottleneck is the DOM/rendering overhead, but strictly speaking, the math itself is blazing fast. | | | |
| ▲ | jdhwosnhw 10 hours ago | parent | prev | next [-] | | Thats not how big-O notation works. You don’t know what proportionality constants are being hidden by the notation so you cant make any assertions about absolute runtimes | | |
| ▲ | threeducks 10 hours ago | parent [-] | | It is true that big-O notation does not necessarily tell you anything about the actual runtime, but if the hidden constant appears suspiciously large, one should double-check whether something else is going on. |
| |
| ▲ | donkeybeer 16 hours ago | parent | prev | next [-] | | If he wrote the for loop in python instead of numpy or C or whatever it could be a plausible runtime. | |
| ▲ | yorwba 16 hours ago | parent | prev [-] | | Each of the N data points is processed through several expensive linear algebra operations. O(N * k) just expresses that if you double N, the runtime also at most doubles. It doesn't mean it has to be fast in an absolute sense for any particular value of N and k. | | |
| ▲ | akoboldfrying 16 hours ago | parent [-] | | Didn't read TFA, but it's hard to think of a linear algebra operation that is both that slow and takes time independent of n and k. |
|
|
|
| ▲ | yxhuvud 14 hours ago | parent | prev | next [-] |
| FWIW, there are iterative SVD implementations out there that could potentially be useful as well in certain contexts when you get more data over time in a streamed manner. |
| |
| ▲ | leecarraher 11 hours ago | parent [-] | | are you referring to this paper https://arxiv.org/abs/1501.01711 ? i believe they won best paper at icml or other impact journal. the published paper and algorithm i recall being compact and succinct, something that took less than a day to implement. |
|
|
| ▲ | aoeusnth1 19 hours ago | parent | prev [-] |
| This is really cool! Are you considering publishing a paper on it? This seems conceptually similar to landmark MDS / Isomap, except using PCA on the landmark matrix instead of MDS. (https://cannoodt.dev/2019/11/lmds-landmark-multi-dimensional...) |
| |
| ▲ | romanfll 18 hours ago | parent [-] | | Thanks! You nailed the intuition! Yes, it shares DNA with Landmark MDS, but we needed something strictly deterministic for the UI.
Re: Publishing: We don't have a paper planned for this specific visualisation technique yet. I just wanted to open-source it because it solved a major bottleneck for our dashboard.
However, our main research focus at Thingbook is DriftMind (a cold start streaming forecaster and anomaly detector, preprint here: https://www.researchgate.net/publication/398142288_DriftMind...). That paper is currently under peer review! It shares the same 'efficiency-first' philosophy as this visualisation tool |
|