Remix.run Logo
Data Activation Thoughts(galsapir.github.io)
9 points by galsapir 11 hours ago | 2 comments

i've been working with healthcare/biobank data and keep thinking about what "data moats" mean now that llms can ingest anything. some a16z piece from 2019 said moats were eroding — now the question seems to be whether you can actually make your data useful to these systems, not just have it. there's some recent work (tables2traces, ehr-r1) showing you can convert structured medical data into reasoning traces that improve llm performance, but the approaches are still rough and synthetic traces don't fully hold up to scrutiny (writing this to think through it, not because i have answers)

armcat 21 minutes ago | parent | next [-]

I've been working in legaltech space and can definitely echo the sentiments there. There are some major legaltech/legal AI companies but after speaking to dozens of law firms, none of them are finding these tools very valuable. But they have signed contracts with many seats, they are busy people, and tech is not intrinsic to them, so they are not in the business of just changing tools and building things in-house (a handful of them are). And the problem is despite massive amount of internal data, all the solutions fail on the relevance and precision scale. When I sit down with actual legal associates, I can see how immensely complex these workflows are, and to fully utilize this data moat you need: (1) multi-step agentic retrieval, (2) a set of rules/heuristics to ground and steer everything per transaction/case "type", (3) adaptation/fine-tuning towards the "house language/style", (4) integration towards many different data sources and tools; and you need to wrap all this with real-world evals (where LLM-as-a-judge technique often fail).

sgt101 an hour ago | parent | prev [-]

How to know if one should fine tune/pretrain or RL / reasoning train given some data set?