Remix.run Logo
Show HN: NRC nuclear licensing RAG pipeline and regulatory embeddings dataset(huggingface.co)
2 points by davenporten 5 hours ago

I've been building an AI system to automate parts of the NRC Combined Operational License process: gap analysis against the Standard Review Plan, FSAR strength scoring, and RAI prediction using vector similarity to historical NRC requests. I intended this as a SaaS business, but was ultimately beat to the market.

What I think is the most interesting artifact is the dataset: 37,734 chunks of NRC regulatory documents (NUREG-0800, 10 CFR Parts 20/50/51/52/72/73/100, and Regulatory Guides) embedded with OpenAI text-embedding-3-small. It covers the full regulatory corpus an applicant would need for a COL submission. I'm not aware of anything like this being publicly available before.

The embeddings are ready to load directly into ChromaDB, Pinecone, or any other vector store. If you're doing nuclear AI, regulatory NLP, or just want a large real-world RAG dataset to experiment with, it should be useful.

Here's the full codebase if you're interested: https://github.com/Davenporten/nrc-licensing-rag