short answer is that in rag systems the documents are chunked into some predefined size (you can pick a size based on your use-case) and the text is converted into vector embeddings (e.g. use the openai embed API) and stored in a vector database like chroma or pinecone or pg_vector in postgres

then your query is converted into embeddings and the top N chunks are returned via similarity search (cosine or dot product or some other method) - this has advantages over bm25 which is lexical

then you can do some processing or just hand over all the chunks as context saying "here are some documents use them to answer this question" + your query to the llm

▲

potato-peeler 5 days ago | parent [-]

> then you can do some processing or just hand over all the chunks as context saying "here are some documents use them to answer this question" + your query to the llm

This part is what I want to understand. How does the llm “frame” an answer?

	▲	fareesh 15 hours ago \| parent [-]
		I guess you could just try an equivalent in chatgpt or gemini or something. Paste 5 text files one after the other in some structured schema that includes metadata and ask a question. You can steer it with additional instructions like mention the filename etc etc.