Remix.run Logo
marcyb5st a day ago

Will try to respond in order:

1. It depends on how much embeddings we are talking about. Few millions, probably yes, 100s millions/Billions range? You likely need something custom.

2. Vectors are only one way to search for things. If your corpus contains stuff that don't carry semantic weight (think about part numbers) and you want to find the chunk that contains that information you'll likely need something that uses tf-idf.

3. Regarding chunk size, it really depends on your data and the queries your users will do. The denser the content the smaller the chunk size.

4. Preprocessing - again, it depends. If it's PDFs with just texts, try to remove footers / headers from the extracted text. Of it contains tables, look at something like table former to extract a good html representation. Clean up other artifacts from the text (like dashes for like breaking, square brackets with reference numbers for scientific papers, ... ).