| ▲ | Show HN: Pulpie – Pareto-Optimal Models for Cleaning the Web(usefeyn.com) | |
| 1 points by snyy 5 hours ago | 1 comments | ||
The idea for Pulpie came to us when building a deep research harness. All Search APIs return noisy content containing ads, navigational elements, and sidebars. In one instance, an ad for "Gemini on Pixel" made its way into our search results. The ad copy was then passed into LLM context, where it was used in the final answer served to the user. Unclean data is harmful to model intelligence. In pre training, it pollutes data the model learns from. At inference, it confuses a model. We built Pulpie to make clean data available for cheap. The blog goes over our process. Happy to answer any questions | ||
| ▲ | snyy 5 hours ago | parent [-] | |
[dead] | ||