| ▲ | Show HN: Data Engineering Book – An open source, community-driven guide(github.com) | |||||||||||||
| 137 points by xx123122 10 hours ago | 12 comments | ||||||||||||||
Hi HN! I'm currently a Master's student at USTC (University of Science and Technology of China). I've been diving deep into Data Engineering, especially in the context of Large Language Models (LLMs). The Problem: I found that learning resources for modern data engineering are often fragmented and scattered across hundreds of medium articles or disjointed tutorials. It's hard to piece everything together into a coherent system. The Solution: I decided to open-source my learning notes and build them into a structured book. My goal is to help developers fast-track their learning curve. Key Features: LLM-Centric: Focuses on data pipelines specifically designed for LLM training and RAG systems. Scenario-Based: Instead of just listing tools, I compare different methods/architectures based on specific business scenarios (e.g., "When to use Vector DB vs. Keyword Search"). Hands-on Projects: Includes full code for real-world implementations, not just "Hello World" examples. This is a work in progress, and I'm treating it as "Book-as-Code". I would love to hear your feedback on the roadmap or any "anti-patterns" I might have included! Check it out: Online: https://datascale-ai.github.io/data_engineering_book/ GitHub: https://github.com/datascale-ai/data_engineering_book | ||||||||||||||
| ▲ | hliyan 2 hours ago | parent | next [-] | |||||||||||||
I'm not sure whether this is an artefact of translation, but things like this don't inspire confidence: > The "Modern Data Stack" (MDS) is a hot concept in data engineering in recent years, referring to a cloud-native, modular, decoupled combination of data infrastructure https://github.com/datascale-ai/data_engineering_book/blob/m... Later parts are better and more to the point though: https://github.com/datascale-ai/data_engineering_book/blob/m... Edit: perhaps I judged to early. The RAG sections isn't bad either: https://github.com/datascale-ai/data_engineering_book/blob/m... | ||||||||||||||
| ▲ | esafak 4 hours ago | parent | prev | next [-] | |||||||||||||
I'd have titled the submission 'Data Engineering for LLMs...' as it is focused on that. | ||||||||||||||
| ▲ | osamabinladen 2 hours ago | parent | prev | next [-] | |||||||||||||
this is great and i bookmarked it so i can read it later. i’m just curious though, was the readme written by chatgpt? i can’t tell if im paranoid thinking everything is written by chatgpt | ||||||||||||||
| ▲ | joshuaissac 9 hours ago | parent | prev | next [-] | |||||||||||||
English version: https://github.com/datascale-ai/data_engineering_book/blob/m... | ||||||||||||||
| ||||||||||||||
| ▲ | guillem_lefait 7 hours ago | parent | prev | next [-] | |||||||||||||
The figures in the different chapters are in english (it's not the case for the image in README_en.md). | ||||||||||||||
| ||||||||||||||
| ▲ | dvrp 7 hours ago | parent | prev | next [-] | |||||||||||||
If you are interested in (2026-)internet scale data engineering challenges (e.g. 10-100s of petabyte processing) challenges and pre-training/mid-training/post-training scale challenges, please send me an email to d+data@krea.ai ! | ||||||||||||||
| ▲ | 9 hours ago | parent | prev | next [-] | |||||||||||||
| [deleted] | ||||||||||||||
| ▲ | rafavargascom 8 hours ago | parent | prev [-] | |||||||||||||
谢谢 How is possible a Chinese publication gets to the top in HN? | ||||||||||||||
| ||||||||||||||