Remix.run Logo
Show HN: PyBujia, Easy Unit Testing for PySpark Jobs(github.com)
2 points by jpgerek 11 hours ago

As a Data Engineer, I've often wondered why so many companies don't unit test their Spark jobs.

In my experience, the main reasons are:

- Creating DataFrame fixtures (data and schemas) takes too much time

- Debugging across multiple tables is complicated

- Boilerplate code is verbose and repetitive

To address these pain points, I built PyBujia, a framework that:

- Lets you define table fixtures using Markdown to facilitate DataFrame creation, debugging and readability.

- Generalizes the boilerplate, saving setup time

It's made testing Spark jobs much easier for me, now I do TDD, and I hope it helps other Data Engineers as well.

Feedback is very welcome!