Remix.run Logo
Show HN: Misata – synthetic data engine using LLM and Vectorized NumPy(github.com)
23 points by rasinmuhammed 6 days ago | 1 comments

Hey HN, I’m the author.

I built Misata because existing tools (Faker, Mimesis) are great for random rows but terrible for relational or temporal integrity. I needed to generate data for a dashboard where "Timesheets" must happen after "Project Start Date," and I wanted to define these rules via natural language.

How it works: LLM Layer: Uses Groq/Llama-3.3 to parse a "story" into a JSON schema constraint config.

Simulation Layer: Uses Vectorized NumPy (no loops) to generate data. It builds a DAG of tables to ensure parent rows exist before child rows (referential integrity).

Performance: Generates ~250k rows/sec on my M1 Air.

It’s early alpha. The "Graph Reverse Engineering" (describe a chart -> get data) is experimental but working for simple curves.

pip install misata

I’d love feedback on the simulator.py architecture—I’m currently keeping data in-memory (Pandas) which hits a ceiling at ~10M rows. Thinking of moving to DuckDB for out-of-core generation next. Thoughts?

twelvechess 2 days ago | parent [-]

That would be useful for testing MVPs with dummy data to see if they work. However, synthetic data is usually used when you derive new data from existing data, so the new data is called synthetic. From the README I didn't quite catch if that is the case here, but still useful.