| ▲ | MarginalGainz 7 hours ago |
| The saddest part about this article being from 2014 is that the situation has arguably gotten worse. We now have even more layers of abstraction (Airflow, dbt, Snowflake) applied to datasets that often fit entirely in RAM. I've seen startups burning $5k/mo on distributed compute clusters to process <10GB of daily logs, purely because setting up a 'Modern Data Stack' is what gets you promoted, while writing a robust bash script is seen as 'unscalable' or 'hacky'. The incentives are misaligned with efficiency. |
|
| ▲ | jesse__ 3 hours ago | parent | next [-] |
| I've done a handful of interviews recently where the 'scaling' problem involves something that comfortably fits on one machine. The funniest one was ingesting something like 1gb of json per day. I explained, from first principals, how it fits, and received feedback along the lines of "our engineers agreed with your technical assessment, but that's not the answer we wanted, so we're going to pass". I've had this experience a good handful of times. I think a lot of people don't realize machines come with TBs of RAM and hundreds of physical cores. One machine is fucking huge these days. |
| |
| ▲ | kevmo314 2 hours ago | parent | next [-] | | The wildest part is they’ll take those massive machines, shard them into tiny Kubernetes pods, and then engineer something that “scales horizontally” with the number of pods. | | |
| ▲ | cyberpunk 3 minutes ago | parent | next [-] | | To be fair each of those pods can have dedicated, separate external storage volumes which may actually help and it’s def easier than maintaining 200 iscsi or more whatever targets yourself | |
| ▲ | andai an hour ago | parent | prev | next [-] | | I had to re-read this a few times. I am sad now. | |
| ▲ | jesse__ an hour ago | parent | prev | next [-] | | Yeah man, you're running on a multitasking OS. Just let the scheduler do the thing. | |
| ▲ | ahartmetz an hour ago | parent | prev [-] | | I think my brain hurts |
| |
| ▲ | yndoendo an hour ago | parent | prev | next [-] | | I recently had to parse 500MB to 2GB daily log files into analytical information for sales. Quick and dirty, the application would of needed 64GB RAM and work laptop only has 48GB RAM. After taking time cleaning it up, it was using under 1GB of RAM and worked faster by only retaining records in RAM if need be between each day. It is not about what you are doing, it is always about how you do it. This was the same with doing OCR analysis of assembly and production manuals. Quick and dirty, it would of took over 24 hours of processing time, after moving to semaphores with parallelization it took less than two hours to process all the information. | |
| ▲ | dehrmann 39 minutes ago | parent | prev | next [-] | | > but that's not the answer we wanted You could have learned this if you were better about collecting requirements. You can tell the interviewer "I'd do it like this for this size data, but I'd do it like this for 100x data. Which size should I design this for?" If they're looking for one direction and you ask which one, interviewers will tell you. | | |
| ▲ | jesse__ 10 minutes ago | parent [-] | | I've done that too and, in my experience, people that ask a scaling question that fits on a single machine don't have the capacity to have that nuanced conversation. I usually try to help the interviewer adjust the scale to something that actually requires many machines, but they usually don't get it. Said another way, how do you have a meaningful conversation about scaling with a person who thinks their application is huge, but in reality only requires a tiny fraction of a single machine? Sometimes, there's such a massive gulf between perception and reality that the only thing to do is chuckle and move on. |
| |
| ▲ | bauerd an hour ago | parent | prev | next [-] | | In interviews just give them what they are looking for. Don't overthink it. Interviews have gotten so stupidly standardized as the industry at large copied the same Big Tech DSA/System Design/Behavioral process. And therefore interview processes have long been decoupled from the business reality most companies face. Just shard the database and don't forget the API Gateway | | |
| ▲ | mystifyingpoi 19 minutes ago | parent | next [-] | | This. Most interviewers don't want to do interviews, they have more important job to do (at least, that's what they claim). So they learn questions and approaches from the same materials and guides that are used by candidates. Well, I'm guilty of doing exactly this a few times. | |
| ▲ | jesse__ 33 minutes ago | parent | prev [-] | | Meh .. I've played that game; it doesn't work out well for anyone involved. I optimize my answers for the companies I want to work for, and get rejected by the ones I don't. The hardest part of that strategy is coming to terms with the idea that I constantly get rejected by people that I think are mostly <derogatory_words_here>, but I've developed thick skin over the years. I'd much rather spend a year unemployed (and do a ton of painful interviews) and find a company who's values align with mine, than work for a year on a team I disagree with constantly and quit out of frustration. | | |
| ▲ | bauerd 31 minutes ago | parent [-] | | The company's values may align to yours, even though they reject you. It's because the interview process doesn't need to have anything to do with their real-world process. Their engineers probe you for the same "best practices" that they themselves were constantly probed for in their own interviews. Interviewing is its very own skill that doesn't necessarily translate into real-life performance. | | |
| ▲ | jesse__ 16 minutes ago | parent [-] | | I agree with your observation. My issue is (from experience) it's really hard to tell from the outside if a teams' values align with mine. Many teams talk the talk, but don't walk the walk, as the saying goes. It's just easier to not participate than it is to guess, and be wrong. I also believe that running a broken interview process actively selects for qualities you actually don't want, so it's much more likely that teams conducting those interviews aren't teams I want to work on. |
|
|
| |
| ▲ | coliveira 2 hours ago | parent | prev | next [-] | | Yes, but then how are these people going to justify the money they're spending on cloud systems?... They need to find only reasons to maintain their "investment", otherwise they could be held as incompetent when their solution is proven to be ineffective. So, they have to show that it was a unanimous technical decision to do whatever they wanted in the first place. | |
| ▲ | badgersnake 21 minutes ago | parent | prev | next [-] | | This kind of bad interview is rife. It’s often more a case of guess what the interviewer thinks than come up with a good solution. | |
| ▲ | ahartmetz an hour ago | parent | prev | next [-] | | Every one of these cores is really fast, too! | | | |
| ▲ | yieldcrv 2 hours ago | parent | prev [-] | | “there’s no wrong answer, we just want to see how you think” gaslighting in tech needs to be studied by the EEOC, Department of Labor, FTC, SEC, and Delaware Chancery Court to name a few let’s see how they think and turn this into a paid interview |
|
|
| ▲ | pocketarc 6 hours ago | parent | prev | next [-] |
| I agree - and it's not just what gets you promoted, but also what gets you hired, and what people look for in general. You're looking for your first DevOps person, so you want someone who has experience doing DevOps. They'll tell you about all the fancy frameworks and tooling they've used to do Serious Business™, and you'll be impressed and hire them. They'll then proceed to do exactly that for your company, and you'll feel good because you feel it sets you up for the future. Nobody's against it. So you end up in that situation, which even a basic home desktop would be more than capable of handling. |
| |
| ▲ | jrjeksjd8d 6 hours ago | parent | next [-] | | I have been the first (and only) DevOps person at a couple startups. I'm usually pretty guilty of NIH and wanting to develop in-house tooling to improve productivity. But more and more in my career I try to make boring choices. Cost is usually not a huge problem beyond seed stage. Series A-B the biggest problem is growing the customer base so the fixed infra costs become a rounding error. We've built the product and we're usually focused on customer enablement and technical wins - proving that the product works 100% of the time to large enterprises so we can close deals. We can't afford weird flakiness in the middle of a POC. Another factor I rarely see discussed is bus factor. I've been in the industry for over a decade, and I like to be able to go on vacation. It's nice to hand off the pager sometimes. Using established technologies makes it possible to delegate responsibility to the rest of the team, instead of me owning a little rats nest fiefdom of my own design. The fact is that if 5k/month infra cost for a core part of the service sinks your VC backed startup, you've got bigger problems. Investors gave you a big pile of money to go and get customers _now_. An extra month of runway isn't going to save you. | | |
| ▲ | woooooo 6 hours ago | parent [-] | | The issue is when all the spending gets you is more complexity, maintenance, and you don't even get a performance benefit. I once interviewed with a company that did some machine learning stuff, this was a while back when that typically meant "1 layer of weights from a regression we run overnight every night". The company asked how I had solved the complex problem of getting the weights to inference servers. I said we had a 30 line shell script that ssh'd them over and then mv'd them into place. Meanwhile the application reopened the file every so often. Zero problems with it ever. They thought I was a caveman. | | |
| ▲ | ffsm8 4 hours ago | parent [-] | | The issue with solutions like that is usually that people don't know how it works and how to find it if it ever stops working... Basically discoverability is where shell script fail | | |
| ▲ | roncesvalles 21 minutes ago | parent | next [-] | | You can literally have a 20 line Python script on cron that verifies if everything ran properly and fires off a PagerDuty if it didn't. And it looks like PagerDuty even supports heartbeat so that means even if your Python script failed, you could get alerted. | |
| ▲ | chuckadams 4 hours ago | parent | prev | next [-] | | Those scripts have logs, right? Log a hostname and path when they run. If no one thinks to look at logs, then there's a bigger problem going on than a one-off script. | |
| ▲ | LevGoldstein 2 hours ago | parent | prev | next [-] | | Which is why you take the time to put usage docs in the repo README, make sure the script is packaged and deployed via the same methods that the rest of the company uses, and ensure that it logs success/failure conditions. That's been pretty standard at every organization I've been at my entire professional career. Anyone who can't manage that is going to create worse problems when designing/building/maintaining a more complex system. | | |
| ▲ | mlyle 10 minutes ago | parent [-] | | Yah. A lot of the complexity in data movement or processing is unneeded. But decent standardized orchestration, documentation, and change management isn't optional even for the 20 line shell script. Thankfully, that stuff is a lot easier for the 20 line standard shell script. Or python. The python3 standard library is pretty capable, and it's ubiquitous. You can do a lot in 50-100 lines (counting documentation) with no dependencies. In turn it's easy to plug into the other stuff. |
| |
| ▲ | woooooo 4 hours ago | parent | prev | next [-] | | That becomes a problem if you let the shell script mutate into an "everything" script that's solving tons of business problems. Or if you're reinventing kubernetes with shell scripts. There's still a place for simple solutions to simple problems. | |
| ▲ | justsomehnguy 4 hours ago | parent | prev [-] | | > Basically discoverability is where shell script fail No, it's lack of documentation and no amount of $$$$/m enterprise AI solutions (R)(TM) would help you if there is no documentation. |
|
|
| |
| ▲ | pragma_x 4 hours ago | parent | prev [-] | | I've seen the ramifications of this "CV first" kind of engineering. Let's just say that it's a bad time when you're saddled with tech debt solely from a handful of influential people that really just wanted to work elsewhere. |
|
|
| ▲ | wccrawford 6 hours ago | parent | prev | next [-] |
| I've spent my last 2 decades doing what's right, using the technologies that make sense instead of the techs that are cool on my resume. And then I got laid off. Now, I've got very few modern frameworks on my resume and I've been jobless for over a year. I'm feeling a right fool now. |
| |
| ▲ | hackthemack an hour ago | parent | next [-] | | I have hung on to my job for many years now because of being in a similar situation in regards to trying to do the right thing and the fear of not being hire-able. There is something wrong with the industry in chasing fads and group think. It has always been this way. Businesses chased Java in the late 90s, early 00s. They chased CORBA, WSDL, ESB, ERP and a host of other acronyms back in the day. More recently, Data Lake, Big Data, Cloud Compute, AI. Most of the executives I have met really have no clue. They just go with what is being promoted in the space because it offers a safety net. Look, we are "not behind the curve!". We are innovating along with the rest of the industry. Interviews do not really test much for ability to think and reason. If you ran an entire ISP, if you figured out, on your own, without any help, how to shard databases, put in multiple layers of redundancy, caching... well, nobody cares now. You had to do it in AWS or Azure or whatever stack they have currently. Sadly, I do not think it will ever be fixed. It is something intrinsic to human nature. | |
| ▲ | ahartmetz an hour ago | parent | prev | next [-] | | Try Rust? The system programming world isn't very bullshit-infested and Rust is trendy (which is good for a change), also employers can't realistically expect many years of Rust experience. Need training and something to show? Contribute to some FOSS project. | |
| ▲ | fHr 5 hours ago | parent | prev [-] | | This exactly, actual doers are most of the time not rewarded meanwhile the AWS senior sucking Jeffs wiener specialist gets a job doing nothing but generating costs and leave behind more shit after his 3 years moving the ladder up to some even bigger bs pretend consulting job at an even bigger company. It's the same bs mostly for developers. I rewrite their library from TS to Rust and it gains them 50x performance increases and saves them 5k+ a week over all their compute now but nobody gives a shit and I do not have a certification for that to show off on my LinkedIn. Meanwhile my PM did nothing got paid to do some shity certificate and then gets the credit and the certificate and pisses of to the next bigger fish collecting another 100k more meanwhile I get a 1k bonus and a pat on the shoulder. Corporate late stage capitalism is complete fucking bs and I think about becoming a PM as well now. I feel like a fool and betrayed. Meanwhile they constantly threaten our Team to lay it off or outsource it as they say we are to expensive in a first world country and they easily find as good people in India etc. What a time to be alive. | | |
| ▲ | antonvs an hour ago | parent [-] | | > saves them 5k+ a week over all their compute If you're willing and able to promote yourself internally, you can make people give a shit, or at least publicly claim they do. That's 260k+ per year, and even big businesses are going to care about that at some level, especially if it's something that can be replicated. Find 10 systems you can do that with, and it's 2.6m+ per year. But, if you don't want to play the self-promotion game, yeah someone else is going to benefit from your work. |
|
|
|
| ▲ | nicoburns 6 hours ago | parent | prev | next [-] |
| > datasets that often fit entirely in RAM. Yep, and a lot more datasets fit entirely into RAM now. Ignoring the recent price spikes for a moment, 128GB of RAM in a laptop is entirely achievable and not even the limit of what is possible. That was a pipe dream in 2014 when computers with only 4GB were still common. And of course for servers the max RAM is much higher, and in a lot of scenarios streaming data off a fast local SSD may be almost as good. |
| |
| ▲ | dapperdrake 5 hours ago | parent | next [-] | | Oldie-but-goldy: https://yourdatafitsinram.net/ | |
| ▲ | newyankee 4 hours ago | parent | prev | next [-] | | I have actually worked in a company as a consultant data guy in a non technical team, I had a 128 GB PC 10 years back, and did everything with open source R then, and it worked ! The others thought it was wizardry | |
| ▲ | plagiarist 2 hours ago | parent | prev [-] | | You don't really need to ignore the price spikes even. You can still buy more than 128Gb RAM on a machine with the $5k from one of the months. |
|
|
| ▲ | reval 7 hours ago | parent | prev | next [-] |
| I’ve seen this pattern play out before. The pushback on simpler alternatives seems from a legitimate need for short time to market from the demand some of the equation and a lack of knowledge on the supply side. Every time I hear an engineer call something hacky, they are at the edge of their abilities. |
| |
| ▲ | networkadmin 6 hours ago | parent [-] | | > Every time I hear an engineer call something hacky, they are at the edge of their abilities. It's just like the systemd people talking about sysvinit. "Eww, shell scripts! What a terrible hack!" says the guy with no clue and no skills. It's like the whole ship is being steered by noobs. | | |
| ▲ | acdha 6 hours ago | parent | next [-] | | systemd would be a derail even if you weren’t misrepresenting the situation at several levels. Experienced sysadmins in my experience were the ones pushing adoption because they had to clean up the messes caused by SysV’s design limitations and flaws, whereas in this case it’s a different scenario where the extra functionality is both unneeded and making it worse at the core task. | | |
| ▲ | networkadmin 6 hours ago | parent [-] | | > Experienced sysadmins in my experience were the ones pushing adoption because they had to clean up the messes caused by SysV’s design limitations and flaws That's funny. I used to have to clean up the messes caused by systemd's design limitations and flaws, until I built my own distro with a sane init system installed. Many of the noobs groaning about the indignity of shell scripts don't even realize that they could write init 'scripts' in whatever language they want, including Python (the language these types usually love so much, if they do any programming at all.) | | |
| ▲ | acdha an hour ago | parent | next [-] | | I think you’d have a more fruitful discussion if you stopped trying to call people noobs when they don’t agree with you. For example, I’ve been dealing with SysV since the early 90s and while it’s gotten better since we no longer have to support the really bizarre Unix variants, my problem with init scripts wasn’t “indignity” but the lack of consistency across distributions and versions, which affects anyone shipping software professionally (“can’t do this easily until $distro upgrades coreutils”), and from an operator’s perspective using Python doesn’t make that better because instead of supporting one consistent thing you’d end up with the subset of features each application team felt like implementing, consistent only to the extent that they care to follow other projects. One virtue of systemd is that having a single common way to specify dependencies, restarts, customization, etc. avoids the ops people having to learn dozens of different variations of the same ideas and especially how to deal with their gaps. A few years back, a data center power outage at one place I worked really highlighted that: the systemd-based servers recovered quickly because they actually had working retries; all of the older stuff using SysV had to be manually reviewed because there were all kinds of problems like races on dependencies like DNS or NFS, retry logic which failed hard after a short period of time, failures because a stale PID file wasn’t removed, or cases where a vendor had simply never implemented retries in their init scripts. While in theory you can handle all of those in SysV most people never did. After a couple decades of that, a lot of us don’t want to spend time on problems Microsoft solved in Bill Clinton’s first term. | | |
| ▲ | networkadmin 32 minutes ago | parent | next [-] | | I just created my own OS, with my own init system that does things how I think it should be done--and it does it every time, without the bizarre bugs that come from Linux Puttering's shitware code. It's the same thing any corporation should be doing if they were smart, instead of outsourcing everything to RedHat, Microsoft, Google, etc. | |
| ▲ | whatwhaaaaat 16 minutes ago | parent | prev [-] | | I hate to blather on about systemd in this decade but how in the world does creating something completely different than sysv init help people shipping software? Now they have to support yet another init scheme. |
| |
| ▲ | chuckadams 4 hours ago | parent | prev | next [-] | | It's entirely possible that both SysV init and systemd suck for different reasons. I'm still partial to systemd since it takes care of daemons and supervision in a way that init does not, but I'll take s6 or process-compose or even supervisord if I have to. Horses for courses. | | |
| ▲ | plagiarist an hour ago | parent [-] | | I want to love s6 but every time I see the existence of s6-rc-compile I get heated. I'm sure there are excellent reasons behind it but I personally don't want services to work that way. |
| |
| ▲ | bitwize 24 minutes ago | parent | prev | next [-] | | Specifying system processes and their dependencies declaratively, rather than in a tangle of arbitrary executable code, is cleaner, more efficient, easier to use, and more auditable. And that's not even getting into the additional process management duties systemd assumes. | |
| ▲ | plagiarist an hour ago | parent | prev [-] | | You can write arbitrary scripts into systemd... or like one step removed at most? That's not really a difference unless you have some nuance in mind that I don't. I honestly do not like systemd, either. It is okay for managing processes but I wish it didn't spread into everything else in the machine. Or if it must, could it actually work cohesively across their concepts? Would be nice to have an obvious and easy way to run Quadlet as its own user to isolate further, would be nice to have systemd-sysusers present in /etc/subuid so they can run containers. I like what they are doing with atomic distros. It would be great to have a single file declarative setup for something like running a containerized reverse HTTP proxy with an isolated user. Instead of "atomic" but you manually edit files in /etc after install. |
|
| |
| ▲ | dapperdrake 5 hours ago | parent | prev [-] | | Eternal September | | |
|
|
|
| ▲ | RobinL 7 hours ago | parent | prev | next [-] |
| Worse in some ways, better in others. DuckDB is often an excellent tool for this kind of task. Since it can run parallelized reads I imagine it's often faster than command line tool, and with easier to understand syntax |
| |
| ▲ | briHass 5 hours ago | parent | next [-] | | More importantly, you have your data in a structured format that can be easily inspected at any stage of the pipeline using a familiar tool: SQL. I've been using this pattern (scripts or code that execute commands against DuckDB) to process data more recently, and the ability to do deep investigations on the data as you're designing the pipeline (or when things go wrong) is very useful. Doing it with a code-based solution (read data into objects in memory) is much more challenging to view the data. Debugging tools to inspect the objects on the heap is painful compared to being able to JOIN/WHERE/GROUP BY your data. | |
| ▲ | mrgoldenbrown 5 hours ago | parent | prev [-] | | IMHO the main point of the article is that typical unix command pipeline pipeline IS parallelized already. The bottleneck in the example was maxing out disk IO, which I don't think duckdb can help with. | | |
| ▲ | chuckadams 4 hours ago | parent [-] | | Pipes are parallelized when you have unidirectional data flow between stages. They really kind of suck for fan-out and joining though. I do love a good long pipeline of do-one-thing-well utilities, but that design still has major limits. To me, the main advantage of pipelines is not so much the parallelism, but being streams that process "lazily". On the other hand, unix sockets combined with socat can perform some real wizardry, but I never quite got the hang of that style. | | |
| ▲ | mdavidn 2 hours ago | parent | next [-] | | Pipelines are indeed one flow, and that works most of the time, but shell scripts make parallel tasks easy too. The shell provides tools to spawn subshells in the background and wait for their completion. Then there are utilities like xargs -P and make -j. | |
| ▲ | Linux-Fan 2 hours ago | parent | prev [-] | | UNIX provides the Makefile as go-to tool if a simple pipeline is not enough. GNUmake makes this even more powerful by being able to generate rules on-the-fly. If the tool of interest works with files (like the UNIX tools do) it fits very well. If the tool doesn't work with single files I have had some success in using Makefiles for generic processing tasks by creating a marker file that a given task was complete as part of the target. |
|
|
|
|
| ▲ | attractivechaos 4 hours ago | parent | prev | next [-] |
| On the contrary, the key message from the blog post is not to load the entire dataset to RAM unless necessary. The trick is to stream when the pattern works. This is how our field routinely works with files over 100GB. |
|
| ▲ | willtemperley 5 hours ago | parent | prev | next [-] |
| Yep. The cloud providers however always get paid, and get paid twice on Sunday when the dev-admins forget to turn stuff off. It’s the same story as always, just it used to be Oracle certified tech, now it’s the AWS tech certified to ensure you pay Amazon. |
|
| ▲ | lormayna 6 hours ago | parent | prev | next [-] |
| For a dasaset that live in RAM, the best solution are DuckDB or clickhouse-local.
Using SQLish data is easier than a bunch of bash script and really powerful. |
| |
| ▲ | zX41ZdbW 5 hours ago | parent [-] | | Though ClickHouse is not limited to a single machine or local data processing. It's a full-featured distributed database. |
|
|
| ▲ | 6 hours ago | parent | prev | next [-] |
| [deleted] |
|
| ▲ | rawgabbit 3 hours ago | parent | prev | next [-] |
| Well. I try for a middle ground. I am currently ditching both airflow and dbt. In Snowflake, I use scheduled tasks that call stored procedures. The stored procedures do everything I need to do. I even call external APIs like Datadog’s and Okta’s and pull down the logs directly into snowflake. I do try to name my stored procedures with meaningful names. I also add generous comments including urls back to the original story. |
|
| ▲ | data-ottawa 4 hours ago | parent | prev | next [-] |
| Airflow and dbt serve a real purpose. The issue is you can run sub tib jobs on a few small/standard instances with better tooling. Spark and Hadoop are for when you need multiple machines. Dbt and airflow let you represent your data as a DAG and operate on that, which is critical if you want to actually maintain and correct data issues and keep your data transforms timely. edit: a little surprised at multiple downvotes. My point is, you can run airflow and dbt on small instances, and you can do all your data processing on small instances with tools like duckdb or polars. But it is very useful to use a tool like dbt that allows you to re-build and manage your data in a clear way, or a tool like airflow which lets you specify dependencies for runs. After say 30 jobs or so, you'll find that being able to re-run all downstreams of a model starts to payoff. |
| |
| ▲ | adammarples 2 hours ago | parent [-] | | Agreed, airflow and dbt have literally nothing to do with the size of the data and can be useful, or overkill, at any size. Dbt just templates the query strings we use to query the data and airflow just schedules when we query the data and what we do next. The fact that you can fit the whole dataset in duckdb without issue is kind of separate to these tools, we still need to be organised about how and when we query it. |
|
|
| ▲ | petcat 7 hours ago | parent | prev | next [-] |
| > a robust bash script These hardly exist in practice. But I get what you mean. |
| |
| ▲ | sam_lowry_ 2 hours ago | parent [-] | | Yoy don't. It's bash only because the parent process is bash, but otherwise it's all grep, sort, tr, cut and othe textutils piped together. |
|
|
| ▲ | hmokiguess 3 hours ago | parent | prev | next [-] |
| This reminds me of this reddit comment from a long time ago: https://www.reddit.com/r/programming/comments/8cckg/comment/... |
|
| ▲ | mritchie712 4 hours ago | parent | prev | next [-] |
| happy middle ground: https://www.definite.app/ (I'm the founder). datalake (DuckLake), pipelines (hubspot, stripe, postgres), and dashboards in a single app for $250/mo. marketing/finance get dashboards, everyone else gets SQL + AI access. one abstraction instead of five, for a fraction of your Snowflake bill. |
|
| ▲ | 1vuio0pswjnm7 5 hours ago | parent | prev | next [-] |
| "I've seen startups burning $5k/mo on distributed compute clusters to process <10GB of daily logs, purely because setting up a 'Modern Data Stack' is what gets you promoted, while writing a robust bash script is seen as 'unscalable' or 'hacky'." Also seen strange responses from HN commenters when it's mentioned that bash is large and slow compared to ash and bash is better suited for use as an interactive shell whereas ash is better suited for use as a non-interactive shell, i.e., a scripting shell I also use ash (with tabcomplete) as an interactive shell for several reasons |
|
| ▲ | shiandow 5 hours ago | parent | prev [-] |
| If airflow is a layer of abstraction something is wrong. Yes it is an additional layer, but if your orchestration starts concerning itself with what it is doing then something is wrong. It is not a layer on top of other logic, it is a single layer where you define how to start your tasks, how to tell when something is wrong, and when to run them. If you don't insist on doing heavy compitations within the airflow worker it is dirt cheap. If it's something that can easily be done in bash or python you can do it within the worker as long as you're willing to throw a minimal amount of hardware at it. |