Remix.run Logo
HackerRank open sourced its ATS. My resume scored 90/100. Oh wait 74. No – 88(danunparsed.com)
449 points by sambellll 8 hours ago | 160 comments
dvt 5 hours ago | parent | next [-]

An alarming number of people don't understand that LLMs work via purely stochastic processes, so I'm happy to see in-depth pieces like this. I'm looking for a job and maybe this is why it's so hard to get a callback these days: resumes are just dumped in some LLM black hole and no one really knows how it works. The author says:

> temperature 0.1 — low, supposedly nudging the model toward deterministic outputs

This is not correct (and is briefly touched on later in the piece when he sets temperature to 0), temperature is not some kind of "deterministic" switch, but rather it affects the sampling distribution (which becomes more "spiky"—but is still very much a distribution).

miki123211 3 hours ago | parent | next [-]

In theory, temperature 0 does make the LLM deterministic.

Well, in theory theory, temperature 0 doesn't really exist. Mathematically, as lim temperature->0, the distribution gets spikier and spikier, the most likely sample goes to almost-but-not-quite infinity and the rest go to almost-but-not-quite 0. In practice, temperature=0 is literally a separate branch of an if statement that just picks the most common sample (using the actual formula that works for non-zero values would cause a zero division).

However, due to things such as batching and even different kinds of floating point imprecisions for different algorithm implementations, the probability distribution itself often differs run-by-run, so what you sample from it also differs.

sigmoid10 3 hours ago | parent | next [-]

>in theory theory, temperature 0 doesn't really exist.

It does exist very much, even if you go to pure math. Look at the softmax function and take the limit as T->0. It becomes a dirac-delta function. I.e. in a discrete setting (like for LLMs with a finite set of output tokens), probability P becomes one for argmax and 0 for everything else. Only in coding practice it is easer to implement T=0 as a simple if check that directly chooses argmax instead of calculating the limit of some function that includes 1/T quotients. But setting T to zero is in both, theory and practice, turning the usual probability function into greedy sampling.

thaumasiotes 8 minutes ago | parent | next [-]

> It becomes a dirac-delta function. I.e. in a discrete setting (like for LLMs with a finite set of output tokens), probability P becomes one for argmax and 0 for everything else. Only in coding practice it is easer to implement T=0 as a simple if check that directly chooses argmax instead of calculating the limit of some function that includes 1/T quotients.

I don't understand the distinction you're drawing. A Dirac delta function is a "simple if check".

317070 18 minutes ago | parent | prev [-]

> Look at the softmax function and take the limit as T->0. It becomes a dirac-delta function.

In pure math, it does not do that. It becomes a dirac-delta comb with equal weight on every maximum. There can be more than 1 maximum. Setting the temperature to zero turns into greedy sampling, but greedy sampling is not necessarily deterministic as you can have multiple equally optimal options.

chrisjj an hour ago | parent | prev | next [-]

> However, due to things such as batching and even different kinds of floating point imprecisions for different algorithm implementations, the probability distribution itself often differs run-by-run

The implementation does not often differ run by run.

lelandbatey 2 hours ago | parent | prev [-]

As I understood it, the "randomness" affecting what is selected at any temperature still comes from a PRNG or CSPRNG (or whatever RNG you want, maybe a hardware one), and if you where to swap out that with something deterministic you'd get the same results every time (barring non-determinism in other parts of the OS/drivers/maybe even hardware).

But theoretically, the output of every LLM is seed-driven (or could be if you wrote the software to isolate it) just like any computer software. It's just none of the software written (even llama.cpp AFAIK) chooses to support stable-seeding due to the changes in stuff like CPU/Vulkan/CUDA/Metal differences making it difficult to make consistent.

They could though! Hopefully one day someone implements it into the mainstream LLM-engine software and it gets exposed in the APIs serving the models. It'd do a lot to show folks the "internals" of these models.

toolslive 2 hours ago | parent | next [-]

It's probably due to the fact that it's a cloud service. You have no guarantee that your next request will go to the same machine. So even with an identical seed, and temp 0 you might get different hardware and hence different accuracy/noise in the floating point operations.

rightbyte 41 minutes ago | parent [-]

How can there be noise in floating point operations? I could buy like completion order for parallized batches i.e. adding a+b+c is different from a+c+b etc.

microtonal 2 hours ago | parent | prev | next [-]

Stable seeding is not enough. A lot of modern, fast compute kernels are nondeterministic. Floating point multiplication/addition is not strictly associative and e.g. reductions can combine results from different threads in different orders (e.g. through atomic ops). You can write kernels to be deterministic, but it is generally less efficient.

nok22kon 2 hours ago | parent | prev [-]

that's incorrect in the presence of batching. it's tough work making it truly deterministic:

https://x.com/FireworksAI_HQ/status/2069873437217276015

vidarh an hour ago | parent [-]

It's not that hard. What is hard is making it truly deterministic and retain high throughput.

aesthesia 4 hours ago | parent | prev | next [-]

A distribution with all probability mass on one outcome is deterministic, so in principle, setting temperature to 0 _should_ result in deterministic outputs. There are a few reasons it might not, but I don't think any of these apply when running a local model like the author did.

317070 4 hours ago | parent | next [-]

> so in principle, setting temperature to 0 _should_ result in deterministic outputs

It is a common misconception, but it is not true even in principle. If I have 2 or more logits which are equal to the maximum of my logits, I will sample uniformly random from them with any temperature, even zero. Sampling from softmax([1, 0, 1]) is still stochastic at temperature 0, because the limit is to sample uniformly from the first or the last element.

Anyway: "GPUs don't do deterministic matrix multiplications" is the biggest source of randomness in LLMs. GPUs put the associativity of the sums in matrix multiplications in arbitrary order, and this has a huge impact on the logits coming out of the neural network.

EvgeniyZh 4 hours ago | parent | next [-]

You don't have to sample uniformly. You could take the lowest index of all maxima. But yeah, the main source of randomness is non-deterministic matmul, and temperature does nothing with it

jstanley 2 hours ago | parent | prev | next [-]

> "GPUs don't do deterministic matrix multiplications" is the biggest source of randomness in LLMs.

But this isn't a fundamental property of LLMs, it's just an implementation detail. It's pretty obvious that if you evaluate the matrix multiplications correctly and deterministically sample from the highest-probability outputs, you will have a deterministic LLM.

vbarrielle an hour ago | parent [-]

It may be an implementation detail, but in practice, if the only way to get a deterministic output is to run on the CPU, then it's not going to be usable.

317070 15 minutes ago | parent [-]

Actually, Google's TPUs are also deterministic!

DougBTX 2 hours ago | parent | prev [-]

> GPUs put the associativity of the sums in matrix multiplications in arbitrary order

That’s user-controlled too, not an inherent property of GPUs:

https://docs.pytorch.org/docs/2.12/generated/torch.use_deter...

vbarrielle an hour ago | parent [-]

The matrix multiplication is only deterministic for sparse-dense products under these settings:

> torch.bmm() when called on sparse-dense CUDA tensors

And it's not listed under the operations that raise an exception otherwise, so I'm not sure the docs promise that dense-dense matrix-matrix products are deterministic.

easygenes 4 hours ago | parent | prev | next [-]

There are. If the kernels are nondeterministic (e.g. timing issues) there are minor changes between runs, on a single system, even with eager decode enabled (typically what temperature=0 achieves).

IshKebab 4 hours ago | parent | prev | next [-]

Setting the temperature to 0 should give deterministic results but that's not any better - it's just hiding the huge variance by only taking one sample.

croes 3 hours ago | parent | prev | next [-]

So you would get always the same result, but it could be the wrong one

srdjanr 3 hours ago | parent [-]

Of course, nothing can guarantee the right answer from LLMs

valzam 4 hours ago | parent | prev [-]

I mean the easiest explanation would be that the model harness doesn't always take the most likely token but does top-k sampling or similar. temperatur just means that probabilities get more and more equalized, boosting the chance that an unlikely token gets picked. but even with temp 0 you could have 0.8 T1, 0.19 T2, ... and sometimes sample T2

aesthesia 4 hours ago | parent [-]

No, this can't happen at temperature 0. The formula defining temperature-adjusted softmax isn't strictly defined at 0, but taking the limit (in the case where all logits are distinct) results in probability 1 being placed on the largest logit. Samplers will typically special case temperature 0 and pick the most likely token at each step.

dvt 4 hours ago | parent [-]

This is a very authoritative answer that should be more nuanced and caveated as implementation-dependent. In some cases, repetition penalties take precedence over sampling; top_k and top_p can also be handled before or after the temperature step. In other cases, `0` is turned into like 1e-10 or some super tiny float value (which can drift if you do any arithmetic with it). Routing, quantization, etc. can also have an effect on sampling. And yes, in some cases, setting temperature to 0 can mean "pure greedy decoding" which makes the decoder about as deterministic as it can get.

nok22kon 2 hours ago | parent | prev | next [-]

its a bad idea in general to use non-1.0 temperature. there is a reason labs are strongly recommending using 1.0.

using low temperature is more deterministic, but the cost is the model becomes "dumber"

tipsytoad an hour ago | parent | next [-]

1.0 is actually pretty arbitrary and way too high as a general rule. Something like 0.3 is a more sensible default

317070 2 minutes ago | parent | next [-]

If RL was used to train the model, the model will have been trained on its own sequences. Those will have been generated with a temperature of 1.0. They must be, otherwise you would get a premature collapse or explosion of your entropy if the temperature was respectively lower or higher.

After that RL step, you want to stick to the RL distribution, and so keep a temperature of 1.0. Other temperatures will drive the model out-of-distribution.

That is why agents are usually kept at a temperature of 1.0.

zipy124 an hour ago | parent | prev | next [-]

It really depends on the application does it not? I'm not an LLM guy, but for creative tasks like storytelling wouldn't you want a higher temperature usually? Happy to gain insight from anyone with experience here :)

embedding-shape an hour ago | parent | prev [-]

Heavily depends on the model architecture and the implementation though, I don't think you can say what values are better than others without first specifying those, otherwise it's straight up guessing, ironically.

codeflo an hour ago | parent | prev | next [-]

It can be useful for pure translation tasks and stuff like that where you explicitly don't want creativity of any kind.

vidarh an hour ago | parent | prev [-]

Plenty of setups defaults to lower values than 1.0.

make3 4 hours ago | parent | prev | next [-]

A more spikey distribution exactly makes the distribution closer to deterministic. That's not the point though. Even in greedy (deterministic) decoding, it is still a black box though that reacts in ways ways that are unpredictable to the inputs. Switching one word around might lead to different scores for example.

bhanu786 2 hours ago | parent | prev | next [-]

Agree

spwa4 3 hours ago | parent | prev | next [-]

> An alarming number of people don't understand that LLMs work via purely stochastic processes ...

I've been studying AI for 20 years. What really needs to be added to this statement is:

"An alarming number of people don't understand that LLMs work via purely stochastic processes - and so does human thinking. People do NOT arrive at the same conclusion if merely the weather's different. Worse: with human thinking not only do most people not think this is real, a subset of people will actively fight the idea. Of course, depending on the weather"

miki123211 3 hours ago | parent | next [-]

What's even worse, different humans have different weights.

If you train two different LLMs and replace what data they "see" in batch n, that doesn't affect the data they see in batch n+1, or any further batches. In LLMs, you can introduce "noise" into the training process, but that noise doesn't really compound.

Humans learn from experience, not from data, and their experiences at age n shape what experiences they seek (and hence train on) at age n+1. A small amount of "noise" injected into their "training", let's say hearing a group of friends discuss a movie while their identical tween goes to the bathroom, can compound into them watching that movie, which can compound into them forming an identity around that genre, and so on, until they're two completely different people, trained on completely different "data mixtures".

chrisjj 40 minutes ago | parent [-]

> What's even worse, different humans have different weights.

Far worse would be different humans having the same weights.

mnky9800n 3 hours ago | parent | prev | next [-]

Test retest reliability is a thing in psychometrics.

spwa4 an hour ago | parent [-]

Ah cool. So there is data? How consistent are humans?

What I'd really love is an actual number for a "human hallucination rate". How often will a random human

1) claim something that is wrong

2) defend the wrong claim and/or logic even when the problem is pointed out to them

(and this of course outside of the usual topics. In politics? I don't care. In religion? Don't care (well, maybe a bit more than politics). Let's say in physics or popular logic or something like that)

smusamashah 3 hours ago | parent | prev | next [-]

We expect computers to be consistent on the other hand. A calculator will always give you the same answer unless some chip gets struck by a particle. LLMs are on computers and should be fairly consistent too.

vidarh an hour ago | parent [-]

And this lies at the heart of the problem.

We expect computers to be consistent despite running programs that are not designed to be consistent.

This despite the fact that we have lots of experience of programs running on computers that produces wildly inconsistent outputs.

But for some reason some people choose to assume LLMs should act like a calculator instead of any of those programs.

chrisjj 37 minutes ago | parent [-]

> This despite the fact that we have lots of experience of programs running on computers that produces wildly inconsistent outputs.

The average user has very little. A word processor with inconsistent pagination or a spreadsheet with inconsistent totals is rightly seen as faulty.

cyanydeez 21 minutes ago | parent | prev [-]

a studied example is sampling judicial decisions before lunch and after lunch. judges are more lenient on a full stomach.

bluechair 5 hours ago | parent | prev [-]

Willing to be corrected but I believe this type of automated resume filtering is illegal. Not saying it never happens but my understanding is it is not typical.

thayne 5 hours ago | parent | next [-]

I would expect that to depend on jurisdiction.

I don't know for sure, but I would be surprised if it was illegal in my particular US state. You might be able to argue the AI has inherent biases that introduce illegal discrimination in the hiring process, but my understanding is winning I case like that would be very difficult, especially since most employers are very cagey about their hiring process and why they mades a decision.

small_scombrus 5 hours ago | parent | prev | next [-]

They don't need to actually filter/blackhole to have have the same virtual effect.

Show someone a list of resumes with an "applicant score*" and they'll naturally ignore the ones with a low ranking

*scores are generated with AI, mistakes may be made, use only as a guide and verify results

ivan_gammel 4 hours ago | parent | prev | next [-]

In situations when you get hundreds of applications for one open position (real market now), whatever reduces your pool to the size a human can handle, works. You can preserve some diversity metrics in the process. This particular filtering is rather primitive, but LLM as a first filter can definitely do the job. You may burn less tokens than the hourly rate of your HR and it will be fairer than just dumping 50% of unread CVs in trash.

369548684892826 3 hours ago | parent [-]

Great until someone realises you’ve filtered out minority groups from the application process (most developers are men so maybe the LLM decided they’re the best fit, but you’ll never know exactly why it screwed your over) and you suddenly have an expensive lawsuit

TeMPOraL 17 minutes ago | parent | next [-]

LLMs are DEI-aware, as over past few years, their vendors all had various high profile news stories with their models and their default biases, so it's more likely they'll heavily discriminate in favor of minority candidates, not against them. Still, in both cases it would indicate whoever is operating the system is doing a really, really lazy job. It's really not hard to test and supervise LLMs on tasks where they give you mere 2-10x leverage, and prompt adherence today is much better than it was 3 years ago.

cyanydeez 15 minutes ago | parent | prev [-]

this happened a decade ago when a US courted tried to make sentencing decisions via ML. it was easialy demonstrated that the training data was flawed because the justice system was flawed so the data it was trained on was weighted against minorities because it oversampled because you know, police routinely oversample and poverty for es oversampling

nonetheless, people will defend history as perfect and say those samples, like nepo babies, are "perfect".

elric 3 hours ago | parent | prev | next [-]

Under GDPR, you have the right to request manual processing whenever personal data is processed automatically to make a decision about you that has "significant impact". Not being hired seems like it would qualify.

dgellow 4 hours ago | parent | prev [-]

Illegal where?

CM30 38 minutes ago | parent | prev | next [-]

I think what's more worrying to me (if other systems work like this ATS) is that it seems to judge based on a bunch of factors that will probably disqualify a ton of decent to good participants.

For example, 65 points are given for a mix of personal projects and open source contributions. Which is great if your one and only interest is in tech, and you don't have a family, dependents or a second/third job. If you have any of those other things, well the odds seem like they're incredibly stacked against you.

And it makes me wonder how many of these systems are stacked in favour of wealthy people with a near special interest level of obsession with tech and no worries outside of going to college/working a single job in their industry of choice.

ryukoposting 5 hours ago | parent | prev | next [-]

At this point we might as well adopt that joke where you blindly throw away half the resumes because you don't want to hire unlucky people.

taffronaut 23 minutes ago | parent | next [-]

At one point in the past a major UK a medical school adopted random selection for qualified candidates (Barts and The London School of Medicine and Dentistry - part of Queen Mary University of London). The approach benefitted qualified students from less well-off backgrounds vs those who can afford to win at the ever more elaborate (manual at the time) hurdles of resume assessment criteria and effectively game the system. There was an orchestrated campaign against the lottery around "Why gamble with would-be doctors?". Random selection was quietly dropped.

agnosticmantis 2 hours ago | parent | prev | next [-]

A person's total luck is constant over a lifetime. The remaining half of the candidates already spent some of their luck in this selection, so they'll be on average less lucky than the discarded half.

t-3 an hour ago | parent | next [-]

No, luck would be some expression of the difference between the average and the individual outcomes - it only exists relative to a population at the point in time when it is measured.

latexr an hour ago | parent | prev | next [-]

Even assuming that was genuinely how luck works, the conclusion does not follow from the premise because it’s obvious not everyone “starts with” the same amount of luck to spend.

throwawaythekey 2 hours ago | parent | prev [-]

> A person's total luck is constant over a lifetime

Ah yes, the much revered cosmological fairness constraint.

cyanydeez 10 minutes ago | parent [-]

everyone knows luck is tied to the wealth-gravity and increases as the inverse distance to the density of matter. hut because its relative, everyone thinks they have the same luck when not observing others.

zipy124 an hour ago | parent | prev | next [-]

Or more to the point. There are generally far more qualified applicants than job roles. That is training and education greatly expanded over the last couple of decades to produce more and more job seekers, whilst job creation hasn't really kept pace.

pjio 2 hours ago | parent | prev [-]

This hurts more than it should.

jerrythegerbil 5 hours ago | parent | prev | next [-]

> I fail 65% of the time. Same exact resume, different luck.

As someone who’s run hiring pipelines for technical roles in the past few years, that’s actually a fantastic number. I objectively hate saying that, but it’s true.

35% chance of elevating a technical individual to the next stage with no effort? I’ve seen as many as 100+ applicants an hour even when including a domain specific screener question. That’s 35 “screened” applicants in an hour. Were valid candidates screened out? Yes. Does you still have a candidate pool 35x larger than you need? Unfortunately, also yes.

The volume of applicants is SO HIGH such that your chances of getting moved to the next stage are actually markedly worse if AI isn’t involved. If you didn’t apply immediately (using an AI bot) there’s 50+ people ahead of you, and an exhausted technical leader if they ever make it to your resume.

Referral bonuses exist for a reason.

PufPufPuf 4 hours ago | parent | next [-]

In that case, I have a pre-screening system to sell you. Through state of the art technology, it only lets through the best* 1% of applications.

*According to our proprietary, undisclosed, non-deterministic metric, which may or may not be Math.random

rvba 2 hours ago | parent [-]

Reminds me of this

https://stackoverflow.com/questions/16833100/why-does-the-mo...

ludicrousdispla 3 hours ago | parent | prev | next [-]

So the logical solution is for candidates to submit multiple applications with slight variations to their contact info, "John Schmidt", "John J. Schmidt", "John J. J. Schmidt", "John Jacob J. Schmidt", "J. J. Jingleheimer Schmidt", etc.

kyralis 5 hours ago | parent | prev | next [-]

Is it? Or is it a 65% chance of a resume getting ignored before a single human sees it, reducing your pipeline's likelihood of catching qualified candidates by the same?

Gates that reduce resume flow-through are only useful if their reduction is correlated with quality. Otherwise they're just dragging out your hiring process or unnecessarily causing you to ultimately lower your hiring bars.

jerrythegerbil 5 hours ago | parent | next [-]

> Gates that reduce resume flow-through are only useful if their reduction is correlated with quality.

The volume is infeasible to review everyone for quality, even at an hour scale. The conclusion and solution is inevitable, though I wish it were different. 35% is actually really good if you’re not coming in through a referral.

The current reality is <1% and the person reviewing you is exhausted.

falsemyrmidon 3 hours ago | parent | next [-]

You may as well just randomly pick 65 to discard, if your only goal is to reduce the number for review.

sevenzero 4 hours ago | parent | prev | next [-]

What a inhumane way of looking at this. Hiring is deeply flawed, you know it, and yet you keep job postings open for weeks/months in case "the one" magically appears on your doorstep instead of just interviewing 10-20 people and just pick one...

Corpo bullshittery at its finest.

LinXitoW an hour ago | parent [-]

What's the alternative? Everyones up in arms, but I see ZERO viable alternatives proposed.

If you have 1000 applications for every job, and you know that a bunch of these applications are "a bad fit", to put it mildly, you have to filter. And you cannot realistically give every resume a good, human look. By the time HR would be done, the market has already moved on five times.

So, what is the real difference between being overlooked because HR could only look at the first 100 resumes, or the AI filtered all 1000 resumes down to 100? In the end, a fuckton of potentially great people get their feelings hurt either way.

sevenzero an hour ago | parent [-]

>instead of just interviewing 10-20 people and just pick one

Here's a realistic proposition. HR just wants to inflate numbers so that they seem busy looking for the right fit. Keep posting open for 1 week, manually filter for another week, invite people, employ one. Plenty of people with degrees looking for jobs right now, I don't see what's the issue with just trying one. Companies desperately look for the "magic" applicant that checks all boxes, while also trying to pay them almost minimum wage.

Brian_K_White 5 hours ago | parent | prev [-]

This reasoning isn't.

bagels 5 hours ago | parent | prev | next [-]

The goal for the interviewer is to have a much higher ratio of good/bad candidates after the first screening. This means the more costly time you spend on the second step has a better return.

aesthesia 4 hours ago | parent | prev [-]

So the question is: is the score given by this system correlated with candidate quality? I don't think this post gives enough data to know.

recursivecaveat 2 hours ago | parent | prev | next [-]

If you have no requirements for accuracy, you can just advance 35% of applicants at random.

If the first 50 people who apply are all bots, why are you reading resumes in order of submission?

spike021 4 hours ago | parent | prev | next [-]

there have got to be better ways to optimize pipelines. maybe set a limit on number of applications for a role based on the number you/your team can reliably go through them. if more are needed then open the role for another wave of applications.

IshKebab 3 hours ago | parent | prev | next [-]

I wonder if you could solve this for programming specifically as follows:

1. Give them some easy leetcode questions. Nothing that a competent programmer would have any problem with.

2. If they pass, ask for a deposit of like $20. Shouldn't be an issue for people who are actually serious.

3. Do more simple leetcode questions but this time on zoom so you can tell if they are using AI. If they pass that they get the deposit back.

(Yeah I know there are real-time interview cheat AI programs but based on what I've seen on demos of them it's super obvious when they're being used.)

Probably not practical but just a thought!

lowbloodsugar 4 hours ago | parent | prev [-]

Except the bit about ranking a decades long S3 engineer lower than an intern with GitHub repo.

Aurornis 4 hours ago | parent | prev | next [-]

> The default model is gemma3:4b

That’s a tiny model. No LLM is going to be a perfect and repeatable judge, but a tiny 4B model is like plugging an RNG into this system.

This whole exercise feels like someone vibe coded an ATS and got it to the point where the tests were passing because they decided they should have an open source ATS project.

danpalmer 3 hours ago | parent [-]

This sort of model is fine for small problems, when used in the right way. I think there's probably a version of Resume analysis that would work well with this model, but "hey clanker, what projects has this person done" is not the way. You need extraction, cleanup, probably OCR to compare and further clean up, multiple analysis passes per signal with LLMs, judges, etc. None of that needs to be large models, you'll get marginally better performance, but there's very little context, these models will perform well when used correctly.

seanieb 31 minutes ago | parent | prev | next [-]

It's always amazed me that a tech company will pay $300,000+ for a good engineer, because talent is so hard hard to find... meanwhile their recruiter operates unsupported, has a very different idea about what good looks like. Their ATS black-holes >50% the resumes because it's filtering heuristics are garbage because recruiting selected the ATS system because it has a google Gmail integration or something, and the ATS's filtering technology was not reviewed by anyone in the engineering or data teams.

gs17 4 hours ago | parent | prev | next [-]

I'm a little confused, is this an ATS system that anyone actually uses? If not, I'm not sure how it's better than just asking ChatGPT to score your resume out of 100. Why would you want to optimize your resume for a system no one is using to score it?

Bukhmanizer 3 hours ago | parent | next [-]

I would assume at least hackerrank is?

I don’t think the point of a lot of this is to optimize your resume. It’s to show how arbitrary these systems are.

marticode 34 minutes ago | parent | prev | next [-]

From my understanding this one is used for hiring tech workers only. The (very) widely used Workday application system for ex seems to have its own built-in ATS.

40four 3 hours ago | parent | prev | next [-]

“I'm a little confused, is this an ATS system that anyone actually uses?”

You read my mind. If the answer is “no”, then we can ignore this.

another-dave an hour ago | parent [-]

For one, if you go on to Hacker Rank's "Screen" page, they mention the product is used by Stripe/AirBnB/LinkedIn/Atlassian/IBM etc etc. I imagine that there's plenty more companies using it too.

But I'd also assume that their competitors are doing something similar so I don't think we as an industry can just ignore that it's happening.

petesergeant 3 hours ago | parent | prev [-]

(Almost) everyone’s using some kind of ATS, every ATS is adding AI auto-ranking (and has been trying to for 15 years), and almost all HR people feel like they have too many obviously bad CVs to read. Whether or not someone is using this ATS specifically, if you submit several CVs to several places, your CV is going into at least one magical 8-ball.

saidnooneever 2 hours ago | parent | prev | next [-]

Count to three, no more, no less. Four shalt thou not count, neither count thou two—excepting that thou then proceed to three. Five is right out.

kailpa1 2 hours ago | parent | prev | next [-]

From `resume_evaluation_system_message.jinja`

> *SCORES MUST NEVER DEPEND ON THE FOLLOWING FACTORS:*

> - College, university, or educational institution name

> - CGPA, GPA, or academic grades

I don't understand why they would omit these factors from the evaluation.

swiftcoder 2 hours ago | parent | next [-]

> I don't understand why they would omit these factors from the evaluation.

Only hiring MIT graduates sounds great to a lot of tech folks! Automatically rejecting applicants from HBCUs, however, sounds like a lawsuit

As to GPA thing, I think it's just to stop the LLM glomming onto an obvious numerical grade? LLMs like to rank things by obvious dimensions, and whether someone had a 4.0 or a 3.8 in grad school makes very little difference to their performance 10 years down the line.

sph 2 hours ago | parent | prev [-]

Hopefully so that people like me, that dropped out of high school yet have had a successful career as a self-taught engineer, have a chance. [1]

Just kidding, my resumes are sent to /dev/null like everybody else’s.

——

1: In fact, I will be controversial and say that self-taught engineers tend to be the strongest in their own particular niche, because they are powered by sheer desire to learn and improve. I am routinely appalled by how many people go on forums to ask how to learn a new thing, completely unable to self-direct their learning. I blame the modern school system.

kailpa1 2 hours ago | parent [-]

I'm a self-taught programmer as well, who dropped out of university, and these factors being omitted would benefit me as well, but I feel like good grades and a good university are still indicators of someone being or is capable of becoming a good programmer.

This system would drop a Harvard top graduate for someone having a year of experience in some outsourcing firm.

goosejuice an hour ago | parent | next [-]

> I feel like good grades and a good university are still indicators of someone being or is capable of becoming a good programmer.

Really depends on the program. In my undergrad program there were some very smart CS students who got great grades that really struggled with the programming. Smart and capable people can be bad at programming and lack many qualities that make for a good hire.

sph an hour ago | parent | prev [-]

I started in an outsourcing firm (body rental actually) but I definitely get your point. Maybe they optimize for real world experience, or rather, how one is used to workplace politics and logistics. The top grad will have higher expectations, and all they want is a cog for the Machine.

cs02rm0 42 minutes ago | parent | prev | next [-]

I feel like hiring is all a bit broken. Roles get flooded with applications, it's chance whether your CV gets through, then there's hiring rounds that seem designed to make you quit the process before they have to filter you out.

Is it working for anyone, on any level?

luckylion 37 minutes ago | parent [-]

I'm on the other side, and my main tip (at least if there's people like me!) is: avoid the usual AI signs.

For one role we got ~70 applications and all CVs looked obviously AI-written. I don't know whether the people did actually do any of the things mentioned and I don't have the time to find out, so the AI-written CVs are a discard-signal for me. (Either those people delegated a very important task to AI and didn't even bother to check, or they are bad using AI and don't know -- I want neither)

Any CVs that signal they were actually written by a person I will actually look at.

davidpapermill 3 hours ago | parent | prev | next [-]

A better way to reformulate this problem is for the LLM to be tasked with making a _comparative_ judgement between two CVs. This should prove much more reliable, especially if you give it a third “too close to call” option. You can also ask for clear justifications of preference.

srdjanr 2 hours ago | parent [-]

That's a good idea.

The only drawback I see is that you should compare every pair of CVs for best results, and that grows quadraticly with number of CVs. Of course you can settle for fewer comparisons and not perfect results. But then I'm not sure if you can hit a good ratio of quality and token spend.

skribb 2 hours ago | parent | next [-]

Could probably do an elo system and sample pairs. E.g.

1. Set the elo of all CVs to 1000 elo

2. Randomly pair up CVs and compare. Winners gain elo, losers lose elo.

3. Repeat #2 for a few iterations, then remove bottom X% of CVs.

4. Repeat 2-3 until the amount of remaining CVs is small enough to do an exhaustive comparison.

I don't have a mathematical proof, but I suspect that this is a decent cost-effective approximation of comparing every pair (depending on the parameters)

swiftcoder 2 hours ago | parent | prev [-]

> you should compare every pair of CVs for best results

Or compare each one to a reference set? Take 5 resumes of existing employees, rank all candidates against that set, maybe you get some useful level prediction into the bargain

speedgoose an hour ago | parent | prev | next [-]

Many em dashes and a "This is not, it is…" later, I think this article would have been a much better critic if it didn't use a LLM to (re)write some parts of it.

another-dave an hour ago | parent [-]

I always find it funny when a technical crowd starts picking on em dashes as a sure sign of AI. I mean, are keyboard shortcuts really that difficult for developers? Some of us always knew how to use correct punctuation, even before LLMs existed.

Also, neither "this is not" or "it is" appear at all in the article?

speedgoose an hour ago | parent [-]

It’s a lot of them. It’s a style. I know some people who used them before and use them less nowadays.

> This non-determinism isn’t a bug you can just fine-tune away, it’s a fundamental design flaw.

makeavish 4 hours ago | parent | prev | next [-]

Hiring and job search has been so hard and AI has amplified the existing problems instead of solving any.

sevenzero 4 hours ago | parent [-]

Wdym, cant you just litter your applications with buzzwords and other bs to automatically get a high score in these systems?

szszrk 3 hours ago | parent | next [-]

HR market is basically an early google rigging era, where you can place hundreds of keywords at the footer (white text on white background) to start popping up on random searches.

makeavish 19 minutes ago | parent | prev [-]

I have been at both side of the market. And it sucks so bad at both ends. Companies which deeply care about next hire are struggling to hire and actual great people looking out are outcompeted by AI slop and AI bulk applying.

It is actually a very hard to solve problem.

tasuki 3 hours ago | parent | prev | next [-]

> Sometimes my projects “lack architectural complexity”

Well done you! It is difficult to avoid architectural complexity, but imho well worth it.

bryanrasmussen an hour ago | parent | prev | next [-]

>If your company’s cutoff sits at 85, I fail 65% of the time. Same exact resume, different luck.

Your resume's reception is always affected by random factors, only now you are able to test, debug and technically critique the randomness.

realty_geek 3 hours ago | parent | prev | next [-]

Why doesn't something like this exist for real estate? A popular open source AVM (automated valuation model) that helps home sellers get an idea of what their home will sell for. Right now it seems AVMs are mainly seen as just a way to capture leads. Every estate agent will tell you they have some magic recipe that makes their valuation better than anyone else's. I have had a bunch of ideas on how to approach this, but I really could do with a collaborator or two.

gebruikersnaam 3 hours ago | parent [-]

The article raises a lot of questions the article already answered.

realty_geek an hour ago | parent [-]

keine ahnung

dc3k 5 hours ago | parent | prev | next [-]

Disregarding the fact that this thing is completely broken, its grading rubric is ridiculous to begin with (as was mentioned in the article itself, but I must reiterate how completely stupid this is):

> 35 points for open source contributions

> 30 for personal projects

I don't contribute to open source or have personal projects because I don't spend my free time doing what I do 40 hours a week to make a living. My 15 years of work experience is worth a maximum of 25%, so any company using this idiotic system would pass on me immediately. Open source and personal projects are fine, but in no sane world are they worth 65% of a resume's score.

adrianN 5 hours ago | parent [-]

They are selecting for people who are fine working in their free time. If you contribute to open source you are more likely to contribute to the company on weekends. If instead you have other hobbies or a family that takes up non-work hours you are more likely to drop your pen after forty hours.

matheusmoreira 4 hours ago | parent | next [-]

Maybe they're selecting for intrinsic motivation. People who enjoy programming to the point they do it for fun, not just because it pays.

Free software work doesn't imply we work for free. We work on our projects, the stuff that we actually enjoy working on. Nobody is going to work on corporate products without adequate compensation.

lukan 4 hours ago | parent [-]

"Nobody is going to work on corporate products without adequate compensation."

I guess there sadly are many nobodies who do this to hope to become somebody.

matheusmoreira 4 hours ago | parent [-]

If the open source work is part of a hiring pipeline, sure. Contribute to some repository and have it serve as a resume that gets you hired is also a form of compensation. If the work is also enjoyable, then it's a win either way.

another-dave an hour ago | parent | prev | next [-]

> If you contribute to open source you are more likely to contribute to the company on weekends

I wonder if that assumption is bourne out in reality though?

I'd imagine if someone's OSS contributions are enough of a factor that it's worth hiring them, they're not going to drop it on a whim to work extra hours on the day job.

(Assuming you weed out open source contributions like "I made a todo list app in React but licenced it as MIT" or "I fixed a typo in the docs for NextJS". )

emj 5 hours ago | parent | prev | next [-]

You might have numbers on that but after working in a place with a strict no more than 40 hour policy my view is that people overwork for many reasons. Being an open source enthusiast is not one of them.

stevesimmons 4 hours ago | parent | prev [-]

I'm not sure that follows. I stopped making open source contributions when I switched from mature companies to startups.

Now all my "non-work" time is spent on startup work. And none of that is visible via GitHub.

cemoktra 3 hours ago | parent | prev | next [-]

So sending my CV to every company three times should get me pass the ATS?

rkuska 5 hours ago | parent | prev | next [-]

This reminds me of my former CTO. He would take bunch of CVs and randomly throw some of them in a bin. He didn’t want to work with “unlucky” people.

psalaun 5 hours ago | parent | next [-]

I thought this was only an old urban legend; some people actually use this technique? Especially in a trade supposed to be led by people trained in sciences?

gregates 2 hours ago | parent | next [-]

Given how often it's been mentioned here, it's likely that this is an urban legend that people are pretending to have first-hand knowledge of for karma. In a trade that's supposed to be led by people trained in sciences, no less!

(A more charitable interpretation would be that aforementioned CTO was making a joke that didn't land.)

aquariusDue 3 hours ago | parent | prev [-]

It's OK! We can disguise it as the Secretary Problem and it'll be fine, we could even write a post on the company blog about it. /s

https://en.wikipedia.org/wiki/Secretary_problem

hahahaa 5 hours ago | parent | prev [-]

The problem is with this system he only worked with unlucky people.

YossarianFrPrez an hour ago | parent | prev | next [-]

Looking at the linked scoring prompt (resume_evaluation_criteria.jinja) [0], I immediately see several red flags that suggest the output won't be reliable. (I'm developing an LLM intensive application where the stakes are high enough that I need the LLM output to be reasonably correct.)

[0] https://github.com/interviewstreet/hiring-agent/blob/main/pr...

In no particular order:

1. The prompt is trying to get the system to do all of the evaluation steps at once. Instead, the system should break down the task of resume evaluation into its subcomponents and have separate prompts for each component. Like "evaluating open source contributions" should be its own task. Same with "assessing the complexity of software projects on the resume." Fwiw, each of the tasks contained within the prompt is woefully underspecified.

2. The prompt leaves spreads of ~10 points up to the LLM, when it's doubtful that humans are that well calibrated. Take for example:

  > SCORING CRITERIA Open Source (0-35 points) 
  HIGH SCORES (25-35 points):
   - Contributions to popular open source projects (1000+ stars)
   - Significant contributions to well-known projects
   - Google Summer of Code (GSoC) participation
   - Substantial community involvement
Are all of these 35-point examples? Is one a 26-point example? If not, what's the difference? If an expert can't reliably make the judgement, the LLM is going to struggle too. One partial fix is to get rid of the ranges and just say all of these are worth 30 points. An additive point scheme would be better...

3. The authors of this prompt have left an incredible number of judgement calls up to the LLM, when that's the very thing you want to minimize. Using the same example as above...

- Are all contributions to open source projects with 1000+ stars equal?

- What counts as a "significant contribution"? Doesn't that imply that the LLM has to know or read through all of the commits in like the last ~6 months at minimum for the project to understand what the given contribution meant to the project? That itself isn't impossible with tool usage, but again, that'd be a separate task.

- What on earth counts as "Substantial community involvement"? Why didn't the prompt authors define this, or at least give a few examples?

Honestly at this point maybe someone should build a tool that scans prompts for adjectives...

4. This sort of thing is just asking for trouble:

  > SCORES MUST NEVER DEPEND ON:
   Candidate's name, gender, or personal demographic information

Just remove this stuff before you send the rest of the resume to the LLM. Even if you ask it not to, it's not a person, it's a very fancy statistical distribution generator. All of the input (including the name) will affect the distribution that gets generated. (This one is not unlike Andreessen's "don't be a sycophant" prompt.)

5. Obviously this one depends on the LLM in question, but instead of writing things like:

  > DO NOT RETURN A RESUME SUMMARY. RETURN ONLY THE SCORING EVALUATION IN THE SPECIFIED JSON FORMAT. Analyze the following resume and provide a JSON response with this EXACT structure (all fields are required):...

The system should utilize the "structured output" option, which guarantees a fixed output format. Also, fwiw, the JSON should force the LLM to pick between categorical options as much as possible. Forced-choice structured output should, at least in theory, cut down on hallucinatory responses and constrain judgement calls.

6. One major thing that's not in the prompt is anything about traceability. This system should be designed so that humans can review the logs and make sure this is working as intended.

7. Another thing that is missing in the file is what I'll call evidence of a theory of coding / coder quality. Most of the examples are designed to have the LLM assess proxies for code quality, not code quality itself. Surely both should be taken into account?

I'm not an expert at evaluating coders. But two pretty basic LLM-answerable thing I would ask is: How well do a candidate's 5 most recent commit messages match the contents of those commits? Do the claimed technical skills on the resume match their GitHub code? (i.e., if they say they know R, is there any evidence of that on their GitHub?)

8. The prompt also seems unaware of what it's asking the LLM to do:

  > LIVE DEMO BONUS: Projects with working live demos should receive 10-20% higher scores

This implies that the LLM can use tools, but even then, I'd be pretty wary of its ability to fully execute this part of the prompt without more detailed instructions, examples, and guidance. There are very likely tons of edge cases here.
nnevatie 34 minutes ago | parent | prev | next [-]

> An LLM is called

Hooray for incidental non-determinism.

bhanu786 2 hours ago | parent | prev | next [-]

ATS resume usually check the keywords, and formatting your spacing and give score accordingly. As If someone is following some reference of the format. It can depend might he will be getting low scores.

pu_pe 3 hours ago | parent | prev | next [-]

He tried with a tiny model (gemma3:4b), got a range from 66 to 99. Then tried again with a small model (gemini 3.1 flash lite), the range was 48 to 64. Would a frontier model be more consistent? Perhaps this tool was optimized for more capable models?

srdjanr 2 hours ago | parent [-]

It makes sense to me intuitively (though I'm not sure if my reasoning is actually correct).

Worse model may not "know" enough to distinguish between a 70 and a 100 candidate, so it's expected that it's output has high variance. But a better model might "know" enough, so it can be more confident and thus more consistent.

swingboy 35 minutes ago | parent | prev | next [-]

I’ve always assumed any LLM output that was some type of rating or score was bullshit. Unless the LLM writes a Python script to calculate the score (and even then…) then the score it outputs is just the next most likely token, taking into account temperature and what not.

You see a lot of frameworks for things like spec-driven development make use of scoring how good the spec/design/plan is and it’s like, uhhh…

joelthelion 31 minutes ago | parent [-]

> is just the next most likely token, taking into account temperature and what not.

This doesn't mean anything. All LLM output is like that.

That said, I agree that LLMs are terrible at grading stuff, except perhaps if you give them a very detailed evaluation grid.

0xpgm 3 hours ago | parent | prev | next [-]

With such kind of ATS systems, is it still a thing to optimize for a one page resume that is easy for a human reviewer to scan, or just include enough buzzwords and external links to try and please the LLM?

jorisw 2 hours ago | parent [-]

I wouldn't assume based on this one thread/article that this is what you need to optimize your resume for. Nor that a majority or even significant group of reviewers is even using LLMs. I've been involved in hiring pipelines and never even thought of using LLMs to review incoming candidates.

However given the time constraints reviewers have, yes, the former (making a resume easy to consume quickly) is a huge help.

ChicagoDave 3 hours ago | parent | prev | next [-]

I was inspired by this. I made a Claude skill to take my resume and compare it to any job description to point out viability and gaps. Pretty cool skill. I'll post it somewhere.

padolsey 2 hours ago | parent | prev | next [-]

This is just the 'LLM judge', very badly implemented without any scientific prudence. What a joke. To be terse: you cannot rely on LLMs to provide standardized scores against arbitrary criteria. To get close to 'reliable' you would need highly tested rubrics, grounded in human decision-making, and you'd need to avoid all the measurement biases these things are riddled with... positional/order effects, anchoring on whatever numbers you stuffed into your own prompt, scale-format sensitivity (a 1–5 and an A–E scale give different answers for the same input), holistic-vs-isolated context effects, and lovely examples like where adding a "be unbiased" instruction makes it more biased. I've studied this at length. You cannot even _begin_ to approach this problem seriously without held-out validation, inter-rater agreement, and ground truth. This repo is just quagmire of wishful vibes with random numbers littered throughout.

neya 5 hours ago | parent | prev | next [-]

I wonder how is this even legal? The only useful job the HR departments are ever required to do - they decide to automate it? Aside from being a daycare for adults, what exactly does HR accomplish? It's clearly NOT on the side of employees, but this seems like they're clearly NOT on the side of employers, either.

While resume's are being filtered left and right, they just make TikTok's on company's dime [1]. What a sad state of affairs.

[1] https://www.youtube.com/shorts/wSug80Vg5JU

srdjanr 2 hours ago | parent [-]

They could be using this just to throw out the obviously bad CVs, and then manually go over the rest. I'm not sure if they do this in practice, but the tech itself can be useful.

Also if HR was really useless (or actively hurting the company) they wouldn't still have a job (or they'll lose it eventually). No one likes burning money for no reason. So obviously they are doing something useful.

syockit 2 hours ago | parent [-]

The last time I heard HR being completely let go was with a fintech company Bolt. Then again, that company was midsized, around 200-500 people or so. For larger companies, it's going to be difficult to even realize that HR is redundant in the first place.

steve_j_choi 5 hours ago | parent | prev | next [-]

This could be used as a good way to self-evaluate one's current position from the company's point of view. you would tweak prompts and guidelines that are expected from the company and see how you score

hahahaa 5 hours ago | parent [-]

I sort of hope we land on 2 agents, one working for the candidate and one for the employee do a screen round. Salary compatiability could be negotiated by a 3rd party bot that knows both parties ranges and what would be needed each end of range, and figure out yes/no worth going ahead. Such a time saver.

jdw64 2 hours ago | parent | prev | next [-]

It seems like the design is flawed, probably because the scoring structure and conditions are wrong. And originally, due to the nature of LLMs, even if the input is unstructured, when you design something like a RAG system, you usually need to create a verifiable evidence table. Even with that, the scores are still probabilistic by nature, but at least they stay within an error distribution that I can verify. But it doesn't seem like there's any such evaluation criteria here.

Typically, retrieval should be tied to evaluation metrics, evidence should be linked to scores, and you also need to account for parsing errors.

But personally, I'm weak to these kinds of ATS systems (ugly appearance, non-native English speaker, didn't go to a good university), so if this kind of filtering existed, I probably would have never had a job in my entire life. Come to think of it, even now I don't have a proper job—I just bid on projects at the lowest price and implement them. So maybe it doesn't really matter whether such a system exists or not

cyberax 5 hours ago | parent | prev | next [-]

Ah... The AI learned the old HR trick: take 50% of resumes and throw them out without looking. Rationale: "we don't need unlucky losers".

worldthruword 3 hours ago | parent [-]

There are plenty of resumes in the sea. Assuming thorough mixing up and statistically speaking, throwing 50% of resumes is a good enough heuristics.

brikym 4 hours ago | parent | prev | next [-]

So that's where the Windows XP file copy dialog author now works.

quink 5 hours ago | parent | prev | next [-]

"A computer can never be held accountable, therefore a computer must never make a management decision."

maxignol 3 hours ago | parent | prev | next [-]

Are many people using HackerRank ATS ?

diimdeep 3 hours ago | parent | prev | next [-]

They forgot to add "masterpiece" /s https://www.youtube.com/watch?v=mcYl70vq_Ns

https://github.com/interviewstreet/hiring-agent/blob/main/pr...

carb 2 hours ago | parent | prev | next [-]

It's a good analysis but the AI slop writing makes me not trust you've reviewed this and I'm unable to finish or subscribe. I'm sure you're a great blogger but this is holding it back!

rvz 2 hours ago | parent | prev | next [-]

I see.

> LLM is called six times to extract structured information

Followed by

> The default model is gemma3:4b, running at temperature 0.1 — low, supposedly nudging the model toward deterministic outputs.

This is exactly why hiring is even more broken: Because the people looking for candidates are also just as unqualified if not, more.

Using much weaker LLMs to replace the person in charge of the final judgement call is the wrong solution as this is a plain old social problem.

Even if you wanted to use LLMs for this case, the default configuration, model choice is laughably flawed. This LLM can’t be trusted as it doesn’t even know what it is reading.

The correct solution is either advanced OCR with keyword ranking with a basic filter or a far stronger LLM that excels at document / vision parsing benchmarks with an experienced person making the final judgement call in case the technology misses a critical detail.

Rather than using this less accurate one that hallucinates out its decision depending on a dice roll.

Traubenfuchs 2 hours ago | parent | prev | next [-]

This actually makes a lot of sense, it's testing the luck of the candidate through the rng feeding the LLM. You wouldn't want to hire unlucky employees after all! Hiring managers of the past would solve this by throwing every second resume in the trash, now this is a built in feature of ATS.

zuzululu an hour ago | parent | prev | next [-]

this is why i dont feel sorry for working 3 remote jobs

mihaaly 3 hours ago | parent | prev | next [-]

So many people are willing to participate in this kind of robotic practices in human employment makes me think that many are starting to consider that this is as unavoidable as global warming and rather play along, adapts their career (life) to it, sculpture it towards a specific look, doing things that will give them point on some arbitrary test run. Which I feel being dangerous, leading to superficial minded workforce, not those good in something, including judgement of a problem and solution. But good at manipulation.

Speculative thought only, of course.

psychoslave an hour ago | parent | prev | next [-]

>You might as well throw out half the resumes and tell the the applicants you don’t fuck with bad luck.

Hmm, well, maybe a bit with a nuance of elite class structure reproduction (that doesn’t prevent a few transclass to showcase in case anyone critic the perfect meritocracy at run), that’s basically what people get, so crude truth but truth nonetheless.

Oh don’t take it personally. Your own bespoke hand-tailored process of course is different, it does give the opportunity to everyone to reach the most accomplished version of themselves beyond what they ever dare to dream.

It won’t help though with the systematic failure of aiming to provide an accessible path to flourish for everyone and letting no one behind.

Again, this is no fault of any specific player, but as long as a majority feel compelled to move within the frame of the game with few winners that merit all they got in contrast to large stock of inept losers, the outcomes are no wonder.

glouwbug 5 hours ago | parent | prev | next [-]

I guess at least HR doesn’t have to read 1,000 resumes. Heck, to be frank, could they make sense of the first 10 resumes?

yieldcrv 4 hours ago | parent | prev [-]

this will get patched, as in I'll optimize my resume for this and so will many other people that any edge disintegrates