Unexpected security footguns in Go's parsers

▲ Unexpected security footguns in Go's parsers(blog.trailofbits.com)

207 points by ingve 4 days ago | 113 comments

▲ bravesoul2 11 hours ago | parent | next [-]

Oof is this for people who reuse their DTO for business logic and/or share DTOs across different APIs.

It never occurred to me to ever (in any language) have a DTO with fields I wish (let alone require for security) not to unmarshall.

This seems doubly strange in Go the language of "Yes absolutely DO repeat yourself!"

Side rant:

JS (even with Typescript) pisses me off as this is unavoidable. Splats make it worse. But still a distinct DTO and business object and don't use splats would help. (Maybe make ... a lint error)

▲

commandersaki 2 hours ago | parent | next [-]

Does the origin of "DTO" come from Java? I've never seen it used anywhere else.

	▲	paulddraper a minute ago \| parent [-]
		DTO comes from OOP. But you can find it many places. https://codewithstyle.info/typescript-dto/

▲

IceDane 6 hours ago | parent | prev [-]

This is not unavoidable in typescript at all. It really depends a lot on how you have structured your application, but it's basically standard practice at this point to use e.g. zod or similar to parse data at the boundaries. You may have to be careful here (remember to use zod's .strict, for example), but it's absolute not unavoidable.

	▲	bravesoul2 4 hours ago \| parent [-]
		I should have been clear. Yes it is avoidable but only by adding more checking machinery. The bare language doesnt help. In Go etc. If a struct doesn't a field foo then there will not be a field foo at runtime. In JS there might be. Unless you bring in libraries to help prevent it. You are relying on someone remembering to use zod on every fetch

▲ glenjamin a day ago | parent | prev | next [-]

It’s worth noting that if you DisallowUnknownFields it makes it much harder to handle forward/backward compatible API changes - which is a very common and usually desirable pattern

▲

georgelyon 15 hours ago | parent | next [-]

While this is a common view, recently I’ve begun to wonder if it may be secretly an antipattern. I’ve run into a number of cases over the years where additional fields don’t break parsing, or even necessarily the main functionality of a program, but result in subtle incorrect behavior in edge cases. Things like values that are actually distinct being treated as equal because the fields that differ are ignored. More recently, I’ve seen LLMs get confused because they hallucinated tool input fields that were ignored during the invocation of a tool.

I’m a little curious to try and build an API where parsing must be exact, and changes always result in a new version of the API. I don’t actually think it would be too difficult, but perhaps some extra tooling around downgrading responses and deprecating old versions may need to be built.

	▲	yencabulator 12 hours ago \| parent [-]
		It's a convenience and a labor saver, so of course it's fundamentally at odds with security. It's all trade-offs.

▲

physicles 13 hours ago | parent | prev [-]

If you’re writing a client, I could see this being a problem.

If you’re writing a server, I believe the rule is that any once valid input must stay valid forever, so you just never delete fields. The main benefit of DisallowUnknownFields is that it makes it easier for clients to know when they’ve sent something wrong or useless.

▲

nine_k 8 hours ago | parent [-]

No, once-valid input can be rejected after a period of depreciation.

What actually makes sense is versioning your interfaces (and actually anything you serialize at all), with the version designator being easily accessible without parsing the entire message. (An easy way to have that is to version the endpoint URLs: /api/v1, /api/v2, etc).

For some time, you support two (or more) versions. Eventually you drop the old version if it's problematic. You never have to guess, and can always reject unknown fields.

	▲	ljm 4 hours ago \| parent [-]
		This would also be easy to do if an API is designed around versioning from the beginning, because often it isn’t and it requires a lot of boilerplate and duplication, and it just results in everything being slapped into v1. Especially the case in frameworks that prescribe a format for routing.

▲ asimops a day ago | parent | prev | next [-]

In the case of Attack scenario 2, I do not get why in a secure design you would ever forward the client originating data to the auth service. This is more of a broken best practise then a footgun to me.

The logic should be "Parse, don't validate"[0] and after that you work on those parsed data.

[0]: https://hn.algolia.com/?q=https%3A%2F%2Flexi-lambda.github.i...

	▲	securesaml a day ago \| parent [-]
		See: https://bsky.app/profile/filippo.abyssdomain.expert/post/3le... that was about a signature wrapping attack in crypto, but it also applies here.

▲ anitil 3 days ago | parent | prev | next [-]

This was all very interesting, but that polyglot json/yaml/xml payload was a big surprise to me! I had no idea that go's default xml parser would accept proceeding and trailing garbage. I'd always thought of json as one of the simpler formats to parse, but I suppose the real world would beg to differ.

It's interesting that decisions made about seemingly-innocuous conditions like 'what if there are duplicate keys' have a long tail of consequences

▲ shakna 8 hours ago | parent [-]

Tangent for breaking Python's JSON parser: This has worked for five years. The docs do not say that parsing an invalid piece will result in a RecursionError. They specify JSONDecodeError and UnicodeDecodeError. (There is a RecursionError reference to a key that is off by default - but if its off, we can still raise this...)

    #!/bin/sh

    # Python will hit it's recursion limit
    # If you supply just 4 less than the recursion limit
    # I assume this means there's a few objects on the call stack first
    # Probably: __main__, print, json.loads, and input.

    n="$(python3 -c 'import math; import sys; sys.stdout.write(str(math.floor(sys.getrecursionlimit() - 4)))')"

    echo "N: $n"

    # Obviously invalid, but unparseable without matching pair
    # JSON's grammar is... Not good at being partially parsed.
    left="$(yes [ | head -n "$n" | tr -d '\n')"

    # Rather than exploding with the expected decodeError
    # This will explode with a RecursionError
    # Which naturally thrashes the memory cache.
    echo "$left" | python3 -c 'import json; print(json.loads(input()))'

▲ octo888 a day ago | parent | prev | next [-]

What is "IsAdmin" doing in the "create user" request DTO in the first place? The examples seem to indicate inappropriate re-use of data models.

Would it not be better to:

  type CreateUserRequest struct {
    Username string
    Password string
  }

  type UserView struct {
    Username string
    IsAdmin boolean
  }

etc?

No need to just have just 1 model that maps 1:1 to your DB row. This applies to all languages

▲ liampulles a day ago | parent | next [-]

Yeah its a bad pattern. And we can expect to see it in a sizable number of production codebases.

	▲	diggan a day ago \| parent [-]
		ORMs love to put people and projects in the situation where it's much easier for data types in your program to match exactly with the tables in your database, but reality often requires them to be decoupled instead. That "create user" request contains "isAdmin" smells like someone mapped that exact request to a table, and just because the table has "isAdmin", the data type now also has it.

▲ ljm a day ago | parent | prev | next [-]

Aside from security issues it presents, it’s just a really bad idea to tightly couple your data model to your API like that.

Any change to your DB schema is liable to become a breaking change on your API. If you need separate types for your requests and responses so be it.

▲ matthew16550 20 hours ago | parent | prev | next [-]

It's a concept from Fieldings REST thesis that is important to the current meaning of REST:

Transfer Objects are not the Storage Objects.

▲ db48x 15 hours ago | parent | prev | next [-]

Yes!

▲ ajross a day ago | parent | prev | next [-]

Exactly. This isn't a flaw in the runtime or the metaphor, the bug here is that the app exposes private data in a public API. That's a mistake that doesn't care about implementation choice. You can make the same blunder with a RESTful interface or a CGI script.

At most, you can argue that simple serialization libraries (Go's is indeed one of the best) make it more tempting to "just send the data" in such a design, so if you squint really (really) hard, you can call this a "footgun" I guess.

But the rest of the headline is 100% nonsense. This is not about "Go" or "parsers". At all.

▲ yencabulator 11 hours ago | parent | next [-]

Except in the case of encoding/xml garbage, that's really a limitation of the Go parser. It's a very non-strict parser and doesn't actually validate that the input conforms to XML. Think of it more as a query language for a stream of XML fragments.

▲ tln a day ago | parent | prev [-]

What? It's about parsers in Go's standard library.

▲ ajross a day ago | parent [-]

A "parser" is the piece of software that translates unstructured input (usually text) into structured output that better reflects the runtime of a programming language. A "security bug" in a parser is normally construed to be the existence of an input that causes the software to do something incorrect/undocumented/unexpected.

Nothing in the article discusses a parser or anything like a parser bug.

The article doesn't like that the semantics of the user-facing API wrapped around the parser is, I guess, "easy to make mistakes with". That's an article about API design, at most. But that's boring and specious and doesn't grab clicks, so they want you to think that Go's parsers are insecure instead.

▲ woodruffw a day ago | parent | next [-]

I would consider the case-insensitive key handling in Go’s JSON parser and the malleability in the XML parser to both be “classic” examples of differential-prone behavior, which is the ultimate point made by the article.

(This is why the more formal definition of a parser is useful to ground on: a parser is a type of recognizer, and disagreements between recognizers that claim to recognize the same thing can be exploited. This doesn’t require a bug per se, only difference, which is why it’s a footgun.)

▲ yencabulator 11 hours ago | parent [-]

> case-insensitive key handling in Go’s JSON parser

This was an explicit decision for convenience, because the Go struct fields will be Capitalized to export them but JSON tends to follow a lower case convention.

▲ woodruffw 2 hours ago | parent [-]

Just because it’s convenient doesn’t make it a good idea! It’s also not what Go’s other parsers do per the rest of the post.

	▲	yencabulator 10 minutes ago \| parent [-]
		The "other parsers" in stdlib are gob and encoding/xml. Gob is a format made for Go, so it does not have a field naming convention mismatch; what's on wire is the Go convention. encoding/xml isn't really structured as a "convert between equivalent-shaped structs and a message" utility, its struct tags are more like a query language, like a simpler cousin of XPath. Most real-world code using it does things like type Person struct { FirstName string `xml:"name>first"` LastName string `xml:"name>last"` } Where as with JSON there was a clear desire to have `{"email": "jdoe@example.com", "name": "John Doe"}` parse with as little noise as possible: `type Person struct { Email string Name string }`

▲ wavemode a day ago | parent | prev | next [-]

> Nothing in the article discusses a parser or anything like a parser bug.

And it doesn't claim to. The article is titled "footguns" not "bugs". A footgun is just something that is easy to misuse due to unintuitive or unexpected behavior.

▲

dimgl a day ago | parent [-]

> And it doesn't claim to.

Yes it does. The title is literally "Unexpected security footguns in Go's parsers". The article didn't identify a single footgun. This is just bad design.

▲

mplanchard a day ago | parent | next [-]

The article links to CVEs directly caused by some of the less intuitive behavior here. I feel like that’s sufficient to qualify as a footgun

▲

ajross a day ago | parent | prev [-]

That title is clearly constructed to imply that Go's parsers are insecure. The text of the article isn't about parsers at all, and only tangentially about Go[1].

It's deliberately misleading clickbait. You know it. I know it. We all know it.

If you want to have a considered discussion about pitfalls with the use of automatic serialization paradigms across trust boundaries, I'm here for it. If you just want to flame about Go, get better source material. This one isn't the hill to stand on.

[1] Which, again, has a really first rate serialization story; but not one fundamentally different from any of a zillion others. Cooking data from untrusted sources is just plain hard, and not something that anyone (much less the author of this awful blog post) is going to solve with a serialization API.

	▲	CactusRocket 20 hours ago \| parent \| next [-]
		I think you and I read a different article. For example, about treating JSON keys without case sensitivity: > In our opinion, this is the most critical pitfall of Go’s JSON parser because it differs from the default parsers for JavaScript, Python, Rust, Ruby, Java, and all other parsers we tested. It would be kind of difficult to argue that this is not about Go. Don't get me wrong, I love Go just as much as the next person, but this article definitely lays bare some issues with serialization specific to Go.
	▲	woodruffw 20 hours ago \| parent \| prev \| next [-]
		I don’t think the title is clickbait. You’re fixating on just one example in the post, but the others demonstrate a clear pattern: there are footguns in Go’s various parsers, including forms of malleability or flexibility that enable parser differentials. This isn’t intended to be a ding against Go; it’s a universal problem across programming languages (otherwise there’d be no differentials at all). But they’re worth noting, and I think the post amply demonstrates their security relevance.
	▲	ioasuncvinvaer 21 hours ago \| parent \| prev \| next [-]
		I disagree. It was a bad design choice to allow JSON keys with different capitalisation and to serialise all public struct members by default. These decisions can easily result in the creation of an insecure system.
	▲	x0x0 20 hours ago \| parent \| prev [-]
		Of course it's about Go. They decided to build not just opt-out unmarshaling of objects, but fragile opt-out. See eg rails' strong params for the opposite approach: opt-in.

▲ sfvisser a day ago | parent | prev [-]

Exactly right. Better have a domain layer with data types representing the domain object 1:1 and add one or more API layers on top for interacting with those for some modality. Creation, deletion, verification, auth etc.

The security failure is not the parsing library, but failing to model your application architecture properly.

▲ delusional a day ago | parent | prev [-]

> No need to just have just 1 model that maps 1:1 to your DB row. This applies to all languages

One reason is to avoid copying data constantly. I don't just mean this from an efficiency perspective, but also (and maybe more so) from a simplicity one. If you have a library for shoving data into a struct mechanistically, but you then take the data from that struct and shove it into an additional struct, what's the point of the library? You're writing the code move the data anyway.

In my dayjob I see this tendency constantly to have a lot of different very narrow structs that somehow integrate into some library, and then a TON of supporting code to copy between those structs. Only to then do very little actually useful work with any of the data at the end. I generally think you'd be happier with fatter structs that integrated less with weird "struct-filling" libraries.

▲ kgeist a day ago | parent | next [-]

1. As shown in the article, exposing the internal model to APIs directly is a security footgun. I've seen sites get hacked because the developer serialized internal objects without any oversight just like in the article and accidentally exposed secrets.

2. Exposing internal models to APIs directly also makes it hard to refactor code because refactoring would change APIs, which would require updating the clients (especially problematic when the clients are owned by other teams). I've seen this firsthand too in a large legacy project, people were afraid to refactor the code because whenever they tried, it broke the clients downstream. So instead of refactoring, they just added various complex hacks to avoid touching the old core code (and of course, their models also mapped directly to the UI).

In the end, codebases like that, with no layer separation, become really hard to maintain and full of security problems.

All because they thought it was "simpler" to skip writing ~10 lines of extra boilerplate per model to map models to DTOs.

Lack of layer separation becomes a problem in the long term. When you're just starting out, it may seem like overengineering, but it isn't

▲ delusional a day ago | parent [-]

> Lack of layer separation becomes a problem in the long term. When you're just starting out, it may seem like overengineering, but it isn't

I actually agree, but you're setting of a false dichotomy. I do believe in strong layering at the interfaces, for exactly the reasons you line up. What I don't believe in is what I might call "struct annotation based parsing" at those interfaces.

Typically, you don't want to pass DTO's around your code. Usually, you take in that struct, and then immediatly have some code to poke it into the appropriate places in your actual data structures. It's very often much easier to simply take a well structured but more direct and generic interpretation of the input data, and write the code to poke it into the correct places directly.

It is not that you should define your inputs separately from your internal data storage. It's that the specification of your input structure shouldn't exist as a struct, it should exist as the consequence of your parsing code.

> When you're just starting out, it may seem like overengineering, but it isn't

It's a real shame that the internet has driven us to assume everybody is a novice.

▲ kgeist a day ago | parent [-]

>It's a real shame that the internet has driven us to assume everybody is a novice.

Sorry, English is not my native language. I didn't mean to say you're a novice.

>Usually, you take in that struct, and then immediatly have some code to poke it into the appropriate places in your actual data structures

>It's that the specification of your input structure shouldn't exist as a struct, it should exist as the consequence of your parsing code.

>It is not that you should define your inputs separately from your internal data storage. It's that the specification of your input structure shouldn't exist as a struct, it should exist as the consequence of your parsing code.

Can you give an example?

▲ delusional a day ago | parent [-]

> Can you give an example?

Sure. I like taking Jackson (the Java library) as an example, since it actually supports both models. The way I've seen it used mostly is with jackson-databind. Here you define classes and annotate the fields with data that tells the library how to marshall them to/from json. Superficially, I find that similar to how Go or SerDe (from rust) suggests you handle that. In that programming model, I agree it makes total sense to declare some classes separately from your core structures, for all the reasons we've talked about.

The other model Jackson has is what they call the Jackson Tree Model. In this model you get back a representation of a Json Object. From that object you can get fields, those fields can themselves be objects, arrays, or immediate. It's an AST if you're a compiler person.

The first model might lead to code like this:

    public class Person {
        @JsonProperty
        String name;
        @JsonProperty
        String phoneNo;
    }

Usually, the annotations wont be able to fully specify the constraints of your code, so you'll see usage code like this:

    if(personDto.phoneNo.length() != 10) return Http(400, "Phone number must be 10 chars");
    person.phone = personDto.phoneNo

With the Tree Model you'd instead get a representation of the raw JSON from the client and pull out the fields you care about yourself:

    var phoneNoObj = jsonBody.get("phoneNo");
    if(phoneNoObj == null) return Http(400, "Phone number is required");
    var phoneNoStr = phoneNoObj.asString()
    if(phoneNoStr == null) return Http(400, "Phone number must be a string");
    if(phoneNoStr.length() != 10) return Http(400, "Phone number must be 10 chars");
    person.phone = phoneNoStr;

Notice that we are now just doing a single application specific parse of the json, and while we were at it we also got to surface a bunch more relevant errors. The Jackson Tree model is obviously pretty inefficient, but there are ways to implement it that makes it more efficient too.

	▲	kgeist 19 hours ago \| parent [-]
		In Go, you can marshal JSON to a map, so you can extract fields manually, too.

▲ masklinn a day ago | parent | prev | next [-]

> In my dayjob I see this tendency constantly to have a lot of different very narrow structs that somehow integrate into some library, and then a TON of supporting code to copy between those structs.

Maybe that's the problem to solve, rather than exposing the entire internal world to the outside? Because different views of the same entities is pretty critical otherwise it's way too easy to start e.g. returning PII to public endpoints because some internal process needed it.

▲

delusional a day ago | parent [-]

> exposing the entire internal world to the outside

That's not at all what I said.

You don't need a struct to avoid exposing internal data. If you're building a JSON object, you can just not write the code to format some fields out. You don't need a new data layout for that.

	▲	CactusRocket a day ago \| parent \| next [-]
		That's the whole point of the first item in the article, and the original comment you were replying to. In Go (and some other languages) that formatting is implicit and automatic. So you need to write to code to NOT format the fields out. Which leads to the potential security issues where data is leaked, or you can inject data into "hidden" fields. So since it's implicit and automatic, it's safer to, as a rule, define separate structs for the data input and map them, so that there is absolutely no chance (implicit or explicit) to leak or inject data.
	▲	masklinn a day ago \| parent \| prev [-]
		> If you're building a JSON object, you can just not write the code to format some fields out. Did you fail to read the article somehow? Few if any modern language requires you to write the code to format individual fields out. Even Go does not, though when it comes to json most fields need to be annotated because of the choices it made to tie casing and visibility, but then it doesn't even require you to opt the type itself into being serializable, every struct is serializable by default. Skipping fields on a serializable structure is what requires extra work.

▲ eadmund a day ago | parent | prev | next [-]

> In my dayjob I see this tendency constantly to have a lot of different very narrow structs that somehow integrate into some library, and then a TON of supporting code to copy between those structs. Only to then do very little actually useful work with any of the data at the end.

Don’t think of it as doing a little useful work at the end; think of it as doing all the useful work in the centre. Your core logic should be as close to a pure implementation without external libraries as possible (ideally zero, but that is often not easily achievable), but call out to external libraries and services to get its work done as appropriate. That does mean a fair amount of data copying, but IMHO it’s worth it. Testing copies is easy and localised, whereas understand the implications of a JSON (or Protobuf, or Django, or whatever) object carried deep into one’s core logic and passed into other services and libraries is very very difficult.

There’s a common theme with the ORM trap here. The cost of a little bit of magic is often higher than the cost of a little bit of copying.

▲ TeMPOraL a day ago | parent | prev | next [-]

> also (and maybe more so) from a simplicity one. If you have a library for shoving data into a struct mechanistically, but you then take the data from that struct and shove it into an additional struct, what's the point of the library? You're writing the code move the data anyway.

Super annoying if you need to do it by hand, and wastes compute and memory if you actually need to do copies of copies, but this is the mapping part of "object relational mapping", the M in ORM. Skipping it is a bad idea.

Your business/domain model should not be tied directly to your persistence model. It's a common mistake that's responsible for like half of the bad rep ORMs get. Data structures may look superficially similar, but they represent different concepts with different semantics and expectations. If you skip on that, you'll end up with tons of stupid mistakes like 'masklinn mentions, and more subtle bugs when the concepts being squashed together start pulling in opposite directions over time.

▲ homebrewer a day ago | parent | prev [-]

I don't know about Go, but in Java and .NET world this is trivially solvable with libraries like MapStruct. If you have a model with 20 fields and need to create a tiny slice of it (with let's say three fields), you need a few lines of boilerplate: create a record with those fields (1-3 LOC):

  record Profile(int id, String login, boolean isAdmin) {}

create a mapper for it:

  interface UserMapper {
    // arrays are just one example, plain models and plenty
    // of other data structures are supported
    Profile[] usersToProfiles(User[] user);

    // other mappers...
  }

and then use it:

  class UserController {
    //
    @GET("/profiles")
    Profile[] getUserProfiles() {
      var users = userRepo.getUsers();
      return userMapper.usersToProfiles(users);
    }
  }

As long as fields' names match, everything will be handled for you. Adding another "view" of your users requires creating that "view" (as a record or as a plain class) and adding just one line to the mapper interface, even if that class contains all User's fields but one. So no need to write and maintain 19+ lines of copying data around.

It also handles nested/recursive entities, nulls, etc. It's also using codegen, not reflection, so performance is exactly the same as if you had written it by hand, and the code is easy to read.

https://mapstruct.org

Go developers usually "don't need these complications", so this is just another self-inflicted problem. Or maybe it's solved, look around.

	▲	jerf a day ago \| parent \| next [-]
		Actually it's even easier in Go with struct embedding: `type SmallerThing struct { Id int Login string IsAdmin bool } type UserController struct { SmallerThing OtherField Whatever OtherField2 SomethingElse }` In principle this could break down if you need super, super complicated non-overlapping mappings, in practice I have yet to need that.
	▲	kgeist a day ago \| parent \| prev [-]
		>you have a model with 20 fields and need to create a tiny slice of it (with let's say three fields), you need a few lines of boilerplate: create a record with those fields (1-3 LOC): >create a mapper for it: > ... >Go developers usually "don't need these complications", so this is just another self-inflicted problem. In Go: `type DTO struct { A, B, C string }` Somewhere in your API layer: `// copy the fields to the DTO return DTO{A: o.A, B: o.B, C: o.C}` I fail to see where the "self-inflicted problem" is and why it requires a whole library? (which seems to require around the same number of lines of code at the end of the day, if you count the imports, the additional mapper interface)

▲ tptacek a day ago | parent | prev | next [-]

These kinds of issues (parser differentials in particular) are why you shouldn't trust Go SAML implementations that use `encoding/xml`, which was never designed for that application to begin with; I just wrote my own for my SAML.

(I mean, don't use SAML to begin with, but.)

	▲	securesaml 21 hours ago \| parent [-]
		Issue is not with go's parser, but instead about processing layer using different input than verifying layer [1] We patched the gosaml2 (and other go saml libraries), by ensuring only the authenticated bytes are processed (not the original XML document). You can see the patches here: https://github.com/russellhaering/goxmldsig/commit/e1c8a5b89... https://github.com/russellhaering/gosaml2/commit/99574489327... > I just wrote my own for my SAML. Curious to see your implementation for SAML and XML Signatures. [1]: https://bsky.app/profile/filippo.abyssdomain.expert/post/3le...

▲ piinbinary a day ago | parent | prev | next [-]

Kudos to the author for making this very clear and easy to understand. More technical writing should be like this.

On another note, it's mind-blowing that a single string can parse as XML, JSON, and YAML.

▲ aintnolove a day ago | parent | prev | next [-]

I know these problems are easily avoidable... but I'm finally starting to see the appeal of protocol buffers.

Just to have the assurance that, regardless of programming language, you're guaranteed a consistent ser/de experience.

▲

chubot a day ago | parent | next [-]

It would nice if it were true :-)

But it’s not, for reasons that have more to do with the languages themselves, than parsing

e.g. C++ numbers are different than Java numbers are different than Python numbers are different than JavaScript numbers

ditto for strings

▲

CactusRocket a day ago | parent [-]

I imagine that one of the points of a solid protocol buffers library would be to align the types even across programming languages. E.g. explicitly force a 64-bit integer rather than "int" relying on the platform. And to have some custom "string" type which is always UTF-8 encoded in memory rather than depending on the platform-specific encoding.

(I have no idea if that is the case with protobuf, I don't have enough experience with it.)

▲

chubot a day ago | parent [-]

Why would you "imagine" that?

Again, the problem has more to do with the programming languages themselves, rather than with protobufs or parsing.

Protobuf has both signed and unsigned integers - the initial use case was C++ <-> C++ communication

Java doesn't have unsigned integers

Python has arbitrary precision integers

JavaScript traditionally only had doubles, which means it can represent integers up to 53 bit exactly. It has since added arbitrary size integers -- but that doesn't mean that the protobuf libraries actually use them

---

These aren't the only possibilities -- every language is fundamentally different

OCaml has 31- or 63-bit integers IIRC

https://protobuf.dev/programming-guides/encoding/#int-types

And again, strings also differ between all these languages -- there are three main choices, which are basically 8-bit, 16-bit, or 32-bit code units

Go and Rust favor 8-bit units; Java and JavaScript favor 16-bit units; and Python/C/C++ favors 32-bit units (which are code points)

▲

CactusRocket 21 hours ago | parent [-]

I think I didn't explain myself well.

As long as a language has bytes and arrays, you can implement anything on top of them, like unsigned integers, 8-bit strings, UTF-8 strings, UCS-2, whatever you want. Sure it won't be native types, so it will probably be slower and could have an awkward memory layout, but it's possible

Granted, if a language is so gimped that it doesn't even have integers (as you mentioned JavaScript), then that language will not be able to fully support it indeed.

	▲	chubot 19 hours ago \| parent [-]
		Unfortunately that doesn't solve the problem -- it only pushes it around I recommend writing a protobuf generator for your favorite language. The less it looks like C++, the more hard decisions you'll have to make If you try your approach, you'll feel the "tax" when interacting with idiomatic code, and then likely make the opposite decision --- Re: "so gimped" --> this tends to be what protobuf API design discussion are like. Users of certain languages can't imagine the viewpoints of users of other languages e.g. is unsigned vs. signed the way the world is? Or an implementation detail. And it's a problem to be MORE expressive than C/C++ -- i.e. from idiomatic Python code, the protobuf data model also causes a problem Even within C/C++, there is more than one dialect -- C++ 03 versus C++ 11 with smart pointers (and probably more in the future). These styles correspond to the protobuf v1 and protobuf v2 APIs (I used both protobuf v1 and protobuf v2 for many years, and did a design review for the protobuf v3 Python API) In other words, protobufs aren't magic; they're another form of parsing, combined with code generation, which solve some technical problems, and not others. They also don't resolve arguments about parsing and serialization!

▲

nottorp 8 hours ago | parent | prev | next [-]

But since the article isn't really about parser bugs, I don't think using a different data format will save you from most of the problems described there.

▲

liampulles a day ago | parent | prev [-]

I think you can get similar benefits here from writing an RPC style JSON API into an OpenAPI spec and generating structs and route handlers from that. That's what I do for most of my Go projects anyway.

▲ e_y_ a day ago | parent | prev | next [-]

As someone who isn't a Go programmer, on the face of it using strings (struct tags) for field metadata seems pretty backwards compared to Rust macros (which parses the metadata at compile time) or Java annotations (which are processed at runtime but at least don't require parsing a string to break apart options).

The accidental omitempty and - are a good example of the weirdness even if they might not cause problems in practice.

▲ xnorswap a day ago | parent | next [-]

As a .net programmer, the "stringly typed" nature of the metadata horrifies me, but the choices of Go have long confused me.

So in .NET, like Java as you mention, we have attributes, .

e.g.

    [JsonPropertyName("username")]
    [JsonIgnore]

etc.

This is simple, and obvious. The JsonPropertyName attribute is an override, you can set naming policies for the whole class. camelCase by default, with kebab-case, snake_case etc as alternative defaults.

C#/.NET of course has the benefit of having public properties, which are serialised by default, and private properties, which aren't, so you're unlikely to be exposing things you don't want to expose.

This contrasts to Go's approach, much like python, of using casing convention to determine private vs public fields. ( Please correct me if I'm wrong on this? )

The first example still confuses me though, because either you want IsAdmin to come from the user, in which case you still want to deserialise it, or you don't, in which case it shouldn't even be in your DTO at all.

Deserialisation there is a bit of a red-herring, as there should be a validation step which includes, "Does this user have the rights to create an admin?".

The idea of having a user class, which gets directly updated using properties straight from deserialized user input, feels weird to me, but I'd probably be dismissed as an "enterprise programmer" who wants to put layers between everything.

▲

hmry a day ago | parent | next [-]

> This contrasts to Go's approach, much like python, of using casing convention to determine private vs public fields. ( Please correct me if I'm wrong on this? )

I think calling it a convention is misleading.

In Python, you can access an _field just by writing obj._field. It's not enforced, only a note to the user that they shouldn't do that.

But in Go, obj.field is a compiler error. Fields that start with a lowercase letter really are private, and this is enforced.

So I think it's better to think of it as true private fields, just with a... unique syntax.

▲

masklinn a day ago | parent | prev | next [-]

> This contrasts to Go's approach, much like python, of using casing convention to determine private vs public fields. ( Please correct me if I'm wrong on this? )

Go actually ties visibility to casing, instead of using separate annotations. And it will not serialise private fields, only public.

Python has no concept of visibility at all, conventionally you should not access attributes prefixed with `_` but it won't stop you.

	▲	maleldil a day ago \| parent [-]
		> conventionally you should not access attributes prefixed with `_` but it won't stop you. Any serious Python project will use at least one linter or typechecker, which can easily enforce this.

▲

grey-area a day ago | parent | prev [-]

Not just weird, it’s dangerous to do this - to easy to miss validation as fields are added over time. Better to explicitly validate all fields IMO.

▲ liampulles a day ago | parent | prev | next [-]

As someone who is regular Go programmer, yes struct tags do suck. It gets even worse if you want to try and combine multiple "annotations" into one struct tag string.

The reason its like that is that Go philosophically is very much against the idea of annotations and macros, and very strongly about the idea of a clear upfront control flow, and this is one of the reasons I love the language. But it does come at the cost of a few highly useful usecases for annotations (like mapping JSON and XML, etc.) becoming obtuse to use.

The idea of more compile-time macros in Go is interesting to me, but at the same time the ease of debugging and understanding the Go control flow in my programs is one of the reasons I love it so much, and I would not want to invite the possibility of "magic" web frameworks that would inevitably result from more metaprogramming ability in Go. So I guess I'm prepared to live with this consequence. :/

	▲	masklinn a day ago \| parent \| next [-]
		> The reason its like that is that Go philosophically is very much against the idea of annotations and macros, and very strongly about the idea of a clear upfront control flow Annotations have no control flow, they just attach metadata to items. The difference with struct tags being that that metadata is structured.
	▲	valenterry a day ago \| parent \| prev \| next [-]
		I understand your feeling. There are many magical frameworks like e.g. Spring that do these things and it's super hard to figure out what's going on. The solution is usually to have an even better language. One, where the typesystem is so powerful, that such hacks are not necessary. Unfortunately, that also means you have to learn that typesystem to be productive in language, and you have to learn it more or less upfront - which is not something that Google wanted for golang due to the turnover.
	▲	theasisa 4 hours ago \| parent \| prev [-]
		I find the most annoying thing to be that there is no standard for separate values or field in the tag string. Sometimes you use a comma (json) sometimes a semicolon (Gorm IIRC) etc. This has caused me several headaches.

▲ kjksf a day ago | parent | prev | next [-]

For some it's stupidity. For others it's brilliance.

It's one of many examples of 80/20 design in Go: 80% of functionality with 20% of complexity and cost.

Struct tags address an important scenario in an easy to use way.

But they don't try to address other scenarios, like annotations do. They are not function tags. They're not variable tags. They are not general purpose annotations. They are annotations for struct fields and struct fields only.

Are they are as powerful as annotations or macros? Of course not, not even close.

Are they as complex to implement, understand, use? Also not.

80/20 design. 80% of functionality at 20% of cost.

▲

Philpax a day ago | parent | next [-]

Go's simplifications often introduce complexities elsewhere, however, as this article demonstrates with the complexities of correctness of a stringly-typed DSL.

There's no free lunch here, and the compromises Go makes to achieve its outcomes have shown themselves to be error-prone in ways that were entirely predictable at design time.

▲

Kamq a day ago | parent | next [-]

> Go's simplifications often introduce complexities elsewhere

It does occasionally, although I'll push back on the "often". Go's simplifications allow most of the codebase to be... well... simple.

This does come at the cost of some complexity on the edge cases. That's a trade off I'm perfectly willing to make. The weird parts being complex is something I'm willing to accept in exchange for the normal parts being simple, as opposed to constantly dealing with a higher amount of complexity to make the edge cases easier.

> There's no free lunch here

This I'll agree with as well. The lunch is not free, but it's very reasonably priced (like one of those hole in the wall restaurants that serves food way too good for what you pay for it).

> the compromises Go makes to achieve its outcomes have shown themselves to be error-prone in ways that were entirely predictable at design time.

I also agree here, although I see this as a benefit. The things that are error prone are clear enough that they can be seen at design time. There's no free lunch here either, something has to be error prone, and I like the trade offs that go has made on which parts are error prone. Adding significant complexity to reduce those error prone places has, in my experience, just increased the surface area of the error prone sections of other languages.

Could you make the case that some other spot in design space is a better trade-off? Absolutely, especially for a particular problem. But this spot seems to work really well for ~95% of things.

▲

valenterry a day ago | parent | prev [-]

> Go's simplifications often introduce complexities elsewhere

Exactly this.

Basically: have a complex compression algorithm? Yes, it's complex, but the resulting filesize (= program complexity) will be low.

If you use a very basic compression algorithm, it's easier the understand the algorithm, but the filesize will be much bigger.

It's a trade-off. However, as professionals, I think we should really strive to put time to properly learn the good complex compression algorithm once and then benefit for all the programs we write.

▲

maleldil a day ago | parent [-]

> I think we should really strive to put time to properly learn

[insert Pike's Google young programmers quote here]

That's just not the philosophy of the language. The convention in Go is to be as obvious as possible, at the cost of more efficient designs. Some people like it, others don't. It bothers me, so I stopped using Go.

	▲	valenterry 8 hours ago \| parent [-]
		Exactly. Look how we are getting downvoted for the truth.

▲

timeon 19 hours ago | parent | prev [-]

Starting to get feeling that 80/20 design is not good thing. Many things seems to be driven by worse is better but, looking at things like Climate Change... was it worth it?

▲ grey-area a day ago | parent | prev | next [-]

Yes they are a horrible idea for many reasons, not just security. It’s like a hidden ill-defined poorly understood dsl in strings.

You can just not use them though - you can unmarshal to a map instead and select the keys you want, perform validation etc and then set the values.

Same when publishing - I prefer to have an explicit view which defines the keys exposed rather than than publishing all by default based on these poorly understood string keys attached to types.

▲ reactordev a day ago | parent | prev [-]

struct tags greatly reduce the boilerplate code required to map fields to fields. It’s really quite novel once you understand it.

▲

masklinn a day ago | parent | next [-]

> struct tags greatly reduce the boilerplate code required to map fields to fields.

Are you somehow under the impression that Go is unique in having a terse way to map fields to fields?

> It’s really quite novel once you understand it.

It's the opposite of novel, putting ad-hoc annotations in unstructured contexts is what people used to do before java 5.

	▲	reactordev a day ago \| parent [-]
		No, not at all, but considering C has no such thing, I’ll take it.

▲

jlouis a day ago | parent | prev [-]

It's not very novel. There's far better ways of solving this than allowing a random string to be embedded as aux information to a struct field. Examples: F# type providers, or OCamls PPX system for extending the language in a well defined way. Macro rewriting systems also allow for better safety in this area.

This allows you to derive a safe parser from the structural data, and you can make said parser be really strict. See e.g., Wuffs or Langsec for examples of approaches here.

▲

reactordev a day ago | parent [-]

I’m not disagreeing that there are better ways to solve this given how other languages have implemented theirs but considering the constraints they had at the time the Go team designed this, it allowed them to implement marshaling fairly easily and leaves it open for extensions by the community.

	▲	maleldil a day ago \| parent [-]
		> considering the constraints they had at the time the Go team designed this What constraints? Ignoring decades of programming language developments since C89?

▲ fainpul a day ago | parent | prev | next [-]

I didn't see this mentioned in the article: wouldn't the obvious way be to make the private field private (lowercase)?

(I'm not a Go developer, just tried the language casually).

    type User struct {
     Username string `json:"username_json_key,omitempty"`
     Password string `json:"password"`
     isAdmin  bool
    }

https://go.dev/play/p/1m-6hO93Xce

▲

zimpenfish a day ago | parent | next [-]

> wouldn't the obvious way be to make the private field private (lowercase)?

That may break other things - `gorm`, for example, will ignore private fields - inconvenient if you want to serialise `User` to your DB.

▲

CactusRocket a day ago | parent [-]

If you use the same struct in both an HTTP API and an ORM, you're Doing It Wrong in my opinion. These should be completely separated. Exactly to prevent accidental leaking or injection of data.

▲

zimpenfish 20 hours ago | parent [-]

> If you use the same struct in both an HTTP API and an ORM, you're Doing It Wrong in my opinion.

If you mean "public API", yep, 100% agree. Internal API between microservices though? Perfectly safe and cromulent, I'd say.

	▲	CactusRocket 8 hours ago \| parent [-]
		I tend to disagree with that, also. :) Even within one codebase there's immense value in having separate structs/classes per "layer" or domain. E.g. a different set of structs for the database layer than for the "business layer" (or whatever your application's internal setup is). When that boundary is moved to outside the application, so an HTTP API between microservices, I feel even more strongly (though indeed still not as strongly as in what you call a "public API"). E.g. I have seen plenty of times a situation where a bunch of applications were managed within one team, the team split up and now this "internal API" has become an API between teams, suddenly making it "public" (when viewed from the teams perspective).

▲

never_inline a day ago | parent | prev | next [-]

You can annotate `json:"-"` which is equivalent of @JsonIgnore

	▲	zimpenfish a day ago \| parent [-]
		Covered in the article along with a potential snafu - adding anything else (`-,omitempty`) turns it into a key "-", not the "ignore this field" indicator.

▲

arp242 a day ago | parent | prev [-]

That works, but then you also can't access isAdmin from another package.

	▲	pdq a day ago \| parent [-]
		You can just make an IsAdmin() public function to expose it.

▲ em-bee a day ago | parent | prev | next [-]

i am not aware of any parser that does that differently, but i would also argue that this is not the job of parsers. after parsing (or before exporting) there should be a data validation step based on whitelists.

so the user can send in unknown fields all they want, the code will only accept the username and firstname strings, and ignore the other ones.

same with fetching data and sending it to the user. i fetch only the fields i want and create the correct datastructures before invoking the marshaling step.

there are no footguns. if you expect your parser to protect you you are using it wrong. they were not designed for that.

input -> parse -> extract the fields we want, which are valid -> create a data-structure with those fields.

data -> get fields i want -> create datastructures with only wanted fields -> write to output format

	▲	nottorp 8 hours ago \| parent \| next [-]
		> this is not the job of parsers The part of the article that I read before getting annoyed at the clickbaity title is basically "if you trust external data here's how you can blame that design decision on the parser".
	▲	securesaml a day ago \| parent \| prev [-]
		This is correct. In blog post they say: > Other examples exist, but most follow the same pattern: the component that does security checks and the component that performs the actions differ in their view of the input data. This would be solved (as you described), by ensuring that the downstream layer uses only contents that are verified in the security check layer. If they are using a microservice then: Security check API -> return verified data (i.e. re-serialize the verified JSON or XML into byte form, NOT the original input) -> Processing layer i.e. userCreate API uses verified data. This is the method we used in fixing the ruby-saml example. See: https://bsky.app/profile/filippo.abyssdomain.expert/post/3le...

▲ jerf a day ago | parent | prev | next [-]

There's a lot of good information in here, but if you think this is a time to go all language supremicist about how much better your language is and how this proves Go sucks, you might want to double-check the CVE database real quick first. A lot of these issues are more attributable to plain ol' user error and the fundamental messiness of JSON and XML than Go qua Go and I've seen many similar issues everywhere.

For instance, there simply isn't a "correct" way for a parser to handle duplicate keys. Because the problem they have is different layers seeing them differently, you can have the problem anywhere duplicate keys are treated differently, and it's not like Go is the only thing to implement "last wins". It doesn't matter what you do. Last wins? Varies from the many "first wins" implementations. First wins? Varies from the many "last wins" implementations. Nondeterministically choose? Now you conflict with everyone, even yourself, sometimes. Crash or throw an exception or otherwise fail? Now you've got a potential DOS. There's no way for a parser to win here, in any langauge. The code using the parser has some decisions to make.

Another example, the streaming JSON decoder "accepts" trailing garbage data because by the act of using the streaming JSON decoder you have indicated a willingness to potentially decode more JSON data. You can use this to handle newline-separated JSON, or other interesting JSON protocols where you're not committing to the string being just one JSON value and absolutely nothing else. It's not "an issue they're not planning on fixing", it's a feature with an absolutely unavoidable side effect in the context of streams. The JSON parser stops reading the stream at the end of the complete JSON object, by design, and anything else would be wrong because it would be consuming a value to "check" whether the next thing is JSON or not, when you may not even be "claiming" that the "next thing" is JSON, and whatever input it consumed to verify a claim that nobody is even making would itself be a bug.

Accepting user input into sensitive variables is a common mistake I've seen multiple times in a number of langauges. The root problem there is more the tension between convenience and security than languages themselves; any language can make it so convenient to load data that developers accidentally load more data than they realize.

Etc. The best lesson to take away from this is that there is more than meets the eye with JSON and XML and they're harder to use safely than its ease-of-use suggests.

Although in the interests of fairness I also consider case insensitivity in the JSON field names to be a mistake; maybe it should be an option, JSON can get messy in the real world, but it's a bad default. I have other quibbles but most of them are things where there isn't a correct answer where you can unambiguously say that some choice is wrong. JSON is really quite fundamentally messier than people realize, and XML, while generally more tightly specified at the grammer level than JSON is, is generally quite messy in the protocols people build on top of it.

▲

conradludgate 19 hours ago | parent | next [-]

If a parser throwing an error due to duplicate fields causes a DoS, then the same is going to happen when it throws an error due to invalid encoding...

So, I don't think that's a relevant critique. I think any ambiguous case in parsing untrusted user input should raise an error, and anyone working on code with untrusted data should be ready to handle errors

▲

mplanchard a day ago | parent | prev [-]

The article didn’t claim that “last wins” is in and of itself an issue, but that the differences between who wins between parsers across services/languages can cause issues. Their position was that everyone should standardize on “last wins,” since that is the most common.

	▲	securesaml a day ago \| parent [-]
		The correct conclusion is: https://news.ycombinator.com/item?id=44337330 The problem of trying to ensure that each parser behaves the same for all input is twofold: - JSON and XML specifications are complex, lots of quirks. So not feasible. - Does not solve the fundamental issue of the processing layer not using the same data that is verified in the verification layer. Note: the processing layer parses the original input bytes, while the verification layer verifies a struct that is parsed using another parser. Processed: Proc(input) Verified: VerifyingParser(input)

▲ sudomateo 12 hours ago | parent | prev | next [-]

Great informational post. I've been guilty of forgetting that pesky leading comma and changing the field name.

▲ neuroelectron a day ago | parent | prev | next [-]

Been seeing these same problems in services for decades now. It's almost like they made these protocol languages exploitable on purpose.

▲

CactusRocket a day ago | parent | next [-]

I think it's just kinda dumb parsing. E.g. JSON is an extremely simple spec. Most of those issues that the Go JSON parser has, are because of specific choices of the Go implementation, not about JSON. The fact that it allows case-insensitive key matching is just insane. Also that it parses invalid XML documents (with garbage) into valid structs without returning an error is very much a problem with the parser and not with XML.

▲

cesarb 20 hours ago | parent [-]

> The fact that it allows case-insensitive key matching is just insane.

It's probably a side effect of what is IMO another bad design of that language: letter casing determining field visibility, instead of using a keyword or a sigil. If your field has to be named "User" to be public, and the corresponding entry in the JSON has all-lowercase "user" as the key (probably because the JSON was defined first, and most languages have "field names start with lowercase" as part of their naming conventions), you have to either ignore case when matching, or manually map every field. They probably wanted to be "intuitive" and not require manual mapping.

	▲	jeroenhd 3 hours ago \| parent \| next [-]
		> If your field has to be named "User" to be public, and the corresponding entry in the JSON has all-lowercase "user" as the key then you specify the key to be "user"? Isn't that the point of the ability to remap names? Except you can't, because you don't have a choice whether or not your data is deserialised with case sensitivity enabled or not. I've written plenty of Rust code to turn camelCase into snake_case and "it's too much effort" has never been a problem. It's a minor bother that helps prevents real security issues like the ones listed in this article. Even if you want to help lazy programmers, I don't think there's a good reason to confuse "User" and "uſER" by default.
	▲	TheDong 4 hours ago \| parent \| prev [-]
		I mean, one of the more silly things here is that even when you manually map, with the tag `json:"user"`, it still ignores the case of that while deserializing. It respects it while serializing though

▲

v5v3 a day ago | parent | prev [-]

Indeed...

▲ DidYaWipe 7 hours ago | parent | prev | next [-]

pitfalls

▲ Rikudou 18 hours ago | parent | prev [-]

And people keep on hating on PHP... Case-insensitive JSON parsing is another level, one even PHP wouldn't cross.