Remix.run Logo
jmugan 7 months ago

My problem isn't running out of memory; it's loading in a complex model where the fields are BaseModels and unions of BaseModels multiple levels deep. It doesn't load it all the way and leaves some of the deeper parts as dictionaries. I need like almost a parser to search the space of different loads. Anyone have any ideas for software that does that?

enragedcacti 7 months ago | parent | next [-]

The only reason I can think of for the behavior you are describing is if one of the unioned types at some level of the hierarchy is equivalent to Dict[str, Any]. My understanding is that Pydantic will explore every option provided recursively and raise a ValidationError if none match but will never just give up and hand you a partially validated object.

Are you able to share a snippet that reproduces what you're seeing?

jmugan 7 months ago | parent [-]

That's an interesting idea. It's possible there's a Dict[str,Any] in there. And yeah, my assumption was that it tried everything recursively, but I just wasn't seeing that, and my LLM council said that it did not. But I'll check for a Dict[str,Any]. Unfortunately, I don't have a minimal example, but making one should be my next step.

enragedcacti 7 months ago | parent [-]

One thing to watch out for while you debug is that the default 'smart' mode for union discrimination can be very unintuitive. As you can see in this example, an int vs a string can cause a different model to be chosen two layers up even though both are valid. You may have perfectly valid uses of Dict within your model that are being chosen in error because they result in less type coercion. left_to_right mode (or ideally discriminated unions if your data has easy discriminators) will be much more consistent.

    >>> class A(BaseModel):
    >>>     a: int
    >>> class B(BaseModel):
    >>>     b: A
    >>> class C(BaseModel):
    >>>     c: B | Dict[str, Any]

    >>> C.model_validate({'c':{'b':{'a':1}}})
    
    C(c=B(b=A(a=1)))

    >>> C.model_validate({'c':{'b':{'a':"1"}}})

    C(c={'b': {'a': '1'}})

    >>> class C(BaseModel):
    >>>     c: B | Dict[str, Any] = Field(union_mode='left_to_right')
    
    >>> C.model_validate({'c':{'b':{'a':"1"}}})

    C(c=B(b=A(a=1)))
causasui 7 months ago | parent | prev | next [-]

You probably want to use Discriminated Unions https://docs.pydantic.dev/latest/concepts/unions/#discrimina...

jmugan 7 months ago | parent [-]

Yeah, I'm doing that

not_skynet 7 months ago | parent | prev | next [-]

going to shamelessly plug my own library here: https://github.com/mivanit/ZANJ

You can have nested dataclasses, as well as specify custom serializers/loaders for things which aren't natively supported by json.

jmugan 7 months ago | parent [-]

Ah, but I need something JSON-based.

not_skynet 7 months ago | parent [-]

It does allow dumping to/recovering from json, apologies if that isn't well documented.

Calling `x: str = json.dumps(MyClass(...).serialize())` will get you json you can recover to the original object, nested classes and custom types and all, with `MyClass.load(json.loads(x))`

cbcoutinho 7 months ago | parent | prev [-]

At some point, we have to admit we're asking too much from our tools.

I know nothing about your context, but in what context would a single model need to support so many permutations of a data structure? Just because software can, doesn't mean it should.

shakna 7 months ago | parent [-]

Anything multi-tenant? There's a reason Salesforce is used for so many large organisations. The multi-nesting lets you account for all the descrepancies that come with scale.

Just tracking payments through multiple tax regions will explode the places where things need to be tweaked.