Remix.run Logo
srean 2 hours ago

Data science solutions are different in the sense they rarely ever get done and dusted in a sense a sorting library might.

There's almost always something or the other breaking. Did the nature of data change. Did my upstream data feed change. Why are these small set of examples not working for this high paying customer.

You would need resources to understand and fix these problems quarter after quarter.

A rich network of data dependencies can be a double edged sword. Rarely are upstream code and data changes benign to the output of the layer you own.

There are two cases where AI solutions are perfect. They are so good that they are fire and forget. The second is that your customer is a farmer not a gardener. Individual failing saplings mean little to him.

If a single misbehaving plant can cause commercially significant damage then when choosing opaque tools you must consider the maintenance cost you may be signing up for.

Say I have a ton of historical data that is being continuously added to. It's a real temptation to replace the raw data with a model that uses less number of parameters than the raw data. In a sense lossy compression. Can be a very bad idea. Data instances where the model does not fit well may be the most important pieces of art information. Model paints with a broad brush stroke. If you are hunting faults, you have been aware that a lossy compression can paper them away. You are also potentially harming a future model that could have been trained but you have thrown away a decade of useful data because storage costs were running so high.

No easy solution. General recommendation would be to compress but losslessly simply because you know not what may be valuable in the future. If it's impossible, then so be it, you have to eat that opportunity cost in the future, but you did your best.