| ▲ | srean 9 hours ago | |||||||||||||||||||||||||
A 'secret weapon' that has served me very well for learning classifiers is to first learn a good linear classifier. I am almost hesitant to give this away (kidding). Use the non-thresholded version of that linear classifier output as one additional feature-dimension over which you learn a decision tree. Then wrap this whole thing up as a system of boosted trees (that is, with more short trees added if needed). One of the reasons why it works so well, is that it plays to their strengths: (i) Decision trees have a hard time fitting linear functions (they have to stair-step a lot, therefore need many internal nodes) and (ii) linear functions are terrible where equi-label regions have a recursively partitioned structure. In the decision tree building process the first cut would usually be on the synthetic linear feature added, which would earn it the linear classifier accuracy right away, leaving the DT algorithm to work on the part where the linear classifier is struggling. This idea is not that different from boosting. One could also consider different (random) rotations of the data to form a forest of trees build using steps above, but was usually not necessary. Or rotate the axes so that all are orthogonal to the linear classifier learned. One place were DT struggle is when the features themselves are very (column) sparse, not many places to place the cut. | ||||||||||||||||||||||||||
| ▲ | whatever1 4 hours ago | parent | next [-] | |||||||||||||||||||||||||
If you think about it, this what reinforcement learning folks are doing. They use the vanilla state and then they lift it to the observed state by doing some additional calculation with the original state data. For example you start with the raw coordinates of snake in a snake game, but you now can calculate how many escape routes the snake has, and train on it. | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
| ▲ | 3abiton 6 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||
I think it's worth mentioning, that the achilles heel of DT, is in fact, data (more specifically feature) engineering. If one does not spend significant time cleaning and engineering the features, the results would be much worse than, say a "black box" model, like NN. This is the catch. Ironically, NN can detect such latent features, but very difficult to interpret why. | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
| ▲ | u1hcw9nx 8 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||
There are decision trees for what you want do do. Oblique Decision trees, Model Trees. (M5 Trees for example), Logistic Model Trees (LMT) or Hierarchical Mixture of Experts (HME). | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
| ▲ | ekjhgkejhgk 9 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||
> (ii) linear functions are terrible where equi-label regions have a partitioned structure. Could you explain what "equi-label regions having a partitioned structure" mean? | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
| ▲ | selimthegrim 3 hours ago | parent | prev [-] | |||||||||||||||||||||||||
Is this not the heart of the IRM paper by Arjovsky, Bottou et al.? | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||