Remix.run Logo
MontyCarloHall 4 hours ago

Correlated attributes can still lead to the paradox, so long as the error measured parallel to the cutoff line (the "fuzziness" of the correlation) is greater than the slope of the cutoff line. Here are a couple cartoons to demonstrate. Denote each datapoint with I or E, depending on whether it's included or excluded in the region x + y > z.

Uncorrelated attributes:

   y
   │   ∙                
   │    ∙∙ IIIIIII      
   │     E∙∙IIIIIIII    
   │    EEEE∙∙IIIIIII   
   │    EEEEEE∙∙IIIII   
   │    EEEEEEEE∙∙III   
   │     EEEEEEEEE∙∙    
   │       EEEEEEE  ∙∙  
   │                  ∙ 
   └───────────────────x
Looking at just the Included points shows clear (spurious) negative correlation.

Correlated attributes:

   y
   │  ∙              
   │   ∙∙   IIII   
   │     ∙∙IIIIII  
   │      E∙∙IIIII   
   │     EEEE∙III    
   │    EEEEEE∙∙     
   │     EEEE   ∙∙   
   │       E      ∙∙ 
   │                ∙
   └─────────────────x
The Included points still have a negative spurious correlation, though it's smaller than for the uncorrelated cartoon.