"Well that's all fine in practice, but how does it work in theory ?"

posted Aug 25, 2013, 7:07 PM by Admin PEMRF   [ updated Aug 25, 2013, 10:16 PM ]


 


 Dr. Garret FitzGerald 1926-2011, Former Taoiseach (Premier) of Ireland,

 architect of the Anglo-Irish Peace agreement,  and an academically 

 inclined economist who once arrived for a public function in one brown and one black shoe.




"Well that's all fine in practice, but how does it work in theory ?"

When presented with a solution to a difficult problem this apparently was former Taoiseach Garrett Fitzgerald's response. This is the kind of answer that gets academics a bad name. It struck me today however in the context of developing rules to predict rare events.

 

The classic 'I made it up because I'm an expert and then I validated it' rule is the apnea rule proposed by Wilwerth et al. After coming up with the rule the authors tested it on a retrospective data set; which apparently they didn't review prior to developing the rule. And their ruled perfectly perfect predicted all the infants with apnea in the data set.

 

The foundation is supporting a project which is currently attempting to derive a rule prospectively. The authors are using the standard logistic regression approach. But even with the adjustment for the rare events when the goal being sought is a model with 100% sensitivity problems arise. Hundred percent sensitivity is required for such things as apnea (it's not really good enough to send home just a few children to die). Even with adjustment for rare events, when the events become very rare and a zero miss rate is required the imbalance in the data makes the task almost impossible. CART analysis didn't help for the same reasons.

 

And yet the authors noted that with a little thought it was fairly easy to derive a series of rules which when tested had 100% sensitivity and as high as 60% specificity. Our researchers could do this because humans can fairly easily bias a classification tool when they create it manually. At first blush such a tool, with 100% sensitivity and specificity of 60% for a potentially fatal outcome in small infants sounds wonderful. But, is this really any different from the exercise performed by Wilwerth et al? One can argue that it is more data informed, done as it was following extensive univariable and multivariable modeling. This is true, however it does not excuse what is happening here. What is happening here, and what happened in Wilwerth et al. is a classic example of over fitting. Predictably, two subsequent authors failed to validate the rule proposed by Wilwerth et al., one a prospective, and one in retrospective data set and both in different centers to that of the authors of Wilwerth et al. The foundation does not want to see its work similarly discredited.

 

The fundamental problem comes down to this. In both cases informed experts are using either their own experience or data and their own experience to craft a classifier which works perfectly in their experience i.e. their  ER   but not anyone else's.  This is simply not good enough. Inevitably the data will be over fit and the classifier not generalizable (to somebody else's ER or infant).

 

As Garrett Fitzgerald said “It's all very well in practice but how does it work in theory?”

 

SMOTE'ing and other oversampling approaches in order to balance the data set can be taken to the extreme such that cases are oversampled in a proportion that effectively penalizes the controls. This is distinct from applying a penalty for false negatives in the evaluation phase of a classifier; although this distinction is not often brought out in the clinical literature. I’ll discuss how this all pans out in a subsequent entry.

 

Comments