The diagnosis and management of major trauma in the elderly is a really hot topic in Emergency Medicine right now. Compared to younger patients, the elderly can sustain much more serious injuries even with apparently trivial mechanisms. What’s more, they present very differently – and it’s easy to underestimate severity in this group at the time of initial presentation to the ED. For example, traumatic brain injury often presents with relatively subtle symptoms as there can be a lot more space in the cranial cavity due to cerebral atrophy. Subdural haemorrhage is common and identifying it can be tough. Even after diagnosis, there remains the question about whether neurosurgery stands to provide any benefit.
That’s why this paper really caught my eye. The authors aimed to derive a simple decision tool that could help to safely reduce referrals to neurosurgery for elderly patients with acute subdural haemorrhage – i.e. a tool that could tell us when a subdural needs surgery.
Unfortunately it’s not #FOAMed but, as always, we’d really encourage you to read the full paper if you can get hold of it.
Not only is this a great idea for a study but the authors are some of my extremely well respected colleagues from Greater Manchester. They’ve done a great job to produce this study. But will their findings withstand the dispassionate turning of the cogs of critical appraisal?
This was essentially a retrospective diagnostic cohort study. There’s no mention of ethical approval or patient consent, but this was a study using routinely collected patient data, which may have been anonymous – and as such formal ethical approval has presumably been waived. The authors’ objective was to derive a simple and reliable screening tool to identify elderly patients with a “surgically important” acute traumatic subdural haemorrhage.
The authors looked at the regional database of neurosurgical referrals and selected only those patients who were aged >65 years and who had an acute subdural haemorrhage during a 3 year period.
This does seem like a fairly reasonable target population. It’s the sort of patient group we might want to use a decision tool in. Our only question might be about the accuracy of the database interrogation and how likely it is that patients may have been missed.
A greater concern in this regard may be that, although 483 patients were included in this study, 250 were excluded. A total of 166 of the excluded patients did not actually have subdural haemorrhage, we’re told. We might want to know why they were referred in that case. A further 52 CT scans were not available for review, which might raise a few concerns about possible selection bias.
The parameters the authors were looking for were all measurements from the CT brain scans. Each of the CT scans was a ‘presenting’ scan but we don’t know much more about when they were performed exactly. There were also no clinical parameters considered for inclusion in the decision tool, which I thought had the advantage of simplicity but the disadvantage that it might ignore some potentially valuable information (like the GCS and comorbidities, for example).
One medical student made all of the measurements, including: the linear dimensions of the haematoma, the haematoma volume (using a previously validated method) and the extent of midline shift. You might ask whether it’s reasonable for the measurements to be made by a medical student. Would this be how we use the tool in practice? Surely not. But there are ways of addressing that limitation – like, for example, assessing interobserver reliability.
The authors assessed the interobserver reliability of the measurements by having a neurosurgical registrar repeat the measurements in 50 ‘randomly selected’ CT scans. This is >10% of all the scans evaluated – so far, so good.
Because the measurements are continuous variables (they can theoretically assume any value) rather than simple yes/no answers (dichotomous variables), you can’t actually calculate a kappa score. Reporting the kappa score is what we might usually do to assess how much the student and the registrar agreed beyond what we would have expected by chance.
Instead, the authors plotted Bland-Altman charts, which are used to examine the differences between two different measurements of the same thing. Essentially, you plot the difference between the measurements against the mean of the two measurements. This gives you an idea about any systematic differences between the findings of the student and the registrar. Bland-Altman charts are most often used when comparing two different methods of running a blood test.
The authors also looked at the correlation between the measurements made by the student and the registrar using Spearman’s rho. Correlation tells us how much the two measurements depend on each other. A perfect positive correlation would mean that the measurements made by the student and the registrar go up perfectly in tandem. However, there is a limitation to using correlation as a means of testing interobserver reliability.
Suppose that the student’s measurements of diameter for 5 scans were 1mm, 2mm, 3mm, 4mm and 5mm.
And suppose that the registrar’s measurements were 7mm, 8mm, 9mm, 10mm and 11mm.
These two sets of measurements have a perfect correlation. But they are totally different! So you really need to do something more. The authors could have evaluated the intraclass correlation co-efficient (ICC), which is essentially like the kappa score but for continuous variables. It takes account of the differences between measurements whereas simple measures of correlation don’t.
So, although the authors tell us that the correlation co-efficient (r) was pretty good at 0.95 (a perfect positive correlation gives an r of 1), we would like to know more before accepting that the interobserver reliability of the measurements is acceptable. What’s more, we’d like to know the interobserver reliability of the final decision tool itself – but the authors haven’t gone that far.
This is a really crucial question in diagnostic research. If the primary outcome is flawed, so is everything else. You can show 100% sensitivity and specificity – but it means nothing unless you know what the tool is 100% sensitive and specific for.
In this study, the primary outcome was a ‘surgical bleed’, defined as a subdural haemorrhage that a neurosurgeon believed needed an operation based only on the CT appearance. The question is whether that’s a reasonable, clinically meaningful and objective outcome. Does it get us closer to practising evidence-based medicine if we can accurately predict this outcome without telephoning a neurosurgeon? Does it generate new scientific knowledge if we can predict it?
I’d argue not. Unfortunately this outcome doesn’t tell us anything about actual patient outcomes. Not only that, but a neurosurgeon would never make a decision about the need for surgery based only on a CT scan appearance. Other information (such as the patient’s comorbidities and functional status including GCS) would surely factor.
A much better outcome would have given us some idea of the patient’s outcome. This would move us towards a truly evidence-based decision tool. Using the opinion of a neurosurgeon based on CT simply moves us towards accurately predicting what a neurosurgeon’s opinion might be – and that’s not evidence-based medicine. What’s more, if the neurosurgeon is deciding whether each bleed is ‘surgical’ based only on CT appearances, it’s likely that they’re making the decision based on the size of the bleed. It’s therefore absolutely no surprise that measuring of the size of the bleed tells us which bleeds are ‘surgical’. Effectively, we find out that measuring bleeds helps to identify ‘big bleeds’. That means we have some important incorporation bias.
The authors found that several features had 100% sensitivity for ‘surgically important’ subdural haemorrhage. The best combination was:
Maximum thickness of haematoma >10mm
AND midline shift >1mm
Using this combination would have had 100% sensitivity and 83% specificity. Thus, if that is a truly accurate decision tool, perhaps it could be used to reduce unnecessary referrals to neurosurgery. At least, that’s what the authors conclude.
A sensitivity of 100% looks great, doesn’t it? But how were those cut-offs derived? The authors plotted ROC curves – i.e. sensitivity versus 1 – specificity for every possible cut-off. We use ROC curves for diagnostic tests that are continuous variables and have lots of possible cut-offs. If we don’t know the appropriate cut-off, we can decide on one based on the ROC curve. For example, we might want the best overall balance between sensitivity and specificity, in which case we take the point nearest to the top left corner. If we want a ‘rule out’ test, we might find the point with 100% sensitivity that has the highest possible specificity, and set the cut-off there.
In fact, that’s exactly what the authors did. They chose cut-offs to have 100% sensitivity. It’s a perfectly valid approach. The only problem is that the cut-offs are calibrated to this particular dataset. If they repeated the study, that particular cut-off might not perform so well. This means that you can’t draw a conclusion until you’ve validated the findings. So, we have a big reason to be cautious.
As we often find, after a detailed critical appraisal, we can confidently say that, while this decision tool is a good idea (and this work has got the ball rolling), it’s a long way off being ready for prime time. We can’t use it in our practice yet. It’s several studies away from being something that might help us in our practice.
But at least this has got us to think about an important issue, made a start on the journey towards having a decision tool that could guide our practice, and also helped us to hone our critical appraisal skills.
Until next time!