Scoring terrorism risk: questions and limitations for risk practitioners


This article was originally published in the June/July 2019 edition of ASIAL’s Security Insider magazine. 

Methods to assess terrorism risk assessment were quickly developed as a matter of necessity following September 11 to allow organisations to understand, quantify and ultimately manage the complex risk posed by terrorism. This article will look at Aggregated Risk Scoring (ARS) methods of assessing terrorism risk such as CARVER, EVIL DONE, and the Crowded Places Self-Assessment Tool (CPSAT) and will examine several questions which highlight limitations which should be kept in mind by practitioners who use or rely on these types of tools and methods.

A brief overview of methods to assess terrorism risk

The US Department of Homeland Security has led much of the development of terrorism risk assessment methods which have then been adopted and followed by others. From FY2001 to FY2003, terrorism risk was quantified as a function of population.[1] Using population or population density as a measure of terrorism risk is still used by some bodies, such as the Australian Reinsurance Pool Corporation. From FY2004 to FY2005, terrorism risk was then determined by an additive formula where risk was the sum of three features: a threat score, a critical infrastructure score and population density.[2] Whilst the DHS no longer uses additive formulas, or Aggregated Risk Scoring (ARS) as they are referred to in this paper, they do remain used by others as evidenced by the release of the Crowded Places Self-Assessment Tool (CPSAT) by the Australian Government in 2018.

Aggregated Risk Scoring

Aggregated Risk Scoring (ARS) describes any method of assessing risk through the scoring of certain features, typically on an ordinal scale (e.g. on a scale of 1 to 10) which are then added up to obtain an overall score. This overall score may then be matched to predefined risk levels (e.g. if the score is greater than 20 then the risk is “high”). There are many ARS methods out there, but the most familiar ones include CARVER, EVIL DONE, and the CPSAT.[3] ARS provides a way to perform structured assessments using standardised scores or ranks, and there is little doubt it will produce more consistent and reliable assessments compared to unstructured intuitive judgments of risk. The key strength of ARS is its convenience: it is easy to understand and use, with assessors simply needing to read the criteria, score, add and compare. But there are limitations and important questions which need to be acknowledge by any practitioner using or relying on these methods.

How do we know these are the right features or factors to look at?

Most ARS methods provide little explanation as to why they have chosen the features they have in their scoring scheme. In most cases, we simply assume that the features make sense and then accept that they are appropriate. For example, in the case of CARVER, no specific evidence is provided as to why “Criticality”, “Accessibility”, “Recuperability”, “Effect”, and “Recognisability” are the correct features versus any other combination of possible scoring features. ARS methods do not typically test for the appropriateness of the features they use, and this is perhaps why there are so many variations of ARS methods out there – no one can really agree (or prove) that their set of scoring features are actually the correct features we should be looking at.

What is the importance of one feature relative to the other features?

Another related issue is that many ARS methods do not explain or account for the importance of features relative to other features. In the case of CARVER, all features are given the same weight, and this assumption has a signifcaint impact on the results. This can of course be addressed through weighting the features, but this weighting in itself requires some evidence to back up its claim as to why one feature is more important than another and by how much.

How reliably does a person score a feature? How subjective is a feature to score?

One frustration with ARS methods is their use of features which are difficult to objectively measure or define, and therefore introduce a high degree of subjectivity into the scoring process. In the case of CPSAT, whilst some features have a quantifiable and somewhat objective grounding (such as the density of people at a location), other features such as the symbolism of a site and its social importance are highly subjective. With the exception of some major and prominent sites, what is symbolic or considered socially important is inevitably going to vary from person to person. This in turn affects the reliability of ARS methods using such subjective features.

How well are these features defined? What exactly is the implication of scoring X and scoring Y?

In addition to whether features are innately subjective, there is also an issue with how well ARS methods define their features. The CPSAT asks the assessor to score the density of people at a location between 1 and 7 but does not provide any guidance as to what constitutes a 1 and what constitutes a 7. Quite simply, there is often a lack of a definitive reference point in ARS methods which results in the same situation being given two different scores due to variations in understanding. To ensure reliability in the results, all assessors need to have the same understanding of what a feature means and what scoring it a particular value means versus scoring it another value.

Are the features independent or do they overlap with one another?

Related to how well the features are defined, one issue encountered in many ARS methods is that the features tend to overlap with each other. For example, in the CPSAT, there is arguably some overlap between whether a location is symbolic and whether it can be considered socially important i.e. a symbolic location will also be socially important. In the case of CARVER, there is arguably overlap between the “Criticality” and “Effect” features of a location i.e. a critical location will always have a large effect. The result of this overlap is that certain features may be double counted in the overall score, distorting the overall result.

How well do ARS methods actually perform?

This final question is perhaps the most important, and yet it is also the one which most try to avoid. There has been no systematic collection of data on how well ARS methods actually perform in identifying targets that are then actually attacked. The most important metric for a method to assess terrorism risk is its false negative rate: how many locations were assessed as being low risk but were then subsequently attacked. For most ARS methods, we have no idea what the false negative rate is. It is possible to perform retrospective testing of ARS methods, that is, where places which have been attacked are assessed using an ARS method. This may be one way to gather data on the performance an ARS method, but this may be difficult to perform without the bias of hindsight affecting the results. The fact that we do not know how ARS methods being used actually perform should be a concern which is front of mind for those who use and rely upon these methods to make critical decisions which impact the safety and security of others.


[1] Masse, Todd, Siobhan O’Neil, and John Rollins. “The Department of Homeland Security’s risk assessment methodology: Evolution, issues, and options for Congress.” LIBRARY OF CONGRESS WASHINGTON DC CONGRESSIONAL RESEARCH SERVICE, 2007, 5-6.

[2] Ibid.

[3] Newman, Graeme R., and R. V. G. Clarke. Policing terrorism: An executive’s guide. Diane Publishing, 2010; Murray‐Tuite, Pamela M., and Xiang Fei. “A methodology for assessing transportation network terrorism risk with attacker and defender interactions.” Computer‐Aided Civil and Infrastructure Engineering 25.6 (2010): 396-410.