Why backtesting fails against modern fraud

Why backtesting fails
Martin Rehak
Published on 13.03.2026
Updated on 13.03.2026

Backtesting, in our context, is the age-old practice of validating any new or modified fincrime prevention method on past, known, and labeled data in order to judge the impact it may have upon deployment. Backtesting is and will remain indispensable, but its usefulness will be far more limited in the future.

Backtesting is based on the critical assumption that criminal and legitimate behaviors observed in the past are representative of future behaviours. However, that assumption is breaking down in an increasing number of fraud and fincrime domains.

Criminal behavior is no longer stationary and defined by a broad population. Because the professional evolution of criminals is accelerating and the statistical properties of fraud are shifting much faster than our historical datasets can capture, backtesting increasingly provides biased and misleading results.

The fincrime game is getting faster, though not uniformly and not everywhere. Criminals operate globally. They outsource elements of their business, like document forgery, account opening, or money muling.

On top of that, they have started using AI to innovate rapidly. This completely changes how we design fraud detection systems and measure their effectiveness.

Traditional fraud model

Backtesting was a really good approach to system evaluation when the criminal behavior was carried out by individuals, stayed repetitive, and was thus stable. Traditional financial criminals were and remain individuals, operating and learning independently and locally.

If they were really good and went global, they might find themselves on the silver screen, like Frank Abagnale in Catch me if you can. But, by and large, individual criminals operated similarly to each other and repeated mistakes others had made in the past. Because they were learning independently of each other, their mistakes were repetitive and predictable.

Machine learning and risk management techniques designed for approximately i.i.d. data worked reasonably well because fraud exhibited time stationarity, relatively stable population characteristics, and slow-learning dynamics. If we drew a very abstract graph of the traditional criminal behavior likelihood projected on the sophistication axis, we would get a continuous distribution.

On a large scale, we can mentally approximate it with a Gaussian-like shape for intuition, although in reality it is an empirical distribution specific to each company and each time period.

Each criminal contributes marginally to the overall fraud distribution, and only the best ones progressively climb the distribution to the right. Most criminals are “full stack”, independent operators with a limited number of frauds committed.

Traditional fraud and backtesting

Backtesting makes perfect sense in this context. It is a great way to evaluate the defenses against a relatively stable distribution of criminal behavior. It lets you measure the overlap between the distribution of criminal behavior and legitimate behavior.

Once you have the overlap, you can place the decision boundary with a specific ratio of false positives and false negatives, thus creating the simplest supervised detector1.

Sadly, today’s fraud is very different and breaks all the assumptions behind the distribution. While financial crime still exists on an individual level, it is rapidly outpaced by criminal professionals, whose output now dwarfs traditional crime in high-risk domains.

What fraud is like today

It is critical to understand that criminal innovation is now both rapid and done simultaneously on different levels of the criminal business. Criminals have a broad library of operational procedures, and can select an approach and then modify it rapidly to respond to any changes you make in your defensive posture.

The steps a criminal gang takes today to open a bank account, to name but one example, can be very different from the ones the same gang used yesterday. The process will likely be different again from the one the same organization will use tomorrow.

Missing crime and introducing bias

Backtesting fundamentally fails against professional criminals because its estimation of your ability to detect fraud is highly unstable and dependent on your past defenses. Backtesting measures the intersection of very narrow, dynamic fraud behaviours of highly scalable and automated criminals, shaped by your past defenses.

Shaped by your past defenses” is the key phrase here.

 

Criminals respond to your past controls. They are economical. Their attacks aren’t more sophisticated than strictly necessary. They often sell their products (like business identities or onboarded accounts) at very efficient black markets, and the reputation and escrow mechanisms imposed by black markets create strong short-term incentives.

If you successfully off-board the newly created account within hours, they often deliver another one to protect their reputation even at cost to themselves. If you block the same account after two months, they are off the hook.

These incentives are behind a hoarding phenomenon, where your past defenses (resulting in failed onboardings and off-boarded accounts), combined with the cost factors on the attacker side, create an environment where:

  1. Your defenses define what passes through.
  2. The cost structure of each gang shapes their attack behavior and determines their cost.
  3. Many professional attackers compete in a very efficient global market and one or more of the cheapest reliable providers progressively capture most of the market.
  4. This further reduces their costs and drives more concentrated fraud behavior.
  5. The price of the crime decreases further.
  6. At some point you notice that something is wrong and assess new techniques using backtesting.

The effect is the distribution we show below. Instead of a “statistically reasonable” distribution of individual criminals, modern fraud is a set of narrow, Dirac-like distributions. We are hunting highly concentrated and constantly shifting behavioral clusters2.

Why backtesting fails

Backtesting fails on a technical level for four main reasons:

  1. Estimating an intersection of few narrow spikes with a more regular distribution of background traffic is highly unstable.

  2. Backtesting is also unstable against criminals’ change of strategy induced by your change of posture. The minute you reject the first criminal attempt on a new version of your defenses, the criminals start to evolve. They reach to their library of behaviors and start experimenting with additional attack options, perfected on other targets. It is a race, because the winner will be able to capture a larger part of the black market. Criminals used to react in weeks or months. Now, they react within hours or days.

  3. Defense by others increases the pressure on you as well, albeit over a longer time frame. Criminals constantly innovate across their complete set of targets to reduce their costs and to increase their operational efficiency and security. A slower concept drift informed by the defenses of the others and new AI automation options is now available to criminals. The library of strategies available to criminals is constantly increasing, and the complexity of the fraud problem is increasing.

  4. Backtesting against a few narrow spikes is easy to game. Every single vendor who wants to survive in a backtesting-defined market will always look at the data and check if the spikes have been detected. Honest vendors will tweak their systems to make sure they perform well against a small number of big targets. Less honest vendors will bend their systems beyond what is sustainable in production in an attempt to win the deal at any cost.

The decreasing predictive value of backtesting also helps explain why naive supervised learning approaches often miss new criminals and new behaviors of existing criminals. They can over-index past crimes and lean on correlated features not directly tied to the fraud mechanism.

As a result, models may become very confident at detecting people similar to past fraudsters, while the up-to-date criminal mechanisms are already shifting elsewhere.

Beyond backtesting

Given the current pace of criminal innovation, deploying a fraud detection system is not a one-and-done proposition. It is a commitment to both an initial investment and to a constant daily investment afterward to keep it effective against ever-changing criminal activity.

So, if backtesting does not work, how can I pick who is the best vendor, either internal or external?

 

The answer is in assessing the vendor’s (including your in-house vendor’s) maturity. Measure where they are now, how deep their system is, how good their ability to detect the new attacks is, and how fast their reaction is.

How does the vendor perform now?

  1. Live testing. Instead of backtesting, perform live testing over a period of time. Ideally assess multiple vendors, with diverse technologies, in order to cross-check their outputs.

  2. Test over a longer time period. The exact length depends on your data volumes and attack variability, but a few months is a good middle ground.

  3. Constantly assess the statistical significance of the results. Make sure they are not skewed or biased by a small set of low-variance, high-volume events. If so, you need to disregard these events and keep testing.

  4. Measure the variety of detections, not their volume. Detecting a high volume of uniform events should still count as 1. See below for detection timing assessment.

Measure the visibility of new criminal behaviour to the vendor

  1. Global visibility. Am I going to be the first client in their portfolio to be hit? Criminals act globally and you want a vendor with a global visibility, and ideally focusing on the industries where the criminal innovation is at the very bleeding edge.

  2. Intelligence collection operations. Assess how the vendor collects, publishes and leverages the information about the criminals. Measure the feedback speed from collection to deployment in production.

  3. KYC. Know Your Criminal. Measure the proportion of detections that can be properly explained, attributed to a specific threat actor or where the criminal intent is made clear by the system.

Resilience to the new criminal behavior

  1. Defense in depth. Is the system composed of independent detection layers, (Swiss cheese model), with different detection principles in order to maximize its capability to detect new attacks?

  2. Feedback loops. Are the feedback loops between different layers and detectors properly engineered to maximize the overall detection performance over time?

  3. Time-to-detect. In competitive assessments, how often was the system the first to detect a specific novel attack or attack variant?

Resistance to probing

  1. Probing resistance. Is the system designed to withstand continuous probing by a technically savvy criminal with adversarial machine learning skills? Is there a monitoring process that can detect the probing and react?

  2. Probing complexity. Is the system deliberately designed to increase the complexity of adversarial probing and to reduce the attacker’s information gained from each successive rejection?

Speed of adaptation

  1. Change deployment speed. How much time does it take to deploy an updated version of the system to production? Is it measured in minutes or hours, rather than days?

  2. Incremental improvement pace. How well does the system perform against the incrementally improved versions of recent attacks? Does it keep improving with every new version of the attack? Are the new versions discovered faster?

  3. Ability to discover variant attacks. What percentage of new attack variants is blocked outright by the system? This ratio directly impacts the ability of the criminals to iteratively learn from the system responses.

Conclusion

Fincrime, like fintech, is always changing. Criminals can use vibe coding to build new attacks much faster, and generative AI to increase the diversity of the attacks they produce. On the other side, the defenders can also accelerate their response using this technology.

In reality, the progress is uneven. There are very few real-world consequences to failed fraud attempts. Therefore, the criminal’s loss function is much less punishing, and the criminals are able to leverage generative AI much faster. So we can expect future frauds to be:

  • Faster innovating.
  • Producing less uniform, more diverse attacks with minimal cost increment.
  • Being able to react faster to any defensive response.
  • Being able to leverage any information leaked through the system defensive response more efficiently.

The game keeps growing on us. It demands we grow with it. Resistant AI is the way to keep pace.

Scroll down to book a demo.

1 In this article, we detect criminals (positives), and therefore false positives are the legitimate users classified as criminals and false negatives are the criminals classified as legitimate users, who avoided being detected.

2 All this is not unprecedented. Historically, computer security was a small field with few products: antivirus, firewall, maybe email security for the overachievers. Products were aimed at catching a small number of viruses and some esoteric threats. As the stakes increased, criminals started to try harder, the security industry has responded and a typical financial company now has over 80+ security tools covering a far broader and extremely more sophisticated set of attacks.

 

Industry Insights

Any document. Anywhere

150,000,000+ docs verified, language agnostic, and compliance friendly
You might be interested in

Related articles