Blog post

Behind the Numbers: What Goes Into a SentiLink Retrostudy

Charlie Custer

Published

May 5, 2026

Every fraud vendor offers something like a retrostudy — a test that uses historical performance data to measure how their model would have performed for your business. But how a retrostudy is designed and executed matters a lot; it is very possible to design a retrostudy process that produces appealing results but isn't predictive of the real-world experience you'll have using their tools.

SentiLink's retrostudies are carefully designed to ensure that the output is accurate, fair, and genuinely predictive of the performance you can expect when using our solutions in production. In this article we'll walk through how that process works and note some critical elements that one should look for when evaluating retrostudies from any vendor.

Before getting into methodology, a bit of table-stakes: a good retrostudy should cost nothing. While this is typical, we occasionally hear reports of other vendors charging fees for retrostudies. SentiLink does not, except in cases where we are required by the partner to charge a fee (some public-sector organizations cannot legally accept free services).

Data selection

The foundation of any retrostudy is the dataset, and getting this wrong can undermine everything that follows, regardless of how rigorous the rest of the methodology is.

Most critically, the sample needs to be representative of the full application population: goods and bads together in their real-world proportions, not just accounts that eventually went bad. A dataset composed only of known fraud or charge-off accounts makes it impossible to evaluate whether a model can distinguish high-risk applications from low-risk ones, and doesn't provide any insight into how frequently applications flagged by the model are false positives.

Volume matters too. Our models are designed for precision, typically flagging roughly 2–4% of applications as high-risk. In a dataset of 5,000 records, that yields somewhere between 100 and 200 flagged cases — too few to draw statistically-meaningful conclusions. As a rule, a retrostudy should include a minimum of at least 100,000 applications, and ideally more.

Timeframe is a related concern. A dataset covering only a few months may not capture seasonal variation in application behavior or fraud patterns. Targeting at least 12 months of data gives the study more stability and makes the results more generalizable.

Finally, the data needs to reflect your current fraud controls. If your current process includes top-of-funnel identity fraud rejections, but the retrostudy dataset doesn't include those rejected applications, the study can only measure the model's performance on already-filtered traffic and the incremental value of the model against the full population won't be visible. Including top-of-funnel rejects with reject reason tags (fraud, credit, CIP, and where possible, the specific fraud type) will enable us to provide a much clearer picture of the incremental value our models can offer.

Scoring methodology

How applications are scored during a retrostudy has as much impact on the results as the model itself.

Retrostudies should be conducted in as-of fashion: each application is scored using only information that was available at the time it was originally submitted. Without this constraint, models can benefit from look-ahead bias. An application submitted on October 31, 2024 and rescored on October 31, 2025 may produce a different — and seemingly more accurate — result simply because the model has an additional year of data to work with. That 2025 result looks good on a slide, but it tells you nothing about how the model will perform in production.

A related issue arises when vendors train a custom model using your retro data and then evaluate that same model on the same data. Overfitting — or "training to the test" — will produce impressive retro results that may not hold up once the model is deployed against a broader, live population. If evaluating vendors for a custom model, it's reasonable to put those models head-to-head, but the evaluation should include performance against a held-out dataset that the vendors didn't use for training to confirm the results are replicable rather than a product of overfitting.

Defining the target problem

A retrostudy can only tell you how well a model solves the problem it is being evaluated against. That sounds obvious, but misalignment here is a common source of misleading results.

The first question is which fraud types are in scope. Synthetic fraud, identity theft, and first-party fraud have different underlying signals, different loss curves, and different label quality characteristics. Comparing one vendor's single-vector model to another's blended model isn't a fair comparison. Two models nominally targeting the same fraud type may define that fraud type differently. In retrostudies and especially comparative head-to-head retrostudies, it is critical to ensure that models are being evaluated for their performance solving the same problem against the same data and using the same fraud definitions and taxonomy.

The second question is which outcome label the model is being evaluated against. Charge-offs, account closures, and first payment defaults all capture fraud in different ways and with different degrees of completeness. Being explicit about this up front again ensures vendors are optimizing for the same thing, and also ensures that you're seeing the results in the right framing and context for the business problem you're aiming to solve.

Labeling and seasoning

Fraud labeling is imperfect. Ops teams have limited capacity, fraud methodologies evolve, and a meaningful share of fraud losses often end up classified as credit losses or charge-offs rather than fraud. Relying on a formal fraud label alone will undercount actual fraud exposure.

A more complete picture typically comes from combining fraud labels with charge-off data, account closure flags, and other performance indicators. The appropriate combination depends on the fraud type in question — which brings us to seasoning.

Fraud charge-off curves behave differently by fraud type. Synthetic fraud losses tend to taper around 24–36 months on book, while identity theft and first-party fraud typically plateau closer to 12–18 months. This means accounts need to be aged long enough for losses to have materialized before they're useful as labels. A minimum of six months of seasoning is a reasonable floor; 12–24 months is better for most use cases.

A useful step during study design is to break out charge-off rates by months on book, which makes it easier to identify where the loss curve has flattened and pick the right analysis window accordingly.

Evaluating performance

Once scores are in hand, how performance is measured determines what conclusions can actually be drawn.

A rank-ordered performance table is the right tool for comparing models. Rather than comparing results at a particular threshold — which reflects configuration choices as much as model quality — a rank-ordered table shows precision and recall across the riskiest segments of the population: the top 10 bps, 25 bps, 50 bps, and so on. Comparing these tables across vendors isolates actual model performance from threshold decisions either vendor may have made.

This is also why all vendors in a comparison should be evaluated on the same dataset over the same timeframe. Even minor differences in the underlying data — one vendor working from a filtered export, another from the full population — can make a head-to-head comparison unreliable.

Calculating financial impact

The financial figures that come out of a retrostudy tend to get a lot of attention, so it's important to understand where they're coming from and what they represent.

In the context of fraud prevention, ROI typically comes from three main sources:


  • Losses prevented — fraudulent applications that were approved and charged off, but that would have been flagged by our model had it been in place at the time of application.
  • Incremental approvals — legitimate applications that would not have been approved due to inability to verify identity, but that can be approved with SentiLink because SentiLink can positively verify the identity.
  • Operational expense reductions — hours spent on manual case reviews, which are often reduced as the accuracy of SentiLink's scoring and the efficiency of tools like Intercept can mean both fewer reviews and completing those reviews faster and more accurately.

Understanding which of these sources topline financial estimates are based on, and what assumptions went into those calculations, matters. In head-to-head retrostudies, it is critical to ensure that the financial figures you're seeing from both parties are apples-to-apples comparisons. Otherwise, these figures will not be useful for evaluating comparative performance.

Conclusion

A well-designed retrostudy accounts for all of the above: a representative, sufficiently large, and appropriately seasoned dataset; a clearly defined target problem; as-of scoring; rank-ordered performance measurement; and transparent loss calculations. When any of these elements are missing or handled loosely, the results become harder to interpret and easier to game.

SentiLink's retrostudy process is built around these standards. It's part of why we've won a strong majority of the deals that reached the retrostudy stage since 2024, and why we're happy to talk through our methodology in detail before any test begins. If you'd like to see how SentiLink performs against your current solution, reach out — we're confident in what the data will show.

 

Content

Share

Learn how we can help.

Schedule a demo with a fraud expert and evaluate our solutions.