Blog post
How Human Intelligence Drives Model Performance in Fraud Prevention
Jade Gu
Published
March 12, 2026
From an external perspective, SentiLink's fraud scoring appears fully automated. Partners submit application information via API and receive risk scores in return, all in well under a second.
This blink-and-you'll-miss-it interaction looks like artificial intelligence, and the scores are generated in real-time by our highly sophisticated machine learning models. But the fuel that powers those models is human intelligence: manual, boots-on-the-ground case reviews.
As one of the newest analysts on SentiLink's Fraud Intelligence Team (FIT), I've gotten to see firsthand over the past few months just how much of what happens at SentiLink —in terms of our products but also in terms of the way we function as a company — is driven by manual review.
What do Fraud Intelligence Analysts do?
The main responsibility of the Fraud Intelligence Team (FIT) at SentiLink is to manually review applications and provide a label from our Fraud Taxonomy. We assess these applications via a number of categorized queues. For example:
-
The generic queues are monitoring queues that contain stratified samples of applications across different score bins to pulse test the models’ performance on a week by week basis
-
Swap queues contain applications picked to help our Data Science team compare different iterations of the models’ performance against each other
-
Pilot queues contain applications from specific prospective partners to help us better understand the scale and scope of fraud threats they're facing
FIT also manually labels escalations — tricky cases that our partners' fraud teams have referred to SentiLink for manual review and labeling. This is something we've covered in a previous blog post.
As a new Fraud Intelligence Analyst, I work on the generic queues every week, monitoring for active fraud trends or any unusual patterns.
To assess and label applications, FIT leverages a variety of tools, both internal and external. On the internal side, we use SentiLink's Intercept product, the cluster (a collection of other applications linked to the current application by PII elements), Manifest (a tool that aggregates all known identities linked to the application's PII), and more. When the situation calls for it, we also use external tools such as inmate searches, phone verification tools, and social media.
High-scoring (> 700) and low-scoring (< 400) applications are typically cases that we are scoring confidently as fraudulent and not fraudulent. For example, when we see a singular VOIP phone being used with applicants across multiple states with a high velocity of applications in our network, our Identity Theft Score is likely to be in the 800s or 900s. On the flip side, if a person is using an email, phone, and address that they have been associated with for years, the Identity Theft Score will most likely indicate low risk.
Some more complicated scenarios can be mid-scoring, and these often benefit greatly from the perspective of a human being. You might see a slightly elevated Identity Theft Score on an application containing a typoed email address and a phone number that the application identity has not been associated with before. When a FIT analyst assesses the case, we can discern minute details that can contextualize why elements of an application are a certain way.
For example, we may determine that this application likely came from an in-store interaction, which often opens the door to innocuous typos. We may see that this phone belongs to a family member or some other associate with whom the applicant shares address history. The contents of an application are not black and white, and it is in these gray areas where FIT analysts are able to add value both for our partners and for future versions of our models through more accurate labeling.
The breadth of labels in our Fraud Taxonomy allows for a more granular understanding of fraud and how it manifests than what can be signaled by scores alone. We as analysts aim to discern the empirical truth of an application we are reviewing, and applying highly specified labels allows us to capture the broad range of behaviors we observe. Labeling drives the model’s scoring logic in a continual process of refinement, manual review, and quality assurance, which results in scores more attuned to the targets of fraud vs. not fraud.
Human intelligence to inform machine learning
FIT's labels are critical for ensuring our model's scoring remains accurate and that our models can adapt as fraudsters change tactics. So FIT prioritizes labeling at a high volume — in 2025, FIT manually labeled more than 42,000 cases. We also use cluster labeling, which allows us to bulk label on certain PII (e.g., if a new email ties to a high volume of applications in an identity theft attack) to maximize impact on models.
But of course, generating a high volume of labels is not useful unless the labels are accurate. FIT labels for both SentiLink partners and the Data Science team are ensured by a secondary review process in which we require a 90% accuracy rate. A tenured FIT analyst randomly samples a set of cases that each analyst has labeled and independently provides their own label. If there is a discrepancy between the original analyst’s label and the QAed label, it’s brought to the full team for discussion until a consensus is reached.
As a new analyst, my first few weeks of labels were QAed at a higher volume than experienced analysts to ensure accuracy as I transitioned from learning SentiLink's fraud taxonomy into high-volume queue labeling. Through this manual QA process, I was able to learn about specificities and edge cases of the taxonomy that are not immediately apparent without having your own label reexamined by someone with more experience. For example, I learned about certain email domains that can be riskier in terms of being used for identity theft when they are newly created (i.e. certain email domains that have been more archaic in public use over time) through a case I initially mislabeled as clear.
Our QA process exists not only to double-check analyst accuracy, but also to assess model accuracy. We carefully track "model misses" — cases where the analyst's label for an application contradicts the model's score — and each is closely examined by QA to determine whether the model really missed the mark, and if it did, why. Often, these "misses" allow us to flag a "blind spot" in a model or demonstrate an edge case that the model isn't picking up on. Finding true model misses is a high impact way of improving the model’s ability to pick up on greater nuance or simply examine the reason there was a discrepancy between the model and the label.
An example of how FIT labels influence our models
Historically, SentiLink's Identity Theft Score model had treated an aged email (an email with years of tenure as opposed to a newly-created one) as a positive sign, on the generally-true assumption that the applicant has been using this email for a long time.
However, through manual review, FIT discerned that this was not necessarily always the case; aged emails were appearing on many applications that were determined to be fraudulent despite scoring lower. This occurs, we determined, because the longer an email has existed, the more likely it is that it has appeared in a data breach or has otherwise been compromised over time. Sophisticated fraudsters, who are now aware that email age is a widely-used fraud signal, are increasingly using "legacy PII" like these compromised emails when they are available. With FIT’s input, the Data Science team adjusted the model's weighting of aged emails to account for this nuance. There are still nuances to consider with the adjusted scoring; rules alone cannot completely account for whether a person is using an aged, leaked email, or if someone is using an email they have been using consistently for many years.
(For more on the "legacy PII" fraud MO, check out the latest SentiLink Fraud Report).
Manual review beyond FIT
Although supervised learning is not a novel approach, SentiLink maintains deference to the process of manual review even as our scores continually improve. Crucially, we believe the importance of keeping this process in close proximity to every person’s work in our organization to continually create models that increasingly mirror analyst intelligence at scale. We achieve this by proliferating Fraud Intelligence Team insights across our Product, GTM, and Sales teams, as well as by cultivating a deep understanding of fraud in every individual at SentiLink. In fact, it is not just FIT that reviews cases — all SentiLinkers manually review and label at least a few fraud cases every week.
Each application contains in itself a human story — whether fraudulent or not — and it is through manual review that the small nuances found in each case can be applied broadly, continually teaching us about how fraud occurs. SentiLink’s fraud detection is built on the earnest belief that human review provides the essential nuance that allows for evolving precision.
Related Content
Blog article
March 12, 2026
How Human Intelligence Drives Model Performance in Fraud Prevention
Read article
Blog article
March 10, 2026
What Happened to Identity Fraud Rates in 2H 2025?
Read article
Blog article
February 20, 2026