class: center, middle, inverse, title-slide .title[ #
An End-to-end Project-based Approach to Teaching Data Mining Process
] .subtitle[ ##
A Case Study in Credit Card Fraud Detection
] .author[ ###
Cheng Peng
] .institute[ ###
West Chester University of Pennsylvania
] .date[ ###
05/14/2022
Presented at
eCOTS 2022: Teaching Data Mining
Slides available at:
] --- class: inverse, middle ## Agenda ### Learning from Learning Theories - Learning Theories - Pedagogical Strategies ### Case-study: Credit Card Fraud Mining - Fraud Background - Analytic View of Fraud and Challenges - Feature Extraction - Analytic Fraud Identification Methods and Assessment - Deployment and Automation --- class: inverse, center, middle # Learning from Learning Theories --- class: center, middle # Teaching DM Process vs Techniques Cross-Industry Standard Process for Data Mining (CRISP-DM) <img src = "https://github.com/pengdsci/eCOTS2022/blob/main/CRISP-DM.png?raw=true"> --- # Learning from Learning Theories There are many learning theories. They all fall under the three major theories. <center><img src = "https://github.com/pengdsci/eCOTS2022/blob/main/learningTheory.png?raw=true" width="200" height="150"></center> - **Behaviorism Learning Theory**: knowledge is independent and on the exterior of the learner. It focuses on the outside environment’s influences on learning. - **Cognitive Learning Theory**: processing information received rather than just responding to a stimulus as in behaviorism learning theory. It uses metacognition - “thinking about thinking”—to understand how <font color = "red"> thought processes influence learning </font>. - **Constructivism Learning Theory**: constructing learning new ideas based on the prior knowledge and experiences <font color = "red"> through active engagement with the world (such as experiments or real-world problem solving) </font> --- # Some Principles of Constructivism Theory I am a firm believer of constructivism learning theory. - <font color = "blue"><b>Knowledge is constructed</b></font>. This is the basic principle, meaning that knowledge is built upon the foundation of previous learning. - <font color = "blue"><b>Learning is a social activity</b></font>. Learning is something we do together, in interaction with each other, rather than an abstract concept. - There is no knowledge independent of the meaning attributed to experience (constructed) by the learner, or community of learners. - <font color = "blue"><b>Learning is contextual</b></font>: we do not learn isolated facts and theories that are separated from the rest of our lives. - <font color = "blue"><b>Motivation is key to learning</b></font>. Cognitive motivation is rooted in the availability of information and past experience/ prior knowledge. --- # My Adopted Pedagogies in Teaching Analyics - Providing experience with the knowledge construction process - students determine how they will learn. - Providing experience in and appreciation for multiple perspectives - evaluation of alternative solutions. - Embedding learning in realistic contexts - authentic tasks. - Embedding learning in social experience – collaborative learning. - Encourage awareness of the knowledge construction process - reflection, metacognition. - Facilitate students to make sense of information presently available and in determining how to respond or relate to the current situation. --- class: inverse, center, middle # Case Study ### Credit Card Fraud Detection --- # Adapt CRISP-DM for Fraud Mining Process <img src = "https://github.com/pengdsci/eCOTS2022/blob/main/CRISP-FD.png?raw=true"> --- # Credit Card Transaction Process <img src = "https://github.com/pengdsci/eCOTS2022/blob/main/creditCardProcessing.png?raw=true"> --- # What is Credit Card Fraud? **Credit card fraud** is a form of identity theft that involves an unauthorized taking of another’s credit card information for the purpose of charging purchases to the account or removing funds from it. **Credit Card Fraud Types**: Credit card fraud schemes generally fall into one of two categories of fraud: application fraud and account takeover. - Identity theft - [Skimming Fraud (a kind of account takeover)](https://www.youtube.com/watch?v=G_aH50Tn8Fo) **Why Combat Credit Fraud Loss**: Card fraud over the next decade will cost the industry a collective $408.50 billion in losses globally, according to an annual report from the industry research firm Nilson Report. --- # Fraud Data Generation Process & Availability <font color = "darkred"><b>Pre-authorization</b></font>: timestamp, geo-info of POS, Card information (card number, expiration date, billing address, security code) <font color = "darkred"><b>Authorization</b></font>: Pre-auth info + requested payment amount <font color = "darkred"><b>Authentication</b></font>: the issuing bank will - verify the authorization information sent from the processor: validating card info and checking the availability of funds (credit line); and - send the result of the authentication to the merchant: approval or denial. - The merchant will send the complete transaction information to the issuing bank or the processor. --- # Fraud Data Generation Process: A General Fraud Management System <img src = "https://github.com/pengdsci/eCOTS2022/blob/main/fraudDetationProcess.png?raw=true"> --- # Availability and Types of Data Based on credit card processing and the general fraud detection system, The following information is available in different processing stages: - **Pre-authorization Data**: geo-info of POS, timestamps, card information. - **Authorization and Authentication Data**: pre-auth info + payment info. - **Historical Data**: complete transaction information (at least 2 years back), confirmed fraud (labels), account information, etc. - **Other Publicly Data**: crime rate, --- # Data Preparation - Collection **Goal**: detect/identify fraudulent transactions. **Challenges**: - No information about fraudsters! - Real-time detection. - rarity of fraud. **What information is relevant?** - Current transaction: card info, timestamps, amount, POS info. - Historical transactions: timestamps, amount, POS info, fraud labels. - Account information: Card holder’s info. - Derived merchant site info (including publicly available info). --- # Creating Analytic Data According to Potential Analytic Methods - **Key Point**: <font color = "red"><b>Fraudulent activity alters genuine customers’ spending patterns!</b></font> - **Cross-sectional Data**: current transactions. - **Longitudinal /Panel Data**: current and historical transactions - **Hybrid Cross-sectional and Longitudinal Data**: both current transactions and aggregated information of historical transactions --- # Types of Candidate Models/Algorithms - Business rules (expert system). - Supervised classification models/algorithms (need to handle the issue of the rarity of fraudulent transactions) – using fraud labels to train models (index will be the most powerful predictor variable): logistic and tree-based classification models/algorithms. - Unsupervised anomaly detection methods – using the distribution to detect fraud: high quantile along with operational constraints. - Other probabilistic models/algorithms such as HMM. --- # Fraud Index Based on Historical Transactions <font color = "red"><b>How fraudulent activities alter genuine customers’ spending patterns.</b></font> <center><img src = "https://github.com/pengdsci/eCOTS2022/blob/main/alterPattern.png?raw=true" width="600" height="300"></center> - The transaction dollar amount is significantly different from that of genuine customers. - The genuine customers spending frequency will be changed. - The genuine customers’ transaction gap times (time between consecutive transactions) will be changed. --- # What is Process Capability Index (PCI)? Process capability compares the output of an in-control process to the specification limits by using capability indices. <img src = "https://github.com/pengdsci/eCOTS2022/blob/main/rollingPCI.png?raw=true"> - If the PCI of a process is under a threshold, the process is incapable. - There are different PCIs for different processes (main manufacturing processes). - USL and LSL need to be estimated (there are different estimation methods). --- class:inverse, center, middle # A Numerical Example ### Data Layout, Candidate Models and Algorithms --- # The “Capability” of Customers’ Spending Process – Fraud Index For illustration, we define a fraud index using payment dollar amount to define the fraud index as shown in the following figure. <img src = "https://github.com/pengdsci/eCOTS2022/blob/main/fraudIdxDef.png?raw=true"> --- # Pre-processed Data (Long Table) <img src = "https://github.com/pengdsci/eCOTS2022/blob/main/rawData.png?raw=true"> --- # Data Matrix <img src = "https://github.com/pengdsci/eCOTS2022/blob/main/dataMatriix.png?raw=true"> --- class: inverse, center, middle # A PCI-like Fraud Index Using Payment Amount ## `$$idx=\frac{(USL-\mu)^2}{9(\mbox{max} - \mu)^2+(T-\mu)^2}$$` ### USL, T: Estimated from the larger data. ### max, `\(\mu\)`: Estimated from the smaller data. ### Sample sizes of both data sets are tuning parameters --- # How Fraud Index Works in Fraud Detection <video width="800" height="550" controls> <source src="https://github.com/pengdsci/eCOTS2022/raw/main/IndexModelExample.mp4" type="video/mp4"> </video> --- # Distribution of Resulting Fraud Index <img src = "https://github.com/pengdsci/eCOTS2022/blob/main/fraudDist.png?raw=true"> - <font color = "red">The above figure indicates that the fraud index can be used as a standalone fraud detection algorithm with no structural parameters</font> - <font color = "blue"><b>an unsupervised anomaly detection.</b></font> --- # Performance Analysis <center><img src = "https://github.com/pengdsci/eCOTS2022/blob/main/liftAnalysis.png?raw=true" width="600" height="400"></center> - Consideration of multivariate fraud index to incorporate gap time and spending frequency to boost the discriminatory power of the index. --- # Supervised Algorithms and Models Fraud index will be used as a feature variable. Models and algorithms need to account for imbalance labels. - Firth penalized logit models. - King and Zeng's rare event logistic model. - Qing's semi-parametric logistic model. - penalized tree-based algorithm (including BAGGING. RF is not an option for this particular case). - regular logit models based over-/under sampled data. - asymmetric-link GLMs. --- class: inverse, center, middle # Deployment / Monitoring and Updating ## --- class: inverse,center, middle # Thanks! Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan).