A deep dive into the dataset, feature engineering, XGBoost model training, and performance metrics behind the fake account detection system.
The model was trained on a labeled Instagram account dataset of 5,000 profiles with extracted behavioral and structural metadata, split using stratified sampling to preserve class balance.
Each feature was selected based on its correlation with inauthentic account behavior. The model uses 4 binary features — profile picture, private status, external URL, and name-username match — which are among the strongest fake signals.
| Feature | Type | Description | Why it matters | Importance |
|---|---|---|---|---|
| profile pic | Binary | Whether the account has a profile photo | No profile picture is one of the single strongest fake signals — 80% of fake accounts have none | High |
| nums/length username | Float 0–1 | Proportion of digits in the username | Bot usernames like "user39471856" have very high digit ratios; real names rarely do | High |
| #followers | Integer | Number of followers | Fake accounts typically have very few followers; real accounts above 50k are almost always authentic | High |
| #follows | Integer | Number of accounts followed | Mass-following thousands while having few followers is a classic bot pattern | High |
| #posts | Integer | Total number of posts | Zero posts is a very strong fake signal; real active accounts post regularly | Medium |
| description length | Integer | Character count of the bio | Real accounts tend to have meaningful bios; fake accounts usually have empty or very short ones | Medium |
| fullname words | Integer | Number of words in the display name | Real people typically have 2-word names; bots often have 0 or 1 word display names | Medium |
| nums/length fullname | Float 0–1 | Proportion of digits in the display name | Display names with many digits (e.g. "John99887") are highly suspicious | Medium |
| external URL | Binary | Whether a website link exists in bio | Real influencers and businesses usually have URLs; fake accounts rarely do | Medium |
| private | Binary | Whether the account is set to private | Private accounts with very few followers are a mild suspicious signal | Low |
| name==username | Binary | Whether display name exactly matches username | Bot accounts often use the same string for both name and username fields | Low |
The end-to-end process from raw data to a ML prediction system.
Instagram account data was collected and labeled as real (0) or fake (1). The dataset contains 5,000 profiles with 11 extracted metadata features including profile picture presence, follower/following counts, post counts, username structure, and bio information.
5,000 labeled samples · Balanced 50/50Column names were normalized (stripped, lowercased, special characters replaced). Binary features were validated as 0/1. Numeric columns were coerced and missing values filled. Features were aligned to the model's training schema using reindex.
Column normalization · Type coercion · ReindexThe dataset was split 80/10/10 using stratified sampling to preserve class balance across all three sets. This prevents the model from seeing test data during any phase of training or validation, giving a true unbiased performance estimate.
80/10/10 stratified split · train_test_split(stratify=y)An XGBoost classifier was tuned using 5-fold cross-validated Grid Search across n_estimators [100, 200], max_depth [3, 5, 7], learning_rate [0.01, 0.1, 0.2], subsample [0.8, 1.0], and colsample_bytree [0.8, 1.0]. Scoring metric was ROC-AUC. The best estimator was selected and saved.
XGBoost · GridSearchCV · ROC-AUC · 5-fold CVThe model was evaluated on the held-out test set of 500 accounts. Accuracy, precision, recall, F1 score, and ROC-AUC were measured. The confusion matrix confirmed near-perfect classification with very few false positives and false negatives.
500-sample test set · ROC-AUC: 0.9995The trained model was serialized with joblib and served through a Flask REST API. The /predict endpoint accepts account features and returns the ML verdict, confidence level, probability scores, and feature influence data for explainability.
Flask · joblib · REST API · ML inferenceEvaluation metrics on the held-out test set of 500 accounts. The ROC-AUC score of 0.9995 indicates near-perfect class separation.
| Predicted real | Predicted fake | |
|---|---|---|
| Actual real | 248True positive |
2False positive |
| Actual fake | 2False negative |
248True negative |
Try the live dashboard with demo presets or your own data.