InstaGuard — Methodology

Feature engineering

The 11 features

Each feature was selected based on its correlation with inauthentic account behavior. The model uses 4 binary features — profile picture, private status, external URL, and name-username match — which are among the strongest fake signals.

Feature	Type	Description	Why it matters	Importance
profile pic	Binary	Whether the account has a profile photo	No profile picture is one of the single strongest fake signals — 80% of fake accounts have none	High
nums/length username	Float 0–1	Proportion of digits in the username	Bot usernames like "user39471856" have very high digit ratios; real names rarely do	High
#followers	Integer	Number of followers	Fake accounts typically have very few followers; real accounts above 50k are almost always authentic	High
#follows	Integer	Number of accounts followed	Mass-following thousands while having few followers is a classic bot pattern	High
#posts	Integer	Total number of posts	Zero posts is a very strong fake signal; real active accounts post regularly	Medium
description length	Integer	Character count of the bio	Real accounts tend to have meaningful bios; fake accounts usually have empty or very short ones	Medium
fullname words	Integer	Number of words in the display name	Real people typically have 2-word names; bots often have 0 or 1 word display names	Medium
nums/length fullname	Float 0–1	Proportion of digits in the display name	Display names with many digits (e.g. "John99887") are highly suspicious	Medium
external URL	Binary	Whether a website link exists in bio	Real influencers and businesses usually have URLs; fake accounts rarely do	Medium
private	Binary	Whether the account is set to private	Private accounts with very few followers are a mild suspicious signal	Low
name==username	Binary	Whether display name exactly matches username	Bot accounts often use the same string for both name and username fields	Low

Pipeline

Training & deployment pipeline

The end-to-end process from raw data to a ML prediction system.

Data collection & labeling

Instagram account data was collected and labeled as real (0) or fake (1). The dataset contains 5,000 profiles with 11 extracted metadata features including profile picture presence, follower/following counts, post counts, username structure, and bio information.

5,000 labeled samples · Balanced 50/50

Preprocessing & column normalization

Column names were normalized (stripped, lowercased, special characters replaced). Binary features were validated as 0/1. Numeric columns were coerced and missing values filled. Features were aligned to the model's training schema using reindex.

Column normalization · Type coercion · Reindex

Stratified train / validation / test split

The dataset was split 80/10/10 using stratified sampling to preserve class balance across all three sets. This prevents the model from seeing test data during any phase of training or validation, giving a true unbiased performance estimate.

80/10/10 stratified split · train_test_split(stratify=y)

XGBoost model training with GridSearchCV

An XGBoost classifier was tuned using 5-fold cross-validated Grid Search across n_estimators [100, 200], max_depth [3, 5, 7], learning_rate [0.01, 0.1, 0.2], subsample [0.8, 1.0], and colsample_bytree [0.8, 1.0]. Scoring metric was ROC-AUC. The best estimator was selected and saved.

XGBoost · GridSearchCV · ROC-AUC · 5-fold CV

Evaluation & validation

The model was evaluated on the held-out test set of 500 accounts. Accuracy, precision, recall, F1 score, and ROC-AUC were measured. The confusion matrix confirmed near-perfect classification with very few false positives and false negatives.

500-sample test set · ROC-AUC: 0.9995

Deployment via Flask API

The trained model was serialized with joblib and served through a Flask REST API. The /predict endpoint accepts account features and returns the ML verdict, confidence level, probability scores, and feature influence data for explainability.

Flask · joblib · REST API · ML inference

Results

Model performance

Evaluation metrics on the held-out test set of 500 accounts. The ROC-AUC score of 0.9995 indicates near-perfect class separation.

99.2%

Accuracy

99.6%

Precision

98.8%

Recall

99.2%

F1 score

Confusion matrix (test set · 500 accounts)

	Predicted real	Predicted fake
Actual real	248True positive	2False positive
Actual fake	2False negative	248True negative

Reading the matrix

True positive — 248Real accounts correctly identified as real. The model accurately recognizes authentic profiles.

True negative — 248Fake accounts correctly identified as fake. The model successfully flags inauthentic profiles.

False positive — 2Real accounts incorrectly flagged as fake. Only 2 misclassified in 500 — excellent precision.

False negative — 2Fake accounts missed by the model. Very few missed cases demonstrate strong recall performance.

How InstaGuard
actually works

Training data

Dataset overview

Stratified split (80 / 10 / 10)

The 11 features

Training & deployment pipeline

Data collection & labeling

Preprocessing & column normalization

Stratified train / validation / test split

XGBoost model training with GridSearchCV

Evaluation & validation

Deployment via Flask API

Model performance

Reading the matrix

See it in action

How InstaGuardactually works

Training data

Dataset overview

Stratified split (80 / 10 / 10)

The 11 features

Training & deployment pipeline

Data collection & labeling

Preprocessing & column normalization

Stratified train / validation / test split

XGBoost model training with GridSearchCV

Evaluation & validation

Deployment via Flask API

Model performance

Reading the matrix

See it in action

How InstaGuard
actually works