Final Year Project SETU Carlow 2026

Detecting Phishing
in Images and Email

A hybrid phishing detection system that uses OCR, computer vision, and threat intelligence to analyse phishing content embedded in images and emails content that traditional text-based filters cannot read.

Lorcan Kelly Zazera C00288941 BSc (Hons) Cybercrime and IT Security South East Technological University, Carlow
29
Detection indicators across 8 attack categories
81%
Precision on email detection
100%
Precision on image detection
0%
False positive rate on images

What is ImageAware+?

Phishing attacks increasingly embed malicious content inside image files to bypass traditional email security filters. A fake Geek Squad invoice, a PayPal billing alert, or a DocuSign impersonation email rendered as a graphic is completely invisible to text-based scanners but convincing to any human who reads it.

ImageAware+ was built to close this gap. It combines multi-pass OCR with OpenCV image preprocessing, QR code detection, HTML href URL extraction, email header analysis, and a 29-indicator rule-based scoring engine to produce an explainable forensic risk assessment for any submitted image or email file.

Every point in the final score is traced back to a named indicator with supporting evidence making the system suitable for forensic documentation, not just binary classification. The system is deployed as a live educational platform covering phishing awareness, attack types, and real-time sample analysis.

How It Works

01

Upload

Submit a phishing image (PNG, JPG) or email file (.eml) through the web interface.

02

Extract

Multi-pass OCR extracts text, HTML hrefs recover hidden URLs, QR codes are decoded.

03

Enrich

Extracted URLs are checked against VirusTotal, URLScan.io, and PhishTank APIs.

04

Score

29 indicators across 8 attack categories produce an explainable risk score from 0 to 100.

Detection Performance

Formally evaluated on 300 labelled samples 150 phishing emails from the Nazario 2025 corpus and 150 legitimate emails from the TREC 2007 ham corpus. Image pipeline evaluated on 22 labelled samples.

Email Pipeline Medium Threshold (35)

Precision 80.95%
Recall 11.33%
F1 Score 0.199
False Positive Rate 2.67%
True / False Positives 17 TP / 4 FP

Image Pipeline Medium Threshold (35)

Precision 100%
Recall 58.33%
F1 Score 0.737
False Positive Rate 0.00%
True / False Positives 7 TP / 0 FP

Technology Stack

Built with Python and Flask, deployed as a Docker container on Render.com with automatic GitHub CI/CD deployment.

Python 3.11 Flask Tesseract OCR OpenCV SQLite ReportLab VirusTotal API URLScan.io API Chart.js Docker gunicorn HTML / CSS / JavaScript

Documentation

All project documentation produced throughout the year, plus the live platform and source code.