Using Few-Shot Learning for Creating Phishing Chat Dataset

Please use this identifier to cite or link to this item: http://kmutnb-ir.kmutnb.ac.th/jspui/handle/123456789/301

Title:	Using Few-Shot Learning for Creating Phishing Chat Dataset การสร้างชุดข้อมูลข้อความแชทฟิชชิ่งด้วยโมเดลการเรียนรู้แบบฟิวช็อต
Authors:	SARAN HANSAKUL ศรัณย์ หงสกุล NAWAPORN WISITPONGPHAN นวพร วิสิฐพงศ์พันธ์ King Mongkut's University of Technology North Bangkok NAWAPORN WISITPONGPHAN นวพร วิสิฐพงศ์พันธ์ nawaporn.w@itd.kmutnb.ac.th,nawapornn@kmutnb.ac.th nawaporn.w@itd.kmutnb.ac.th,nawapornn@kmutnb.ac.th
Keywords:	ข้อความแชท ฟิชชิ่ง แพลตฟอร์มออนไลน์ LLM Few-Shot การเพิ่มจำนวนข้อมูล เทคนิคการเรียนรู้ของเครื่อง Chat Messages Phishing Online Platform LLM Few-Shot Augmentation Machine Learning
Issue Date:	8
Publisher:	King Mongkut's University of Technology North Bangkok
Abstract:	Online businesses relying on messaging services encounter increasingly severe phishing chat threats. So, they need to raise awareness among employees to identify malicious messages. But, high employee turnover disrupts these efforts. Even there are many phishing detection tools for emails, websites, SMSs, etc., they cannot be applied to detect phishing from 1-1 chat. And there is still no public phishing chat dataset that has enough quality for training models. However, advancement in AI, in LLM enable data augmenting from limited real-world data through few-shot learning techniques. This study aims to create a phishing chat dataset from limited real-world data, and evaluate for quality, for training phishing detection models. We were able to increase chat messages by 10 times using GPT-3.5 Turbo model. The quality of dataset received 7.68 out of 10 from GPT-4o model. Upon testing the dataset against various Machine Learning models, Logistic Regression together with dataset transforming using MaxAbsScaler outperformed other models, achieving the accuracy of 99.87% with 100% precision and 0.99 F1 scores. These results imply that our augmented dataset can efficiently be used for training models to detect phishing. ธุรกิจที่พึ่งพาการสื่อสารผ่านข้อความโต้ตอบบนแพลตฟอร์มออนไลน์ มักพบปัญหาข้อความแชทแบบฟิชชิ่งที่รุนแรงมากขึ้น ซึ่งต้องสร้างความตระหนักรู้ข้อความที่เป็นอันตรายให้กับพนักงาน แต่พนักงานเหล่านี้มักเปลี่ยนงานเร็วกว่าที่จะได้รับการอบรมเพียงพอ แม้ปัจจุบันมีการนำเทคนิคการเรียนรู้ด้วยเครื่องมาตรวจจับแล้วก็ตาม แต่ข้อความที่สามารถตรวจจับได้มักอยู่ในรูปแบบอื่น เช่น อีเมล เว็บไซต์ ข้อความสั้น ฯลฯ ซึ่งมีรูปแบบต่างจากการแชทโต้ตอบแบบ 1-1 กับลูกค้า และปัจจุบันยังไม่มีชุดข้อมูลข้อความแชทแบบสาธารณะที่นำมาใช้สร้างโมเดลสำหรับตรวจจับฟิชชิ่งได้ แต่ด้วยความสามารถของระบบ AI แบบ LLM จึงทำให้สามารถเพิ่มจำนวนข้อมูลตัวอย่างที่มีลักษณะคล้ายกันได้ แม้จะมีข้อมูลจริงที่นำมาใช้ตั้งต้นน้อย งานวิจัยนี้มีวัตถุประสงค์เพื่อจัดทำชุดข้อมูลข้อความแชทแบบฟิชชิ่งจากข้อมูลจริงที่มีจำนวนน้อยมาก แล้วนำมาทดสอบคุณภาพในการนำมาใช้เทรนโมเดลจริง ซึ่งพบว่า การใช้โมเดล LLM แบบ GPT-3.5 Turbo เพิ่มจำนวนแชทได้ 10 เท่า ได้คะแนนประเมินความสมจริงเฉลี่ยที่ 7.68 (เต็ม 10) เมื่อประเมินด้วยโมเดล GPT-4o เมื่อนำไปเทรนโมเดลแบบ Logistic Regression ที่แปลงข้อมูลแบบ MaxAbsScaler จะได้ประสิทธิภาพสูงที่สุด ได้ค่าความเที่ยงตรงที่ 99.87% ความแม่นยำที่ 100% (แบบไบนารี่) คะแนน F1 เท่ากับ 0.99 จึงถือว่าสามารถนำชุดข้อมูลแชทที่สร้างขึ้นนี้ไปใช้ประโยชน์ได้
URI:	http://kmutnb-ir.kmutnb.ac.th/jspui/handle/123456789/301
Appears in Collections:	FACULTY OF INFORMATION TECHNOLOGY AND DIGITAL INNOVATION

Files in This Item:

File	Description	Size	Format
s6607031857086.pdf		8.93 MB	Adobe PDF	View/Open

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets