Development of Spatio-temporal Deep Learning Model for Visual speech recognition

Please use this identifier to cite or link to this item: http://kmutnb-ir.kmutnb.ac.th/jspui/handle/123456789/187

Title:	Development of Spatio-temporal Deep Learning Model for Visual speech recognition การพัฒนาแบบจำลองเรียนรู้แบบลึกเชิงพื้นที่และเวลาสำหรับการจดจำคำพูดด้วยภาพ
Authors:	TEERAPONG NGAMPROMWONG ธีรพงศ์ งามพร้อมวงษ์ LUEPOL PIPANMEKAPORN ลือพล พิพานเมฆาภรณ์ King Mongkut's University of Technology North Bangkok LUEPOL PIPANMEKAPORN ลือพล พิพานเมฆาภรณ์ luepol.p@sci.kmutnb.ac.th,luepolp@kmutnb.ac.th luepol.p@sci.kmutnb.ac.th,luepolp@kmutnb.ac.th
Keywords:	การจดจำคำพูดด้วยภาพ การอ่านริมฝีปาก การเรียนรู้เชิงลึก ความสนใจเชิงพื้นที่ Visual Speech Recognition Lipreading Deep Learning Spatial Attention
Issue Date:	9
Publisher:	King Mongkut's University of Technology North Bangkok
Abstract:	Lipreading focuses on developing efficient models to decode spoken words from the motion of the lips. Despite continuous advancements, there remain two significant challenges: Model Complexity and Lip Movement Homophones. Complex models are often designed with the aim of increasing lipreading accuracy; however, this complexity comes at the cost of high computational resource requirements, rendering them unsuitable for use in portable devices or resource-constrained systems. Additionally, many words exhibit similar pronunciations, leading to lip movements that closely resemble one another; for instance, the words "million" and "millions" demonstrate such similarities, necessitating that models discern differences based solely on visual data. To address these issues, this research focuses on developing a streamlined yet highly accurate model capable of distinguishing between lip movements that closely resemble one another. This is achieved through the use of a hybrid deep learning approach that integrates Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and Attention mechanisms, with the aim of enhancing the applicability and efficiency of lipreading in real-world scenarios. การอ่านริมฝีปาก (lipreading) มุ่งเน้นการพัฒนาแบบจำลองที่มีประสิทธิภาพในการถอดรหัสคำพูดจากภาพการเคลื่อนไหวของริมฝีปาก ถึงแม้จะมีความก้าวหน้าอย่างต่อเนื่อง แต่ยังคงมีความท้าทายที่สำคัญ 2 ประการ คือ 1) ความซับซ้อนของแบบจำลอง (Model Complexity) แบบจำลองที่ซับซ้อนมักถูกสร้างขึ้นโดยมีจุดมุ่งหมายเพื่อเพิ่มความแม่นยำในการอ่านริมฝีปาก แต่ก็มีข้อเสียคือต้องใช้ทรัพยากรในการประมวลผลสูง ทำให้ไม่เหมาะสำหรับการใช้งานใน อุปกรณ์พกพาหรือระบบที่มีข้อจำกัดด้านทรัพยากร 2) ความคล้ายคลึงในการขยับริมฝีปาก (Lip movement homophones) คำหลายคำมีการออกเสียงที่ใกล้เคียงกัน ทำให้ริมฝีปากขยับ ในลักษณะที่คล้ายคลึงกันมาก ตัวอย่างเช่น คำว่า "million" และ "millions" ซึ่งแบบจำลองต้องพยายามแยกแยะความแตกต่างจากข้อมูลภาพเพียงอย่างเดียว เพื่อแก้ไขปัญหาเหล่านี้ งานวิจัยนี้จึงมุ่งเน้นไปที่การพัฒนาแบบจำลองที่ไม่ซับซ้อนแต่ยังคงมีความแม่นยำสูง และสามารถแยกแยะความแตกต่างของการเคลื่อนไหวริมฝีปากที่มีความคล้ายคลึงกันได้ โดยใช้เทคนิคการเรียนรู้เชิงลึกแบบผสมผสานประกอบ CNN, RNN และ Attention เพื่อให้การอ่านริมฝีปากมีความสามารถนำไปใช้ทำงานได้ในสถานการณ์จริงและมีประสิทธิภาพมากยิ่งขึ้น
URI:	http://kmutnb-ir.kmutnb.ac.th/jspui/handle/123456789/187
Appears in Collections:	FACULTY OF APPLIED SCIENCE

Files in This Item:

File	Description	Size	Format
s6504062856039.pdf		3.62 MB	Adobe PDF	View/Open

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets