๐Ÿง  Boston Neuromind
ํŠนํ—ˆ ์ถœ์› ์ง„ํ–‰ ์ค‘ ยท 3/3Provisional Patent Application ยท 3/3

์‹œ๊ฐ„์  ์ผ์น˜์„ฑ ํƒ์ง€ ์‹œ์Šคํ…œ Temporal Congruence Detection System

์‚ฌ์šฉ์ž์˜ ์–ผ๊ตด ๋ฏธ์„ธํ‘œ์ •ยท์Œ์„ฑ ์šด์œจยทํ…์ŠคํŠธ ์˜๋ฏธ ์‚ฌ์ด์˜ ์‹œ๊ฐ„์  ์ •๋ ฌ ํŒจํ„ด์„ ๋‹ค์ค‘ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋กœ ๋ถ„์„ํ•˜์—ฌ, ์˜์‹์  ๋ณด๊ณ ์—์„œ๋Š” ๋“œ๋Ÿฌ๋‚˜์ง€ ์•Š๋Š” ์ž ์žฌ ์ž„์ƒ ์ƒํƒœ(๊ฐ€๋ฉด ์šฐ์šธ, ์–ต์••๋œ ๋ถ„๋…ธ, ๋ถ€์กฐํ™” ์ž๊ธฐ๋ณด๊ณ  ๋“ฑ)๋ฅผ ์ž๋™ ํƒ์ง€ํ•˜๋Š” ์ปดํ“จํ„ฐ ๊ตฌํ˜„ ๋ฐฉ๋ฒ•. A computer-implemented method for automatically detecting latent clinical states โ€” including masked depression, suppressed anger, and incongruent self-report โ€” that do not surface in conscious reporting, by performing multi-modal analysis of temporal alignment patterns among a user's facial micro-expressions, vocal prosody, and textual semantics.
์ถœ์›์ธApplicant Boston Neuromind, LLC
๋ฐœ๋ช…์žInventor [๋ฐœ๋ช…์ž๋ช…] (BCN, PhD) [Inventor Name] (BCN, PhD)
์ƒํƒœStatus USPTO ๊ฐ€์ถœ์› ์ค€๋น„ USPTO Provisional Pending
๋ถ„๋ฅ˜Classification G06V 40/16 / G10L 25/63 / G16H 50/20
๋ชฉ์ฐจTable of Contents
  1. ์ดˆ๋กAbstract
  2. ๋ฐœ๋ช… ๋ถ„์•ผField of Invention
  3. ๋ฐฐ๊ฒฝ ๊ธฐ์ˆ Background
  4. ํ•ด๊ฒฐ ๊ณผ์ œProblem Statement
  5. ๋ฐœ๋ช… ์š”์•ฝSummary of Invention
  6. ์ƒ์„ธ ์„ค๋ช…Detailed Description
  7. ๋„๋ฉด ์„ค๋ช…Drawings
  8. ์ฒญ๊ตฌํ•ญClaims
  9. ์„ ํ–‰ ๊ธฐ์ˆ  ๋น„๊ตPrior Art Comparison
  10. ์‚ฐ์—…์ƒ ์ด์šฉ ๊ฐ€๋Šฅ์„ฑIndustrial Applicability
  11. ๊ด€๋ จ ๋…ผ๋ฌธReferences

01์ดˆ๋กAbstract

๐Ÿ“‹ ํ•ต์‹ฌ ์š”์•ฝ๐Ÿ“‹ One-Paragraph Summary

๋ณธ ๋ฐœ๋ช…์€ ์‚ฌ์šฉ์ž์˜ ๋™๊ธฐํ™”๋œ ๋น„๋””์˜คยท์˜ค๋””์˜คยทํ…์ŠคํŠธ ์ž…๋ ฅ์œผ๋กœ๋ถ€ํ„ฐ (i) ์–ผ๊ตด ๋™์ž‘ ๋‹จ์œ„(Facial Action Units, FAU), (ii) ์Œ์„ฑ ์šด์œจ ํŠน์ง•(Prosodic Features, F0ยทshimmerยทjitterยทpitch contour), (iii) ํ…์ŠคํŠธ ์ •์„œ ๊ทน์„ฑ(Sentiment Polarity)์„ ์‹œ๊ณ„์—ด๋กœ ์ถ”์ถœํ•˜๊ณ , ์„ธ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๊ฐ„ ์‹œ๊ฐ„ ์ง€์—ฐ(lag) ๋ฐ ๋ถ€์กฐํ™” ์ ์ˆ˜๋ฅผ ๊ต์ฐจ์ƒ๊ด€(cross-correlation)ยท์ฝ”ํžˆ๋Ÿฐ์Šค(coherence)ยท๋™์  ์‹œ๊ฐ„ ์›Œํ•‘(DTW)์œผ๋กœ ์‚ฐ์ถœํ•˜์—ฌ, ํ‘œํ˜„ ์ฑ„๋„ ๊ฐ„์˜ ์‹œ๊ฐ„์  ๋ถˆ์ผ์น˜ ํŒจํ„ด(์˜ˆ: ๋ฏธ์†Œ ์งํ›„ ์Œ์„ฑ ๋–จ๋ฆผ, ๊ธ์ •์  ํ…์ŠคํŠธ ์ง์ „ ๋ถ„๋…ธ ๋ฏธ์„ธํ‘œ์ • 0.4์ดˆ ์„ ํ–‰)์„ ์ž„์ƒ ์‹œ๊ทธ๋„๋กœ ๋ณ€ํ™˜ํ•˜๊ณ , ์ž„์ƒ ์ •์ƒ ๋ชจ์ง‘๋‹จ ๋Œ€๋น„ Z-์ ์ˆ˜ํ™”ํ•˜์—ฌ ๊ฐ€๋ฉด ์šฐ์šธยท์–ต์••๋œ ๋ถ„๋…ธยทํ•ด๋ฆฌ์  ์ž๊ธฐ๋ณด๊ณ  ๋“ฑ 8๊ฐœ ์ž ์žฌ ์ƒํƒœ๋ฅผ ๋ถ„๋ฅ˜ยท๊ฒฝ๊ณ ํ•˜๋Š” ์ปดํ“จํ„ฐ ๊ตฌํ˜„ ๋ฐฉ๋ฒ• ๋ฐ ์‹œ์Šคํ…œ์— ๊ด€ํ•œ ๊ฒƒ์ด๋‹ค. The present invention relates to a computer-implemented method and system that extracts, as time series from synchronized video, audio, and text inputs of a user, (i) Facial Action Units (FAUs), (ii) prosodic features (F0, shimmer, jitter, pitch contour), and (iii) textual sentiment polarity, then computes inter-modal temporal lag and incongruence scores via cross-correlation, coherence, and Dynamic Time Warping (DTW), thereby converting temporal mismatch patterns across expression channels โ€” for example, vocal tremor immediately following a smile, or anger micro-expression preceding positive text by 0.4 seconds โ€” into clinical signals which are Z-scored against a clinical normative population to classify and alert on eight latent states including masked depression, suppressed anger, and dissociative self-report.

02๋ฐœ๋ช… ๋ถ„์•ผField of Invention

๋ณธ ๋ฐœ๋ช…์€ ๋‹ค์ค‘ ๋ชจ๋‹ฌ ์ •์„œ ์ปดํ“จํŒ…(Multi-Modal Affective Computing), ์ž„์ƒ ์‹ฌ๋ฆฌ์ง„๋‹จ ๋ณด์กฐ(Clinical Diagnostic Aid), ์ธ๊ณต์ง€๋Šฅ ๋Œ€ํ™” ์‹œ์Šคํ…œ ๋ถ„์•ผ์— ์†ํ•œ๋‹ค. ๋”์šฑ ๊ตฌ์ฒด์ ์œผ๋กœ๋Š”, ์–ผ๊ตดยท์Œ์„ฑยทํ…์ŠคํŠธ์˜ ์‹œ๊ฐ„์  ์ •ํ•ฉ์„ฑ์„ ๋ถ„์„ํ•˜์—ฌ ์ž ์žฌ ์ž„์ƒ ์ƒํƒœ๋ฅผ ํƒ์ง€ํ•˜๋Š” ์‹œ์Šคํ…œ์— ๊ด€ํ•œ ๊ฒƒ์ด๋‹ค. The present invention pertains to the fields of Multi-Modal Affective Computing, Clinical Diagnostic Aids, and Artificial Intelligence Conversational Systems. More specifically, it concerns a system that detects latent clinical states by analyzing temporal congruence among facial, vocal, and textual channels.

๊ด€๋ จ ๊ธฐ์ˆ  ๋ถ„์•ผRelated Technical Fields

03๋ฐฐ๊ฒฝ ๊ธฐ์ˆ Background

3.1 ์ž๊ธฐ๋ณด๊ณ ์˜ ํ•œ๊ณ„์™€ ์ž ์žฌ ์ƒํƒœ3.1 Limits of Self-Report and Latent States

์ž„์ƒ ์‹ฌ๋ฆฌํ•™์—์„œ ๊ฐ€์žฅ ์˜ค๋ž˜๋œ ๋ฌธ์ œ ์ค‘ ํ•˜๋‚˜๋Š” "ํ™˜์ž๊ฐ€ ๋ณด๊ณ ํ•˜๋Š” ๊ฒƒ"๊ณผ "์‹ค์ œ ์ƒํƒœ" ๊ฐ„์˜ ๊ดด๋ฆฌ์ด๋‹ค. ๋‹ค์Œ์€ ์ž„์ƒ ๋ฌธํ—Œ์ด ์ผ๊ด€๋˜๊ฒŒ ๋ณด๊ณ ํ•˜๋Š” ์ž๊ธฐ๋ณด๊ณ  ํ•œ๊ณ„ ์‚ฌ๋ก€: One of the oldest problems in clinical psychology is the gap between "what the patient reports" and "the patient's actual state." The following is a non-exhaustive list of self-report limitations consistently documented in the clinical literature:

3.2 Ekman ๋ฏธ์„ธํ‘œ์ • ์—ฐ๊ตฌ3.2 Ekman's Micro-Expression Research

Paul Ekman ๋ฐ•์‚ฌ(1969-2003)์˜ ์ผ๋ จ์˜ ์—ฐ๊ตฌ๋Š” ์ธ๊ฐ„์ด ์˜์‹์ ์œผ๋กœ ํ†ต์ œํ•˜๊ธฐ ์–ด๋ ค์šด 0.04-0.2์ดˆ ๊ธธ์ด์˜ ๋ฏธ์„ธํ‘œ์ •์ด ์ง„์ •ํ•œ ์ •์„œ๋ฅผ ๋“œ๋Ÿฌ๋‚ธ๋‹ค๋Š” ๊ฒƒ์„ ์ž…์ฆํ–ˆ๋‹ค. FACS(Facial Action Coding System)๋Š” 44๊ฐœ์˜ ๋™์ž‘ ๋‹จ์œ„(AU)๋กœ ํ‘œ์ •์„ ๋ถ„ํ•ดํ•˜๋ฉฐ, ํŠน์ • AU ์กฐํ•ฉ์€ ์ง„์ •ยท๊ฐ€์งœ ์ •์„œ๋ฅผ ๊ตฌ๋ถ„ํ•  ์ˆ˜ ์žˆ๋‹ค. A series of studies by Dr. Paul Ekman (1969โ€“2003) established that micro-expressions, lasting 0.04 to 0.2 seconds and difficult to consciously suppress, reveal genuine emotion. The Facial Action Coding System (FACS) decomposes facial expressions into 44 Action Units (AUs), and specific AU combinations can distinguish genuine from feigned affect.

AU ๊ทผ์œก ์›€์ง์ž„Muscle Movement ์ •์„œ ์‹œ๊ทธ๋„Affective Signal
AU1๋‚ด์ธก ๋ˆˆ์น ์˜ฌ๋ฆผinner brow raise์Šฌํ””, ๋‘๋ ค์›€sadness, fear
AU4๋ˆˆ์น ๋ชจ์Œbrow lowerer๋ถ„๋…ธ, ์ง‘์ค‘anger, concentration
AU6๋ณผ ์˜ฌ๋ฆผ (Duchenne)cheek raiser (Duchenne)์ง„์ • ๋ฏธ์†Œgenuine smile
AU12์ž…๊ผฌ๋ฆฌ ์˜ฌ๋ฆผlip corner puller๋ฏธ์†Œsmile
AU15์ž…๊ผฌ๋ฆฌ ๋‚ด๋ฆผlip corner depressor์Šฌํ””sadness
AU17ํ„ฑ ์˜ฌ๋ฆผchin raiser์˜์‹ฌ, ๋ถ„๋…ธdoubt, anger
AU23์ž…์ˆ  ๋ˆ„๋ฆ„lip tightener์–ต์••๋œ ๋ถ„๋…ธsuppressed anger
AU45๋ˆˆ ๊นœ๋ฐ•์ž„ ๋นˆ๋„blink frequency๋ถˆ์•ˆ, ์ธ์ง€๋ถ€ํ•˜anxiety, cognitive load

3.3 ๋‹จ์ผ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๋ถ„์„์˜ ํ•œ๊ณ„3.3 Limitations of Single-Modality Analysis

๊ธฐ์กด ์ •์„œ ์ปดํ“จํŒ… ์‹œ์Šคํ…œ(Affectiva, RealEyes, iMotions ๋“ฑ)์€ ๋‹ค์Œ ํ•œ๊ณ„๋ฅผ ๊ฐ€์ง„๋‹ค: Existing affective computing systems (Affectiva, RealEyes, iMotions, and the like) suffer from the following limitations:

  1. ๋‹จ์ผ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๋ถ„์„: ์–ผ๊ตด๋งŒ, ์Œ์„ฑ๋งŒ, ํ…์ŠคํŠธ๋งŒ โ€” ๋‹ค์ค‘ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ํ†ตํ•ฉ ๋ถ€์žฌSingle-modality analysis: face only, voice only, or text only โ€” no multi-modal integration
  2. ์ˆœ๊ฐ„ ๋ถ„๋ฅ˜: ์‹œ์ ๋ณ„ ์ •์„œ ๋ถ„๋ฅ˜๋งŒ, ์‹œ๊ฐ„์  ์ •ํ•ฉ์„ฑยท์ง€์—ฐ ๋ถ„์„ ๋ถ€์žฌInstantaneous classification: per-frame affect classification with no temporal congruence or lag analysis
  3. ๋น„์ž„์ƒ ๋ชจ์ง‘๋‹จ: ๊ด‘๊ณ ยท์†Œ๋น„์ž ๋ฐ˜์‘ ๋ถ„์„์— ์ตœ์ ํ™”, ์ž„์ƒ ์ •์ƒ ๋ชจ์ง‘๋‹จ ์ •๊ทœํ™” ๋ถ€์žฌNon-clinical populations: optimized for advertising and consumer-response analysis, lacking clinical normative populations
  4. ์ž ์žฌ ์ƒํƒœ ์ถ”๋ก  ๋ถ€์žฌ: ์˜์‹์  ํ‘œํ˜„๋งŒ ๋ถ„๋ฅ˜, ์ž ์žฌ ์ƒํƒœ ์ถ”๋ก  ๋ถˆ๊ฐ€No latent-state inference: classifies only conscious expression, cannot infer latent states

04ํ•ด๊ฒฐ ๊ณผ์ œProblem Statement

  1. ์‹œ๊ฐ„์  ์ •ํ•ฉ์„ฑ ๋ถ„์„ ๋ถ€์žฌ.Absence of temporal congruence analysis. ์–ผ๊ตดยท์Œ์„ฑยทํ…์ŠคํŠธ ๊ฐ„ ์‹œ๊ฐ„ ์ง€์—ฐ์„ ๋ถ„์„ํ•˜์—ฌ ์ž ์žฌ ์ƒํƒœ๋ฅผ ์ถ”๋ก ํ•˜๋Š” ์‹œ์Šคํ…œ์ด ์กด์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค.No existing system analyzes temporal lag among facial, vocal, and textual channels to infer latent states.
  2. 3 ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๋™์‹œ ํ†ตํ•ฉ ๋ถ€์žฌ.No simultaneous integration of three modalities. ๊ธฐ์กด ์‹œ์Šคํ…œ์€ 2๊ฐœ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๊ฒฐํ•ฉ๊นŒ์ง€๋งŒ, 3๊ฐœ ๋™์‹œ ์‹œ๊ฐ„์  ์ •ํ•ฉ์„ฑ ๋ถ„์„์ด ์—†๋‹ค.Existing systems combine at most two modalities; none performs simultaneous temporal congruence analysis across three.
  3. ์ž„์ƒ ์ž ์žฌ ์ƒํƒœ ๋งคํ•‘ ๋ถ€์žฌ.No clinical latent-state mapping. ์‹œ๊ฐ„์  ๋ถ€์กฐํ™” ํŒจํ„ด์„ ์ž„์ƒ ์ž ์žฌ ์ƒํƒœ(๊ฐ€๋ฉด ์šฐ์šธ, ์–ต์••๋œ ๋ถ„๋…ธ ๋“ฑ)๋กœ ๋งคํ•‘ํ•˜๋Š” ๊ธฐ์ค€์ด ์—†๋‹ค.No standardized criterion exists for mapping temporal incongruence patterns to clinical latent states (masked depression, suppressed anger, and the like).
  4. ์‹ค์‹œ๊ฐ„ ์ฒ˜๋ฆฌ ๋ถ€์žฌ.No real-time processing. ๊ธฐ์กด ๋‹ค์ค‘ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ์‹œ์Šคํ…œ์€ ์‚ฌํ›„ ๋ถ„์„๋งŒ, ์‹ค์‹œ๊ฐ„ ๋Œ€ํ™” ์ค‘ ํƒ์ง€ ๋ถˆ๊ฐ€๋Šฅ.Existing multi-modal systems perform post-hoc analysis only and cannot detect during live conversation.
  5. ์ž๊ธฐ๋ณด๊ณ  ๊ฒ€์ฆ ๋„๊ตฌ ๋ถ€์žฌ.No self-report validation tool. ํ™˜์ž์˜ ์ž๊ธฐ๋ณด๊ณ ๊ฐ€ ๋น„์–ธ์–ด ์‹œ๊ทธ๋„๊ณผ ์ผ์น˜ํ•˜๋Š”์ง€ ๊ฐ๊ด€ ๊ฒ€์ฆํ•˜๋Š” ์‹œ์Šคํ…œ์ด ์—†๋‹ค.No system objectively validates whether a patient's self-report aligns with non-verbal signals.

05๋ฐœ๋ช… ์š”์•ฝSummary of Invention

5.1 ์‹œ์Šคํ…œ ๊ตฌ์„ฑ ์š”์†Œ5.1 System Components

  1. 3-Channel ์‹œ๊ณ„์—ด ์ถ”์ถœ๊ธฐ (TSE):3-Channel Time-Series Extractor (TSE): ๋™๊ธฐํ™”๋œ ๋น„๋””์˜คยท์˜ค๋””์˜คยทํ…์ŠคํŠธ๋กœ๋ถ€ํ„ฐ FAUยท์šด์œจยท์ •์„œ ๊ทน์„ฑ ์‹œ๊ณ„์—ด ์‚ฐ์ถœcomputes FAU, prosody, and sentiment polarity time series from synchronized video, audio, and text
  2. ๊ต์ฐจ ๋ชจ๋‹ฌ ์ •๋ ฌ๊ธฐ (CMA):Cross-Modal Aligner (CMA): ๊ต์ฐจ์ƒ๊ด€ยท์ฝ”ํžˆ๋Ÿฐ์ŠคยทDTW๋กœ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ์Œ ๊ฐ„ ์‹œ๊ฐ„ ์ง€์—ฐ ์‚ฐ์ถœcomputes inter-modal time lag via cross-correlation, coherence, and DTW
  3. ๋ถ€์กฐํ™” ์ ์ˆ˜ ๊ณ„์‚ฐ๊ธฐ (ISC):Incongruence Scorer (ISC): 3 ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๊ฐ„ ๋ถ€์กฐํ™” ์ข…ํ•ฉ ์ ์ˆ˜ ์‚ฐ์ถœ (0-100)computes a composite incongruence score across three modalities (0โ€“100)
  4. ์ž ์žฌ ์ƒํƒœ ๋ถ„๋ฅ˜๊ธฐ (LSC):Latent State Classifier (LSC): 8๊ฐœ ์ž ์žฌ ์ƒํƒœ ํด๋ž˜์Šค๋กœ ๋ถ„๋ฅ˜classifies into eight latent state classes
  5. ์ž„์ƒ ์ •๊ทœํ™” ๋ชจ๋“ˆ (CNM):Clinical Normalizer (CNM): ์ž„์ƒ ์ •์ƒ ๋ชจ์ง‘๋‹จ ๋Œ€๋น„ Z-์ ์ˆ˜ํ™”Z-scores against the clinical normative population
  6. ์‹ค์‹œ๊ฐ„ ๊ฒฝ๊ณ  ๋ชจ๋“ˆ (RTA):Real-Time Alert Module (RTA): ์ž„๊ณ„์น˜ ์ดˆ๊ณผ ์‹œ ์ž„์ƒ ๊ฒฝ๊ณ  ๋ฐœ์†กissues clinical alerts upon exceedance of threshold

5.2 ํ•ต์‹ฌ ์ฐจ๋ณ„์ 5.2 Inventive Steps

06์ƒ์„ธ ์„ค๋ช…Detailed Description

6.1 ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ๋™๊ธฐํ™”6.1 Input Data Synchronization

3 ์ฑ„๋„์ด ์‹œ๊ฐ„์ ์œผ๋กœ ์ •ํ•ฉ๋œ ์ž…๋ ฅ์œผ๋กœ ๊ฒฐํ•ฉ๋œ๋‹ค:The three channels are combined into temporally synchronized input streams:

3 ์ฑ„๋„์€ ๊ณตํ†ต ์‹œ๊ฐ„์ถ•(timestamp) ์œ„์— ์ •๋ ฌ๋˜์–ด ๋งค 100ms ๋งˆ๋‹ค ๋™๊ธฐ ์ƒ˜ํ”Œ์„ ์ƒ์„ฑํ•œ๋‹ค.The three channels are aligned to a common timestamp axis and produce a synchronized sample every 100 ms.

6.2 ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ณ„ ์‹œ๊ณ„์—ด ํŠน์ง• ์ถ”์ถœ6.2 Per-Modality Time-Series Feature Extraction

(a) ์–ผ๊ตด ์ฑ„๋„ (Facial Action Units)(a) Facial Channel โ€” Action Units

OpenFace ๋˜๋Š” ๋™๋“ฑ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ ๋งค ํ”„๋ ˆ์ž„๋งˆ๋‹ค 17๊ฐœ ํ•ต์‹ฌ AU ๊ฐ•๋„(0-5) ์‚ฐ์ถœ:Using OpenFace or an equivalent library, the intensity (0โ€“5) of seventeen core AUs is computed at each frame:

F(t) = [AU1, AU2, AU4, AU5, AU6, AU7, AU9, AU10, AU12, AU14, AU15, AU17, AU20, AU23, AU25, AU26, AU45] at time t (sampled @ 30 fps)

(b) ์Œ์„ฑ ์ฑ„๋„ (Prosodic Features)(b) Vocal Channel โ€” Prosodic Features

10ms ์œˆ๋„์šฐ ๋ถ„์„ + 100ms ๋‹ค์šด์ƒ˜ํ”Œ๋ง:10 ms window analysis with 100 ms downsampling:

V(t) = [F0_mean, F0_variance, F0_slope, Jitter, Shimmer, Energy, Pitch_contour, Speech_rate, Pause_ratio]

(c) ํ…์ŠคํŠธ ์ฑ„๋„ (Sentiment + Semantics)(c) Textual Channel โ€” Sentiment plus Semantics

Transformer ๊ธฐ๋ฐ˜ (BERT/RoBERTa) ๋‹จ์–ด ์ˆ˜์ค€ ๋ถ„์„, ๋‹จ์–ด ํƒ€์ž„์Šคํƒฌํ”„์— ๋งคํ•‘:Transformer-based (BERT/RoBERTa) word-level analysis, mapped to word timestamps:

T(t) = [Sentiment_polarity (-1..+1), Sentiment_intensity (0..1), Emotion_anger, Emotion_sadness, Emotion_joy, Emotion_fear, Word_negation, Self_reference]

6.3 ๊ต์ฐจ ๋ชจ๋‹ฌ ์‹œ๊ฐ„ ์ง€์—ฐ ์‚ฐ์ถœ6.3 Cross-Modal Time Lag Computation

3 ์ฑ„๋„ ์Œ(F-V, F-T, V-T)์— ๋Œ€ํ•ด ๊ต์ฐจ์ƒ๊ด€ํ•จ์ˆ˜(cross-correlation)๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ์ตœ๋Œ“๊ฐ’ ์œ„์น˜(lag)๋ฅผ ์ถ”์ถœํ•œ๋‹ค:For each of the three channel pairs (Fโ€“V, Fโ€“T, Vโ€“T), the cross-correlation function is computed and the lag at its maximum is extracted:

์ˆ˜์‹ 1: ๊ต์ฐจ ์ƒ๊ด€Equation 1: Cross-Correlation R_xy(ฯ„) = ฮฃ_t [ x(t) ยท y(t+ฯ„) ] / sqrt(ฮฃxยฒ ยท ฮฃyยฒ) lag_xy = argmax_ฯ„ |R_xy(ฯ„)|, where ฯ„ โˆˆ [-2.0s, +2.0s]

์ •์ƒ ์ •ํ•ฉ ์ƒํƒœ์—์„œ๋Š” ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๊ฐ„ lag๊ฐ€ ยฑ200ms ์ด๋‚ด. ์ž ์žฌ ์ƒํƒœ์—์„œ๋Š” lag๊ฐ€ ยฑ200ms๋ฅผ ์ดˆ๊ณผํ•˜๊ฑฐ๋‚˜, ๋ถ€ํ˜ธ๊ฐ€ ๋ฐ˜๋Œ€๋กœ ๋‚˜ํƒ€๋‚œ๋‹ค (์˜ˆ: ํ…์ŠคํŠธ ์–‘์„ฑ + ์–ผ๊ตด ์Œ์„ฑ 0.4์ดˆ ์„ ํ–‰ โ†’ ์–ต์••).In normally congruent states, inter-modal lag remains within ยฑ200 ms. In latent states, lag exceeds ยฑ200 ms or reverses in sign (for example, positive text together with negative facial expression preceding by 0.4 seconds โ†’ suppression).

6.4 ๋ถ€์กฐํ™” ์ข…ํ•ฉ ์ ์ˆ˜ (ICS: Incongruence Composite Score)6.4 Incongruence Composite Score (ICS)

์ˆ˜์‹ 2: ๋ถ€์กฐํ™” ์ ์ˆ˜Equation 2: Incongruence Score ICS = w_FV ยท |1 - R_FV(0)|ยทexp(|lag_FV|/ฯ„_0) + w_FT ยท |1 - R_FT(0)|ยทexp(|lag_FT|/ฯ„_0) + w_VT ยท |1 - R_VT(0)|ยทexp(|lag_VT|/ฯ„_0) where: R_xy(0) = correlation at zero lag (synchronicity) lag_xy = optimal lag from Equation 1 ฯ„_0 = scale (200 ms = normal congruence threshold) w_* = empirically calibrated weights, sum = 1 ICS range: 0 (perfect congruence) - 100 (extreme incongruence)

6.5 8 ์ž ์žฌ ์ƒํƒœ ๋ถ„๋ฅ˜6.5 Eight Latent State Classes

# ์ž ์žฌ ์ƒํƒœLatent State ํƒ์ง€ ํŒจํ„ดDetection Pattern ICS
L1๊ฐ€๋ฉด ์šฐ์šธMasked Depressionํ…์ŠคํŠธ (+) + ์–ผ๊ตด AU1+AU15 + ์Œ์„ฑ F0โ†“ยทEnergyโ†“positive text + facial AU1+AU15 + vocal F0โ†“ and Energyโ†“โ‰ฅ45
L2์–ต์••๋œ ๋ถ„๋…ธSuppressed Angerํ…์ŠคํŠธ ์ค‘๋ฆฝ + AU4ยทAU23 โ‰ฅ 200ms ์„ ํ–‰neutral text + AU4 / AU23 leading by โ‰ฅ 200 msโ‰ฅ40
L3ํ•ด๋ฆฌ์  ์ž๊ธฐ๋ณด๊ณ Dissociative Self-Report๊ฐ์ • ๋‹จ์–ด + ์Œ์„ฑ ๋‹จ์กฐยท์–ผ๊ตด ๋ฌดํ‘œ์ • (Flat Affect)emotion words plus monotone voice and facial flat-affectโ‰ฅ50
L4์•Œ๋ ‰์‹œํ‹ฐ๋ฏธ์•„Alexithymia๊ฐ•ํ•œ ์–ผ๊ตด/์Œ์„ฑ ์ •์„œ + ํ…์ŠคํŠธ ์ •์„œ ์–ดํœ˜ ๋ถ€์žฌstrong facial/vocal affect with absent emotional vocabulary in textโ‰ฅ35
L5์‚ฌํšŒ์  ๋™์กฐSocial Complianceํ…์ŠคํŠธ (+) + AU14 (deceptive smile) + ์Œ์„ฑ ๋ถ€์กฐํ™”positive text + AU14 (deceptive smile) + vocal incongruenceโ‰ฅ30
L6๋ถˆ์•ˆ ์œ„์žฅAnxiety Concealmentํ…์ŠคํŠธ ์ค‘๋ฆฝ + AU45 โ†‘ + ์Œ์„ฑ jitter โ†‘neutral text + AU45 โ†‘ and vocal jitter โ†‘โ‰ฅ30
L7์˜๋„์  ๊ฑฐ์ง“๋งIntentional DeceptionF-T lag > 500ms + ์Œ์„ฑ ํœด์ง€ โ†‘Fโ€“T lag > 500 ms with elevated vocal pausesโ‰ฅ55
L8์ž์‚ด ์œ„ํ—˜ ์‹œ๊ทธ๋„Suicide-Risk Signalํ…์ŠคํŠธ ํ‰์˜จ + AU15+AU17 ๊ฐ•ํ•จ + ์Œ์„ฑ F0 dramatic dropcalm text + strong AU15+AU17 + dramatic vocal F0 dropโ‰ฅ60
โš ๏ธ L8 ์ž์‚ด ์œ„ํ—˜ ์‹œ๊ทธ๋„ - ์ž„์ƒ์  ์ฑ…์ž„โš ๏ธ L8 Suicide-Risk Signal โ€” Clinical Responsibility

L8 ํด๋ž˜์Šค ํƒ์ง€ ์‹œ ๋ณธ ์‹œ์Šคํ…œ์€ ์ž๋™์œผ๋กœ ์ž„์ƒ ์Šˆํผ๋ฐ”์ด์ €์—๊ฒŒ ์ฆ‰์‹œ ๊ฒฝ๊ณ ๋ฅผ ๋ฐœ์†กํ•˜๋ฉฐ, ํ™˜์ž ์•ˆ์ „ ํ”„๋กœํ† ์ฝœ์„ ํ™œ์„ฑํ™”ํ•œ๋‹ค. ์ด๋Š” ์ž๊ฐ€๋ณด๊ณ  ๊ธฐ๋ฐ˜ ์‹œ์Šคํ…œ์ด ๋†“์น˜๊ธฐ ์‰ฌ์šด ์œ„์žฅ๋œ ์ž์‚ด ์œ„ํ—˜์„ ๊ฐ๊ด€์ ์œผ๋กœ ํฌ์ฐฉํ•˜๋Š” ํ•ต์‹ฌ ์ž„์ƒ ๊ฐ€์น˜์ด๋‹ค. When the L8 class is detected, the system automatically issues an immediate alert to the clinical supervisor and activates a patient-safety protocol. This represents a core clinical value: the objective capture of concealed suicide risk that self-report-based systems are likely to miss.

6.6 ์ž„์ƒ ์ •์ƒ ๋ชจ์ง‘๋‹จ ์ •๊ทœํ™”6.6 Clinical Normative-Population Normalization

Boston Neuromind ์ž„์ƒ ๋ฐ์ดํ„ฐ์…‹ (Nโ‰ฅ500)์œผ๋กœ ์‚ฐ์ถœํ•œ ICS ์ •์ƒ ๋ถ„ํฌ์— ๋Œ€ํ•ด Z-์ ์ˆ˜ํ™”ํ•œ๋‹ค:The ICS is Z-scored against the normal distribution computed from the Boston Neuromind clinical dataset (N โ‰ฅ 500):

Z_ICS = (ICS_observed - ฮผ_normal) / ฯƒ_normal Threshold: Z < +1.0 โ†’ Normal congruence +1.0 โ‰ค Z < +2.0 โ†’ Mild incongruence +2.0 โ‰ค Z < +3.0 โ†’ Moderate incongruence (alert) Z โ‰ฅ +3.0 โ†’ Severe incongruence (immediate alert + supervisor)

6.7 ์‹ค์‹œ๊ฐ„ ์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ6.7 Real-Time Processing Pipeline

  1. ์ŠคํŠธ๋ฆฌ๋ฐ ์ž…๋ ฅ ์ˆ˜์‹  (video/audio/ASR text)receive streaming input (video / audio / ASR text)
  2. 3 ์ฑ„๋„ 100ms ๋‹จ์œ„ ๋™๊ธฐ ์ƒ˜ํ”Œ๋งsynchronously sample the three channels every 100 ms
  3. 5์ดˆ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ(50 ์ƒ˜ํ”Œ)์—์„œ ํŠน์ง• ์ถ”์ถœextract features within a 5-second sliding window (50 samples)
  4. ๋งค 1์ดˆ๋งˆ๋‹ค ICS ๊ฐฑ์‹ , ์ž ์žฌ ์ƒํƒœ ๋ถ„๋ฅ˜update ICS every 1 second and reclassify latent states
  5. ์ž„๊ณ„์น˜ ์ดˆ๊ณผ ์‹œ ์ž„์ƒ ๊ฒฝ๊ณ  ๋ฐœ์†ก (5์ดˆ ์ด๋‚ด)issue clinical alert upon threshold exceedance (within 5 seconds)

07๋„๋ฉด ์„ค๋ช…Drawings

Video Channel 30fps, face detection Audio Channel 44.1 kHz, prosody Text Channel ASR + sentiment TSE Time-Series Extractor CMA Cross-Correlation ยท Coherence ยท DTW ISC Incongruence Composite Score LSC: 8 Latent States L1โ€“L8 classification CNM Z-score vs Nโ‰ฅ500 normative RTA: Real-Time Alert Z โ‰ฅ +2 โ†’ Supervisor alert Clinical Decision Support Latent state alert + clinical notes + supervisor handoff
๋„ 1.FIG. 1. 3์ฑ„๋„ ์ž…๋ ฅ (Video/Audio/Text) โ†’ TSE โ†’ CMA โ†’ ISC โ†’ LSC โ†’ CNM โ†’ RTA โ†’ ์ž„์ƒ ์˜์‚ฌ๊ฒฐ์ • ์ง€์›. Three-channel input (Video/Audio/Text) โ†’ TSE โ†’ CMA โ†’ ISC โ†’ LSC โ†’ CNM โ†’ RTA โ†’ Clinical Decision Support.
Temporal Lag Pattern: Masked Depression Example t=0 1s 2s 3s 4s TEXT "I'm fine, really" sentiment +0.7 VOICE F0 โ†“, Energy โ†“ FACE AU1+AU15 sadness AU @ +1.4s Lag = 1.4s (โ‰ซ 200ms)
๋„ 2.FIG. 2. ๊ฐ€๋ฉด ์šฐ์šธ(L1) ํƒ์ง€ ์˜ˆ์‹œ. "๊ดœ์ฐฎ์•„์š”" (ํ…์ŠคํŠธ +0.7) ์งํ›„ ์Œ์„ฑ F0 ๊ฐ•ํ•˜ + 1.4์ดˆ ํ›„ ์Šฌํ”” AU ๋“ฑ์žฅ โ†’ ICS โ‰ฅ 50, Z โ‰ฅ 2.5 โ†’ ๊ฐ€๋ฉด ์šฐ์šธ ํด๋ž˜์Šค๋กœ ๋ถ„๋ฅ˜. Detection example for masked depression (L1). "I'm fine" (text sentiment +0.7) is immediately followed by a vocal F0 drop, with a sadness AU emerging 1.4 seconds later โ†’ ICS โ‰ฅ 50, Z โ‰ฅ 2.5 โ†’ classified as masked depression.

08์ฒญ๊ตฌํ•ญClaims

์ฒญ๊ตฌํ•ญ 1 (๋…๋ฆฝํ•ญ)Claim 1 (Independent)
์‚ฌ์šฉ์ž์˜ ๋น„์–ธ์–ด์  ์ž ์žฌ ์ž„์ƒ ์ƒํƒœ๋ฅผ ์ž๋™ ํƒ์ง€ํ•˜๋Š”, ์ปดํ“จํ„ฐ ๊ตฌํ˜„ ๋ฐฉ๋ฒ•์œผ๋กœ์„œ:
(a) ์‚ฌ์šฉ์ž๋กœ๋ถ€ํ„ฐ ์‹œ๊ฐ„์ ์œผ๋กœ ๋™๊ธฐํ™”๋œ ๋น„๋””์˜ค, ์˜ค๋””์˜ค ๋ฐ ํ…์ŠคํŠธ์˜ 3 ์ž…๋ ฅ ์ฑ„๋„์„ ์ˆ˜์‹ ํ•˜๋Š” ๋‹จ๊ณ„;
(b) ๋น„๋””์˜ค ์ฑ„๋„๋กœ๋ถ€ํ„ฐ ๋ณต์ˆ˜์˜ ์–ผ๊ตด ๋™์ž‘ ๋‹จ์œ„(Facial Action Units, FAUs)์˜ ๊ฐ•๋„๋ฅผ ์‹œ๊ณ„์—ด๋กœ ์ถ”์ถœํ•˜๊ณ , ์˜ค๋””์˜ค ์ฑ„๋„๋กœ๋ถ€ํ„ฐ ๋ณต์ˆ˜์˜ ์šด์œจ ํŠน์ง•(F0, jitter, shimmer, energy)์„ ์‹œ๊ณ„์—ด๋กœ ์ถ”์ถœํ•˜๋ฉฐ, ํ…์ŠคํŠธ ์ฑ„๋„๋กœ๋ถ€ํ„ฐ ์ •์„œ ๊ทน์„ฑยท๊ฐ•๋„ ์‹œ๊ณ„์—ด์„ ์ถ”์ถœํ•˜๋Š” ๋‹จ๊ณ„;
(c) ์ถ”์ถœ๋œ 3 ์ฑ„๋„์˜ ์‹œ๊ณ„์—ด์— ๋Œ€ํ•ด ์ฑ„๋„ ์Œ ๋ณ„๋กœ ๊ต์ฐจ์ƒ๊ด€ํ•จ์ˆ˜(cross-correlation)๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ์ตœ๋Œ€ ์ƒ๊ด€๊ฐ’์— ํ•ด๋‹นํ•˜๋Š” ์‹œ๊ฐ„ ์ง€์—ฐ(lag)์„ ์‚ฐ์ถœํ•˜๋Š” ๋‹จ๊ณ„;
(d) ์‚ฐ์ถœ๋œ ์ฑ„๋„ ์Œ ๋ณ„ ์‹œ๊ฐ„ ์ง€์—ฐ ๋ฐ ์ƒ๊ด€๊ฐ’์„ ์ข…ํ•ฉํ•˜์—ฌ ๋ถ€์กฐํ™” ์ข…ํ•ฉ ์ ์ˆ˜(Incongruence Composite Score, ICS)๋ฅผ 0 ๋‚ด์ง€ 100 ๋ฒ”์œ„๋กœ ์‚ฐ์ถœํ•˜๋Š” ๋‹จ๊ณ„;
(e) ์‚ฐ์ถœ๋œ ICS ๋ฐ ์ฑ„๋„๋ณ„ ์‹œ๊ณ„์—ด ํŒจํ„ด์„ ๊ฐ€๋ฉด ์šฐ์šธ, ์–ต์••๋œ ๋ถ„๋…ธ, ํ•ด๋ฆฌ์  ์ž๊ธฐ๋ณด๊ณ , ์•Œ๋ ‰์‹œํ‹ฐ๋ฏธ์•„, ์‚ฌํšŒ์  ๋™์กฐ, ๋ถˆ์•ˆ ์œ„์žฅ, ์˜๋„์  ๊ฑฐ์ง“๋ง ๋ฐ ์ž์‚ด ์œ„ํ—˜ ์‹œ๊ทธ๋„์„ ํฌํ•จํ•˜๋Š” 8๊ฐœ ์ž ์žฌ ์ƒํƒœ ํด๋ž˜์Šค ์ค‘ ํ•˜๋‚˜๋กœ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋‹จ๊ณ„; ๋ฐ
(f) ๋ถ„๋ฅ˜๋œ ์ž ์žฌ ์ƒํƒœ๊ฐ€ ์ž„์ƒ ์ •์ƒ ๋ชจ์ง‘๋‹จ ๋Œ€๋น„ Z-์ ์ˆ˜ โ‰ฅ +2.0 ์ธ ๊ฒฝ์šฐ ์ž„์ƒ ๊ฒฝ๊ณ ๋ฅผ ๋ฐœ์†กํ•˜๋Š” ๋‹จ๊ณ„;
๋ฅผ ํฌํ•จํ•˜๋Š” ๋ฐฉ๋ฒ•.
A computer-implemented method for automatically detecting a user's non-verbal latent clinical state, the method comprising:
(a) receiving three temporally synchronized input channels of video, audio, and text from a user;
(b) extracting from the video channel intensity time series of a plurality of Facial Action Units (FAUs); extracting from the audio channel time series of a plurality of prosodic features (F0, jitter, shimmer, energy); and extracting from the text channel time series of sentiment polarity and intensity;
(c) computing, for each pair of the three channels' time series, a cross-correlation function and the time lag corresponding to the maximum correlation value;
(d) computing an Incongruence Composite Score (ICS) in the range of 0 to 100 by aggregating the per-pair time lags and correlation values;
(e) classifying, based on the ICS and channel-specific time-series patterns, the user's state into one of eight latent state classes comprising masked depression, suppressed anger, dissociative self-report, alexithymia, social compliance, anxiety concealment, intentional deception, and suicide-risk signal; and
(f) issuing a clinical alert when the Z-score of the classified latent state, computed against a clinical normative population, is greater than or equal to +2.0.
์ฒญ๊ตฌํ•ญ 2 (์ข…์†ํ•ญ)Claim 2 (Dependent)
์ฒญ๊ตฌํ•ญ 1์— ์žˆ์–ด์„œ, ๋‹จ๊ณ„ (b)์˜ ์–ผ๊ตด ๋™์ž‘ ๋‹จ์œ„๋Š” ์ ์–ด๋„ AU1, AU4, AU6, AU12, AU15, AU17, AU23 ๋ฐ AU45๋ฅผ ํฌํ•จํ•˜๋Š” 17๊ฐœ ํ•ต์‹ฌ AU์˜ ๊ฐ•๋„(0-5)๋ฅผ 30 fps ์ด์ƒ์œผ๋กœ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ์„ ํŠน์ง•์œผ๋กœ ํ•˜๋Š” ๋ฐฉ๋ฒ•. The method of Claim 1, wherein the Facial Action Units of step (b) comprise the intensities (0โ€“5) of seventeen core AUs including at least AU1, AU4, AU6, AU12, AU15, AU17, AU23, and AU45, sampled at thirty (30) frames per second or greater.
์ฒญ๊ตฌํ•ญ 3 (์ข…์†ํ•ญ)Claim 3 (Dependent)
์ฒญ๊ตฌํ•ญ 1์— ์žˆ์–ด์„œ, ๋‹จ๊ณ„ (c)์˜ ์‹œ๊ฐ„ ์ง€์—ฐ์€ -2.0์ดˆ ๋‚ด์ง€ +2.0์ดˆ ๋ฒ”์œ„ ๋‚ด์—์„œ ์‚ฐ์ถœ๋˜๋ฉฐ, ์ •์ƒ ์ •ํ•ฉ ์ž„๊ณ„์น˜๋Š” ยฑ200ms์ด๊ณ , ์ ˆ๋Œ€๊ฐ’ 200ms๋ฅผ ์ดˆ๊ณผํ•˜๋Š” ์‹œ๊ฐ„ ์ง€์—ฐ์€ ๋ถ€์กฐํ™” ์‹œ๊ทธ๋„๋กœ ๋ถ„๋ฅ˜๋˜๋Š” ๊ฒƒ์„ ํŠน์ง•์œผ๋กœ ํ•˜๋Š” ๋ฐฉ๋ฒ•. The method of Claim 1, wherein the time lag of step (c) is computed within a range of โˆ’2.0 to +2.0 seconds, the normal congruence threshold is ยฑ200 ms, and time lags exceeding 200 ms in absolute value are classified as incongruence signals.
์ฒญ๊ตฌํ•ญ 4 (์ข…์†ํ•ญ)Claim 4 (Dependent)
์ฒญ๊ตฌํ•ญ 1์— ์žˆ์–ด์„œ, ๋‹จ๊ณ„ (d)์˜ ๋ถ€์กฐํ™” ์ข…ํ•ฉ ์ ์ˆ˜๋Š” ๋‹ค์Œ ์‹์— ์˜ํ•ด ์‚ฐ์ถœ๋˜๋Š” ๊ฒƒ์„ ํŠน์ง•์œผ๋กœ ํ•˜๋Š” ๋ฐฉ๋ฒ•: ICS = ฮฃ_(pโˆˆ{FV, FT, VT}) w_p ยท |1 - R_p(0)| ยท exp(|lag_p|/ฯ„_0), ์—ฌ๊ธฐ์„œ R_p(0)๋Š” 0 ์ง€์—ฐ์—์„œ์˜ ์ƒ๊ด€๊ฐ’, lag_p๋Š” ์ฑ„๋„ ์Œ p์˜ ์ตœ์  ์‹œ๊ฐ„ ์ง€์—ฐ, ฯ„_0๋Š” 200ms ์Šค์ผ€์ผ, w_p๋Š” ํ•ฉ์ด 1์ธ ๊ฒฝํ—˜์  ๊ฐ€์ค‘์น˜์ž„. The method of Claim 1, wherein the Incongruence Composite Score of step (d) is computed by the following equation: ICS = ฮฃ_(pโˆˆ{FV, FT, VT}) w_p ยท |1 โˆ’ R_p(0)| ยท exp(|lag_p|/ฯ„_0), where R_p(0) is the correlation value at zero lag, lag_p is the optimal time lag for channel pair p, ฯ„_0 is the 200 ms scale, and w_p are empirically calibrated weights summing to one.
์ฒญ๊ตฌํ•ญ 5 (์ข…์†ํ•ญ)Claim 5 (Dependent)
์ฒญ๊ตฌํ•ญ 1์— ์žˆ์–ด์„œ, ๋‹จ๊ณ„ (e)์˜ ๋ถ„๋ฅ˜ ์ค‘ ์ž์‚ด ์œ„ํ—˜ ์‹œ๊ทธ๋„ ํด๋ž˜์Šค๋Š” ํ…์ŠคํŠธ ์ฑ„๋„์˜ ํ‰์˜จํ•œ ์ •์„œ ํ‘œํ˜„, ์–ผ๊ตด ์ฑ„๋„์˜ AU15 ๋ฐ AU17์˜ ๊ฐ•ํ•œ ๋™์‹œ ๋ฐœํ˜„, ๊ทธ๋ฆฌ๊ณ  ์Œ์„ฑ ์ฑ„๋„์˜ F0 ๊ธ‰๊ฐ•ํ•˜์˜ ๊ฒฐํ•ฉ ํŒจํ„ด์— ์˜ํ•ด ์‹๋ณ„๋˜๋Š” ๊ฒƒ์„ ํŠน์ง•์œผ๋กœ ํ•˜๋Š” ๋ฐฉ๋ฒ•. The method of Claim 1, wherein the suicide-risk-signal class among the classifications of step (e) is identified by a combined pattern of calm emotional expression in the text channel, strong concurrent expression of AU15 and AU17 in the facial channel, and a sharp drop in F0 in the audio channel.
์ฒญ๊ตฌํ•ญ 6 (์ข…์†ํ•ญ)Claim 6 (Dependent)
์ฒญ๊ตฌํ•ญ 1์— ์žˆ์–ด์„œ, 5์ดˆ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ์—์„œ ๋งค 1์ดˆ๋งˆ๋‹ค ๋‹จ๊ณ„ (b) ๋‚ด์ง€ (e)๋ฅผ ๋ฐ˜๋ณตํ•˜์—ฌ ์‹ค์‹œ๊ฐ„ ์ฒ˜๋ฆฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ , ์ž„๊ณ„์น˜ ์ดˆ๊ณผ ํ›„ 5์ดˆ ์ด๋‚ด์— ๋‹จ๊ณ„ (f)์˜ ์ž„์ƒ ๊ฒฝ๊ณ ๊ฐ€ ๋ฐœ์†ก๋˜๋Š” ๊ฒƒ์„ ํŠน์ง•์œผ๋กœ ํ•˜๋Š” ๋ฐฉ๋ฒ•. The method of Claim 1, further comprising performing real-time processing by iterating steps (b) through (e) every one second within a five-second sliding window, and wherein the clinical alert of step (f) is issued within five seconds of threshold exceedance.
์ฒญ๊ตฌํ•ญ 7 (์ข…์†ํ•ญ)Claim 7 (Dependent)
์ฒญ๊ตฌํ•ญ 1์— ์žˆ์–ด์„œ, ๋‹จ๊ณ„ (f)์˜ Z-์ ์ˆ˜ ์‚ฐ์ถœ์— ์‚ฌ์šฉ๋˜๋Š” ์ž„์ƒ ์ •์ƒ ๋ชจ์ง‘๋‹จ์€ ์ ์–ด๋„ 500๋ช…์˜ ์ž„์ƒ์  ๋น„๋ณ‘๋ฆฌ ๋Œ€์กฐ๊ตฐ์œผ๋กœ๋ถ€ํ„ฐ ์ˆ˜์ง‘๋œ ICS ๋ถ„ํฌ์˜ ํ‰๊ท  ๋ฐ ํ‘œ์ค€ํŽธ์ฐจ์— ๊ธฐ๋ฐ˜ํ•˜๋Š” ๊ฒƒ์„ ํŠน์ง•์œผ๋กœ ํ•˜๋Š” ๋ฐฉ๋ฒ•. The method of Claim 1, wherein the clinical normative population used for Z-score computation in step (f) is based on the mean and standard deviation of the ICS distribution collected from at least five hundred (500) clinically non-pathological control subjects.
์ฒญ๊ตฌํ•ญ 8 (๋…๋ฆฝํ•ญ โ€” ์‹œ์Šคํ…œ)Claim 8 (Independent โ€” System)
์ฒญ๊ตฌํ•ญ 1 ๋‚ด์ง€ 7 ์ค‘ ์–ด๋А ํ•œ ํ•ญ์˜ ๋ฐฉ๋ฒ•์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•œ, ์ ์–ด๋„ ํ•˜๋‚˜์˜ ํ”„๋กœ์„ธ์„œ, ๋น„๋””์˜คยท์˜ค๋””์˜คยทํ…์ŠคํŠธ ์ž…๋ ฅ์„ ์ˆ˜์‹ ํ•˜๋Š” ๋‹ค์ค‘ ๋ชจ๋‹ฌ ์ธํ„ฐํŽ˜์ด์Šค, ๋ฐ ์ƒ๊ธฐ ํ”„๋กœ์„ธ์„œ์— ์˜ํ•ด ์‹คํ–‰๋˜๋Š” ๋ช…๋ น์–ด๋ฅผ ์ €์žฅํ•˜๋Š” ๋น„์ผ์‹œ์  ์ปดํ“จํ„ฐ ํŒ๋… ๊ฐ€๋Šฅ ์ €์žฅ ๋งค์ฒด๋ฅผ ํฌํ•จํ•˜๋Š” ๋‹ค์ค‘ ๋ชจ๋‹ฌ ์ž ์žฌ ์ƒํƒœ ํƒ์ง€ ์‹œ์Šคํ…œ. A multi-modal latent-state detection system for performing the method of any one of Claims 1 through 7, the system comprising at least one processor, a multi-modal interface for receiving video, audio, and text inputs, and a non-transitory computer-readable storage medium storing instructions executable by the processor.
์ฒญ๊ตฌํ•ญ 9 (๋…๋ฆฝํ•ญ โ€” ๋งค์ฒด)Claim 9 (Independent โ€” Medium)
์ปดํ“จํ„ฐ์— ์˜ํ•ด ์‹คํ–‰๋  ๋•Œ ์ฒญ๊ตฌํ•ญ 1 ๋‚ด์ง€ 7 ์ค‘ ์–ด๋А ํ•œ ํ•ญ์˜ ๋ฐฉ๋ฒ•์„ ์ˆ˜ํ–‰ํ•˜๋„๋ก ํ•˜๋Š” ๋ช…๋ น์–ด๋ฅผ ์ €์žฅํ•˜๋Š” ๋น„์ผ์‹œ์  ์ปดํ“จํ„ฐ ํŒ๋… ๊ฐ€๋Šฅ ์ €์žฅ ๋งค์ฒด. A non-transitory computer-readable storage medium storing instructions which, when executed by a computer, cause the computer to perform the method of any one of Claims 1 through 7.

09์„ ํ–‰ ๊ธฐ์ˆ  ๋น„๊ตPrior Art Comparison

์„ ํ–‰ ๊ธฐ์ˆ Prior Art ์ ‘๊ทผ ๋ฐฉ์‹Approach ํ•œ๊ณ„Limitation ๋ณธ ๋ฐœ๋ช…๊ณผ์˜ ์ฐจ์ดDistinction
Affectiva (Smart Eye) ์–ผ๊ตด๋งŒ, ๋‹จ์ผ ๋ชจ๋‹ฌface only; single modality ์Œ์„ฑยทํ…์ŠคํŠธ ๋ฏธํ†ตํ•ฉno integration with voice or text 3 ์ฑ„๋„ ์‹œ๊ฐ„ ์ •ํ•ฉ์„ฑ ํ†ตํ•ฉintegrated tri-channel temporal congruence
iMotions ์ƒ์ฒด+ํ‘œ์ • ๋™์‹œ ์ธก์ •simultaneous biometric and facial measurement ์‹œ๊ฐ„ ์ง€์—ฐ ๋ถ„์„ ๋ถ€์žฌno temporal-lag analysis ๊ต์ฐจ์ƒ๊ด€ + DTW๋กœ lag ์ •๋Ÿ‰quantification of lag via cross-correlation and DTW
RealEyes / FacePlusPlus ์–ผ๊ตด ์ •์„œ ๋ถ„๋ฅ˜facial affect classification ๊ด‘๊ณ ์šฉ, ์ž„์ƒ ๋น„๊ฒ€์ฆadvertising-oriented; not clinically validated ์ž„์ƒ ์ •์ƒ ๋ชจ์ง‘๋‹จ ์ •๊ทœํ™”normalization against clinical normative population
Cogito (call-center voice) ์Œ์„ฑ๋งŒ ๋ถ„์„voice-only analysis ์–ผ๊ตดยทํ…์ŠคํŠธ ๋ฏธ์‚ฌ์šฉno face or text 3 ๋ชจ๋‹ฌ ๋™์‹œ ์‹œ๊ฐ„ ์ •ํ•ฉ์„ฑsimultaneous tri-modal temporal congruence
Ginger.io / Mindstrong ์Šค๋งˆํŠธํฐ ํ‚ค๋ณด๋“œ ํŒจํ„ดsmartphone keystroke patterns ํ‘œ์ •ยท์Œ์„ฑ ๋ฏธ์‚ฌ์šฉno face or voice ์‹ค์‹œ๊ฐ„ ๋น„๋””์˜คยท์˜ค๋””์˜คยทํ…์ŠคํŠธreal-time video, audio, and text
Ekman Group / METT ๋ฏธ์„ธํ‘œ์ • ํ›ˆ๋ จmicro-expression training ์‚ฌ๋žŒ์ด ๋ถ„์„, ์ž๋™ํ™” ๋ถ€์žฌhuman-rated; not automated ์™„์ „ ์ž๋™, ์‹ค์‹œ๊ฐ„, ์•Œ๊ณ ๋ฆฌ์ฆ˜ํ™”fully automated, real-time, and algorithmic
๐ŸŽฏ ๋ฐœ๋ช…์˜ ์ง„๋ณด์„ฑ๐ŸŽฏ Inventive Step

๋ณธ ๋ฐœ๋ช…์€ (1) ์–ผ๊ตดยท์Œ์„ฑยทํ…์ŠคํŠธ 3์ฑ„๋„์„ ๋™์‹œ์— ์‹œ๊ฐ„์  ์ •ํ•ฉ์„ฑ์œผ๋กœ ๋ถ„์„ํ•˜๋Š” ์ตœ์ดˆ์˜ ์‹œ์Šคํ…œ์ด๋ฉฐ, (2) ์‹œ๊ฐ„์  ๋ถ€์กฐํ™” ํŒจํ„ด์„ 8๊ฐœ ์ž„์ƒ ์ž ์žฌ ์ƒํƒœ๋กœ ๋งคํ•‘ํ•˜๊ณ , (3) ์ž„์ƒ ์ •์ƒ ๋ชจ์ง‘๋‹จ ์ •๊ทœํ™”๋กœ ๊ฐ๊ด€์  ์ž„๊ณ„์น˜๋ฅผ ์„ค์ •ํ•˜๋ฉฐ, (4) 5์ดˆ ์ด๋‚ด ์‹ค์‹œ๊ฐ„ ๊ฒฝ๊ณ ๋ฅผ ๋ฐœ์†กํ•œ๋‹ค. ํŠนํžˆ ์ž์‚ด ์œ„ํ—˜ ์‹œ๊ทธ๋„(L8)์˜ ๊ฐ๊ด€์  ํƒ์ง€๋Š” ์ž๊ฐ€๋ณด๊ณ  ์˜์กด ์‹œ์Šคํ…œ์ด ๋†“์น˜๋Š” ์œ„์žฅ๋œ ์œ„ํ—˜์„ ํฌ์ฐฉํ•˜๋Š” ์ž„์ƒ์ ยท๋ฒ•์ ยท์œค๋ฆฌ์  ๊ฐ€์น˜๋ฅผ ๊ฐ€์ง„๋‹ค. The present invention is (1) the first system to simultaneously analyze temporal congruence across the three channels of face, voice, and text; (2) it maps temporal incongruence patterns to eight clinical latent states; (3) it sets objective thresholds via clinical normative-population normalization; and (4) it issues real-time alerts within five seconds. In particular, the objective detection of the suicide-risk signal (L8) holds clinical, legal, and ethical value by capturing concealed risk that self-report-dependent systems are likely to miss.

10์‚ฐ์—…์ƒ ์ด์šฉ ๊ฐ€๋Šฅ์„ฑIndustrial Applicability

10.1 ์ ์šฉ ์‹œ์žฅ10.1 Target Markets

10.2 ์œค๋ฆฌ์  ์•ˆ์ „์žฅ์น˜10.2 Ethical Safeguards

๋ณธ ๋ฐœ๋ช…์€ ๋‹ค์Œ ์œค๋ฆฌ ์›์น™์„ ์ค€์ˆ˜ํ•œ๋‹ค: (1) ์‚ฌ์šฉ์ž ๋™์˜ ํ›„ ์ž‘๋™, (2) ๋ฐ์ดํ„ฐ๋Š” ๋‹จ๋Œ€๋‹จ(end-to-end) ์•”ํ˜ธํ™”, (3) ์ž„์ƒ ๊ฒฐ๊ณผ๋Š” ์ž๊ฒฉ ์žˆ๋Š” ์ž„์ƒ๊ฐ€์—๊ฒŒ๋งŒ ๊ณต๊ฐœ, (4) ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฒฐ์ •์€ ํ•ญ์ƒ ์ž„์ƒ๊ฐ€ ๊ฒ€ํ† ๋ฅผ ๊ฑฐ์นจ, (5) ์‚ฌ์šฉ์ž๊ฐ€ ๋ฐ์ดํ„ฐ ์‚ญ์ œยทํƒˆํ‡ด ๊ถŒ๋ฆฌ ๋ณด์œ . ํŠนํžˆ L7(์˜๋„์  ๊ฑฐ์ง“๋ง) ํด๋ž˜์Šค๋Š” ์ž„์ƒ ํ™˜๊ฒฝ ์™ธ ์ ์šฉ์„ ๊ถŒ์žฅํ•˜์ง€ ์•Š์œผ๋ฉฐ, ๋ฒ•์‹ฌ๋ฆฌํ•™์  ์‘์šฉ์€ ์ž๊ฒฉ ์žˆ๋Š” ์ „๋ฌธ๊ฐ€์˜ ๊ฐ๋…ํ•˜์—์„œ๋งŒ ์‚ฌ์šฉ๋œ๋‹ค. The invention complies with the following ethical principles: (1) operates only after user consent; (2) all data is encrypted end-to-end; (3) clinical outputs are disclosed only to qualified clinicians; (4) algorithmic decisions always undergo clinician review; and (5) users retain rights of data deletion and withdrawal. In particular, the L7 (intentional deception) class is not recommended for use outside clinical settings, and forensic-psychology applications are conducted only under the supervision of qualified experts.

10.3 ๊ทœ์ œ ๊ฒฝ๋กœ10.3 Regulatory Pathway

FDA 510(k) Class II ์˜๋ฃŒ๊ธฐ๊ธฐ ๋˜๋Š” De Novo ๋””์ง€ํ„ธ ์น˜๋ฃŒ์ œ(DTx) ๊ฒฝ๋กœ ๊ฒ€ํ†  ๊ฐ€๋Šฅ. ์ž„์ƒ ์‹œํ—˜์„ ํ†ตํ•œ PMA(Premarket Approval) ์Šน์ธ ํ›„ ์ง„๋‹จ ๋ณด์กฐ ๋„๊ตฌ๋กœ์„œ ์ž„์ƒ ์‚ฌ์šฉ ๊ฐ€๋Šฅ. The invention is amenable to review under FDA 510(k) Class II medical-device or De Novo Digital Therapeutic (DTx) pathways. Following Premarket Approval (PMA) via clinical trials, the system may be used clinically as a diagnostic-aid tool.

11๊ด€๋ จ ๋…ผ๋ฌธReferences

๋ณธ ๋ฐœ๋ช…์˜ ์ด๋ก ์ ยท์ž„์ƒ์  ๊ทผ๊ฑฐ๊ฐ€ ๋˜๋Š” ํ•ต์‹ฌ ๋…ผ๋ฌธ ๋ฐ ์ž๋ฃŒ. ํด๋ฆญํ•˜๋ฉด ์™ธ๋ถ€ ์ถœ์ฒ˜๋กœ ์ด๋™ํ•ฉ๋‹ˆ๋‹ค. Key papers and resources providing the theoretical and clinical basis for this invention. Click links to access external sources.

A. ์–ผ๊ตด ๋™์ž‘ ๋‹จ์œ„ (FACS) ๋ฐ ๋ฏธ์„ธํ‘œ์ •A. Facial Action Coding (FACS) and Micro-Expressions
B. ์Œ์„ฑ ์šด์œจ ๋ฐ ์ž„์ƒ ์Œ์„ฑ ๋ถ„์„B. Vocal Prosody and Clinical Voice Analysis
C. ํ…์ŠคํŠธ ์ •์„œ ๋ถ„์„ ๋ฐ NLPC. Text Sentiment Analysis and NLP
D. ๋‹ค์ค‘ ๋ชจ๋‹ฌ ์ •์„œ ์œตํ•ฉD. Multi-Modal Affect Fusion
E. ์‹œ๊ฐ„ ์ •ํ•ฉ์„ฑ ๋ฐ ๋™์  ์‹œ๊ฐ„ ์›Œํ•‘ (DTW)E. Temporal Alignment and Dynamic Time Warping
F. ๊ฐ€๋ฉด ์šฐ์šธ ๋ฐ ์ž๊ธฐ๋ณด๊ณ ์˜ ํ•œ๊ณ„F. Masked Depression and Self-Report Limits
G. ์ž์‚ด ์œ„ํ—˜ ํ‰๊ฐ€ ๋ฐ ์•ˆ์ „ ๋ชจ๋‹ˆํ„ฐ๋งG. Suicide Risk Assessment and Safety Monitoring
H. ์œค๋ฆฌ ๋ฐ ๋ฒ•์‹ฌ๋ฆฌํ•™H. Ethics and Forensic Psychology