๋ณธ ๋ฐ๋ช ์ ์ฌ์ฉ์์ ๋๊ธฐํ๋ ๋น๋์คยท์ค๋์คยทํ ์คํธ ์ ๋ ฅ์ผ๋ก๋ถํฐ (i) ์ผ๊ตด ๋์ ๋จ์(Facial Action Units, FAU), (ii) ์์ฑ ์ด์จ ํน์ง(Prosodic Features, F0ยทshimmerยทjitterยทpitch contour), (iii) ํ ์คํธ ์ ์ ๊ทน์ฑ(Sentiment Polarity)์ ์๊ณ์ด๋ก ์ถ์ถํ๊ณ , ์ธ ๋ชจ๋ฌ๋ฆฌํฐ ๊ฐ ์๊ฐ ์ง์ฐ(lag) ๋ฐ ๋ถ์กฐํ ์ ์๋ฅผ ๊ต์ฐจ์๊ด(cross-correlation)ยท์ฝํ๋ฐ์ค(coherence)ยท๋์ ์๊ฐ ์ํ(DTW)์ผ๋ก ์ฐ์ถํ์ฌ, ํํ ์ฑ๋ ๊ฐ์ ์๊ฐ์ ๋ถ์ผ์น ํจํด(์: ๋ฏธ์ ์งํ ์์ฑ ๋จ๋ฆผ, ๊ธ์ ์ ํ ์คํธ ์ง์ ๋ถ๋ ธ ๋ฏธ์ธํ์ 0.4์ด ์ ํ)์ ์์ ์๊ทธ๋๋ก ๋ณํํ๊ณ , ์์ ์ ์ ๋ชจ์ง๋จ ๋๋น Z-์ ์ํํ์ฌ ๊ฐ๋ฉด ์ฐ์ธยท์ต์๋ ๋ถ๋ ธยทํด๋ฆฌ์ ์๊ธฐ๋ณด๊ณ ๋ฑ 8๊ฐ ์ ์ฌ ์ํ๋ฅผ ๋ถ๋ฅยท๊ฒฝ๊ณ ํ๋ ์ปดํจํฐ ๊ตฌํ ๋ฐฉ๋ฒ ๋ฐ ์์คํ ์ ๊ดํ ๊ฒ์ด๋ค. The present invention relates to a computer-implemented method and system that extracts, as time series from synchronized video, audio, and text inputs of a user, (i) Facial Action Units (FAUs), (ii) prosodic features (F0, shimmer, jitter, pitch contour), and (iii) textual sentiment polarity, then computes inter-modal temporal lag and incongruence scores via cross-correlation, coherence, and Dynamic Time Warping (DTW), thereby converting temporal mismatch patterns across expression channels โ for example, vocal tremor immediately following a smile, or anger micro-expression preceding positive text by 0.4 seconds โ into clinical signals which are Z-scored against a clinical normative population to classify and alert on eight latent states including masked depression, suppressed anger, and dissociative self-report.
๋ณธ ๋ฐ๋ช ์ ๋ค์ค ๋ชจ๋ฌ ์ ์ ์ปดํจํ (Multi-Modal Affective Computing), ์์ ์ฌ๋ฆฌ์ง๋จ ๋ณด์กฐ(Clinical Diagnostic Aid), ์ธ๊ณต์ง๋ฅ ๋ํ ์์คํ ๋ถ์ผ์ ์ํ๋ค. ๋์ฑ ๊ตฌ์ฒด์ ์ผ๋ก๋, ์ผ๊ตดยท์์ฑยทํ ์คํธ์ ์๊ฐ์ ์ ํฉ์ฑ์ ๋ถ์ํ์ฌ ์ ์ฌ ์์ ์ํ๋ฅผ ํ์งํ๋ ์์คํ ์ ๊ดํ ๊ฒ์ด๋ค. The present invention pertains to the fields of Multi-Modal Affective Computing, Clinical Diagnostic Aids, and Artificial Intelligence Conversational Systems. More specifically, it concerns a system that detects latent clinical states by analyzing temporal congruence among facial, vocal, and textual channels.
์์ ์ฌ๋ฆฌํ์์ ๊ฐ์ฅ ์ค๋๋ ๋ฌธ์ ์ค ํ๋๋ "ํ์๊ฐ ๋ณด๊ณ ํ๋ ๊ฒ"๊ณผ "์ค์ ์ํ" ๊ฐ์ ๊ดด๋ฆฌ์ด๋ค. ๋ค์์ ์์ ๋ฌธํ์ด ์ผ๊ด๋๊ฒ ๋ณด๊ณ ํ๋ ์๊ธฐ๋ณด๊ณ ํ๊ณ ์ฌ๋ก: One of the oldest problems in clinical psychology is the gap between "what the patient reports" and "the patient's actual state." The following is a non-exhaustive list of self-report limitations consistently documented in the clinical literature:
Paul Ekman ๋ฐ์ฌ(1969-2003)์ ์ผ๋ จ์ ์ฐ๊ตฌ๋ ์ธ๊ฐ์ด ์์์ ์ผ๋ก ํต์ ํ๊ธฐ ์ด๋ ค์ด 0.04-0.2์ด ๊ธธ์ด์ ๋ฏธ์ธํ์ ์ด ์ง์ ํ ์ ์๋ฅผ ๋๋ฌ๋ธ๋ค๋ ๊ฒ์ ์ ์ฆํ๋ค. FACS(Facial Action Coding System)๋ 44๊ฐ์ ๋์ ๋จ์(AU)๋ก ํ์ ์ ๋ถํดํ๋ฉฐ, ํน์ AU ์กฐํฉ์ ์ง์ ยท๊ฐ์ง ์ ์๋ฅผ ๊ตฌ๋ถํ ์ ์๋ค. A series of studies by Dr. Paul Ekman (1969โ2003) established that micro-expressions, lasting 0.04 to 0.2 seconds and difficult to consciously suppress, reveal genuine emotion. The Facial Action Coding System (FACS) decomposes facial expressions into 44 Action Units (AUs), and specific AU combinations can distinguish genuine from feigned affect.
| AU | ๊ทผ์ก ์์ง์Muscle Movement | ์ ์ ์๊ทธ๋Affective Signal |
|---|---|---|
| AU1 | ๋ด์ธก ๋์น ์ฌ๋ฆผinner brow raise | ์ฌํ, ๋๋ ค์sadness, fear |
| AU4 | ๋์น ๋ชจ์brow lowerer | ๋ถ๋ ธ, ์ง์คanger, concentration |
| AU6 | ๋ณผ ์ฌ๋ฆผ (Duchenne)cheek raiser (Duchenne) | ์ง์ ๋ฏธ์genuine smile |
| AU12 | ์ ๊ผฌ๋ฆฌ ์ฌ๋ฆผlip corner puller | ๋ฏธ์smile |
| AU15 | ์ ๊ผฌ๋ฆฌ ๋ด๋ฆผlip corner depressor | ์ฌํsadness |
| AU17 | ํฑ ์ฌ๋ฆผchin raiser | ์์ฌ, ๋ถ๋ ธdoubt, anger |
| AU23 | ์ ์ ๋๋ฆlip tightener | ์ต์๋ ๋ถ๋ ธsuppressed anger |
| AU45 | ๋ ๊น๋ฐ์ ๋น๋blink frequency | ๋ถ์, ์ธ์ง๋ถํanxiety, cognitive load |
๊ธฐ์กด ์ ์ ์ปดํจํ ์์คํ (Affectiva, RealEyes, iMotions ๋ฑ)์ ๋ค์ ํ๊ณ๋ฅผ ๊ฐ์ง๋ค: Existing affective computing systems (Affectiva, RealEyes, iMotions, and the like) suffer from the following limitations:
3 ์ฑ๋์ด ์๊ฐ์ ์ผ๋ก ์ ํฉ๋ ์ ๋ ฅ์ผ๋ก ๊ฒฐํฉ๋๋ค:The three channels are combined into temporally synchronized input streams:
3 ์ฑ๋์ ๊ณตํต ์๊ฐ์ถ(timestamp) ์์ ์ ๋ ฌ๋์ด ๋งค 100ms ๋ง๋ค ๋๊ธฐ ์ํ์ ์์ฑํ๋ค.The three channels are aligned to a common timestamp axis and produce a synchronized sample every 100 ms.
OpenFace ๋๋ ๋๋ฑ ๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ก ๋งค ํ๋ ์๋ง๋ค 17๊ฐ ํต์ฌ AU ๊ฐ๋(0-5) ์ฐ์ถ:Using OpenFace or an equivalent library, the intensity (0โ5) of seventeen core AUs is computed at each frame:
10ms ์๋์ฐ ๋ถ์ + 100ms ๋ค์ด์ํ๋ง:10 ms window analysis with 100 ms downsampling:
Transformer ๊ธฐ๋ฐ (BERT/RoBERTa) ๋จ์ด ์์ค ๋ถ์, ๋จ์ด ํ์์คํฌํ์ ๋งคํ:Transformer-based (BERT/RoBERTa) word-level analysis, mapped to word timestamps:
3 ์ฑ๋ ์(F-V, F-T, V-T)์ ๋ํด ๊ต์ฐจ์๊ดํจ์(cross-correlation)๋ฅผ ๊ณ์ฐํ๊ณ , ์ต๋๊ฐ ์์น(lag)๋ฅผ ์ถ์ถํ๋ค:For each of the three channel pairs (FโV, FโT, VโT), the cross-correlation function is computed and the lag at its maximum is extracted:
์ ์ ์ ํฉ ์ํ์์๋ ๋ชจ๋ฌ๋ฆฌํฐ ๊ฐ lag๊ฐ ยฑ200ms ์ด๋ด. ์ ์ฌ ์ํ์์๋ lag๊ฐ ยฑ200ms๋ฅผ ์ด๊ณผํ๊ฑฐ๋, ๋ถํธ๊ฐ ๋ฐ๋๋ก ๋ํ๋๋ค (์: ํ ์คํธ ์์ฑ + ์ผ๊ตด ์์ฑ 0.4์ด ์ ํ โ ์ต์).In normally congruent states, inter-modal lag remains within ยฑ200 ms. In latent states, lag exceeds ยฑ200 ms or reverses in sign (for example, positive text together with negative facial expression preceding by 0.4 seconds โ suppression).
| # | ์ ์ฌ ์ํLatent State | ํ์ง ํจํดDetection Pattern | ICS |
|---|---|---|---|
| L1 | ๊ฐ๋ฉด ์ฐ์ธMasked Depression | ํ ์คํธ (+) + ์ผ๊ตด AU1+AU15 + ์์ฑ F0โยทEnergyโpositive text + facial AU1+AU15 + vocal F0โ and Energyโ | โฅ45 |
| L2 | ์ต์๋ ๋ถ๋ ธSuppressed Anger | ํ ์คํธ ์ค๋ฆฝ + AU4ยทAU23 โฅ 200ms ์ ํneutral text + AU4 / AU23 leading by โฅ 200 ms | โฅ40 |
| L3 | ํด๋ฆฌ์ ์๊ธฐ๋ณด๊ณ Dissociative Self-Report | ๊ฐ์ ๋จ์ด + ์์ฑ ๋จ์กฐยท์ผ๊ตด ๋ฌดํ์ (Flat Affect)emotion words plus monotone voice and facial flat-affect | โฅ50 |
| L4 | ์๋ ์ํฐ๋ฏธ์Alexithymia | ๊ฐํ ์ผ๊ตด/์์ฑ ์ ์ + ํ ์คํธ ์ ์ ์ดํ ๋ถ์ฌstrong facial/vocal affect with absent emotional vocabulary in text | โฅ35 |
| L5 | ์ฌํ์ ๋์กฐSocial Compliance | ํ ์คํธ (+) + AU14 (deceptive smile) + ์์ฑ ๋ถ์กฐํpositive text + AU14 (deceptive smile) + vocal incongruence | โฅ30 |
| L6 | ๋ถ์ ์์ฅAnxiety Concealment | ํ ์คํธ ์ค๋ฆฝ + AU45 โ + ์์ฑ jitter โneutral text + AU45 โ and vocal jitter โ | โฅ30 |
| L7 | ์๋์ ๊ฑฐ์ง๋งIntentional Deception | F-T lag > 500ms + ์์ฑ ํด์ง โFโT lag > 500 ms with elevated vocal pauses | โฅ55 |
| L8 | ์์ด ์ํ ์๊ทธ๋Suicide-Risk Signal | ํ ์คํธ ํ์จ + AU15+AU17 ๊ฐํจ + ์์ฑ F0 dramatic dropcalm text + strong AU15+AU17 + dramatic vocal F0 drop | โฅ60 |
L8 ํด๋์ค ํ์ง ์ ๋ณธ ์์คํ ์ ์๋์ผ๋ก ์์ ์ํผ๋ฐ์ด์ ์๊ฒ ์ฆ์ ๊ฒฝ๊ณ ๋ฅผ ๋ฐ์กํ๋ฉฐ, ํ์ ์์ ํ๋กํ ์ฝ์ ํ์ฑํํ๋ค. ์ด๋ ์๊ฐ๋ณด๊ณ ๊ธฐ๋ฐ ์์คํ ์ด ๋์น๊ธฐ ์ฌ์ด ์์ฅ๋ ์์ด ์ํ์ ๊ฐ๊ด์ ์ผ๋ก ํฌ์ฐฉํ๋ ํต์ฌ ์์ ๊ฐ์น์ด๋ค. When the L8 class is detected, the system automatically issues an immediate alert to the clinical supervisor and activates a patient-safety protocol. This represents a core clinical value: the objective capture of concealed suicide risk that self-report-based systems are likely to miss.
Boston Neuromind ์์ ๋ฐ์ดํฐ์ (Nโฅ500)์ผ๋ก ์ฐ์ถํ ICS ์ ์ ๋ถํฌ์ ๋ํด Z-์ ์ํํ๋ค:The ICS is Z-scored against the normal distribution computed from the Boston Neuromind clinical dataset (N โฅ 500):
| ์ ํ ๊ธฐ์ Prior Art | ์ ๊ทผ ๋ฐฉ์Approach | ํ๊ณLimitation | ๋ณธ ๋ฐ๋ช ๊ณผ์ ์ฐจ์ดDistinction |
|---|---|---|---|
| Affectiva (Smart Eye) | ์ผ๊ตด๋ง, ๋จ์ผ ๋ชจ๋ฌface only; single modality | ์์ฑยทํ ์คํธ ๋ฏธํตํฉno integration with voice or text | 3 ์ฑ๋ ์๊ฐ ์ ํฉ์ฑ ํตํฉintegrated tri-channel temporal congruence |
| iMotions | ์์ฒด+ํ์ ๋์ ์ธก์ simultaneous biometric and facial measurement | ์๊ฐ ์ง์ฐ ๋ถ์ ๋ถ์ฌno temporal-lag analysis | ๊ต์ฐจ์๊ด + DTW๋ก lag ์ ๋quantification of lag via cross-correlation and DTW |
| RealEyes / FacePlusPlus | ์ผ๊ตด ์ ์ ๋ถ๋ฅfacial affect classification | ๊ด๊ณ ์ฉ, ์์ ๋น๊ฒ์ฆadvertising-oriented; not clinically validated | ์์ ์ ์ ๋ชจ์ง๋จ ์ ๊ทํnormalization against clinical normative population |
| Cogito (call-center voice) | ์์ฑ๋ง ๋ถ์voice-only analysis | ์ผ๊ตดยทํ ์คํธ ๋ฏธ์ฌ์ฉno face or text | 3 ๋ชจ๋ฌ ๋์ ์๊ฐ ์ ํฉ์ฑsimultaneous tri-modal temporal congruence |
| Ginger.io / Mindstrong | ์ค๋งํธํฐ ํค๋ณด๋ ํจํดsmartphone keystroke patterns | ํ์ ยท์์ฑ ๋ฏธ์ฌ์ฉno face or voice | ์ค์๊ฐ ๋น๋์คยท์ค๋์คยทํ ์คํธreal-time video, audio, and text |
| Ekman Group / METT | ๋ฏธ์ธํ์ ํ๋ จmicro-expression training | ์ฌ๋์ด ๋ถ์, ์๋ํ ๋ถ์ฌhuman-rated; not automated | ์์ ์๋, ์ค์๊ฐ, ์๊ณ ๋ฆฌ์ฆํfully automated, real-time, and algorithmic |
๋ณธ ๋ฐ๋ช ์ (1) ์ผ๊ตดยท์์ฑยทํ ์คํธ 3์ฑ๋์ ๋์์ ์๊ฐ์ ์ ํฉ์ฑ์ผ๋ก ๋ถ์ํ๋ ์ต์ด์ ์์คํ ์ด๋ฉฐ, (2) ์๊ฐ์ ๋ถ์กฐํ ํจํด์ 8๊ฐ ์์ ์ ์ฌ ์ํ๋ก ๋งคํํ๊ณ , (3) ์์ ์ ์ ๋ชจ์ง๋จ ์ ๊ทํ๋ก ๊ฐ๊ด์ ์๊ณ์น๋ฅผ ์ค์ ํ๋ฉฐ, (4) 5์ด ์ด๋ด ์ค์๊ฐ ๊ฒฝ๊ณ ๋ฅผ ๋ฐ์กํ๋ค. ํนํ ์์ด ์ํ ์๊ทธ๋(L8)์ ๊ฐ๊ด์ ํ์ง๋ ์๊ฐ๋ณด๊ณ ์์กด ์์คํ ์ด ๋์น๋ ์์ฅ๋ ์ํ์ ํฌ์ฐฉํ๋ ์์์ ยท๋ฒ์ ยท์ค๋ฆฌ์ ๊ฐ์น๋ฅผ ๊ฐ์ง๋ค. The present invention is (1) the first system to simultaneously analyze temporal congruence across the three channels of face, voice, and text; (2) it maps temporal incongruence patterns to eight clinical latent states; (3) it sets objective thresholds via clinical normative-population normalization; and (4) it issues real-time alerts within five seconds. In particular, the objective detection of the suicide-risk signal (L8) holds clinical, legal, and ethical value by capturing concealed risk that self-report-dependent systems are likely to miss.
๋ณธ ๋ฐ๋ช ์ ๋ค์ ์ค๋ฆฌ ์์น์ ์ค์ํ๋ค: (1) ์ฌ์ฉ์ ๋์ ํ ์๋, (2) ๋ฐ์ดํฐ๋ ๋จ๋๋จ(end-to-end) ์ํธํ, (3) ์์ ๊ฒฐ๊ณผ๋ ์๊ฒฉ ์๋ ์์๊ฐ์๊ฒ๋ง ๊ณต๊ฐ, (4) ์๊ณ ๋ฆฌ์ฆ ๊ฒฐ์ ์ ํญ์ ์์๊ฐ ๊ฒํ ๋ฅผ ๊ฑฐ์นจ, (5) ์ฌ์ฉ์๊ฐ ๋ฐ์ดํฐ ์ญ์ ยทํํด ๊ถ๋ฆฌ ๋ณด์ . ํนํ L7(์๋์ ๊ฑฐ์ง๋ง) ํด๋์ค๋ ์์ ํ๊ฒฝ ์ธ ์ ์ฉ์ ๊ถ์ฅํ์ง ์์ผ๋ฉฐ, ๋ฒ์ฌ๋ฆฌํ์ ์์ฉ์ ์๊ฒฉ ์๋ ์ ๋ฌธ๊ฐ์ ๊ฐ๋ ํ์์๋ง ์ฌ์ฉ๋๋ค. The invention complies with the following ethical principles: (1) operates only after user consent; (2) all data is encrypted end-to-end; (3) clinical outputs are disclosed only to qualified clinicians; (4) algorithmic decisions always undergo clinician review; and (5) users retain rights of data deletion and withdrawal. In particular, the L7 (intentional deception) class is not recommended for use outside clinical settings, and forensic-psychology applications are conducted only under the supervision of qualified experts.
FDA 510(k) Class II ์๋ฃ๊ธฐ๊ธฐ ๋๋ De Novo ๋์งํธ ์น๋ฃ์ (DTx) ๊ฒฝ๋ก ๊ฒํ ๊ฐ๋ฅ. ์์ ์ํ์ ํตํ PMA(Premarket Approval) ์น์ธ ํ ์ง๋จ ๋ณด์กฐ ๋๊ตฌ๋ก์ ์์ ์ฌ์ฉ ๊ฐ๋ฅ. The invention is amenable to review under FDA 510(k) Class II medical-device or De Novo Digital Therapeutic (DTx) pathways. Following Premarket Approval (PMA) via clinical trials, the system may be used clinically as a diagnostic-aid tool.
๋ณธ ๋ฐ๋ช ์ ์ด๋ก ์ ยท์์์ ๊ทผ๊ฑฐ๊ฐ ๋๋ ํต์ฌ ๋ ผ๋ฌธ ๋ฐ ์๋ฃ. ํด๋ฆญํ๋ฉด ์ธ๋ถ ์ถ์ฒ๋ก ์ด๋ํฉ๋๋ค. Key papers and resources providing the theoretical and clinical basis for this invention. Click links to access external sources.