특허 출원 진행 중 · 3/3Provisional Patent Application · 3/3

시간적 일치성 탐지 시스템 Temporal Congruence Detection System

사용자의 얼굴 미세표정·음성 운율·텍스트 의미 사이의 시간적 정렬 패턴을 다중 모달리티로 분석하여, 의식적 보고에서는 드러나지 않는 잠재 임상 상태(가면 우울, 억압된 분노, 부조화 자기보고 등)를 자동 탐지하는 컴퓨터 구현 방법. A computer-implemented method for automatically detecting latent clinical states — including masked depression, suppressed anger, and incongruent self-report — that do not surface in conscious reporting, by performing multi-modal analysis of temporal alignment patterns among a user's facial micro-expressions, vocal prosody, and textual semantics.

출원인Applicant Boston Neuromind, LLC

발명자Inventor [발명자명] (BCN, PhD) [Inventor Name] (BCN, PhD)

상태Status USPTO 가출원 준비 USPTO Provisional Pending

분류Classification G06V 40/16 / G10L 25/63 / G16H 50/20

목차Table of Contents

초록Abstract
발명 분야Field of Invention
배경 기술Background
해결 과제Problem Statement
발명 요약Summary of Invention
상세 설명Detailed Description
도면 설명Drawings
청구항Claims
선행 기술 비교Prior Art Comparison
산업상 이용 가능성Industrial Applicability
관련 논문References

01초록Abstract

📋 핵심 요약📋 One-Paragraph Summary

본 발명은 사용자의 동기화된 비디오·오디오·텍스트 입력으로부터 (i) 얼굴 동작 단위(Facial Action Units, FAU), (ii) 음성 운율 특징(Prosodic Features, F0·shimmer·jitter·pitch contour), (iii) 텍스트 정서 극성(Sentiment Polarity)을 시계열로 추출하고, 세 모달리티 간 시간 지연(lag) 및 부조화 점수를 교차상관(cross-correlation)·코히런스(coherence)·동적 시간 워핑(DTW)으로 산출하여, 표현 채널 간의 시간적 불일치 패턴(예: 미소 직후 음성 떨림, 긍정적 텍스트 직전 분노 미세표정 0.4초 선행)을 임상 시그널로 변환하고, 임상 정상 모집단 대비 Z-점수화하여 가면 우울·억압된 분노·해리적 자기보고 등 8개 잠재 상태를 분류·경고하는 컴퓨터 구현 방법 및 시스템에 관한 것이다. The present invention relates to a computer-implemented method and system that extracts, as time series from synchronized video, audio, and text inputs of a user, (i) Facial Action Units (FAUs), (ii) prosodic features (F0, shimmer, jitter, pitch contour), and (iii) textual sentiment polarity, then computes inter-modal temporal lag and incongruence scores via cross-correlation, coherence, and Dynamic Time Warping (DTW), thereby converting temporal mismatch patterns across expression channels — for example, vocal tremor immediately following a smile, or anger micro-expression preceding positive text by 0.4 seconds — into clinical signals which are Z-scored against a clinical normative population to classify and alert on eight latent states including masked depression, suppressed anger, and dissociative self-report.

02발명 분야Field of Invention

본 발명은 다중 모달 정서 컴퓨팅(Multi-Modal Affective Computing), 임상 심리진단 보조(Clinical Diagnostic Aid), 인공지능 대화 시스템 분야에 속한다. 더욱 구체적으로는, 얼굴·음성·텍스트의 시간적 정합성을 분석하여 잠재 임상 상태를 탐지하는 시스템에 관한 것이다. The present invention pertains to the fields of Multi-Modal Affective Computing, Clinical Diagnostic Aids, and Artificial Intelligence Conversational Systems. More specifically, it concerns a system that detects latent clinical states by analyzing temporal congruence among facial, vocal, and textual channels.

03배경 기술Background

3.1 자기보고의 한계와 잠재 상태3.1 Limits of Self-Report and Latent States

임상 심리학에서 가장 오래된 문제 중 하나는 "환자가 보고하는 것"과 "실제 상태" 간의 괴리이다. 다음은 임상 문헌이 일관되게 보고하는 자기보고 한계 사례: One of the oldest problems in clinical psychology is the gap between "what the patient reports" and "the patient's actual state." The following is a non-exhaustive list of self-report limitations consistently documented in the clinical literature:

가면 우울 (Masked Depression): "괜찮아요"라고 말하지만 미세 표정과 음성에서 우울 시그널 (Goldney 1989)Masked depression: a patient who says "I'm fine" while micro-expressions and vocal markers show depression signals (Goldney 1989)
알렉시티미아: 자신의 정서를 언어화하지 못하는 상태 (Sifneos 1973)Alexithymia: inability to verbalize one's own emotions (Sifneos 1973)
억압된 분노: 미소나 긍정 표현 직후/직전에 0.2-0.5초의 분노 AU (Ekman 2003)Suppressed anger: 0.2–0.5-second anger AUs immediately preceding or following a smile or positive expression (Ekman 2003)
해리적 자기보고: 외상 환자가 사건을 무감정·"제3자 시점"으로 서술 (van der Kolk 2014)Dissociative self-report: trauma patients narrating events without affect or from a "third-person" perspective (van der Kolk 2014)
사회적 기대 편향: 치료자에게 "잘 지낸다"고 답하는 동조 행동 (Edwards 1957)Social desirability bias: compliant "I'm doing well" responses to a clinician (Edwards 1957)

3.2 Ekman 미세표정 연구3.2 Ekman's Micro-Expression Research

Paul Ekman 박사(1969-2003)의 일련의 연구는 인간이 의식적으로 통제하기 어려운 0.04-0.2초 길이의 미세표정이 진정한 정서를 드러낸다는 것을 입증했다. FACS(Facial Action Coding System)는 44개의 동작 단위(AU)로 표정을 분해하며, 특정 AU 조합은 진정·가짜 정서를 구분할 수 있다. A series of studies by Dr. Paul Ekman (1969–2003) established that micro-expressions, lasting 0.04 to 0.2 seconds and difficult to consciously suppress, reveal genuine emotion. The Facial Action Coding System (FACS) decomposes facial expressions into 44 Action Units (AUs), and specific AU combinations can distinguish genuine from feigned affect.

AU	근육 움직임Muscle Movement	정서 시그널Affective Signal
AU1	내측 눈썹 올림inner brow raise	슬픔, 두려움sadness, fear
AU4	눈썹 모음brow lowerer	분노, 집중anger, concentration
AU6	볼 올림 (Duchenne)cheek raiser (Duchenne)	진정 미소genuine smile
AU12	입꼬리 올림lip corner puller	미소smile
AU15	입꼬리 내림lip corner depressor	슬픔sadness
AU17	턱 올림chin raiser	의심, 분노doubt, anger
AU23	입술 누름lip tightener	억압된 분노suppressed anger
AU45	눈 깜박임 빈도blink frequency	불안, 인지부하anxiety, cognitive load

3.3 단일 모달리티 분석의 한계3.3 Limitations of Single-Modality Analysis

기존 정서 컴퓨팅 시스템(Affectiva, RealEyes, iMotions 등)은 다음 한계를 가진다: Existing affective computing systems (Affectiva, RealEyes, iMotions, and the like) suffer from the following limitations:

단일 모달리티 분석: 얼굴만, 음성만, 텍스트만 — 다중 모달리티 통합 부재Single-modality analysis: face only, voice only, or text only — no multi-modal integration
순간 분류: 시점별 정서 분류만, 시간적 정합성·지연 분석 부재Instantaneous classification: per-frame affect classification with no temporal congruence or lag analysis
비임상 모집단: 광고·소비자 반응 분석에 최적화, 임상 정상 모집단 정규화 부재Non-clinical populations: optimized for advertising and consumer-response analysis, lacking clinical normative populations
잠재 상태 추론 부재: 의식적 표현만 분류, 잠재 상태 추론 불가No latent-state inference: classifies only conscious expression, cannot infer latent states

04해결 과제Problem Statement

시간적 정합성 분석 부재.Absence of temporal congruence analysis. 얼굴·음성·텍스트 간 시간 지연을 분석하여 잠재 상태를 추론하는 시스템이 존재하지 않는다.No existing system analyzes temporal lag among facial, vocal, and textual channels to infer latent states.
3 모달리티 동시 통합 부재.No simultaneous integration of three modalities. 기존 시스템은 2개 모달리티 결합까지만, 3개 동시 시간적 정합성 분석이 없다.Existing systems combine at most two modalities; none performs simultaneous temporal congruence analysis across three.
임상 잠재 상태 매핑 부재.No clinical latent-state mapping. 시간적 부조화 패턴을 임상 잠재 상태(가면 우울, 억압된 분노 등)로 매핑하는 기준이 없다.No standardized criterion exists for mapping temporal incongruence patterns to clinical latent states (masked depression, suppressed anger, and the like).
실시간 처리 부재.No real-time processing. 기존 다중 모달리티 시스템은 사후 분석만, 실시간 대화 중 탐지 불가능.Existing multi-modal systems perform post-hoc analysis only and cannot detect during live conversation.
자기보고 검증 도구 부재.No self-report validation tool. 환자의 자기보고가 비언어 시그널과 일치하는지 객관 검증하는 시스템이 없다.No system objectively validates whether a patient's self-report aligns with non-verbal signals.

05발명 요약Summary of Invention

5.1 시스템 구성 요소5.1 System Components

3-Channel 시계열 추출기 (TSE):3-Channel Time-Series Extractor (TSE): 동기화된 비디오·오디오·텍스트로부터 FAU·운율·정서 극성 시계열 산출computes FAU, prosody, and sentiment polarity time series from synchronized video, audio, and text
교차 모달 정렬기 (CMA):Cross-Modal Aligner (CMA): 교차상관·코히런스·DTW로 모달리티 쌍 간 시간 지연 산출computes inter-modal time lag via cross-correlation, coherence, and DTW
부조화 점수 계산기 (ISC):Incongruence Scorer (ISC): 3 모달리티 간 부조화 종합 점수 산출 (0-100)computes a composite incongruence score across three modalities (0–100)
잠재 상태 분류기 (LSC):Latent State Classifier (LSC): 8개 잠재 상태 클래스로 분류classifies into eight latent state classes
임상 정규화 모듈 (CNM):Clinical Normalizer (CNM): 임상 정상 모집단 대비 Z-점수화Z-scores against the clinical normative population
실시간 경고 모듈 (RTA):Real-Time Alert Module (RTA): 임계치 초과 시 임상 경고 발송issues clinical alerts upon exceedance of threshold

5.2 핵심 차별점5.2 Inventive Steps

3 모달리티 동시 시간적 정합성:Simultaneous tri-modal temporal congruence: 최초의 3채널 통합 시간 분석first integrated three-channel temporal analysis
잠재 상태 매핑 알고리즘:Latent-state mapping algorithm: 시간적 부조화 패턴 → 8개 임상 클래스 변환conversion of temporal incongruence patterns to eight clinical classes
임상 정상 모집단 정규화:Clinical normative-population calibration: 건강한 모집단 대비 Z-점수화로 임계치 결정Z-scoring against a healthy population for threshold determination
실시간 처리:Real-time processing: 스트리밍 입력에서 5초 이내 잠재 상태 탐지latent-state detection within five seconds of streaming input

06상세 설명Detailed Description

6.1 입력 데이터 동기화6.1 Input Data Synchronization

3 채널이 시간적으로 정합된 입력으로 결합된다:The three channels are combined into temporally synchronized input streams:

Video: 30 fps, 720p+ resolution, 얼굴 영역 검출 + 정렬 (face detection + alignment)face detection plus alignment
Audio: 44.1 kHz sampling, mono PCM, 잡음 제거 + 화자 분리noise reduction plus speaker diarization
Text: ASR(Automatic Speech Recognition) 또는 사용자 직접 입력, 단어 수준 타임스탬프or direct user input, with word-level timestamps

3 채널은 공통 시간축(timestamp) 위에 정렬되어 매 100ms 마다 동기 샘플을 생성한다.The three channels are aligned to a common timestamp axis and produce a synchronized sample every 100 ms.

6.2 모달리티별 시계열 특징 추출6.2 Per-Modality Time-Series Feature Extraction

(a) 얼굴 채널 (Facial Action Units)(a) Facial Channel — Action Units

OpenFace 또는 동등 라이브러리로 매 프레임마다 17개 핵심 AU 강도(0-5) 산출:Using OpenFace or an equivalent library, the intensity (0–5) of seventeen core AUs is computed at each frame:

F(t) = [AU1, AU2, AU4, AU5, AU6, AU7, AU9, AU10, AU12, AU14, AU15, AU17, AU20, AU23, AU25, AU26, AU45] at time t (sampled @ 30 fps)

(b) 음성 채널 (Prosodic Features)(b) Vocal Channel — Prosodic Features

10ms 윈도우 분석 + 100ms 다운샘플링:10 ms window analysis with 100 ms downsampling:

V(t) = [F0_mean, F0_variance, F0_slope, Jitter, Shimmer, Energy, Pitch_contour, Speech_rate, Pause_ratio]

(c) 텍스트 채널 (Sentiment + Semantics)(c) Textual Channel — Sentiment plus Semantics

Transformer 기반 (BERT/RoBERTa) 단어 수준 분석, 단어 타임스탬프에 매핑:Transformer-based (BERT/RoBERTa) word-level analysis, mapped to word timestamps:

T(t) = [Sentiment_polarity (-1..+1), Sentiment_intensity (0..1), Emotion_anger, Emotion_sadness, Emotion_joy, Emotion_fear, Word_negation, Self_reference]

6.3 교차 모달 시간 지연 산출6.3 Cross-Modal Time Lag Computation

3 채널 쌍(F-V, F-T, V-T)에 대해 교차상관함수(cross-correlation)를 계산하고, 최댓값 위치(lag)를 추출한다:For each of the three channel pairs (F–V, F–T, V–T), the cross-correlation function is computed and the lag at its maximum is extracted:

수식 1: 교차 상관Equation 1: Cross-Correlation R_xy(τ) = Σ_t [ x(t) · y(t+τ) ] / sqrt(Σx² · Σy²) lag_xy = argmax_τ |R_xy(τ)|, where τ ∈ [-2.0s, +2.0s]

정상 정합 상태에서는 모달리티 간 lag가 ±200ms 이내. 잠재 상태에서는 lag가 ±200ms를 초과하거나, 부호가 반대로 나타난다 (예: 텍스트 양성 + 얼굴 음성 0.4초 선행 → 억압).In normally congruent states, inter-modal lag remains within ±200 ms. In latent states, lag exceeds ±200 ms or reverses in sign (for example, positive text together with negative facial expression preceding by 0.4 seconds → suppression).

6.4 부조화 종합 점수 (ICS: Incongruence Composite Score)6.4 Incongruence Composite Score (ICS)

수식 2: 부조화 점수Equation 2: Incongruence Score ICS = w_FV · |1 - R_FV(0)|·exp(|lag_FV|/τ_0) + w_FT · |1 - R_FT(0)|·exp(|lag_FT|/τ_0) + w_VT · |1 - R_VT(0)|·exp(|lag_VT|/τ_0) where: R_xy(0) = correlation at zero lag (synchronicity) lag_xy = optimal lag from Equation 1 τ_0 = scale (200 ms = normal congruence threshold) w_* = empirically calibrated weights, sum = 1 ICS range: 0 (perfect congruence) - 100 (extreme incongruence)

6.5 8 잠재 상태 분류6.5 Eight Latent State Classes

#	잠재 상태Latent State	탐지 패턴Detection Pattern	ICS
L1	가면 우울Masked Depression	텍스트 (+) + 얼굴 AU1+AU15 + 음성 F0↓·Energy↓positive text + facial AU1+AU15 + vocal F0↓ and Energy↓	≥45
L2	억압된 분노Suppressed Anger	텍스트 중립 + AU4·AU23 ≥ 200ms 선행neutral text + AU4 / AU23 leading by ≥ 200 ms	≥40
L3	해리적 자기보고Dissociative Self-Report	감정 단어 + 음성 단조·얼굴 무표정 (Flat Affect)emotion words plus monotone voice and facial flat-affect	≥50
L4	알렉시티미아Alexithymia	강한 얼굴/음성 정서 + 텍스트 정서 어휘 부재strong facial/vocal affect with absent emotional vocabulary in text	≥35
L5	사회적 동조Social Compliance	텍스트 (+) + AU14 (deceptive smile) + 음성 부조화positive text + AU14 (deceptive smile) + vocal incongruence	≥30
L6	불안 위장Anxiety Concealment	텍스트 중립 + AU45 ↑ + 음성 jitter ↑neutral text + AU45 ↑ and vocal jitter ↑	≥30
L7	의도적 거짓말Intentional Deception	F-T lag > 500ms + 음성 휴지 ↑F–T lag > 500 ms with elevated vocal pauses	≥55
L8	자살 위험 시그널Suicide-Risk Signal	텍스트 평온 + AU15+AU17 강함 + 음성 F0 dramatic dropcalm text + strong AU15+AU17 + dramatic vocal F0 drop	≥60

⚠️ L8 자살 위험 시그널 - 임상적 책임⚠️ L8 Suicide-Risk Signal — Clinical Responsibility

L8 클래스 탐지 시 본 시스템은 자동으로 임상 슈퍼바이저에게 즉시 경고를 발송하며, 환자 안전 프로토콜을 활성화한다. 이는 자가보고 기반 시스템이 놓치기 쉬운 위장된 자살 위험을 객관적으로 포착하는 핵심 임상 가치이다. When the L8 class is detected, the system automatically issues an immediate alert to the clinical supervisor and activates a patient-safety protocol. This represents a core clinical value: the objective capture of concealed suicide risk that self-report-based systems are likely to miss.

6.6 임상 정상 모집단 정규화6.6 Clinical Normative-Population Normalization

Boston Neuromind 임상 데이터셋 (N≥500)으로 산출한 ICS 정상 분포에 대해 Z-점수화한다:The ICS is Z-scored against the normal distribution computed from the Boston Neuromind clinical dataset (N ≥ 500):

Z_ICS = (ICS_observed - μ_normal) / σ_normal Threshold: Z < +1.0 → Normal congruence +1.0 ≤ Z < +2.0 → Mild incongruence +2.0 ≤ Z < +3.0 → Moderate incongruence (alert) Z ≥ +3.0 → Severe incongruence (immediate alert + supervisor)

6.7 실시간 처리 파이프라인6.7 Real-Time Processing Pipeline

스트리밍 입력 수신 (video/audio/ASR text)receive streaming input (video / audio / ASR text)
3 채널 100ms 단위 동기 샘플링synchronously sample the three channels every 100 ms
5초 슬라이딩 윈도우(50 샘플)에서 특징 추출extract features within a 5-second sliding window (50 samples)
매 1초마다 ICS 갱신, 잠재 상태 분류update ICS every 1 second and reclassify latent states
임계치 초과 시 임상 경고 발송 (5초 이내)issue clinical alert upon threshold exceedance (within 5 seconds)

07도면 설명Drawings

도 1.FIG. 1. 3채널 입력 (Video/Audio/Text) → TSE → CMA → ISC → LSC → CNM → RTA → 임상 의사결정 지원. Three-channel input (Video/Audio/Text) → TSE → CMA → ISC → LSC → CNM → RTA → Clinical Decision Support.

도 2.FIG. 2. 가면 우울(L1) 탐지 예시. "괜찮아요" (텍스트 +0.7) 직후 음성 F0 강하 + 1.4초 후 슬픔 AU 등장 → ICS ≥ 50, Z ≥ 2.5 → 가면 우울 클래스로 분류. Detection example for masked depression (L1). "I'm fine" (text sentiment +0.7) is immediately followed by a vocal F0 drop, with a sadness AU emerging 1.4 seconds later → ICS ≥ 50, Z ≥ 2.5 → classified as masked depression.

08청구항Claims

청구항 1 (독립항)Claim 1 (Independent)

사용자의 비언어적 잠재 임상 상태를 자동 탐지하는, 컴퓨터 구현 방법으로서:
(a) 사용자로부터 시간적으로 동기화된 비디오, 오디오 및 텍스트의 3 입력 채널을 수신하는 단계;
(b) 비디오 채널로부터 복수의 얼굴 동작 단위(Facial Action Units, FAUs)의 강도를 시계열로 추출하고, 오디오 채널로부터 복수의 운율 특징(F0, jitter, shimmer, energy)을 시계열로 추출하며, 텍스트 채널로부터 정서 극성·강도 시계열을 추출하는 단계;
(c) 추출된 3 채널의 시계열에 대해 채널 쌍 별로 교차상관함수(cross-correlation)를 계산하고 최대 상관값에 해당하는 시간 지연(lag)을 산출하는 단계;
(d) 산출된 채널 쌍 별 시간 지연 및 상관값을 종합하여 부조화 종합 점수(Incongruence Composite Score, ICS)를 0 내지 100 범위로 산출하는 단계;
(e) 산출된 ICS 및 채널별 시계열 패턴을 가면 우울, 억압된 분노, 해리적 자기보고, 알렉시티미아, 사회적 동조, 불안 위장, 의도적 거짓말 및 자살 위험 시그널을 포함하는 8개 잠재 상태 클래스 중 하나로 분류하는 단계; 및
(f) 분류된 잠재 상태가 임상 정상 모집단 대비 Z-점수 ≥ +2.0 인 경우 임상 경고를 발송하는 단계;
를 포함하는 방법. A computer-implemented method for automatically detecting a user's non-verbal latent clinical state, the method comprising:
(a) receiving three temporally synchronized input channels of video, audio, and text from a user;
(b) extracting from the video channel intensity time series of a plurality of Facial Action Units (FAUs); extracting from the audio channel time series of a plurality of prosodic features (F0, jitter, shimmer, energy); and extracting from the text channel time series of sentiment polarity and intensity;
(c) computing, for each pair of the three channels' time series, a cross-correlation function and the time lag corresponding to the maximum correlation value;
(d) computing an Incongruence Composite Score (ICS) in the range of 0 to 100 by aggregating the per-pair time lags and correlation values;
(e) classifying, based on the ICS and channel-specific time-series patterns, the user's state into one of eight latent state classes comprising masked depression, suppressed anger, dissociative self-report, alexithymia, social compliance, anxiety concealment, intentional deception, and suicide-risk signal; and
(f) issuing a clinical alert when the Z-score of the classified latent state, computed against a clinical normative population, is greater than or equal to +2.0.

청구항 2 (종속항)Claim 2 (Dependent)

청구항 1에 있어서, 단계 (b)의 얼굴 동작 단위는 적어도 AU1, AU4, AU6, AU12, AU15, AU17, AU23 및 AU45를 포함하는 17개 핵심 AU의 강도(0-5)를 30 fps 이상으로 추출하는 것을 특징으로 하는 방법. The method of Claim 1, wherein the Facial Action Units of step (b) comprise the intensities (0–5) of seventeen core AUs including at least AU1, AU4, AU6, AU12, AU15, AU17, AU23, and AU45, sampled at thirty (30) frames per second or greater.

청구항 3 (종속항)Claim 3 (Dependent)

청구항 1에 있어서, 단계 (c)의 시간 지연은 -2.0초 내지 +2.0초 범위 내에서 산출되며, 정상 정합 임계치는 ±200ms이고, 절대값 200ms를 초과하는 시간 지연은 부조화 시그널로 분류되는 것을 특징으로 하는 방법. The method of Claim 1, wherein the time lag of step (c) is computed within a range of −2.0 to +2.0 seconds, the normal congruence threshold is ±200 ms, and time lags exceeding 200 ms in absolute value are classified as incongruence signals.

청구항 4 (종속항)Claim 4 (Dependent)

청구항 1에 있어서, 단계 (d)의 부조화 종합 점수는 다음 식에 의해 산출되는 것을 특징으로 하는 방법: ICS = Σ_(p∈{FV, FT, VT}) w_p · |1 - R_p(0)| · exp(|lag_p|/τ_0), 여기서 R_p(0)는 0 지연에서의 상관값, lag_p는 채널 쌍 p의 최적 시간 지연, τ_0는 200ms 스케일, w_p는 합이 1인 경험적 가중치임. The method of Claim 1, wherein the Incongruence Composite Score of step (d) is computed by the following equation: ICS = Σ_(p∈{FV, FT, VT}) w_p · |1 − R_p(0)| · exp(|lag_p|/τ_0), where R_p(0) is the correlation value at zero lag, lag_p is the optimal time lag for channel pair p, τ_0 is the 200 ms scale, and w_p are empirically calibrated weights summing to one.

청구항 5 (종속항)Claim 5 (Dependent)

청구항 1에 있어서, 단계 (e)의 분류 중 자살 위험 시그널 클래스는 텍스트 채널의 평온한 정서 표현, 얼굴 채널의 AU15 및 AU17의 강한 동시 발현, 그리고 음성 채널의 F0 급강하의 결합 패턴에 의해 식별되는 것을 특징으로 하는 방법. The method of Claim 1, wherein the suicide-risk-signal class among the classifications of step (e) is identified by a combined pattern of calm emotional expression in the text channel, strong concurrent expression of AU15 and AU17 in the facial channel, and a sharp drop in F0 in the audio channel.

청구항 6 (종속항)Claim 6 (Dependent)

청구항 1에 있어서, 5초 슬라이딩 윈도우에서 매 1초마다 단계 (b) 내지 (e)를 반복하여 실시간 처리를 수행하고, 임계치 초과 후 5초 이내에 단계 (f)의 임상 경고가 발송되는 것을 특징으로 하는 방법. The method of Claim 1, further comprising performing real-time processing by iterating steps (b) through (e) every one second within a five-second sliding window, and wherein the clinical alert of step (f) is issued within five seconds of threshold exceedance.

청구항 7 (종속항)Claim 7 (Dependent)

청구항 1에 있어서, 단계 (f)의 Z-점수 산출에 사용되는 임상 정상 모집단은 적어도 500명의 임상적 비병리 대조군으로부터 수집된 ICS 분포의 평균 및 표준편차에 기반하는 것을 특징으로 하는 방법. The method of Claim 1, wherein the clinical normative population used for Z-score computation in step (f) is based on the mean and standard deviation of the ICS distribution collected from at least five hundred (500) clinically non-pathological control subjects.

청구항 8 (독립항 — 시스템)Claim 8 (Independent — System)

청구항 1 내지 7 중 어느 한 항의 방법을 수행하기 위한, 적어도 하나의 프로세서, 비디오·오디오·텍스트 입력을 수신하는 다중 모달 인터페이스, 및 상기 프로세서에 의해 실행되는 명령어를 저장하는 비일시적 컴퓨터 판독 가능 저장 매체를 포함하는 다중 모달 잠재 상태 탐지 시스템. A multi-modal latent-state detection system for performing the method of any one of Claims 1 through 7, the system comprising at least one processor, a multi-modal interface for receiving video, audio, and text inputs, and a non-transitory computer-readable storage medium storing instructions executable by the processor.

청구항 9 (독립항 — 매체)Claim 9 (Independent — Medium)

컴퓨터에 의해 실행될 때 청구항 1 내지 7 중 어느 한 항의 방법을 수행하도록 하는 명령어를 저장하는 비일시적 컴퓨터 판독 가능 저장 매체. A non-transitory computer-readable storage medium storing instructions which, when executed by a computer, cause the computer to perform the method of any one of Claims 1 through 7.

09선행 기술 비교Prior Art Comparison

선행 기술Prior Art	접근 방식Approach	한계Limitation	본 발명과의 차이Distinction
Affectiva (Smart Eye)	얼굴만, 단일 모달face only; single modality	음성·텍스트 미통합no integration with voice or text	3 채널 시간 정합성 통합integrated tri-channel temporal congruence
iMotions	생체+표정 동시 측정simultaneous biometric and facial measurement	시간 지연 분석 부재no temporal-lag analysis	교차상관 + DTW로 lag 정량quantification of lag via cross-correlation and DTW
RealEyes / FacePlusPlus	얼굴 정서 분류facial affect classification	광고용, 임상 비검증advertising-oriented; not clinically validated	임상 정상 모집단 정규화normalization against clinical normative population
Cogito (call-center voice)	음성만 분석voice-only analysis	얼굴·텍스트 미사용no face or text	3 모달 동시 시간 정합성simultaneous tri-modal temporal congruence
Ginger.io / Mindstrong	스마트폰 키보드 패턴smartphone keystroke patterns	표정·음성 미사용no face or voice	실시간 비디오·오디오·텍스트real-time video, audio, and text
Ekman Group / METT	미세표정 훈련micro-expression training	사람이 분석, 자동화 부재human-rated; not automated	완전 자동, 실시간, 알고리즘화fully automated, real-time, and algorithmic

🎯 발명의 진보성🎯 Inventive Step

본 발명은 (1) 얼굴·음성·텍스트 3채널을 동시에 시간적 정합성으로 분석하는 최초의 시스템이며, (2) 시간적 부조화 패턴을 8개 임상 잠재 상태로 매핑하고, (3) 임상 정상 모집단 정규화로 객관적 임계치를 설정하며, (4) 5초 이내 실시간 경고를 발송한다. 특히 자살 위험 시그널(L8)의 객관적 탐지는 자가보고 의존 시스템이 놓치는 위장된 위험을 포착하는 임상적·법적·윤리적 가치를 가진다. The present invention is (1) the first system to simultaneously analyze temporal congruence across the three channels of face, voice, and text; (2) it maps temporal incongruence patterns to eight clinical latent states; (3) it sets objective thresholds via clinical normative-population normalization; and (4) it issues real-time alerts within five seconds. In particular, the objective detection of the suicide-risk signal (L8) holds clinical, legal, and ethical value by capturing concealed risk that self-report-dependent systems are likely to miss.

10산업상 이용 가능성Industrial Applicability

10.1 적용 시장10.1 Target Markets

임상 진단 보조:Clinical diagnostic aid: 치료사 세션 중 잠재 상태 탐지 (특히 자살 위험)latent-state detection during therapist sessions, especially suicide risk
디지털 정신건강 SaaS:Digital mental health SaaS: Talkspace, BetterHelp 등에 통합되는 안전 모니터링 모듈safety-monitoring modules integrated into services such as Talkspace and BetterHelp
컴패니언 봇:Companion bots: TalkCatcher 등 AI 동반자가 위험 상태를 객관 탐지AI companions such as TalkCatcher objectively detecting risk states
법심리학:Forensic psychology: 신뢰성 평가, 법정 증언 분석credibility assessment and courtroom-testimony analysis
원격 의료:Telehealth: 화상 진료 중 비언어 위험 시그널 자동 탐지automatic detection of non-verbal risk signals during video consultations
학교·직장 정신건강:School and workplace mental health: 상담사 보조 도구counselor support tools

10.2 윤리적 안전장치10.2 Ethical Safeguards

본 발명은 다음 윤리 원칙을 준수한다: (1) 사용자 동의 후 작동, (2) 데이터는 단대단(end-to-end) 암호화, (3) 임상 결과는 자격 있는 임상가에게만 공개, (4) 알고리즘 결정은 항상 임상가 검토를 거침, (5) 사용자가 데이터 삭제·탈퇴 권리 보유. 특히 L7(의도적 거짓말) 클래스는 임상 환경 외 적용을 권장하지 않으며, 법심리학적 응용은 자격 있는 전문가의 감독하에서만 사용된다. The invention complies with the following ethical principles: (1) operates only after user consent; (2) all data is encrypted end-to-end; (3) clinical outputs are disclosed only to qualified clinicians; (4) algorithmic decisions always undergo clinician review; and (5) users retain rights of data deletion and withdrawal. In particular, the L7 (intentional deception) class is not recommended for use outside clinical settings, and forensic-psychology applications are conducted only under the supervision of qualified experts.

10.3 규제 경로10.3 Regulatory Pathway

FDA 510(k) Class II 의료기기 또는 De Novo 디지털 치료제(DTx) 경로 검토 가능. 임상 시험을 통한 PMA(Premarket Approval) 승인 후 진단 보조 도구로서 임상 사용 가능. The invention is amenable to review under FDA 510(k) Class II medical-device or De Novo Digital Therapeutic (DTx) pathways. Following Premarket Approval (PMA) via clinical trials, the system may be used clinically as a diagnostic-aid tool.

11관련 논문References

본 발명의 이론적·임상적 근거가 되는 핵심 논문 및 자료. 클릭하면 외부 출처로 이동합니다. Key papers and resources providing the theoretical and clinical basis for this invention. Click links to access external sources.

A. 얼굴 동작 단위 (FACS) 및 미세표정A. Facial Action Coding (FACS) and Micro-Expressions

A1Ekman P, Friesen WV. Facial Action Coding System (FACS): A Technique for the Measurement of Facial Movement. Consulting Psychologists Press, 1978.
A2Ekman P. Emotions Revealed: Recognizing Faces and Feelings to Improve Communication and Emotional Life. Henry Holt, 2003.
A3Ekman P, Friesen WV. Nonverbal leakage and clues to deception. Psychiatry, 1969; 32(1):88-106. DOI ↗
A4Cohn JF, Kruez TS, Matthews I, et al. Detecting depression from facial actions and vocal prosody. 3rd International Conference on Affective Computing and Intelligent Interaction (ACII), 2009. DOI ↗
A5Yan WJ, Wu Q, Liang J, et al. CASME II: An improved spontaneous micro-expression database and the baseline evaluation. PLOS ONE, 2014; 9(1):e86041. DOI ↗

B. 음성 운율 및 임상 음성 분석B. Vocal Prosody and Clinical Voice Analysis

B1Cummins N, Scherer S, Krajewski J, et al. A review of depression and suicide risk assessment using speech analysis. Speech Communication, 2015; 71:10-49. DOI ↗
B2Mundt JC, Snyder PJ, Cannizzaro MS, et al. Voice acoustic measures of depression severity and treatment response. Journal of Neurolinguistics, 2007; 20(1):50-64. DOI ↗
B3Scherer KR. Vocal communication of emotion: A review of research paradigms. Speech Communication, 2003; 40(1-2):227-256. DOI ↗
B4Patel S, Scherer KR, Björkner E, Sundberg J. Mapping emotions into acoustic space: The role of voice production. Biological Psychology, 2011; 87(1):93-98. DOI ↗

C. 텍스트 정서 분석 및 NLPC. Text Sentiment Analysis and NLP

C1Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019. arXiv ↗
C2Liu Y, Ott M, Goyal N, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692, 2019. arXiv ↗
C3Pennebaker JW, Boyd RL, Jordan K, Blackburn K. The development and psychometric properties of LIWC2015. University of Texas at Austin, 2015.

D. 다중 모달 정서 융합D. Multi-Modal Affect Fusion

D1Poria S, Cambria E, Bajpai R, Hussain A. A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion, 2017; 37:98-125. DOI ↗
D2Baltrušaitis T, Ahuja C, Morency LP. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019; 41(2):423-443. DOI ↗
D3Calvo RA, D'Mello SK. Affect detection: An interdisciplinary review of models, methods, and their applications. IEEE Transactions on Affective Computing, 2010; 1(1):18-37. DOI ↗

E. 시간 정합성 및 동적 시간 워핑 (DTW)E. Temporal Alignment and Dynamic Time Warping

E1Sakoe H, Chiba S. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1978; 26(1):43-49. DOI ↗
E2Berndt DJ, Clifford J. Using Dynamic Time Warping to Find Patterns in Time Series. KDD Workshop, 1994; 359-370.
E3Chatfield C. The Analysis of Time Series: An Introduction, 6th ed. Chapman & Hall/CRC, 2003.

F. 가면 우울 및 자기보고의 한계F. Masked Depression and Self-Report Limits

F1Goldney RD. Suicide and depression: An overview. Australasian Psychiatry, 1989; 23(4):529-538. DOI ↗
F2Sifneos PE. The prevalence of 'alexithymic' characteristics in psychosomatic patients. Psychotherapy and Psychosomatics, 1973; 22(2-6):255-262. DOI ↗
F3Edwards AL. The social desirability variable in personality assessment and research. Holt, 1957.
F4van der Kolk BA. The Body Keeps the Score: Brain, Mind, and Body in the Healing of Trauma. Viking, 2014.
F5Bagby RM, Parker JDA, Taylor GJ. The twenty-item Toronto Alexithymia Scale-I. Journal of Psychosomatic Research, 1994; 38(1):23-32. DOI ↗

G. 자살 위험 평가 및 안전 모니터링G. Suicide Risk Assessment and Safety Monitoring

G1Posner K, Brown GK, Stanley B, et al. The Columbia–Suicide Severity Rating Scale (C-SSRS): Initial Validity and Internal Consistency Findings. American Journal of Psychiatry, 2011; 168(12):1266-1277. DOI ↗
G2Franklin JC, Ribeiro JD, Fox KR, et al. Risk factors for suicidal thoughts and behaviors: A meta-analysis of 50 years of research. Psychological Bulletin, 2017; 143(2):187-232. DOI ↗

H. 윤리 및 법심리학H. Ethics and Forensic Psychology

H1National Research Council. The Polygraph and Lie Detection. National Academies Press, 2003. DOI ↗
H2U.S. Food and Drug Administration. Software as a Medical Device (SaMD): Clinical Evaluation. FDA Guidance, 2017. FDA ↗