Speaker Recognition
강의 주제: Audio processing, Feature extraction, Speaker recognition Instructor : Quan Wang(Software Engineer, Google) [[course](https://www.udemy.com/course/speaker-recognition/)]:bulb: 목표
- 화자 인식 기초 개념 공부
화자 인식 기초 개념을 공부한다.
- 음성 처리 기법을 파악
음향학과 음성 데이터 처리 기법을 파악한다.
🚩 정리한 문서 목록
📖 Basics of Audio Processing
pattern matching, per-segment matching, Gaussian Mixture Models, factor analysis, deep learning
utterance, generation of speech(vocal folds, glottis, vocal tract), human hearing(ear canal, ossicles, cochlea)
sine wave, Fourier analysis, spectrum, frequency, fundamental frequency, pitch, formant
intensity, loudness
nonlinearity of frequency(Bark scale, Mel scale, Equivalent Rectangular Bandwidth, Cochlear frequency-position function scale), nonlinearity of intensity
Analog-to-Digital Converter(ADC): sampling(sampling rate, Nyquist frequency), quantization
audio coding: linear PCM, non-linear PCM(μ-law, A-law), Adaptive PCM, Differential PCM, Linear Predictive Coding(LPC), frequency domain coding(Sub-Band Coding, Adaptive Transform Coding)
audio formats: WAV, SPHERE, FLAC, MP3, AAC, OPUS, Speex, WMA
sound processing programs: SoX, FFmpeg
short-time analysis: framing(frame size, frame step), window function(Gaussian, Hanning, Hamming)
Frame post-processing: frame stacking, frame subsampling, frame normalization
Time domain features: short-time enearge, short-time average magnitude, short-time zero cross rate, short-time auto-correlation, short-time average magnitude difference fuction, short-time linear predictive coding
Frequency domain features: Discrete Fourier Transform(DFT), Fast Fourier Transform(FFT), Short-Time Fourier Transform(STFT), Spectrogram, Cepstrum
Commonly used features: Perceptual Linear Prediction(PLP), Mel-Frequency Cepstral Coefficients(MFCC), Power-Normalized Cepstral Coefficient(PNCC), Log-mel Filterbank Energies(LFBE)
⚙️ Fundamentals of Speaker Recognition
speech vs speaker recognition, speaker verification, speaker identification
Textual content(Text-Dependent, Text-Independent, Text-Prompted)
system workflow: training(speaker encoder, embedding), enrollment(aggregation), recognition
Thresholding, Similarity Score(Cosine Similarity, Euclidean Distance, Model-based similarity score), Score Triaging
Evaluation: Pair-based Evaluation, Set-based Evaluation
False Accept Rate(FAR), False Reject Rate(FRR), ROC curve(Area Under Curve(AUC)), DET curve, Equal Error Rate(EER), Minimum Detection Cost Function(minDCF)
📻 Early Speaker Recognition Approaches
Gaussian Distribution, Multivariate Gaussian Distribution
Gaussian Mixture Model(GMM): sharing covariance matrix/simpler covariance matrix(diagonal covariance matrix), parameter estimation(Expectation-Maximization algorithm)
Universal Background Model(UBM), Bayesian Adaptation, GMM-UBM, supervector
Support Vector Machine(SVM): linear SVM, linear SVM with soft margin, non-linear SVM(kernel trick)
Factor Analysis: observed/unobserved, correlated/uncorrelated variables, common factor, unique factor, loading matrix
Joint Factor Analysis, JFA-SVM, i-vector(channel compensation)
🧠 Speaker Recognition with Deep Learning
Indirect use(Tandem deep features, DNN i-vector, j-vector), Direct use(neural network encoder, embeddings, loss function, optimizer)
Inference: Frame-independent, Fixed window, Full sequence(RNN, attention), Sliding window(RNN, attention)