Application of Audio Fingerprinting Techniques for Real-Time Scalable Speech Retrieval and Speech Clusterization
Abstract: Audio fingerprinting techniques have seen great advances in recent years, enabling accurate and fast audio retrieval even in conditions when the queried audio sample has been highly deteriorated or recorded in noisy conditions. Expectedly, most of the existing work is centered around music, with popular music identification services such as Apple's Shazam or Google's Now Playing designed for individual audio recognition on mobile devices. However, the spectral content of speech differs from that of music, necessitating modifications to current audio fingerprinting approaches. This paper offers fresh insights into adapting existing techniques to address the specialized challenge of speech retrieval in telecommunications and cloud communications platforms. The focus is on achieving rapid and accurate audio retrieval in batch processing instead of facilitating single requests, typically on a centralized server. Moreover, the paper demonstrates how this approach can be utilized to support audio clustering based on speech transcripts without undergoing actual speech-to-text conversion. This optimization enables significantly faster processing without the need for GPU computing, a requirement for real-time operation that is typically associated with state-of-the-art speech-to-text tools.
- J. S. Seo, M. Jin, S. Lee, D. Jang, S. Lee, and C. D. Yoo, “Audio fingerprinting based on normalized spectral subband centroids,” in Proceedings.(ICASSP’05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005., vol. 3, pp. iii–213, IEEE, 2005.
- A. Wang et al., “An industrial strength audio search algorithm.,” in Ismir, vol. 2003, pp. 7–13, Washington, DC, 2003.
- S. Baluja and M. Covell, “Waveprint: Efficient wavelet-based audio fingerprinting,” Pattern recognition, vol. 41, no. 11, pp. 3467–3480, 2008.
- B. Gfeller, R. Guo, K. Kilgour, S. Kumar, J. Lyon, J. Odell, M. Ritter, D. Roblek, M. Sharifi, M. Velimirović, et al., “Now playing: Continuous low-power music recognition,” arXiv preprint arXiv:1711.10958, 2017.
- U. Glavitsch, “A first approach to speech retrieval,” Technical Report/ETH Zurich, Department of Computer Science, vol. 238, 1995.
- L.-s. Lee, J. Glass, H.-y. Lee, and C.-a. Chan, “Spoken content retrieval—beyond cascading speech recognition with text retrieval,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 9, pp. 1389–1420, 2015.
- E. Schooler, J. Rosenberg, H. Schulzrinne, A. Johnston, G. Camarillo, J. Peterson, R. Sparks, and M. J. Handley, “SIP: Session Initiation Protocol.” RFC 3261, July 2002.
- H. Schulzrinne and G. Camarillo, “Early Media and Ringing Tone Generation in the Session Initiation Protocol (SIP).” RFC 3960, Dec. 2004.
- S. Baluja and M. Covell, “Audio fingerprinting: Combining computer vision & data stream processing,” in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, vol. 2, pp. II–213, IEEE, 2007.
- M. Covell and S. Baluja, “Known-audio detection using waveprint: Spectrogram fingerprinting by wavelet hashing,” in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, vol. 1, pp. I–237, IEEE, 2007.
- Y. Ke, D. Hoiem, and R. Sukthankar, “Computer vision for music identification,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, pp. 597–604, IEEE, 2005.
- C. E. Jacobs, A. Finkelstein, and D. H. Salesin, “Fast multiresolution image querying,” in Proceedings of the 22nd annual conference on Computer graphics and interactive techniques, pp. 277–286, 1995.
- S. G. Mallat, “A theory for multiresolution signal decomposition: the wavelet representation,” IEEE transactions on pattern analysis and machine intelligence, vol. 11, no. 7, pp. 674–693, 1989.
- J. Kovacevic, V. K. Goyal, and M. Vetterli, “Fourier and wavelet signal processing,” Fourier Wavelets. org, pp. 1–294, 2013.
- J. Haitsma and T. Kalker, “A highly robust audio fingerprinting system.,” in Ismir, vol. 2002, pp. 107–115, 2002.
- B. C. Moore, An introduction to the psychology of hearing. Brill, 2012.
- G. Ballou, Handbook for sound engineers. Taylor & Francis, 2013.
- A. Gionis, P. Indyk, R. Motwani, et al., “Similarity search in high dimensions via hashing,” in Vldb, vol. 99, pp. 518–529, 1999.
- E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, and C. Yang, “Finding interesting associations without support pruning,” IEEE Transactions on Knowledge and Data Engineering, vol. 13, no. 1, pp. 64–78, 2001.
- M. Ravanelli, T. Parcollet, and Y. Bengio, “The pytorch-kaldi speech recognition toolkit,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6465–6469, IEEE, 2019.
- J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third ‘chime’speech separation and recognition challenge: Analysis and outcomes,” Computer Speech & Language, vol. 46, pp. 605–626, 2017.
- V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5206–5210, IEEE, 2015.
- A. Lee and T. Kawahara, “Recent development of open-source speech recognition engine julius,” in Proceedings: APSIPA ASC 2009: Asia-Pacific signal and information processing association, 2009 annual summit and conference, pp. 131–137, Asia-Pacific Signal and Information Processing Association, 2009 Annual …, 2009.
- D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recognition toolkit,” in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, IEEE Signal Processing Society, Dec. 2011. IEEE Catalog No.: CFP11SRW-USB.
- A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning, pp. 28492–28518, PMLR, 2023.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- A. Conneau, M. Ma, S. Khanuja, Y. Zhang, V. Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “Fleurs: Few-shot learning evaluation of universal representations of speech,” in 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 798–805, IEEE, 2023.
- V. Pratap, A. Sriram, P. Tomasello, A. Hannun, V. Liptchinsky, G. Synnaeve, and R. Collobert, “Massively multilingual asr: 50 languages, 1 model, 1 billion parameters,” arXiv preprint arXiv:2007.03001, 2020.
- A. Tjandra, N. Singhal, D. Zhang, O. Kalinli, A. Mohamed, D. Le, and M. L. Seltzer, “Massively multilingual asr on 70 languages: Tokenization, architecture, and generalization capabilities,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, IEEE, 2023.
- V. Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi, et al., “Scaling speech technology to 1,000+ languages,” arXiv preprint arXiv:2305.13516, 2023.
- J. MacQueen et al., “Some methods for classification and analysis of multivariate observations,” in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol. 1, pp. 281–297, Oakland, CA, USA, 1967.
- M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al., “A density-based algorithm for discovering clusters in large spatial databases with noise,” in kdd, vol. 96, pp. 226–231, 1996.
- J. Wolfe, “Speech and music, acoustics and coding, and what music might be ‘for’,” in Proc. 7th International Conference on Music Perception and Cognition, pp. 10–13, 2002.
- J. L. Fitch and A. Holbrook, “Modal vocal fundamental frequency of young adults,” Archives of Otolaryngology, vol. 92, no. 4, pp. 379–382, 1970.
- J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search with gpus,” IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535–547, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.