The Engineer Behind Samsung’s Speech Recognition Software

9 months ago 69

Every time you use your voice to generate a message on a Samsung Galaxy mobile phone or activate a Google Home device, you’re using tools Chanwoo Kim helped develop. The former executive vice president of Samsung Research’s Global AI Centers specializes in end-to-end speech recognition, end-to-end text-to-speech tools, and language modeling. “The most rewarding part of my career is helping to develop technologies that my friends and family members use and enjoy,” Kim says. He recently left Samsung to continue his work in the field at Korea University, in Seoul, leading the school’s speech and language processing laboratory. A professor of artificial intelligence, he says he is passionate about teaching the next generation of tech leaders. “I’m excited to have my own lab at the school and to guide students in research,” he says. Bringing Google Home to market When Amazon announced in 2014 it was developing smart speakers with AI assistive technology, a gadget now known as Echo, Google decided to develop its own version. Kim saw a role for his expertise in the endeavor—he has a Ph.D. in language and information technology from Carnegie Mellon, and he specialized in robust speech recognition. Friends of his who were working on such projects at Google in Mountain View, Calif., encouraged him to apply for a software engineering job there. He left Microsoft in Seattle where he had worked for three years as a software development engineer and speech scientist. After joining Google’s acoustic modeling team in 2013, he worked to ensure the company’s AI assistive technology, used in Google Home products, could perform in the presence of background noise. Chanwoo Kim Employer Korea University in Seoul Title Director of the the speech and language processing lab and professor of artificial intelligence Member grade Member Alma maters Seoul National University; Carnegie Mellon He led an effort to improve Google Home’s speech-recognition algorithms, including the use of acoustic modeling, which allows a device to interpret the relationship between speech and phonemes (phonetic units in languages). “When people used the speech-recognition function on their mobile phones, they were only standing about 1 meter away from the device at most,” he says. “For the speaker, my team and I had to make sure it understood the user when they were talking farther away.” Kim proposed using large-scale data augmentation that simulates far-field speech data to enhance the device’s speech-recognition capabilities. Data augmentation analyzes training data received and artificially generates additional training data to improve recognition accuracy. His contributions enabled the company to release its first Google Home product, a smart speaker, in 2016. “That was a really rewarding experience,” he says. That same year, Kim moved up to senior software engineer and continued improving the algorithms used by Google Home for large-scale data augmentation. He also further developed technologies to reduce the time and computing power used by the neural network and to improve multi-microphone beamforming for far-field speech recognition. Kim, who grew up in South Korea, missed his family, and in 2018 he moved back, joining Samsung as vice president of its AI Center in Seoul. When he joined Samsung, he aimed to develop end-to-end speech recognition and text-to-speech recognition engines for the company’s products, focusing on on-device processing. To help him reach his goals, he founded a speech processing lab and led a team of researchers developing neural networks to replace the conventional speech-recognition systems then used by Samsung’s AI devices. “The most rewarding part of my work is helping to develop technologies that my friends and family members use and enjoy.” Those systems included an acoustic model, a language model, a pronunciation model, a weighted finite state transducer, and an inverse text normalizer. The language model looks at the relationship between the words being spoken by the user, while the pronunciation model acts as a dictionary. The inverse text normalizer, most often used by text-to-speech tools on phones, converts speech into written expressions. Because the components were bulky, it was not possible to develop an accurate, on-device speech-recognition system using conventional technology, Kim says. An end-to-end neural network would complete all the tasks and “greatly simplify speech-recognition systems,” he says. Chanwoo Kim [top row, seventh from the right] with some of the members of his speech processing lab at Samsung Research.Chanwoo Kim He and his team used a streaming attention-based approach to develop their model. An input sequence—the spoken words—are encoded, then decoded into a target sequence with the help of a context vector, a numeric representation of words generated by a pretrained deep learning model for machine translation. The model was commercialized in 2019 and is now part of Samsung’s Galaxy phone. That same year, a cloud version of the system was commercialized and is used by the phone’s virtual assistant, Bixby. Kim’s team continued to improve speech recognition and text-to-speech systems in other products, and every year they commercialized a new engine. They include the power-normalized cepstral coefficients, which improve the accuracy of speech recognition in environments with disturbances such as additive noise, changes in the signal, multiple speakers, and reverberation. It suppresses the effects of background noise by using statistics to estimate characteristics. It is now used in a variety of Samsung products including air conditioners, cellphones, and robotic vacuum cleaners. Samsung promoted Kim in 2021 to executive vice president of its six Global AI Centers, located in Cambridge, England; Montreal; Seoul; Silicon Valley; New York; and Toronto. In that role he oversaw research on incorporating artificial intelligence and machine learning into Samsung products. He is the youngest person to be an executive vice president at the company. He also led the development of Samsung’s generative large language models, which evolved in Samsung Gauss. The suite of generative AI models can generate code, images, and text. In March he left the company to join Korea University as a professor of artificial intelligence—which is a dream come true, he says. “When I first started my doctoral work, my dream was to pursue a career in academia,” Kim says. “But after earning my Ph.D., I found myself drawn to the impact my research could have on real products, so I decided to go into industry.” He says he was excited to join Korea University, as “it has a strong presence in artificial intelligence” and is one of the top universities in the country. Kim says his research will focus on generative speech models, multimodal processing, and integrating generative speech with language models. Chasing his dream at Carnegie Mellon Kim’s father was an electrical engineer, and from a young age, Kim wanted to follow in his footsteps, he says. He attended a science-focused high school in Seoul to get a head start in learning engineering topics and programming. He earned his bachelor’s and master’s degrees in electrical engineering from Seoul National University in 1998 and 2001, respectively. Kim long had hoped to earn a doctoral degree from a U.S. university because he felt it would give him more opportunities. And that’s exactly what he did. He left for Pittsburgh in 2005 to pursue a Ph.D. in language and information technology at Carnegie Mellon. “I decided to major in speech recognition because I was interested in raising the standard of quality,” he says. “I also liked that the field is multifaceted, and I could work on hardware or software and easily shift focus from real-time signal processing to image signal processing or another sector of the field.” Kim did his doctoral work under the guidance of IEEE Life Fellow Richard Stern, who probably is best known for his theoretical work in how the human brain compares sound coming from each ear to judge where the sound is coming from. “At that time, I wanted to improve the accuracy of automatic speech recognition systems in noisy environments or when there were multiple speakers,” he says. He developed several signal processing algorithms that used mathematical representations created from information about how humans process auditory information. Kim earned his Ph.D. in 2010 and joined Microsoft in Seattle as a software development engineer and speech scientist. He worked at Microsoft for three years before joining Google. Access to trustworthy information Kim joined IEEE when he was a doctoral student so he could present his research papers at IEEE conferences. In 2016 a paper he wrote with Stern was published in the IEEE/ACM Transactions on Audio, Speech, and Language Processing. It won them the 2019 IEEE Signal Processing Society’s Best Paper Award. Kim felt honored, he says, to receive this “prestigious award.” Kim maintains his IEEE membership partly because, he says, IEEE is a trustworthy source of information, and he can access the latest technical information. Another benefit of membership is IEEE’s global network, Kim says. “By being a member, I have the opportunity to meet other engineers in my field,” he says. He is a regular attendee at the annual IEEE Conference for Acoustics, Speech, and Signal Processing. This year he is the technical program committee’s vice chair for the meeting, which is scheduled for next month in Seoul.


View Entire Post

Read Entire Article