As internet services are increasingly becoming available to all regions of the world, we face an unprecedented task of overcoming barriers such as language, so as to provide cutting edge quality content to each person on the planet. Artificial intelligence (AI) technologies such as speech recognition, which are a key to this transformation, require a diverse training corpus such as speech recordings, which aren't widely available for many world languages. By building a simple and easy-to-use karaoke system, we can crowdsource a speech-to-text corpus that can be used to train AI models, e.g., speech recognition.
There are around 6500 languages in the world. Unfortunately, a lot of artificial intelligence technology today exists only in major languages like English, Spanish, Mandarin (Chinese), etc. A major barrier to the development of artificial intelligence technology in all the other languages is the lack of training data. However, data collection is an expensive task, and we must invent ways to continuously collect and refine training data, while minimizing the cognitive load on people contributing to training data for AI models. To this end, we propose to create a Raspberry Pi karaoke system that displays song lyrics on the piTFT screen and records speech input from users listening to the karaoke track through an earphone connected to the Pi. A PiTFT screen, a LED panel, a microphone, and an earphone are the primary components of the Karaoke System. After a user chooses a song on the GUI displayed on the piTFT, the music video is played on the piTFT and the real-time frequency spectrum of the music playback is displayed on the LED panel. The user can listen to the music from the earphone and sing to the microphone. The singing is recorded with the microphone and after the song is finished, the singing is scored based on its correlation and consistency with the playback. The voice recording would be then uploaded to the cloud along with other user and song metadata, which could be used to further train an automatic speech recognition (ASR) deep learning model on the cloud.
Our goal is to create a diverse and high quality voice to text training corpus for ASR by engaging users in a fun task such as karaoke singing in languages that have musical content with the corresponding lyrics but an immature ASR technology. This helps in appealing to a broader audience, in contrast with current mechanisms such as Amazon Mechanical Turk that impose a huge cognitive load on contributors.
Design and Testing
The LED panel we used is the Medium 16x32 RGB LED matrix panel produced by Adafruit. The panel requires 12 digital pins, 6 pins for color data and 6 pins for control, and a 5V power supply. The pins are connected directly to the GPIO pins on the RPi. Although the pins require 5V logic input, and the RPi outputs 3.3V logic, the 3.3V logic is enough to trigger. A 5V 2A charger supplies power to the LED panel through a 2.1mm to Screw Jack Adapter. We used the LED-matrix library provided by Henner Zeller on Github to help control the display of the LED. The figure shows how the pins and GPIO pins are connected based on the settings of the library
A few changes were made in the samplebase.py script in the python/samples subdirectory. The size of the matrix was changed to 16x32. There was only one panel in the chain. The brightness was changed to level 50. The LED GPIO mapping was changed to ‘adafruit-hat’, because we used Adafruit HAT default configuration for GPIO mapping. We also changed the --led-slowdown-gpio to 4 to slow down the speed of data output.
To display words on the LED panel, the function graphics.DrawText was used that displays words on the specific position of the LED with colors. The graphics.DrawLine function was used to display graphics on the LED panel. We used the two functions to display the “Karaoke” title, a tone animation, and scores on the LED. The following figures show the display of the words and graphics.
The frequency spectrum is being displayed while the music playback is being played. We used the wave module to open the audio file of the music and read 4096 frames of audio from the file. The struct.unpack function was used to unpack the data. We then used numpy.fft.rfft function to compute the discrete Fourier transform of the data, stored the output and had them ready to be displayed on the LED. To make the spectrum be displayed in real-time, we used the pyaudio module to play the audio file of the music. As the audio data was played by the pyaudio, the spectrum of the data was calculated and displayed on the LED. We used the SetPixel function of the LED-matrix library to display the spectrum. The figure shows the spectrum display.
The GUI on piTFT is achieved by the Pygame module. The pygame.event.get function detects the mouse movement. The following functions add buttons and text on the screen. There are two levels: The start level and the Song choice level. The Start level contains a Start button at the center of the screen and an EXIT button at the right bottom of the screen. The Start button leads to the second level and the EXIT button terminates the program. The second level contains six buttons each corresponding to one song, BACK button, and EXIT button. Pressing the song buttons leads to the playback on the piTFT and the spectrum display of the song. The BACK button returns the GUI back to the first level. The figures show the first and second levels of the GUI.
The labels in blue circles on the piTFT screen are the 6 songs we curated for our Karaoke system. 3 of the 6 songs are from high resource languages, whereas the rest are from low resource languages. Here is a list of the songs:
"Morning": Morning Song in Cherokee
"Despacito": Despacito in Spanish
"Lao Shu": Lao Shu in Mandarin (Chinese)
"Nayan": Nayan Ne Bandh Rakhine in Gujarati
"Twinkle": Twinkle Twinkle Little Star in English
"Man Udhan": Man Udhan Varyache in Marathi
While the purpose of our device is to collect training data for speech recognition AI models, in order to incentivize good recording practices for our users, we score user recordings based on their consistency and singing. We define consistency by the fraction of the time users are making some attempt to sing while the song is also going on. In most songs, the lyrics correspond to the high amplitude regions of the spectrum.
So we can find out the consistency of the user by figuring out the fraction of the time they were singing, i.e., amplitude was above a threshold, and divide it by the fraction of the time the playback had an amplitude beyond a threshold. We scale this fraction up to a maximum of 50 points, by multiplying the fraction by 50. We also calculate the cross-correlation between the playback (reference) and the user recording (source). Due to noise this correlation cannot be expected to be perfect, even for a good singer. Therefore we set a maximum of 0.03 correlation, above which the user gets a maximum score of 50. If the correlation turns out to be below 0.03 in magnitude, then we scale down the score appropriately. Finally, the sum of the correlation score (out of 50) and the persistence score (also out of 50) is displayed on the LED panel.
Combine each section
All the scripts are combined in a bash script. The “Karaoke” title and the GUI are run first. After a song is chosen, the choice is written to a text file. The bash file reads the text file to find the song that is chosen, gets the mp4 and wav file of the song ready, and creates a record file for the user. The spectrum script in the meantime reads the choice from the text file and plays the playback and displays the spectrum in the background. The music video of the song is run by mplayer as well. The length of the song is calculated and the singing is only recorded for the amount of time. After the song is finished, the score is calculated and displayed on the LED.
Each section ran pretty well but there were problems combining them together. Both the music video and the pyaudio play audio. And the built-in audio of the Raspberry Pi was disabled as required by the LED-matrix library. To make sure that the two audio plays didn’t conflict, we used two audio adapters with one of them outputs the pyaudio audio and the other outputs the audio of the music video.
Because the RPi is required to control the LED panel and piTFT at the same time, some funny things happened when we ran them together. At first, we put both the pygame and the “Karaoke” title in the same script. This causes that the pygame cannot be initialized without being interrupted by the keyboard. It also influences the mplayer as well. Even though the program is terminated, the mplayer cannot be run after the program. We solved the problem by splitting the pygame and the “Karaoke” title into two different scripts.
We designed each program separately and tested the programs by running them multiple times and fixing any problem. After all the programs were designed and tested, we combined them in the same bash file and ran all the programs together, problems showed up and are described in the issue section. We ran the bash file and solved the problems.
Everything performed as planned. The pygame, LED panel, mplayer, recording, and scoring algorithm ran smoothly. For the “Karaoke” title and the pygame GUI, they are displayed after the program is started. The “Karaoke!” animation runs first, then the tone animation runs and loops back to the “Karaoke!” animation. Meanwhile, the Start button is displayed on the piTFT. Pressing the Start button leads to six song buttons. The GUI returns to the first level by pressing the BACK button and the program quits by pressing the EXIT button. Both the spectrum and the music video are displayed after one song is chosen. The microphone records the singing. A “Loading” text is displayed on the LED after the song finishes. Finally, the score is displayed on the LED after it’s calculated. The program automatically loops back to the “Karaoke”, after the score is displayed. Our team meet all the goals outlined in the description
We achieve all the goals listed in the objectives and all the sections end up running smoothly. One major conclusion we learned is that combining two different modules in the same scripts and running them together causes something wrong with the RPi. The initialization of the pygame and mplayer are affected. Another thing we learned is that the root kills the process by force if the process takes too much memory of the RPi.
If we have more time, we could improve the GUI system of the program. We plan to add more GPIO pin buttons to the system so that we could pause and resume the music video and spectrum display. The music video can be paused and resumed by FIFO file control. The spectrum display can be paused and resumed by a callback function. The singing is recorded to a wav file first. We plan to display the spectrum of the singing if there is a way to acquire the audio sample in real-time.
Raspberry Pi 4 model B
320x240 2.8" TFT+Touchscreen for Raspberry Pi
Medium 16x32 RGB LED matrix panel
Sabrent Aluminum USB External Stereo Sound Adapter
Universal 5v 2a Power Supply with 8 Tips Switching Connector
Design LED display (“Karaoke” title, spectrum display and score display)
Design scoring algorithm
Design Pygame GUI
Preprocessing of videos into wav files so that future processing can happen
Write Combine.sh bash script
Setting up arecord and mplayer for playing the karaoke video and listening to the user voice input
Highly appreciate Professor Joseph Skovira and ECE5725 TAs for helpful guides. They helped a lot in providing information and result examination in the process of project development. It's an interesting course and we have learned a lot about embedded system design.