ᎣᏏᏲ. ᏙᎯᏧ? Hey, welcome back!
This has been an incredibly long year. I have put aside all of my hopes and dreams so I could finish paramedic school… I said what I said. I’m only waiting, now, for the ability to test as everything else is done. With that, I’ve returned to Serinda (V3) Graceland.
I had some time to think about what it was that I wanted to do. I mean, I’ve only dreamt of this for more than 35 years… maybe longer. However, I’m talking about the implementation approach (a new change?).
The last tests I ran the Flask server was bogged down. That’s not the servers fault, I was having it do so many things – to wit the short list is: flask server, webRTC manager, STT, TTS, and more. I’m going to separate out the tasks to individual components to make it a bit easier on the server.
Server – I think the server will need only the ability to take input via REST request and send that off to be processed as well as display pages. So basically a page server. All of the logic for how OpenCV is managed via webRTC I’m going to attempt to offload.
SwanSong – I don’t want to call this just ‘audio,’ however, it’s pretty much handling the STT and TTS. This is the second app that will run in a loop until it’s done. The STT will come in and then get processed through the CommandProcessor to get back an intent. That intent will be sent to the server where it will be sent through the system and the desired intent flag run.
OpenCV – I like the ability to do this via Python. That means many, many people can utilize it and tinker at will. I would also like the ability to run OpenCV processing in Rust. Which means that there will be a python folder and a rust folder – where the python code is translated to rust.
I have already built a pretty good system of loading plugins. They are automatically loaded and the creator need only drop a new package into the plugin path and then add the command to the list of commands – everything else is handled automagically.
For the steps, I’ve broken it down like this:
User speaks a command
The speech is transcribed in SwanSong
The transcription is run through the command processor to get the intent
The intent is sent to the Flask server via REST
The server runs whatever process is requested of it – via plugins e.g. translate, weather, etc
The server gets the result
The server sends the result back to the SwanSong
If the server needs to display something it can
The user is then updated with the results from SwanSong – audio and visual as needed
So here are my goals over the next couple of weeks:
1 – get a generic Flask server running
2 – get a generic SwanSong running
3 – process commands via SwanSong and interact with the Flask server
4 – test audio in, process, audio out
5 – set up webRTC with Flask
6 – shoehorn my visuals into webRTC as I want via plugins and OpenCV
7 – display what I need to on the web page as well
So many of these tasks have already been done. I’m pretty much separating the tasks out and hopefully, that makes the overall app faster and doesn’t get bogged down with too much.
8 – This is the real start of the project. This is where I have it integrated with a headset. The headset isn’t going to be something awesome. I mean, we’ve already seen the ways I’ve done this in the past. What I’m going to do with this is set up a monocular view and a binocular view. I have wanted to be able to read a PDF while walking. That’s pretty much the goal for the first phase. I should be able to say “Serinda open PDF” and “Serinda scroll down one page” and it match. Then I should also be able to say “Serinda make a note to get articles on xyz reference page 6 of this pdf” where ‘this pdf’ will be filled in and then later I can look through those notes and do what I need to.
That’s it. That is my end of January goal. I have already sketched out this crappy monocular headset. I have the binocular headset as well and will test both… but I’m most interested in the monocular version to start.
I’m not entirely sure what will be in phase 2. However, I know that I want to include gestures. Now, because I’m using some generic works as I’ve started over on V3, I can work those in with an OAK-D or some other manner like a single camera and hand gestures. I prefer the OAK-D or some other stereo camera with an IMU so I can place things in the real world and get back to them in the future. For example, the ability to get a geolocation, drop an object, find all of the objects I’ve dropped, pick them up, and return to them at any time. For example, a virtual desktop that I leave in my house. Again, this may will be a clunky version to start. But I’m excited to get back into it.
I’ll do my best to share updates to this project as often as I can and I’ll post on TikTok (while we can), Instagram, Lemon8, and X.
Until next time. Dodadagohvi. ᏙᏓᏓᎪᎲᎢ.