SERINDA v2 Updates

Osiyo. Dohiju? Hey welcome back.

I have managed to add and change features to improve functionality. I don’t know why I said it that way. It seems like adding features always improve functionality. Maybe because I’m a developer I know that’s not always true. idk.

Alright, so one item I have not been happy with is that the browser handles the speech recognition. I liked it initially because it was a way to get that to work. I’ve been using the same method for a while so it was a natural approach. However, I decided that what I really wanted to do was handle as much “natively” as I could. That means Python should actually handle speech recognition for me. TTS can be done by the browser or I could make it work through Python – it just depends on what I want to do. As of this moment it’s via the browser. I have some code in my tests to do this via Python.

A lot of my move towards wanting this to happen in Python is that the browser was irritating me when the web speech kit would stop working and during that time commands weren’t working. I wanted to reduce that aspect so the computer would handle the processing as much as possible. One element I had to overcome was Python detecting speech and then processing the commands and then sending that information to the page via a push and then the front end can process what it’s given. I answered a speech detection question on stackoverflow regarding my approach in Serinda. You can read that here.

Originally I wanted to use pocketsphinx for speech recognition. I still could, however, for the time being so I can move forward with development I’ve decided to use Google speech. I’d rather keep it offline if possible since the goal is to be able to use this whole application without internet.

The next element I’ve added is a sqlite database. I want to be able to move the GUI dynamically. The initial layout could be manipulated by getting the GUI values saved in the database. For example, the user can say “move video 2 left 50 pixels” and then that will be translated to something jquery will understand and then save that in the database. When the webpage loads it’ll call for all of the GUI update values and then move the UI on the page how it needs to. This also means I can say “hide video 3” and it will toggle show/hide on that element. I’d like to be able to add additional display features on the fly. Using some grammar rules I can create a new display feature then wire it to flask. I can run pip installs or apt-get installs to support functions I need etc. It’s one of the features I’ve wanted to include since I watched Star Trek: Voyager and B’elanna Torres said “computer create subroutine” then went on to define that. My intent is to continue building my work so it can be used in this way. Eventually, I’d like to add some sort of NN intention mapping to be able to find code examples. For now, I’d be happy with “computer create new class” and it creates a python file with a stub and then I can continue to add features to the page and application that will be reflected when the application is reloaded.

I added Google translate to the speech API so I can say “computer translate to Spanish” then it’ll prompt me and I can say whatever I want translated and it’ll return that translation. I can also do this for English. If you read the stackoverflow answer above you’ll see what I’ve done. I’m very happy with it at the moment. I may do some rewrites later, but it works very well. I’d like to be able to do translations offline and that may be harder. Again, for now, I’m going to stick with Google and later work on the offline stuff.

I now have 4 cameras that display on the page. I’m working on a way to detect cameras and then add them dynamically. I’m not too concerned with that at this point b/c I’ve got many other features to add. So manually commenting out 3 cameras for now isn’t a big deal. With the addition of the database I can add configuration parameters as well.

I also wrote a Python script that will generate the intention JSON, start the server, and open the webpage in a new tab on chrome. I like this approach because it’s not limited to Windows or Linux. I run it on my RaspberryPi, WSL, Debian on Windows, Debian, and Windows and it works on all of them.

So, what’s left until I release the code for people to be able to use? I need to integrate Keras, Tensorflow, and Pytorch. I’ll integrate them under a NeuralNetwork class and the user can choose which they want to employ. The idea being that a user can use pretrained models from any of those Darknet or Kaggle or wherever there are pretrained models. I need to fix the loading of a PDF. Scrolling a PDF in the webpage. More than one camera on the Raspberry Pi crashes. I’m told this is a bug in Buster so I’ll be watching for it to be fixed. I can detect a face, however, it will not recognize them, yet. I need to be able to recognize objects and gestures. I also need to be able to take real time photos and video and train the NN to be able to recognize audio and people. I’d like to get OpenPose, skeleton pose, body language, micro expressions, human poses, eye tracker, and human activity recognition into the first release. If I can get most everything else done I won’t worry about this til the 2nd iteration. I also need to make sure I put together a requirements document so people can install what they need. I can process an image via TesseractJS. I absolutely have to update documentation in the Readme and wiki page and probably add more code comments. I want to make Tesseract a real-time streaming service.

I’d like to add OpenVX to support OpenCV. I’d like to add AR/MX/XR to the display, but that’s outside of what I can do now. I’ll start by going back to using a screen that’s reflected on a see through medium so I can view the display. I have a couple of utility functions I’d like to explore so I can leverage existing code – such as a real time sudoku solver and I wrote some Javascript code to take a picture of a chess board and provide chess engine recommendations. I’d like to make this more augmented reality. Because I’ve included some Groovy and Java, just in case I need to do something with them, I need to be able to compile them so I’ll add a compiler (also not necessary for the first release). The last item I’d really like to get in is Kivy so I can possibly use this as a mobile app on Android or iOS or both. I don’t know how well it’ll work, but if I don’t use it then it’ll never be a mobile app because I’ve already made more progress in the last month and a half using Python mostly than I did in the last 10+ years using Java and Node and other languages so I’m not rewriting this to work as a mobile app if Kivy doesn’t work. If I can I’d like to add NN translation. This isn’t an immediate need, but it’d be nice to be able to train languages that don’t have machine translation yet and then pump out a pretrained model no matter how rudimentary it is.

I think that’s it for this update. My goal is to have the official V2 release by 1 Nov. Anything not done by that time will get pushed to the V2 dot release.

Until next time. Dodadagohvi.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.