SERINDA – Why the framework choices?

The “competition”
Here’s a pretty good article on what’s available in the “good” section.  I’ll hit some highs and lows of each and a couple others.

Vuzix Blade – beautiful. I’m not going into a review. There’s a projector on one eye and you’ve got to link to it via their OS.  These are the closest I’ve come to purchasing a set of something similar to what I’ve wanted to do. They have videos on the products page.  There were three issues that turned me off of this product. 1) one eye display. 2) it’s own OS. I didn’t see where I could screencast my own work. if that were an option it’d be closer to a purchase. 3) the price. It’s a fair price, however, for the price I could do what I’m doing and purchase hardware to test for those that can’t and experiment on better ways for everyone, not just commercial solutions.  Again, great product.

Moverio BT-300 – I have heard and read rave reviews about this product from FPV drone fliers. I got excited about these glasses.  However, I haven’t found a single review of the non-drone version that I’ve liked.  Again, if I could stream my own content or whatever I’d be happy because I could use my own OS.  But it’s got a little box to move through the content of the connected Android system. I don’t want that.  I will look more at these because if I can figure out how to do more with them, the reviews are incredible.

Sony Smart Glasses – Not full color. These aren’t for me.

Osterhout – beautiful.  I didn’t look farther than the price.  I don’t care what they can do for $2750 – I don’t have it.

MagicLeap – Demos, amazing. I signed up because in order to purchase they trick you into doing so. Over $2k. So out.

Dream Glass – This is where I want to go. They’ve got the start. Doesn’t work on anything but Windows.  What?!  WHY?!?!?!?!  If they change this then this might be a real contender too.  Or if I have time to figure out hacking.  See hardware later on because this is display and not really a system.

Node.Js
I started with Java and wrote some DLLs in J++. I used C/C++ and interfaced with Java.  I used Visual Basic. Python has been the best up to this point.  I had two issues with Python. 1) Windows use was spotty.  That’s a lot better with WSL, but still, I wanted something even better for development and if someone runs this on their machine I don’t want to answer a bunch of questions (I’m lazy like that I just want it to work). 2) I don’t know Python as well so there were times that I’d be doing something with the Python language and I’d have issues only to find out that the return value was three elements and that’s why there’s an underscore /_/ or I didn’t indent code correctly. I’m not bashing the language, I’m just saying the syntax is easy, I’ve just been a front-end and Java/Groovy developer for so long I’m much more versed in HTML, CSS, and Javascript.

I switched to a MEAN-ish stack (but I’m using MySQL, not MongoDB) for one of my sites and it dawned on me that I could do everything in Node.js if there was enough support for OpenCV. If there wasn’t I could still do some workarounds by calling Python to do what I wanted and manipulating things with Node.js the way I wanted to.  It wasn’t ideal, but I could do it.

To my surprise, there were a lot more bindings, mature bindings, for Node.js that did what I wanted than I anticipated.

As of this moment, using Express, Node.js, and jQuery I’m able to do nearly everything that I want to do.  My base example for this project was a bastardized Express project that I found that was doing what I was doing but better and complete.  I’ve not been sorry of the switch to Node.js from Python.  My processor and my RAM have both been grateful for the switch from other implementations I’ve run in the past that pushed them to the limits.  Let’s not forget it’s been 17 years since I started the first version so hardware has come a long way too but in the last two years the other versions in Java and Python would run the system to its brink.  Node.js and Express simply don’t do that and I can manipulate the DOM how I want to.

Speech-to-text (STT)
I read articles like How to Build Your Own Artificial Intelligence Assistant and inspiring projects like ILA (and it’s successor SEPIA), Jasper (I used a version that Matt Curry was working on for home automation), and a bunch of others that are all really good. They all like to use online voice APIs. I don’t disagree with those choices there are some fantastic APIs out there.  I don’t want an online voice API to start.  The configuration will eventually include the users choice. For more rapid development and initial gains I’ve chosen to go with Chrome and webkitSpeechRecognition (if I refer to ‘webkit’ this is what I’m saying).

Here is a fantastic introduction to HTML5 Speech Recognition API.

I’ve used pocketsphinx and CMU Sphinx for many years and love them both but I wanted to get the other parts of the project working first.  Neither is hard to set up, but one minute with webkitSpeechRecognition and I’m done.  Later, when I move to grammar files and more I’ll need to look elsewhere and I’ll have to talk about the Natural Language Processing (NLP) portions of this project.  Again, for now, we’re talking about rapid development.

Text-to-speech (TTS)
I am using the browser speechSynthesis for the moment. Again, it’s simply easier for rapid development gains.  I’ll provide some links to some of the code I looked at to wire this in.  Basically, I took some code from stackoverflow (which I’ll also link to) for a javascript version of String.format from Java and I use that to ‘speak’ to the user.  In the future speech synthesis could be through the browser or it could be configured natively through Festival, Flite, eSpeak, etc.  I’ve used many of them and I’m preferential to Flite because it was super easy to work with. Others have wowed-on about eSpeak.

Here is a demo for speechSynthesis.
Here is a link to the SpeechSynthesis API.
Here is a link to the overall Web Speech API.

Vision
I’ve been using OpenCV (Open Computer Vision) for a few years now and I didn’t see a need to change that.  I’ve been perusing the PyImageSearch, Learn OpenCV, and other sites for years.  I know how to use it. No need to change.  Since the base architecture is now Node.js and Express I’m using OpenCv4NodeJs. Vincent Mühler has done a beautiful job with the module.

Vision is a broad topic with so many subtopics integral to this project.  Vision covers object detection, object recognition, facial detection, facial recognition, hand gestures, object tracking, active gaze, and more.  OpenCV does a lot of the hard work in manipulating the image or video itself. The rest is up to how we utilize it.

For example, all of these may initially be processed well by OpenCV, but does OpenFace or FaceNet do the job better? Or face-api.js? Is there a hand gesture project out there that is magnificent but only in C++ or Python that will need a node module written?  In addition, we’ll need neural networks to process this data.  So, experimentation with different frameworks for the javascript side as well as identifying the best tools to use. Do those come from OpenCV?  Or do we need to enhance OpenCV with something else?

Neural Networks
So much going on here. Keras, Tensorflow, Syntaxnet, what does it all mean? The neural networks and deep learning are integral to this project as well. Without NN we could not have any of the OpenCV enhancements like object recognition we’d just have some basics and we would not have NLP.

There are many choices but the two that stand out over and over again are Keras and Tensorflow for deep learning.  Keras runs on top of Tensorflow.  Here is a great intro article about both.

Neural networks are also part of the Natural Language Processing that we’re going to want to use.  Here is a pretty good intro that goes further in depth as the article goes on. The basic idea is to know that when you speak the computer needs to take that and figure out what you mean for commands.  There are several libraries available that lead the way.  Stanford’s CoreNLP, Apache’s OpenNLP, NLTK, TextBlob, and of course Syntaxnet (and here) among others.  It really depends on the programming language and your needs.  I used Textblob to create a machine translation for a ConLang I created.  Here is an article on tools for NLP, and another on open source NLP both are pretty good introductions.

Tesseract
Tesseract is an Object Character Recognition (OCR) library. It’s awesome. I have never looked up any other library, actually. Take an image.  Tesseract scans it and pulls out the text.  It does this quite well in Python.  There are two javascript libraries out there.  One is a Node.js module wrapper for Tesseract, node-tesseract-ocr, and it seems to do ok. I haven’t used it extensively since I just wired it up the other day.  The second project, Tesseract.js, is a pure javascript implementation here and it is phenomenal.  Tesseract.js does not support as many languages as Tesseract OCR does, yet.  I’ve used it on some clear images and it works fantastic. I have an image that [Python] Tesseract will process that neither the node-tesseract-ocr nor Tesseract.js process well at all, however, they both processed nearly identically. I may have to do some OpenCV processing to clean up the background a bit, or simply take the image, save it to the hard drive, then let Python run it and return the processed text… it’s not ideal, but again, this is my project, I can do what I want. I’m not writing some external library for someone else to use.

Database
I haven’t fully decided, yet, however, I’m leaning towards SQLite for the database.  Since this is supposed to run on a smaller computer (despite what you’ve read the intent is a single board computer or SBC), I am trying to minimize the number of services eating up the processor.  Reading a flat-file DB seems nicer, but not as nice as say MySQL or some other database.  I’m not entirely sure, yet.  Since I’ve not implemented a database, yet. I’ve not explored this more in depth.  I’ve used SQLite and MySQL on many projects so neither is out of my depth.  I don’t know and have never used MongoDB so I just don’t know about it.

The database will mostly house state unless I just want to store state as a flat file. Where the state is the GUI, config, maybe even command objects themselves, speak format to a user, etc.  Again, since I’m more interested in getting all of my other code ported and my examples working I’m neglecting this part for now.  I can always write some sort of pattern to deal with interfacing with the end result and plug in whatever that result is – or configure either whenever I feel like it.

Hardware and OS
This is a rough one for me.  I have worked with a bunch of different components.  I worked with the frustration of Windows 2000, XP, 7, 10, Mac flavors, Debian, Rasbpian, and more. All of the hardware issues up to this point have been in the form of “I wanted it to be cheap enough anyone could do it with what they have.”  My intent is that someone doesn’t have to purchase some gaming system to utilize this project. I *want* it to be used on a Raspberry Pi. That seems to be a limitation. I’ve spent countless hours working to get integration on any of these systems.  It’s literally hit or miss on the day of the week.  Right now, I have this running on Parallels Debian 9 (was 8) on a 2011 iMac and a Windows 10 WSL Ubuntu. I’ve stopped there.  I was installing on the Raspberry Pi, but I have a B 3 and it’s slow.  I’m looking at a LattePanda to see how well it does. The issue is that this kind of runs against my “what you have” thought, but really I’m building a system for wear or computer use that anyone can build. They may just have to spend some money on a faster computer or carry a laptop around running Win 10 WSL Ubuntu? I don’t know.  Again, some of this I’m skipping for the time being because I’ve been so focused on keeping things on an SBC that I lost focus of the overall project.  I even tried Docker and was having success then that turned into failure too.  Especially on Windows 10 home with docker-toolbox.

I tested the Vufine eyepiece.  Fantastic piece of hardware.  I could not use it.  My son loved it.  The eyepiece blocked too much of my vision to do what I want.  Again, not knocking the hardware – it was great. I couldn’t use it. It’s built for the right eye only. I have to use my left eye.  I flipped the computer image and was able to kind of use it then. But it isn’t what will work for this project.  I really need that AR projection.

I have, on order, the AR Headset by Xhope, a face piece that you put a phone into that projects the image onto a screen. I’m going to take the display from the HDMI out and project that and see what happens.  Here’s an unboxing video of the AR Headset.

There’s also Project Northstar (on github and a writeup on hackaday). This shows a lot of promise.  I do not like the headset, however, I’m wondering how well I could transform the viewscreens to the AR headset above.  So this combination is something I’m seriously looking at.

Here are two of the better attempts at a Google glass implementation on the cheap.  The first one is pretty ingenious, might need some modification if you went this route.  The second one is a lot more involved and would need modification.  The majority of the modification would be if, instead of using a helmet or faceplate, you used glasses or a viewfinder like device then an OLED would not work.  I tried this here and without a tiny projector like what Nuviz used there’s just no way to make this work better.  I think this could work well if you were able to mount the projector and do the same as you would for either of these projects.  The issue then is going to be readability and will it really work.

As a note, before I continue, I’m also looking at dev boards from various companies.  Here are the links to those that I’ve looked at, so far.  If they’re cheap enough with USB dev boards or some easy way to use the projectors then I’m interested. I don’t have time during prototyping to spend writing a lot of embedded code. I want to test stuff.  In any case, the list: Himax, Syndiant, OmniVision, VRFocus (OmniVision article), Holo-eye.

Where does all of that get the SERINDA project?
It seems like a lot is going on and yet I haven’t listed hardly anything.  Not including the hardware (save a basic computer with camera, mic, and speakers) the features that are available for a base list are: facial detection, facial recognition, object detection, object recognition, object tracking, STT, TTS, read text aloud, process images, process video, record video, take photo, save video, save photo, enhance photo, enhance video, convert video, convert photo, gaze tracking, gestures, predictive “targeting” [where is an object going to end up]

Granted the conversion and processing of photo and video are limited to OpenCV unless I look for processing features for Node.js – but I can do a lot with just that above. This is also not including internet access.

When we add internet access we also get well the full ability of the internet; searches, map navigation, translate, maybe even location, email send, email receive.

When we add hardware like a LattePanda or Raspberry Pi, we also get location, acceleration, axes, view texts, reply to texts, answer calls, make calls, and more.

Summary
I think with just a few tools this is quite a big setup. I’m still working on transcribing all of my handwritten notes to the project wiki. I’m not looking to compete with some commercial product.  I want to finish a dream I had a long time ago and do something that maybe others would use or maybe inspire others.  Who knows?  Maybe someone will look at my work and say “That’s crap. I can do better” and I’ll be a blog note – that’s good enough for me.

This is just a summary post about the reasons I chose frameworks or solutions the way I did.  The many thousands of hours of research leading up to some of the previous decisions I couldn’t tell you.  I can tell you that I have many bookmarks of project specific items that I’ll go through more in-depth as I move forward. For example, when I work on gestures I’ll go through all of the research I’ve ever done on gestures, the relevant links I have, my expectations, experiments, etc.

One Reply to “SERINDA – Why the framework choices?”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: