I suspect anyone who has tried some of the newer AI-based image descriptions, such as those from Be My Eyes, has noticed the high quality image descriptions that are available. I’ve been curious about how I could apply that to videos so did a little experimentation.
I want to emphasize that I do not consider this a replacement for audio description. There is so much more to that experience than just giving details on what’s in an image.
The first step for my experiment was getting individual images from the video. An article on doing this with a tool called ffmpeg was very helpful and getting the images is a snap with this tool. Options for getting an image for every frame in the video, at specific time increments or a specific time are just a few of the choices you have.
This alone is one reason why I do not consider this a replacement for audio description. There is so much content, even in a single picture, that it can be overwhelming. Then too is the challenge of identifying when enough change has happened to generate a new description.
From this point, so far I’ve simply used Be My Eyes to generate a description of the various extracted images. For example, a video clip shared on social media can quickly be separated into one image per second and then image descriptions provided from Be My Eyes or another service.
I’m sure there are APIs I can explore to automate the image description part of my experiment. Anyone with experience doing this already is welcome to share your knowledge in the comments here.
My 30 minute experiment also tells me that it would be great if the various media players would add an option to describe the current scene. Again, this is not audio description but imagine if you could press a button at any point in a video and get a detailed description. The technology to make all this happen definitely exists today. Here’s hoping the media player makers will incorporate it into a user-friendly experience sooner than later.
Even without such experiences being added directly, I have found that a screen shot of the current point in time or even a photo of the television screen can yield quality results.
I view what I’ve explored here as a supplement to human-created and human-narrated audio description and will continue to explore what is possible.
I could see this being helpful in playing certain video games from a blind gamer perspective. The same technique you applied to videos could be used. And if I remember correctly, modern consoles do offer the ability to take screen shots. Emulators certainly offer that capability. I think with retro games, the capability might even be more useful since 16-bit games tend to be 2d and often more playable.