I want to say upfront the vast majority of the code here was written using prompts to ChatGPT. I wanted to see how the AI tool worked for a simple coding project and jump start my own use of Python. Once I started with ChatGPT, I found making adjustments in the scripts myself and then working with ChatGPT became a challenge so for this effort, I opted to use prompts to ChatGPT exclusively.
I suspect like many, I have thousands of pictures from over the years. I wanted a way to process these in bulk and get descriptions.
I had hoped to use OpenAI but never found a way to use a vision model from them and their support department made it sound like it wouldn’t be available to me with the basic OpenAI subscription I have. If someone knows differently, please share more details. I certainly do not want to upload images individually.
That lead me to explore Ollama and their llama3.2-vision model that you can run locally. I’ve published scripts and instructions in a GitHub project that will take a directory of images, read the prompt you want to use from a file and write out individual description files as well as an HTML file with all descriptions.
This does work but is raw and still needs refinement. Work here is definitely defined as works on my equipment and the environments where I’ve tried it. I wanted to share what I have here so far because even in this form, I’ve found it works well for my task. Again, others may already know of better ways to do this. Some of the enhancements I want to add include:
*Better image selection versus just a directory of images.
*Linking to the image file in the HTML descriptions.
*Extracting meta data from the image files, such as date, to help remind of when the images were taken.
*If possible, use GPS data that may be embedded in the image to provide location information for the images.
*Learning more about the llama model and processing to ensure I’m taking advantage of all it offers.
*Cleaning up file use and allowing this to be configured outside the scripts for things such as image path, and results.
*Figuring out how to make this work on Windows and Mac from one script if possible. I’ve run this on both with success but this documentation and script is based on Windows.
*Packing this up as an executable to make it easier to use.
*Exploring a way to flag descriptions for another pass where you want more details.
*Long term, again assuming something doesn’t already exist, explore building GUI apps here.
My primary goal is the processing of images in bulk. I went to a museum recently and ended up with more than 150 images taken with Meta glasses. I did get some descriptions there but want more and again I have thousands of pictures from over the years.
As I said at the outset, I do not want to take any credit for what ChatGPT did here with the code. I guided to the goals I had in mind and such and that itself was an interesting activity. It is by no means automatic. It is also possible there is already a better way to do this so if someone reads all this and says, hey just use this tool or something, I have no investment in this being the end all of image description experiences. I tried finding something that would do what I wanted but didn’t have success so this was my attempt.
It is my understanding that running Ollama on Windows used to require the use of WSL. I don’t know when that changed but documentation and my own use says that you can now use Ollama on Windows without WSL and that’s what I’ve done here.
If you do try this and want to interrupt the script, just press ctrl+c at the cmd prompt. You’ll get an error from Python but processing will stop.
If there is value in this effort and you want to contribute, I have a GitHub project for this effort. You can also grab the scripts mentioned from the project page.
Last, this is by no means instantaneous. On an M1 MacBook Air and a Lenovo ARM Slim 7, it is taking about three minutes an image. According to Ollama documentation, you do not need an ARM processor on Windows though. This is the sort of thing you run in the background.
If you opt to try this, take note of the scripts and note areas where you need to modify for file paths and such. Feedback is of course welcome. If you try this and it doesn’t work, please do your best to troubleshoot. Until I make more progress, this is kind of an as is idea and nothing something where I can offer a lot of assistance. Most errors are likely not having one of the Python dependencies installed or something not configured with file paths.
Comments