Skip to content →

The Idea Place Posts

The Power of Prompting – Image Description Toolkit 4 beta 3 with NASA Images

Since writing about The Image Description Toolkit (IDT) 4.0 Beta 1 in February, a lot of work on the toolkit has been happening. A new beta is out, new models have been added, the prompt library has grown, and I found a perfect real-world dataset to put the whole system through its paces: NASA’s Image of the Day collection. This post is about what I learned, what the data showed, and why the right prompt can make all the difference.

What Has Changed Since Beta 1

The February post introduced IDT 4.0’s two main tools — the ImageDescriber GUI and the idt command line — along with support for Ollama, OpenAI, and Claude models. Since then, 4.0.0Beta3 has shipped with a number of additions worth knowing about:

  • idt describe — a friendlier alias for idt workflow. Same command, easier to remember.
  • idt redescribe — re-runs AI description on a set of images you have already processed, skipping the video extraction and conversion steps. Useful for quickly testing a new model or prompt on the same images without starting from scratch.
  • idt guideme now accepts URLs — in addition to a folder path, you can paste a web page URL at the image folder prompt and IDT will download images from that page and continue the workflow automatically.
  • Downloaded images organized by source — images downloaded from a URL are now placed in a subfolder named after the domain and page title, so you always know where a set of images came from.

The full list of prompts has also grown. IDT now ships with twelve built-in prompts, each designed to draw out a different aspect of an image:

  1. Narrative — a straightforward scene description, grounding you in what is present
  2. Detailed — everything Narrative gives you, plus more technical depth
  3. Concise — a brief, single-sentence summary
  4. Colorful — emphasis on color, tone, and visual atmosphere
  5. Technical — image quality, composition, exposure, and photographic characteristics
  6. Artistic — style, mood, and creative interpretation
  7. Simple — plain-language description suitable for a general audience
  8. Accessibility — structured for screen readers and assistive technology
  9. Comparison — designed for comparing two versions of the same image
  10. Mood — emotional tone and atmosphere
  11. Functional — what the image is for, not just what it shows
  12. AI Alt Text — three website alt text options at 25, 50, and 100 words, designed as a starting draft for human review before publishing

You can edit any of these or add your own through the Tools menu in ImageDescriber or by editing prompts.json directly.

When You Download Images, IDT Now Captures Alt Text Too

One feature that did not get much attention in the Beta 1 post: when IDT downloads images from a web page, it also captures the alt text that is already on those images. That data is stored alongside your downloaded images and is available for comparison when you run the analysis tools.

This turned out to be far more interesting than I expected once I chose my test dataset.

The NASA Opportunity

On April 1, 2026, NASA launched the Artemis II mission — the first crewed Moon trip since the Apollo program. NASA’s Image of the Day page had been building up to this for weeks with stunning photographs: the crew in training, the rocket being prepared, and then the mission itself: Earthrise from lunar orbit, the Milky Way seen from deep space, the crew capsule from a window looking back at a shrunken planet.

This was a natural fit for IDT. I used idt workflow to download 64 images directly from the NASA Image of the Day page and ran all twelve built-in prompts across multiple AI models. The dataset includes descriptions from Claude Haiku 4.5, Claude Sonnet 4.6, Ollama Gemma4 31b (cloud), Ollama Kimi-K2.5 (cloud), Ollama Moondream, Ollama Qwen3-VL 235b (cloud), GPT-4.1 Mini, and GPT-4.1 Nano — over 700 descriptions in total.

And because IDT captured the alt text from NASA’s page during download, I had a built-in comparison baseline.

NASA Does Quality Alt Text Already

Before getting into what IDT produced, it is worth saying clearly: NASA already writes excellent alt text. This is not a critique. NASA’s image descriptions are thoughtful, specific, and informative. They name people, identify locations, and include context that no AI can supply from an image alone.

Here is an example. The image is a photograph taken from inside the Orion spacecraft. An astronaut is silhouetted against a window, looking out at Earth floating in space.

NASA’s alt text:

NASA astronaut Reid Wiseman looks out of the Orion spacecraft’s rectangular window at Earth. He is barely visible in the dimly lit cabin, but Earth is bright. The entirety of Earth cannot be seen; in this view, Earth’s round shape is cut off. Earth itself looks like a bright blue ball with swirling white clouds all over it; at center is a brown and green landmass.

That is excellent alt text. It names the astronaut. It identifies the spacecraft. It describes what Earth looks like from that vantage point. An AI looking at the same image cannot do any of that.

Here is what Kimi-K2.5 produced for the same image with the Narrative prompt:

The image shows the interior of a spacecraft. The left side of the frame is filled with solid black darkness. Within this darkness, the silhouette of a human head appears in left profile, positioned approximately one-third from the left edge. The person’s forehead, nose, closed mouth, and chin are outlined by light entering from the right.

Good, accurate description of what is visible — but no name, no mission context, no spacecraft identification. The AI sees the scene; NASA’s writer knows the story.

Prompts Change What You Get

Here is where the Power of Prompting shows up. Take a different image: Earthset from the far side of the Moon — the Earth appearing as a crescent above the cratered lunar horizon.

NASA’s alt text:

Earthset From the Lunar Far Side

That is the title, essentially. It tells you what it is but nothing about what you are seeing.

Now watch what happens as IDT works through different prompts, using Kimi-K2.5 (cloud):

Narrative prompt:

The image presents a view of space dominated by the curved bulk of the Moon occupying the lower two-thirds of the frame. The scene contains no human figures, clothing, or artificial objects.

A starting point, but sparse.

Colorful prompt:

This image depicts Earth rising above the Moon’s horizon, with the planet’s partially illuminated disk displaying navy oceans and ivory cloud formations against the charcoal blackness of space, while the foreground shows the Moon’s cratered, uneven surface in shades of slate and charcoal gray. The lighting originates from the right side of the frame, producing hard-edged shadows across the lunar terrain.

Now we have color, light direction, texture. A vivid picture is forming.

Accessibility prompt (Gemma4 31b):

A wide-angle shot of Earth rising over the horizon of the Moon against a black void.

Concise. Functional. Ready for a screen reader.

AI Alt Text prompt (Kimi-K2.5, 25/50/100 words):

25 words: A stunning view of Earth rising over the Moon’s cratered horizon, with the blue planet visible as a crescent against the vast blackness of space.

50 words: A dramatic photograph showing the heavily cratered surface of the Moon in the foreground, with planet Earth visible as a blue and white crescent rising above the lunar horizon. The stark contrast between the gray moonscape and vibrant Earth suspended in black void of space creates a breathtaking cosmic perspective.

100 words: This awe-inspiring image captures the iconic “Earthrise” phenomenon, showing the rugged, crater-covered surface of the Moon dominating the foreground while the delicate blue and white crescent of Earth emerges above the lunar horizon. The photograph reveals the stark beauty of the moonscape’s gray, pockmarked terrain, illuminated by sunlight, contrasting dramatically with the vibrant colors of our home planet suspended in the infinite blackness. This perspective, captured from lunar orbit, emphasizes the profound isolation of the Moon and the fragility of Earth, offering a humbling reminder of our planet’s place in the cosmos.

Each prompt gives you something different. None replaces NASA’s contextual knowledge. But together they build a much fuller picture than any single description could.

Introducing the AI Alt Text Prompt (Experimental)

The eleven standard prompts I was originally going to use in this version of the IDT are all about understanding an image in depth. But there is also a practical question: can AI generate usable alt text for web images?

I added a twelfth experimental prompt — aialttext — that asks the AI for three versions of website alt text at different lengths: 25 words, 50 words, and 100 words. The goal is to give whoever is publishing the image options to choose from depending on the context. I’ll give away the answer and say the answer is a qualified yes so the IDT now has twelve standard prompts.

This yes is qualified because AI text should always be reviewed by a human. AI can assist in generating alt text is probably the better answer.

The results above show what Kimi-K2.5 produced for the Earthset image. Let me give one more example. Here is NASA’s alt text for the Saturn image taken by the James Webb Space Telescope:

NASA’s alt text:

NASA Webb, Hubble Share Most Comprehensive View of Saturn to Date

Kimi AI Alt Text — 25 words:

Infrared image of Saturn captured by Webb Telescope on November 29, 2024, showing bright glowing blue rings and three labeled moons: Janus, Dione, and Enceladus.

This is interesting. The AI correctly read the labels on the image — Webb infrared, the date, the moon names. NASA’s alt text is a headline; the AI’s version is descriptive. Neither is wrong. They serve different purposes.

But here is the point that matters most: you should not publish AI-generated alt text without reviewing it first.

Unless you would automatically publish an AI-generated image on your website, you should not publish AI-generated alt text without review. Alt text is not decoration. It is the description that a person using a screen reader gets instead of seeing the image. Getting it wrong — or getting it confidently wrong, as AI sometimes does — causes real harm.

The AI cannot name people it has not been trained on. It may misread or invent labels. It may describe what looks plausible instead of what is actually there. Domain knowledge matters enormously. NASA’s writers know the astronauts by name, the spacecraft by model, the mission by number. AI does not.

Used as a starting point for a human editor, the AI alt text prompt can save time and surface details a writer might miss. Used as a replacement for human review, it introduces exactly the kind of accessibility failure that ruins the experience for the people who most need it.

Kimi-K2.5 and Gemma4: Same Prompt, Different Voice

One of the things I wanted to know was whether two different models given the exact same alt text prompt would produce meaningfully different results — or whether the prompt would dominate and the model would not matter much.

The short answer: both models followed the three-length structure cleanly, but they have genuinely different voices. And the differences matter for publishing.

Speed is the most obvious difference — and it comes with a lesson. Kimi-K2.5 averaged 116 seconds per image for this run. Gemma4 averaged 14 seconds. For 64 images, Kimi took just over two hours; Gemma took about 15 minutes.

What makes that gap interesting is that Kimi is not normally a slow model. Across all the other prompts in this dataset, it averaged between 11 and 36 seconds per image. The aialttext run was a significant outlier. My hypothesis is that asking for three outputs of different lengths in a single prompt caused Kimi to generate substantially more text per image — and longer outputs simply take longer to produce. Gemma handled the same multi-part prompt without the same time cost.

The practical takeaway: any time you combine things in a prompt — multiple outputs, multiple formats, multiple lengths — it is worth running a small test batch first to understand how your model of choice responds before committing to 64 images. What works efficiently with one model may behave very differently with another.

Formatting differed in small but telling ways. Kimi labeled its sections **25 words:**, **50 words:**, **100 words:** — a clean, precise claim. Gemma used **Short (approx. 25 words)**, **Medium (approx. 50 words)**, **Long (approx. 100 words)**. Gemma’s “approx.” qualifier is more honest: neither model is actually hitting the word counts precisely, but Gemma at least acknowledges it.

Here is the same image — an astronaut silhouetted against a spacecraft window, with Earth filling the frame outside — described by both models.

Kimi-K2.5:

25 words: Silhouette of an astronaut gazing at Earth through a spacecraft window, showing blue oceans, swirling white clouds, and the planet’s curvature against the black space.

50 words: Astronaut’s silhouette visible against a spacecraft window, gazing at Earth suspended in the blackness of space. The planet displays vibrant blue oceans, swirling white cloud formations, and hints of landmasses. The curved horizon emphasizes Earth’s spherical shape, creating a striking contrast between the dark interior and the illuminated planet beyond.

100 words: The dark silhouette of an astronaut’s profile is visible against a spacecraft window, gazing at the stunning view of Earth from orbit. The planet fills the window with vibrant blue oceans, swirling white cloud formations, and visible landmasses partially obscured by haze. The curvature of Earth is prominently displayed against the infinite blackness of space, creating a dramatic contrast between the illuminated planet and the dark void surrounding it. The spacecraft interior remains shadowed, emphasizing the isolation of space travel while highlighting the breathtaking beauty of our home planet. This perspective showcases Earth’s fragility and the profound experience of orbital observation.

Gemma4:

Short (approx. 25 words): Silhouette of an astronaut looking through a spacecraft window at a vibrant blue and white Earth against the deep black void of space.

Medium (approx. 50 words): A dramatic silhouette of a person’s profile, likely an astronaut, gazing through a rounded spacecraft portal. Outside the window, the Earth is visible as a brilliant blue marble with swirling white clouds and brown landmasses, contrasting sharply against the absolute darkness of the surrounding outer space.

Long (approx. 100 words): A poignant, high-contrast photograph taken from inside a spacecraft. In the foreground, the dark silhouette of an astronaut’s face is seen in profile, looking out through a curved observation window. The window frames a breathtaking view of planet Earth, showcasing deep blue oceans, intricate white cloud patterns, and the reddish-brown hues of a continent. The stark contrast between the pitch-black interior of the cabin and the luminous, glowing sphere of the planet emphasizes the isolation of space and the fragile beauty of Earth seen from a low-orbit perspective.

Neither names the astronaut (Reid Wiseman). Neither identifies the spacecraft (Orion). That knowledge has to come from the human editor. But as starting drafts, both are genuinely useful.

Kimi reads as more narrative and expansive. The 100-word version builds toward a conclusion — “the profound experience of orbital observation.” Gemma is tighter and more visual — “the reddish-brown hues of a continent.” Different writers will reach for one or the other depending on the context and tone of the publication.

There is also a case where model voice led to an actual error worth noting. For the Earthset image — Earth appearing above the lunar horizon, taken from the far side of the Moon — Kimi’s 100-word version referred to it as the “Earthrise” phenomenon. Earthrise and Earthset are different events. Earthrise famously refers to the Apollo 8 photograph from lunar orbit. This image, taken by the Artemis II crew, shows Earthset from the lunar far side. Kimi reached for the famous reference but got it wrong. NASA’s captured alt text — “Earthset From the Lunar Far Side” — was correct. Gemma described it accurately without using either term.

This is exactly why human review matters before publishing. The AI knew the image was significant. It chose a famous, evocative label. But the label was wrong. A human editor who knew the difference would catch it instantly. A human editor who did not might publish the error.

Follow-Up Questions and Chat

One more IDT capability worth highlighting, now that the NASA dataset illustrates it well. After any image has been described, IDT lets you ask follow-up questions — press F in the ImageDescriber GUI or use the CLI. You can switch to a different model for the follow-up if you want.

For example: after running the Narrative prompt on the Artemis II launch photograph, you might ask: “How does this compare to a Saturn V launch in terms of visual scale?” Or for the supernova remnant: “What telescope instruments were used to capture the different wavelengths in this image?” The AI does not always get these right, but the capability is there, and for scientific images especially, the follow-up question is often where the real value is.

IDT also has a freestanding chat mode (press C in ImageDescriber) for model-to-model conversation without any image attached.

The NASA Meatball

Some images are so iconic that AI models reference them by name without stopping to describe them — and the NASA meatball logo is the clearest example in this dataset.

The term “meatball” appeared in 61 descriptions across 12 images. In 11 of those images, the logo appeared as an incidental background element — on a wall at JPL, projected as a backdrop behind a rock sample display, on the tail of a T-38 jet, on the building behind the Artemis II crew. Across all of those background appearances, models named the logo 26 times and actually described what it looks like only 7 times. Most of those 7 were borderline — noting that the logo was “circular” or appeared on the wall. Almost no description explained what a viewer would actually see.

NASA’s own alt-text set the tone. For the JPL auditorium selfie, NASA’s caption reads: “In the far background, there is a NASA ‘meatball’ insignia and the letters ‘JPL’ on the wall.” The name, nothing more. The AI models followed the same pattern. If you have never seen the NASA meatball and no one has ever described it to you, most of these descriptions left you exactly where you started.

The one exception: when the NASA logo itself was the image subject (a standalone logo PNG in the dataset), models gave rich, accurate descriptions across virtually every prompt. Gemma4’s AI alt text treatment is a good example: “The NASA insignia, known as the ‘meatball’ logo. It consists of a blue sphere representing a planet, containing the word ‘NASA’ in bold white text. A red V-shaped vector symbolizes aeronautics, while a white orbital ring and scattered white stars represent space exploration.” That’s exactly what someone unfamiliar with the logo needs. The problem is that this only happened when the logo was the foreground subject — never when it appeared in the background.

What the NASA meatball actually looks like: A circular emblem on a deep blue field, with the word “NASA” in large bold white letters across the center. A red swoosh — sometimes described as a wing or chevron — cuts diagonally across the circle from lower-left to upper-right, representing aeronautics. A white curved orbital path arcs around the circle, suggesting a spacecraft in orbit. Small white stars are scattered through the blue field. The overall effect is dense but balanced: space, flight, and identity in one symbol.

Followup questions help, but aren’t foolproof. When asked as a standalone followup — “Please describe what the NASA meatball looks like to someone who has never seen it” — GPT-4.1 Mini returned an accurate description on the first try: blue background, red wing-shaped vector, white stars, bold white NASA lettering. Claude Haiku’s response, by contrast, described it as having “a large red circle dominating the center” as the main element — significantly wrong. The logo is primarily blue. Even when a model knows the name of an iconic image, its ability to accurately describe that image from memory alone varies.

The prompt implication. The accessibility and AI alt text prompts in IDT currently do not explicitly instruct models to describe iconic logos and emblems when they appear incidentally in an image. Adding language like “when referencing iconic logos, insignia, or well-known symbols by name, include a brief visual description of what the symbol looks like” may close this gap in a future prompt revision.

Data Completeness

In a few instances, data for a prompt and model are not complete. In the case of Moondream, some prompts — such as Technical — failed to return any response on multiple tries.

In the case of Claude, not all prompts were run due to the amount of data already gathered and the costs associated with Claude models. In addition, due to image sizing and the way IDT currently works, some images exceeded the size limit for Claude and were not described.

The AI Alt Text prompt was only run on Kimi-K2.5 and Gemma4 models for this dataset.

The Data

You can obtain the full set of image descriptions and prompts used for this set of NASA images at the following locations.

The CSV includes descriptions from eight models across six providers: Claude Haiku 4.5, Claude Sonnet 4.6, Gemma4 31b, Kimi K2, Moondream, Qwen3-VL 235b, GPT-4.1 Mini, and GPT-4.1 Nano. For a brief description of each model and links to official documentation, see the models file above.

Try It Yourself

The latest version of the Image Description Toolkit can be obtained from the GitHub releases page or on the project page at theideaplace.net. Full documentation is in the User Guide on GitHub.

If you want to replicate the NASA experiment, try:

idt workflow https://www.nasa.gov/image-of-the-day/

That will download the current Image of the Day collection and describe the images using your configured model and prompt. From there, idt combinedescriptions will compile everything into a CSV you can explore in Excel or any spreadsheet tool.

Questions, issues, and pull requests are welcome at github.com/kellylford/Image-Description-Toolkit.

*This blog post used AI for parts of the image research, content organization and overall editorial review.

Leave a Comment

Fidelity and Elan Show Zero Accountability for Accessibility

This is a textbook example of a complete lack of accountability from both Fidelity Investments and Elan Financial Services for a fundamental accessibility failure. On January 11, 2026, I reported a serious accessibility issue to both Fidelity and Elan around the website used for a Fidelity-branded credit card from Elan. Fidelity confirmed the issue via email to me on January 16, 2026. I’ve had no response from Elan. Two months later and there is no change. Why?

As I mentioned, Fidelity offers a credit card that comes from Elan but is branded as a Fidelity credit card. As you’d expect, there is a website where you can view transactions and more.

When reviewing transactions, an ARIA-Label on the table row with all the transaction details is used. This is an absolute disaster for accessibility. On top of this, just about all the table columns are filled with blank cells when read with a screen reader because ARIA-Hidden is used on the data in those cells.

It is bad enough that all the data, as a result of the ARIA-Label, is jammed into one long string. The fact that an actual table is still present but the columns are all blank with a screen reader, when you really need a properly structured table to review financial transactions effectively, just compounds the problem.

Ironically, maybe a month ago, the credit card was changed so that if you have multiple cardholders on the account, transactions show up under the name of each cardholder. This column is actually shown in the table when using a screen reader with data.

A Real Example

This is an example from my account just now for one of the ARIA-Labels. Note I changed the ending digits from my card here to 0000 and deleted some of the HTML for styling.

<tr aria-label=”Transaction details for POSTED Transaction Date: 2026-03-24, Transaction Description: Sheraton Hotel, User: KellyF… 0000, DEBIT Transaction Amount: 28.88″>

Can You Say Double Standard?

You can bet if the visual experience was broken in some equivalent way, such as all the table data jammed into a single cell, it wouldn’t take more than two months for something to be done. The sad thing is that this is just par for the course when it comes to accessibility. There is little to no accountability for accessibility basics when they are broken. Fine, mistakes happen. What you do or do not do about them speaks volumes. Leaving the user to deal with this kind of garbage, well, what’s that message? Using a screen reader should not mean you get an inferior experience.

One Comment

Windows Unattended Install with Parallels and a Simple Machine Manager

Background

Virtual machines, running one computer as an application inside another, still give me that “kid in a candy store” feeling. They allow me to test, experiment, and safely break things; if it all goes horribly wrong, I can just delete the virtual machine and start over. There is no risk of losing computer access or screen-reading functionality.

For the past couple of years, my primary environment has been Parallels on a Mac. I mainly run Windows virtual machines, and Parallels has done a great job of supporting Windows 11 despite its more stringent hardware requirements.

For sighted users, Parallels makes creating a VM easy: you simply choose “New VM,” and there is an option to download a Windows ISO directly from Microsoft. Unfortunately, this part of the Parallels interface (along with several other features) suffers from significant accessibility gaps.

One workaround is to use VOCR, a tool that performs optical character recognition on the screen. It is incredibly handy for this and many other situations where standard screen reading fails.

As someone who originally set out to be a journalist, the combination of AI-assisted code development and my own ideas has been a game-changer. I’ve long said, “I’ll tell you what to build; you build it.” We still have a long way to go, but the progress I’ve seen in just one year is truly amazing.

Another Solution: The Unattended Install

I decided to use AI to improve the Parallels experience. I asked it to help me create a script that would download a version of Windows for ARM hardware, create the Parallels virtual machine, and perform an unattended install so that the OS is ready at the desktop. This essentially replicates what Parallels does natively, but through a more accessible interface.

Managing Machines

Once I had a script that allowed me to trigger a VM creation from the Mac Finder and have a working desktop in 20 minutes, I wanted to go further. Managing existing VMs through the Parallels UI can also be hit-or-miss regarding accessibility.

Parallels does offer a robust command-line utility called prlctl. While powerful, command-line tools can be unforgiving if your syntax is slightly off. To bridge this gap, I went back to the AI and created a very basic Parallels Machine Manager. This utility displays a list of your running machines and provides basic management tasks like starting, stopping, cloning, and deleting. I also integrated the targeted Windows VM creation script directly into the app.

Getting These Tools

You can download the standalone unattended install script here, or download the full Parallels Machine Manager here.

Contributions Welcome

I’ve hosted the script and the app on GitHub in a repository I’m calling The Workbench. You can find it at github.com/kellylford/TheWorkBench.

I create simple solutions like this from time to time and plan to populate this repository with more of them. Contributions—whether to these projects or sharing your own—are more than welcome.

My contributions to this repo are functional, but they are “immediate-need” projects. Once I’ve solved the specific problem I’m facing, I don’t typically invest significant time in extending them. While these tools are valuable, they would require more work to become “production-quality” software.

More Formal Projects

I also maintain a list of more formal projects at www.theideaplace.net/projects. These include:

  • A weather app
  • A sports scores app
  • An image description toolkit
  • A simple RSS reader
Leave a Comment

“Introducing IDT 4.0 Beta 1: An Enhanced Way to Describe Your Digital Images

A 4.0Beta1 release of my Image Description Toolkit is now available. If you are unfamiliar with this project, it provides apps to handle image description using Ollama AI models that can run locally. The toolkit is specifically aimed at handling large numbers of images and videos. The toolkit offers both a command line (IDT) version and a graphical version (ImageDescriber) and allows you to generate descriptions based on a series of predefined prompts or add your own to tailor the sort of information you get back from an AI model. The app also supports OpenAI and Claude models with your own user-supplied API key.

What the Image Description Toolkit Is Not

It is important to understand what the toolkit is not. Countless apps exist for getting image descriptions. I use many myself. Most of these are aimed at describing one or two images though and often the descriptions must be generated new each time you want the image described. IDT is not meant to replace these experiences directly.

What the Image Description Toolkit Is

The IDT is aimed at the situation when you have dozens, hundreds or thousands of photos you want described and you want permanent descriptions of the images. IDT also allows for easy generation of multiple descriptions based on different AI prompts and asking follow up questions. If you want more emphasis on colors in an image for example, you can use the ‘Colorful’ prompt. The ‘Technical’ prompt gives you more details about the image quality. All prompts are customizable and you can add your own prompts to the system.

Quick Start

After installing the IDT, you have two main tools for image description. ImageDescriber is the first option you’ll want to explore for describing smaller groups of images and generating multiple descriptions per image. The default install puts this in ‘c:\idt\imagedescriber.exe’. Load the app, press ctrl+l or choose File:Load Directory and then arrow through images that have been loaded. Press the letter p on any image to process that image or choose Processing:Process All Undescribed to have the full set of images you have loaded described. In either case you’ll be prompted for the AI provider you want to use, the model and prompt. The system defaults to Ollama, Moondream and the Narrative prompt.

The second option is IDT command line. Open a command prompt and change to the default install location of ‘c:\idt’, assuming a standard install. Enter ‘idt guideme’ and the system will ask you a series of questions and then kick off a workflow to describe your images. Alternatively enter ‘idt workflow <path to images>’ and the system will just start describing images using the same defaults as mentioned earlier.

Additional Functionality

In addition to providing image descriptions, both IDT and the ImageDescriber can download images from a web page and provide descriptions. In IDT just enter ‘idt workflow <url>’ and in ImageDescriber press ctrl+u and provide the address of a web page.

ImageDescriber also allows you to ask follow up questions about any image using the original AI provider and model or by selecting another of your choice. Just press f on any image or description in the app.

Monitoring Costs for Paid Models

When using Claude or OpenAI models, the number of tokens used for the AI prompt, which includes the image, as well as the response are also shown as part of the image meta data. Review these to gauge the cost of using the various models from these companies. They publish pricing for the tokens used with different models.

A Basic Chat with AI

Finally, ImageDescriber has a basic chat system that does not require images for any AI models you have installed or configured. Press c when the program is open or choose Process:Chat. Next select your model and the chat experience will open. Enter your question and the AI will respond. You can press shift+tab to go to the chat history. Note, currently the only way to save the chat history is to press ctrl+a and then ctrl+c to copy the full chat and paste it in another app. Work to fully integrate saving of chats into an Image Description Toolkit workspace file is ongoing.

User Guide and Issue Reporting

A full User Guide is available for the IDT. You can also report issues from within the program with the Report an Issue menu item on the Help menu or by visiting the New Issue form on GitHub. Check the list of current known issues as well.

Windows Version Today, Mac Coming Soon

This release is for Windows, with a Mac version coming soon. The Mac version is in fact done and for those who are so inclined you can build it from source on GitHub. Look in the wxmigration branch under MacBuilds for build scripts. I’m not officially releasing the Mac version yet because I am still refining how VoiceOver reads lists within the program and doing further testing.

Getting the Image Description Toolkit

You can download the 4.0Beta1 release directly from my projects page.

2/14 Update on Downloading

My application is currently not code signed. As a result, you may be prompted that the file is untrusted or otherwise not safe to download. I run security checks and such before making my files available and am still researching costs for code signing on Windows. You can try downloading a .zip file with the installer. Unzip the file and run the installer. You’ll still likely have to deal with untrusted warnings but they are a bit easier than doing so as a part of the browser download process. Some browsers may still present the warning experience as well.

One Comment

Basic Accessibility Gaps Still Too Common, Microsoft please justify CopilotNative.Chat.Controls

With the end of the year approaching, I was doing some tax prep. I asked the Windows version of Copilot to convert a final pay summary from my last employer into something that was more readable. To Copilot’s credit, it converted the PDF into a highly accessible Excel spreadsheet. That’s the positive. It is really handy.

Unfortunately, that’s not the full story. In reviewing the conversation, here is how the entry providing my answer reads with a screen reader. Note this is just the start of the answer.

CopilotNative.Chat.Controls.ViewModels.MessageThinkingAndActivityPartCopilotNative.Chat.Controls.ViewModels.MessageDocumentPartYour Excel spreadsheet is ready to go — you can **click the card above to download it**.

I have to ask why this sort of a basic accessibility failure is still making it to customers. Both automated and manual testing could easily find this sort of a failure and likely basic code inspection and automated checking as well at the code level.

People who do not use screen readers would never be shown text like CopilotNative.Chat.Controls and the rest of the long string those of us who do use such technology are offered. This is simply not acceptable.

Sadly, this is still far too often the reality of accessibility though in 2025. It is part of why I wrote my post on Accessibility island years ago at https://theideaplace.net/accessibility-island-the-journey-of-many-experienced-as-one/. It is also why I say the accessibility field, especially at the corporate level, needs to be far more accountable. I wrote about this also at https://theideaplace.net/from-word-fluff-to-real-impact-achieving-specific-measurable-and-accountable-accessibility/.

Some accessibility is more complex to address but these sorts of issues are not. All it takes is commitment and attention to the most basic of details.

If you try to reproduce this, the issue seems to occur when Copilot is producing downloadable content. Of course, the Windows Copilot app still serves at best half an answer to screen reading users in my opinion in general.

Yes, it is true that AI, as with many technologies, has done many innovative things for accessibility. But the full story should be positive, not just the innovations. These tools still need sizable attention because the basics of accessibility within many of these tools still falls short.

Leave a Comment

Image Description Toolkit 3.6.0 Available

A 3.6.0 release for my Image Description Toolkit is now available. The release features a new –redescribe option as well as support for two Microsoft Florence-2 models via Hugging Face.

Redescribe

The redescribe option can be used with the command line version of the toolkit to redescribe the same set of images without the need to extract video frames or convert image files to JPG from HEIC. This can save time and allow you to quickly test different prompts on the same set of images. To use the redescribe option, simply add –redescribe as the first option in an “idt workflow” command followed by the standard –provider, –model and –prompt-style options. Use the workflow directory from a previous image description workflow as the file directory for images to be described. Product documentation has full details.

Hugging Face and Florence Models

Hugging Face is described as “The platform where the machine learning community collaborates on models, datasets, and applications.” It contains a wide range of AI models, datasets and apps across the full spectrum of AI tasks.

The Florence-2 AI model is described in part as, “an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks.” It runs locally and requires no pre-installation. Note that the model will download on first use so expect a longer image description time. Also be aware that for a variety of reasons, the Florence-2 model is only supported in the command line version of the Image Description Toolkit.

You can get the latest update for the Image Description Toolkit from my projects page at www.theideaplace.net/projects or the GitHub release page.

2 Comments

Give Accessibility a Seat at the Notepad Table Please

Unreal, I read today that Microsoft has started rolling out table support for Notepad. But guess what, if you are using a screen reader, the fact that you are in a table when editing is not communicated at all. This despite the fact that Office has clearly communicated tables when editing for many years with screen readers, including Narrator. So, clearly technical solutions exist and have been used for a long time. Why is something this basic once again left out for accessibility?

If you are so inclined, you can up vote feedback I submitted on this issue at this feedback link.

This is another example of why I say we need more accountability around accessibility. If Microsoft is going to enhance Notepad, accessibility needs an equal seat at the table, not something missed yet again.

2 Comments

Windows Copilot Serves At Best Half an Answer to Screen Reading Users

If you follow me on social media, one of the themes you’ll find I talk about fairly often related to accessibility is how text that isn’t shown visually needs to have better verification. This can be anything from alt text on web pages, to the accessible names used in different applications for controls such as menus and more. For several months now, the Windows Copilot app has had a serious problem that fits in this category. I know I reported it to Microsoft at the start of March of this year and the problem remains as broken today as it was then.

In this case, the challenge is that the text communicated for accessibility purposes is clipped such that when reviewing answers with Copilot, The answers you receive are incomplete.

If I ask Copilot for a list of the fifty United States, I’m led to believe that there are only 27 states. In the way the list is rendered to a screen reader, the List ends at the State of Nebraska. I guess Wisconsin, where I am writing this from, has disappeared?

If I ask for a list of the U.S. presidents, the list when read with a screen reader only shows 10 presidents and in fact cuts off the last name such that you end up getting only a partial name. At least my history never said we had a president named “10. John T.”

You can try various modes with the different screen readers, such as Narrator’s Scan Mode, the JAWS Touch Cursor and more and the problem is not improved. In most cases, such as Narrator’s Scan Mode, you end up reading some text twice but still never get the full answer.

It should not be necessary to copy a response and paste it into another app just to read the answer from an AI system or any app for that matter.

The frustrating thing is that none of this was an issue at the start of 2025. Microsoft appeared to be coalescing around a fairly standard user experience for the multitude of Copilot experiences available to customers. The foundation for the Windows app was a well-structured HTML experience with headings and more that made reading answers quite straight forward.

Sometime after the start of the year, the Windows Copilot app appears to have changed some of the foundational technology used that has caused this sorry state of incomplete answers when read with a screen reader. One can only wonder how long this busted experience will be allowed to exist. It is already several months too long in my opinion.

To be honest, it is incredibly frustrating to write blog posts like this. I doubt they do much good, but I am not sure what other avenues to pursue. As I said, I reported this situation to Microsoft in early March of this year. Through multiple email exchanges, all I know is that Microsoft claims they are taking the issue seriously.

It is repeated situations like this, where accessibility issues linger on and on where I say we need a shift to true accountable accessibility. I outlined a proposal for such earlier this year as well.

This is just one of several significant accessibility challenges with the Windows Copilot app. I’m a firm believer that all experiences should be usable and accessible, but this is especially true for AI-based situations. We need to be able to access the responses put out from AI effectively and obviously comprehensively.

In fact, if I ask the Windows Copilot app to give me a definition of accessible software, the experience is to me completely unacceptable. Things start out well enough with automatic reading of a message that Copilot is preparing the answer. They then progress to a reading of the beginning of the answer, only to have that reading interrupted by a restart of the reading of the answer that starts over from the beginning, only to read more of the full answer automatically.

Ignoring the multiple tabs or shift+tabs it requires to get from the edit box for communicating with Copilot to the chat history, the answer in the chat history is again clipped from what you heard with the automatic reading. Here is my chat history as read by Narrator when moving to the Copilot chat history for the accessibility prompt and response.

give me the definition of accessible software please.
**Accessible software is software designed to be usable by people with a wide range of disabilities, ensuring equal access to digital tools and content.**
### 🧩 What Accessible Software Means
Accessible software incorporates features and design principles that remove barriers for users with:
– **Visual impairments** (e.g., blindness, low vision, color blindness)
– **Auditory impairments** (e.g., deafness, hard of hearing)
– **Motor impairments** (e.g., limited dexterity, paralysis)
– **Cognitive or neurol

If I copy the answer, you can read just how much is skipped and also the fact that any of the structure, such as lists, is not communicated when reading in the Copilot app. Here is the full copied answer to my question.

Accessible software is software designed to be usable by people with a wide range of disabilities, ensuring equal access to digital tools and content.

🧩 What Accessible Software Means

Accessible software incorporates features and design principles that remove barriers for users with:

  • Visual impairments (e.g., blindness, low vision, color blindness)
  • Auditory impairments (e.g., deafness, hard of hearing)
  • Motor impairments (e.g., limited dexterity, paralysis)
  • Cognitive or neurological conditions (e.g., dyslexia, ADHD, autism)

The goal is to ensure that everyone, regardless of ability, can interact with the software effectively and independently.

🔍 Key Features of Accessible Software

  • Keyboard navigation: All functions can be accessed without a mouse.
  • Screen reader compatibility: Proper use of semantic HTML, ARIA roles, and alt text for images.
  • Color contrast and scalable text: Ensures readability for users with low vision.
  • Captions and transcripts: For multimedia content, aiding users with hearing impairments.
  • Logical structure and focus management: Helps users navigate and understand content flow.
  • Customizable interfaces: Allows users to adjust layout, font size, and input methods.

📐 Standards and Compliance

Accessible software typically adheres to guidelines such as:

  • WCAG (Web Content Accessibility Guidelines) – the global standard for digital accessibility.
  • Section 508 (U.S.) – mandates accessible technology for federal agencies.
  • EN 301 549 (EU) – accessibility requirements for ICT products and services.

🧠 Why It Matters

  • Inclusivity: Promotes equal participation in digital spaces.
  • Legal compliance: Avoids lawsuits and meets regulatory requirements.
  • Better UX for all: Accessibility often improves usability for everyone.
  • Ethical responsibility: Reflects a commitment to social equity and universal design.

If you’d like examples of accessible software in action or want to benchmark a specific app, I can help you analyze it.

2 Comments

Image Description Toolkit 3.5 Beta Featuring Geolocation Data and Web Image Downloads

With more AI-driven development I have another sizable update for my Image Description Toolkit or IDT. There is a full What’s New document available.

Highlights for this beta release include use of geolocation data when present in images, the ability to download images from a specified web address and have them run through the image description system and numerous other enhancements.

You can also keep current with all my projects from my Projects page.

One Comment

Feedback on IDT Demo Gallery

I’m looking to crowd source some feedback. I’ve mentioned here a few times a collection of tools I’ve created called the Image Description Toolkit. The short version of this is that it is a way to get image descriptions that you can save and customize the level of detail you get. This can be a bit of an abstract concept in a world where many still do not understand alt text.

So, I’ve put together a demo page at www.kellford.com/idtdemo. It has the traditional image gallery but then a Description Explorer. The Description Explorer allows you to see how different AI prompts result in various image descriptions and how different AI providers do at image descriptions. There are a total of four prompts (narrative, colorful, technical and detailed) using 10 different AI provider/model combinations.

For example, choose Description Explorer and then the option for all prompts from a provider. Note how the descriptions built on each other in a way from Narrative to Colorful to Technical.

The point of this demo is to showcase the sort of data my toolkit can make available. Whether you are an individual like me who wants more access to my pictures with different descriptions, or you want longer descriptions for other purposes, this is an example of what my toolkit makes possible.

This is not the one-off random describe this picture type of system. There are hundreds of those. This is the I want permanent descriptions at scale type of system.

Feedback I’m love to have. First off, does the web page look reasonable and free from glaring problems? Do the concepts of what info you can have from my toolkit make sense from this demo? If not, what would help?

One very interesting challenge. AI vision models are in my experience not great at generating alt text. I tried a range of prompts to get them to do so. In the end, the alt text (not my longer descriptions) was created by taking the Narrative prompts created by AI and running those through AI again asking for alt text to be created. You can see an example of this in action by using the Image Browser and choosing to show the alt text visibly. Note, choosing this mode with a screen reader will result in alt text reading twice–once as alt text on the images and once as the visible version of the alt text. I debated what to do about this situation and, so far, opted to turn off the visible display of alt text on page load. I do want people to see the alt text on demand because it is part of the overall system.

The toolkit allows for this sort of data gathering and gallery creation to be done all automatically. Just point the tools at a collection of images and an AI provider and you can choose how the info is shown.

Again, visit http://www.kellford.com/idtdemo for the gallery. Visit https://github.com/kellylford/Image-Description-Toolkit/releases/tag/v3.0.1 for the toolkit itself and https://theideaplace.net/image-description-toolkit-3-0-available/ for my latest blog post on the toolkit.

Leave a Comment