`GPT-4V and Gemini Pro for Photo Captioning Level Up Coding
OpenAIs GPT4V
A head-to-head comparison of using commercial LLMs for creating captions of personal photos with face recognition
If your family is like mine, you probably have a sizable collection of old and new photos. My dad recently purchased a set of automatic photo scanners for me and my siblings to create digital images from thousands of family photos. Now, we are trying to figure out what to do with these images. My first thought was to see if multi-modal Large Language Models (LLMs) could help organize our photos by automatically generating captions.
For this article, I compared OpenAI’s GPT-4 with Vision [1] to Google’s Gemini Pro [2] for creating captions of photos. Both systems are available as commercial services with APIs for automating tasks. Here is a quick example.
Jen and Rob Singing Outdoors, Image by Author
First, I asked GPT-4V, “What’s happening in this photo?” It came back with three long paragraphs with many details. I then amended my prompt to make the caption more concise. Here’s the interaction.
RobG: Write a one-sentence caption for this photo.
GPT-4V: Capturing a playful moment, an elderly couple engages in a mock karaoke duet, with the gentleman pointing dramatically at the camera.
I then asked Gemini Pro to do the same thing, adjusting the prompt to get a well-detailed response.
RobG: Write a brief caption for this photo.
Gemini Pro: A man and a woman are singing karaoke together outside. The man is holding a microphone and the woman is looking at him and smiling.
Both responses were good. The one from GPT-4V expresses more emotional context, using words like “playful” and “dramatically.” The one from Gemini Pro seems more straight, albeit accurate. But the key thing to note is that neither system wrote captions using the names of the people in the photo.
After I give some background on GPT-4V and Gemini, I’ll show you how I coaxed both systems into adding names to the generated captions with help from an open-source face recognition system with the appropriate name face_recognition [3].
User Interface for OpenAI’s ChatGPT running GPT-4V, Screenshot by Author
I previously wrote about using OpenAI’s GPT-4V model as a critic for visual art so you can refer to that article for a full write-up of the system. But I’ll go over the basics here.
With the GPT-4V model, users provide one or more images and a text prompt, and it will respond with a text response. Here’s what OpenAI said about the training process.
Similar to GPT-4, training of GPT-4V was completed in 2022 and we began providing early access to the system in March 2023. As GPT-4 is the technology behind the visual capabilities of GPT-4V, its training process was the same. The pre-trained model was first trained to predict the next word in a document, using a large dataset of text and image data from the Internet as well as licensed sources of data. It was then fine-tuned with additional data, using an algorithm called reinforcement learning from human feedback (RLHF) to produce outputs that are preferred by human trainers. — OpenAI [1]
Preventing personal identification in GPT-4V
One of OpenAI’s safety goals for GPT-4V was to prevent the system from identifying people in images. They examined the model’s ability to identify people in photos, utilizing public datasets and images of Congress members for public figures and employee images for semi-private and private individuals. The model demonstrated over 98% effectiveness in refusing to identify these individuals [1].
When I sent Joe Biden’s official presidential portrait to the model and asked who the person was, it responded.
RobG: Who is this person?
GPT-4V: I can’t assist with identifying or making assumptions about people in images.
OK, that spells out OpenAI’s policy. It didn’t answer the question. Later in this article, you will see that the system is OK with recognizing people in images when the identities are provided in the text prompts.
Use of customer data with GPT-4V
According to this note from OpenAI, they will use customers’ prompts and responses from the interactive chat system to train their models unless they turn off their chat history. However, OpenAI will not use the prompts and responses for training their models when using the API.
Costs for running GPT-4V
The GPT-4V model is available as an interactive chatbot for OpenAI’s ChatGPT Plus service for US$20 per month or via their public API as a pay-as-you-go service. The costs for using the API vary based on the lengths of the prompt and response and the image’s resolution. For example, if I ran the test above via the API, it would have cost US$0.00176. If I ran this on 1,000 images, it would cost about US$1.76. The full details on GPT-4V pricing are here.
Sample code for GPT-4V
Here is some sample Python code that shows how to use OpenAI’s API to create a caption for a photo.
<span id="e667" data-selectable-paragraph=""><span>from</span> openai <span>import</span> OpenAI<br><span>import</span> base64<br><br>client = OpenAI(api_key = <span>"your_openai_api_key"</span>)<br><span>def</span> <span>encode_image</span>(<span>image_path</span>):<br> <span>with</span> <span>open</span>(image_path, <span>"rb"</span>) <span>as</span> image_file:<br> <span>return</span> base64.b64encode(image_file.read()).decode(<span>'utf-8'</span>)<br>image_path = <span>"your_image.jpg"</span><br>base64_image = encode_image(image_path)<br>prompt = <span>"Write a one-sentence caption for this photo."</span><br>response = client.chat.completions.create(<br> model=<span>"gpt-4-vision-preview"</span>,<br> messages=[<br> {<br> <span>"role"</span>: <span>"user"</span>,<br> <span>"content"</span>: [<br> {<span>"type"</span>: <span>"text"</span>, <span>"text"</span>: prompt},<br> {<br> <span>"type"</span>: <span>"image_url"</span>,<br> <span>"image_url"</span>: <span>f"data:image/jpeg;base64,<span>{base64_image}</span>"</span><br> },<br> ],<br> }<br> ]<br>)<br><span>print</span>(response.choices[<span>0</span>].message.content)</span>
First, the API is activated with an authorization key. Then, the image is encoded into base64 format. A request is sent to the API with the encoded image and the text prompt. The response from the model is then printed out.
Gemini Pro
User Interface for Google’s Vertex AI running Gemini Pro Vision, Screenshot by Author
About a month ago, Google released its Gemini series of multi-modal language models. Their introductory paper shows that Gemini outperforms GPT-4 in many industry benchmarks.
We’ve been rigorously testing our Gemini models and evaluating their performance on a wide variety of tasks. From natural image, audio and video understanding to mathematical reasoning, Gemini Ultra’s performance exceeds current state-of-the-art results on 30 of the 32 widely-used academic benchmarks used in large language model (LLM) research and development. — Sundar Pichai and Demis Hassabis, Google
Gemini can take in a prompt, images, or video clip and respond with text. It is available as an interactive chatbot and via their public API. You can explore using the API with Google’s Vertex AI Studio.
Societal impact of Gemini Pro
Google studied the potential impact on society and took steps to mitigate harm to people. Here’s what they said about these efforts.
We develop model impact assessments to identify, assess, and document key downstream societal benefits and harms associated with the development of advanced Gemini models. … Areas of focus include: factuality, child safety, harmful content, cybersecurity, biorisk, representation and inclusivity. These assessments are updated in tandem with model development. Impact assessments are used to guide mitigation and product delivery efforts, and inform deployment decisions. Gemini impact assessments spanned across different capabilities of Gemini models, assessing the potential consequences of these capabilities with Google’s AI Principles — Google [2]
Although they addressed many societal concerns, they didn’t directly address the issue of identifying people. So, I ran a quick test using Gemini Pro in Vertex AI Studio.
RobG: Who is this person?
Gemini: This is Joe Biden, the 46th and current president of the United States.
So Gemini Pro doesn’t have any apparent protections for identifying famous people.
Costs for running Gemini Pro
Gemini Pro is free in the interactive version and via the public API. However, Google has announced the API pricing here. When they start charging, it will be less expensive than OpenAI’s GPT-4V. Here’s an estimated cost comparison for creating 1,000 photo captions:
- Gemini Pro: US$0.33
- ChatGPT-4V: US$1.76
OpenAI’s GPT-4V is more than five times the cost of Gemini Pro.
Use of customer data with Gemini Pro
According to this note from Google, they don’t use user prompts and responses to train their AI models.
Sample code for Gemini Pro
Here is some sample Python code that shows how to use Google’s API to pass a text prompt and an image into Gemini Pro.
<span id="64f6" data-selectable-paragraph=""><span>from</span> google.colab <span>import</span> auth<br><span>from</span> vertexai.preview.generative_models <span>import</span> GenerativeModel, Part, Image<br><br>auth.authenticate_user(project_id=<span>"your_vertex_ai_project_id"</span>)<br>model = GenerativeModel(<span>"gemini-pro-vision"</span>)<br>prompt = <span>"Write a brief caption for this photo."</span><br>img = Part.from_image(Image.load_from_file(<span>"your_picture.jpg"</span>))<br>response = model.generate_content([prompt, img], stream=<span>False</span>)<br><span>print</span>(response.candidates[<span>0</span>].content.parts[<span>0</span>].text)</span>
Before running this, you have to set up a project in VertexAI Studio and then pass in your project ID. The code is tidy and straightforward. After authenticating the API and defining the model, the prompt and the image are passed in, and the response is printed out.
Facial Recognition in Personal Photos
To create more meaningful photo captions, I used an open-source facial recognition project to learn who my family and friends are in our collection of photos.
I built an interactive system to automate the process I named TripDown ML, indicating the use of Machine Learning to take a trip down memory lane. Here are the system components.
TripDown ML System Components, Diagram by Author
The process started when my siblings scanned our collection of photographs using bulk photo scanners. Once these photos were digitized, I uploaded them to Google Drive for storage in the cloud.
The next phase involved the ML aspect, using the open-source face_recognition project by Adam Geitgey. This tool processed the images stored on Google Drive to identify and find faces within them. It then created an array of embeddings, which are detailed numerical representations of the facial features in the photos.
I used these embeddings to search through the photo collection in an interactive user interface. When I entered a family member’s or friend’s name, the system used the embeddings to find and display all the photos where that person’s face appeared.
Face locations
It’s straightforward to use face_recognition to find all faces in a photo. I loaded the image using load_image_file()
and then called face_locations()
to get a list of rectangles that bound the faces. Here’s the Python code.
<span id="29cd" data-selectable-paragraph=""><span>import</span> face_recognition<br><br>image_1_path = <span>"/content/drive/MyDrive/photos/R01/Jen and Rob backyard.jpg"</span><br>image_1 = face_recognition.load_image_file(image_1_path)<br>face_locations_1 = face_recognition.face_locations(image_1, model=<span>"cnn"</span>)<br><span>print</span>(face_locations_1)<br><br>image_2_path = <span>"/content/drive/MyDrive/photos/R01/Rob and Jen Thailand.jpg"</span><br>image_2 = face_recognition.load_image_file(image_2_path)<br>face_locations_2 = face_recognition.face_locations(image_2, model=<span>"cnn"</span>)<br><span>print</span>(face_locations_2)<br><br><br></span>
You can see that I am using a built-in model called “cnn,” a Convolutional Neural Network trained to find faces in an image. It returns an array of location tuples defined as (top, right, bottom, left.) Note that this is nonstandard for many Python modules, usually (left, top, right, bottom.)
Here’s what the four face images look like.
Results of find_faces(), Images by Author
Face encodings
Now, the fun part is passing these images into the face_encodings call to get vectors with 128 values that encapsulate the features in each face. Here’s the code.
<span id="e932" data-selectable-paragraph=""><span>import</span> numpy <span>as</span> np<br><span>import</span> face_recognition<br><br><br>encoding_1 = face_recognition.face_encodings(image_1, face_locations_1)<br>encoding_2 = face_recognition.face_encodings(image_2, face_locations_2)<br><br>stack_1 = np.stack(encoding_1)<br>stack_2 = np.stack(encoding_2)<br><br>all_distances = np.linalg.norm(stack_1[:, np.newaxis] - stack_2, axis=<span>2</span>)<br><span>print</span>(all_distances)<br><br><br></span>
Using the images and face locations from the first code block, this code calls into face_encodings()
to get encodings for all four faces, two in the first image and two in the second one. I then used a numpy function to calculate the distances between each encoding vector. Smaller numbers mean a better match. It was close, but the system worked.
The distance between Rob1 and Rob2 was 0.579, and Jen1 to Jen2 was 0.574. The crosses were both bigger. Rob1 to Jen2 was 0.64, and Jen1 to Rob2 was the biggest at 0.84. Note the default threshold for a match is under 0.6, so it was narrowly successful.
Interactive User Interface
To efficiently enter labels for the faces, I used a package called ipywidgets to create an interactive user interface.
This UI allowed me to create labels for the faces of the people in the photo collection. I then saved the labels and embeddings to my Google Drive for later use.
Automatic face recognition
After I collected a set of labeled faces, I expanded the selection using comparisons to all of the other faces in the dataset. Here’s the code.
<span id="0bba" data-selectable-paragraph=""><span>import</span> numpy <span>as</span> np<br><br>labels_set_by_user = [<span>"Rob"</span>, <span>"Jen"</span>]<br>avg_encodings = {}<br><span>for</span> label <span>in</span> labels_set_by_user:<br> label_encodings = embeddings[df[<span>'label'</span>] == label]<br> avg_encodings[label] = np.mean(label_encodings, axis=<span>0</span>)<br><span>def</span> <span>find_match</span>(<span>encoding, avg_encodings, threshold=<span>0.6</span></span>):<br> <span>for</span> label, avg_encoding <span>in</span> avg_encodings.items():<br> <span>if</span> face_recognition.face_distance([avg_encoding], encoding)[<span>0</span>] < threshold:<br> <span>return</span> label<br> <span>return</span> <span>None</span><br>df[<span>'predicted_label'</span>] = <span>None</span> <br><span>for</span> index, row <span>in</span> df.iterrows():<br> current_encoding = embeddings[row[<span>'encoding_index'</span>]]<br> predicted_label = find_match(current_encoding, avg_encodings)<br> <span>if</span> predicted_label <span>is</span> <span>not</span> <span>None</span>:<br> df.at[index, <span>'predicted_label'</span>] = predicted_label</span>
I started by creating the average embeddings for each of the specified labels. This way, the system has a good baseline to find an expanded set of faces. I then checked for all faces that matched the baseline embeddings using a threshold of 0.6. For the matches, I set the predicted_label.
This code will give me a list of all the photos where Jen and I appear.
<span id="3f44" data-selectable-paragraph=""><span>def</span> <span>find_images_with_all_labels</span>(<span>labels</span>):<br> <br> valid_file_paths = pd.DataFrame(columns=[<span>'file_path'</span>])<br> <br> unique_file_paths = df[<span>'file_path'</span>].unique()<br> <br> <span>for</span> file_path <span>in</span> unique_file_paths:<br> <br> rows = df[df[<span>'file_path'</span>] == file_path]<br> <br> <span>if</span> <span>all</span>(label <span>in</span> rows[<span>'predicted_label'</span>].values <span>for</span> label <span>in</span> labels):<br> <br> valid_file_paths = valid_file_paths.append({<span>'file_path'</span>: file_path}, ignore_index=<span>True</span>)<br> <span>return</span> valid_file_paths<br>labels_to_check = [<span>'Rob'</span>, <span>'Jen'</span>]<br>valid_paths_df = find_images_with_all_labels(labels_to_check)</span>
I started by creating a list of all the unique images and then iterated through the list and checked for the presence of all specified labels.
Generating Photo Captions with Names
Now that I have a list of images where Jen and I appear, I can pass these into GPT-4V and Gemini Plus.
Captions from GPT-4V
Here are the captions from GPT-4V with the following prompt, “Write a one-sentence caption of this photo with Rob and Jen.”
Photos with Captions Generated with GPT-4V, Images by Author
These are good! Again, you can see how GPT-4V is adding some emotional content using works like “joyful,” “breathtaking,” and “cheerful” (twice!)
Captions from Gemini Pro
Here are captions for the same four photos generated by Gemini Pro.
Photos with Captions Generated with GPT-4V, Images by Author
These are good, too. I liked that it correctly recognized the Hong Kong vista in the photo at the lower right. However, it made an incorrect assumption about the baby in the upper right image. This is a good reminder to check the captions for inaccuracies.
Final Thoughts
In this project, I explored how OpenAI’s GPT-4V and Google’s Gemini Pro can be used to add captions to a personal photo collection. Both systems have strengths: GPT-4V adds a more expressive touch to the descriptions, while Gemini Pro offers straightforward captions. Integrating an open-source face recognition system was vital to this process, making the captions more personalized and directly relevant to the photos.
This project underscores the importance of being mindful of the ethical and privacy concerns associated with AI systems, particularly facial recognition. In developing this project, I’ve been careful to adhere to privacy laws and ethical guidelines. The goal was to enrich personal memories within a private sphere, strictly controlling recognition of user-provided photos and avoiding external data sharing. It’s a balance between leveraging technology’s capabilities and upholding ethical standards in AI and facial recognition, with a strong emphasis on user consent and data control.
Next Steps
As a possible step of this project, I could integrate the AI-generated captions into a semantic search system using OpenAI’s CLIP model, enabling users to search their photo collections through natural language queries. By indexing the photos with their associated captions in a searchable database, this system will understand and match the semantic content of text queries with images, offering a more intuitive and meaningful way to retrieve memories with specified people. The focus will be to refine search results for accuracy and to create a user-friendly interface, transforming how individuals interact with and explore their personal photo collections.
Source Code
I am releasing the source code for this project on GitHub under the Creative Commons Attribution Sharealike license.
References
[1] OpenAI, GPT-4V(ision) System Card (2023)
[2] Gemini Team, Google, Gemini: A Family of Highly Capable Multi-modal Models (2023)
[3] Adam Geitgey, face_recognition (2018)