A Pro’s Guide Multimodal Magic: Python-Powered Gemini for Creative Breakthroughs

In the ever-evolving landscape of artificial intelligence, Google’s Gemini has emerged as a groundbreaking large language model (LLM) that redefines the boundaries of multimodality. This sophisticated tool, trained on a vast corpus of text, images, audio, video, and even computer code, possesses the remarkable ability to seamlessly interweave these disparate modalities, unlocking a world of creative possibilities.

Harnessing Gemini’s Power with Python

To fully harness Gemini’s potential, we turn to the versatility of Python, a powerful programming language that seamlessly integrates with AI frameworks like Gemini.

Delving into the World of Multimodal Prompts

Multimodal prompts are the cornerstone of Gemini’s multimodal capabilities. By meticulously crafting prompts that combine text, imagery, or other modalities, we can instruct Gemini to generate content that seamlessly blends these elements.

What I found even more intriguing was the possibility’s of helping people. I was able to create an app in a matter of days using Python and the Gemini model. That could read from my web came and tell me what it sees. One of the issues my grandmother had was in seeing money. Paper money has the same size and shape and it is hard for blind people to know what denomination each bill is. My grandmother would have the bank turn down the corners of the bills in certain ways so she knew what each bill is.

Gemini could tell me what they were by my simply showing it a the bill on my webcam.

I was also able to again using python scan a document and have it create a study guide quiz for me.

The code

import google.generativeai as genai
from google.ai import generativelanguage_v1beta
from dotenv import load_dotenv
import os
import requests
import asyncio
from pathlib import Path

load_dotenv()

# The api key for accessing the api. stored in .env
api_key = os.getenv("API_KEY")
path_to_service_account_key_file = os.getenv("GOOGLE_APPLICATION_CREDENTIALS")
genai.configure(api_key=api_key)

# Set the environment variable
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = path_to_service_account_key_file

# Set up the model
generation_config = {
    'temperature': 0.9,
    'top_p': 1,
    'top_k': 40,
    'max_output_tokens': 2048,
    'stop_sequences': [],
}

safety_settings = [{"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_MEDIUM_AND_ABOVE"},
                   {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_MEDIUM_AND_ABOVE"},
                   {"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_MEDIUM_AND_ABOVE"},
                   {"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_MEDIUM_AND_ABOVE"}]


async def sample_generate_text_image_content(text, image):
    """
        This function sends a text and image reqeust to gemini.

        :param text: The text prompt from the user.
        :param image: The image from the user as a byte array
        :return: The response from gemini
        """

    # Create a client
    client = generativelanguage_v1beta.GenerativeServiceAsyncClient()

    image_blob = generativelanguage_v1beta.Blob(mime_type="image/jpeg", data=image)
    text_part = generativelanguage_v1beta.Part(text=text)
    image_part = generativelanguage_v1beta.Part(inline_data=image_blob)
    contents = generativelanguage_v1beta.Content(parts=[image_part, text_part], role="user")

    # Initialize request argument(s)
    request = generativelanguage_v1beta.GenerateContentRequest(
        model="models/gemini-pro-vision",
        contents=[contents],
        generation_config=generation_config,
        safety_settings=safety_settings
    )

    # Make the request
    response = await client.generate_content(request=request)

    # Handle the response
    return response.candidates[0].content.parts[0].text


async def main():
    # Gemini provides a multimodal model (gemini-pro-vision) that accepts both text and images and inputs. The
    # GenerativeModel.generate_content API is designed handle multi-media prompts and returns a text output.

    # downloading an image to test with
    if not os.path.exists("image.jpg"):
        image_url = "https://storage.googleapis.com/generativeai-downloads/images/scones.jpg"
        response = requests.get(image_url)
        if response.status_code == 200:
            print("Image downloaded successfully")

            with open("image.jpg", "wb") as f:
                f.write(response.content)
        else:
            print("Error downloading image:", response.status_code)

    image_bites = Path("image.jpg").read_bytes()
    response = await sample_generate_text_image_content('What do you see?', image_bites)
    print(f'Text Image response: {response}')    


# Just for testing

if __name__ == "__main__":
    asyncio.run(main())

Overview

This Python code utilizes Google AI’s Generative Language API (GenerativeAI) to generate text descriptions based on both text prompts and image inputs. It leverages the Gemini-Pro-Vision multimodal model, which is specifically designed to handle both text and image prompts.

Code Breakdown

Importing Libraries:

google.generativeai as genai: Imports the GenerativeAI library from Google AI.
from google.ai import generativelanguage_v1beta: Imports specific functions from the GenerativeLanguage API.
from dotenv import load_dotenv: Imports the dotenv library for loading environment variables.
import os: Imports the os module for accessing operating system functionalities.
import requests: Imports the requests library for making HTTP requests.
import asyncio: Imports the asyncio library for asynchronous programming.
from pathlib import Path: Imports the Path class from the pathlib module for working with file paths.

Loading Environment Variables:

This code loads environment variables from a .env file, which is typically used to store sensitive information like API keys.

API_KEY=AIzaSyByATSHPvnR7BgJioWrYUn3TnyDusW7exw
GOOGLE_APPLICATION_CREDENTIALS=C:\Development\FreeLance\GoogleSamples\Credentials\gemini.json

Configuring GenerativeAI:

api_key = os.getenv("API_KEY"): Retrieves the API key from the loaded environment variables.
path_to_service_account_key_file = os.getenv("GOOGLE_APPLICATION_CREDENTIALS"): Retrieves the path to the service account key file, which grants access to the API.
genai.configure(api_key=api_key): Configures the GenerativeAI library using the obtained API key.
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = path_to_service_account_key_file: Sets the environment variable for the service account key file.

Setting Up Model and Safety Settings:

generation_config = {...}: Defines a dictionary containing configuration settings for generating text.
safety_settings = [...]: Defines a list of safety settings that control the type of content generated.

Sampling Text from Image and Text Prompt:

async def sample_generate_text_image_content(text, image):: Defines an asynchronous function for generating text from an image and text prompt.

Image Download and Handling:

In order to test this code we download a dummy image that we can send. Feel free to replace this with your own image.

Generating Text from Image and Text Prompt:

It is very import to send the image to Gemini as bytes.

image_bites = Path("image.jpg").read_bytes(): Reads the image data from the file as bytes.
response = await sample_generate_text_image_content('What do you see?', image_bites): Calls the asynchronous function to generate text from the image and prompt.
print(f'Text Image response: {response}'): Prints the generated text to the console.

Main Function:

async def main():: Defines the main asynchronous function that coordinates the code execution.

Testing:

if __name__ == "__main__":: Ensures the code only executes from the main script file.
asyncio.run(main()): Initiates the asynchronous execution of the main function.

Conclusion

This Python code demonstrates the capability of Google AI’s Generative Language API (GenerativeAI) to generate text descriptions based on both text prompts and image inputs. It utilizes the Gemini-Pro-Vision multimodal model to effectively process and integrate both text and image information, resulting in more engaging and informative text generation. The code’s structure is well-organized and employs asynchronous programming to handle the image download and text generation efficiently.

Final note. when sending to the gemini-pro-vision model you must send an image. Text is optional but image is not. If you just want to send text then you will need to use the gemini-pro model.

If your wondering. I created the code. Gemini helped me write this blog post. I gave it the Wikipedia page on Gemini ai and my code. It did the rest. With just a bit of editing from me. Comment and let me know how you think it did.

Google API Obsessed Content Creator