On Monday, OpenAI introduced a groundbreaking generative AI model called GPT-4o, with the “o” signifying “omni,” highlighting the model’s versatility in handling text, speech, and video. This model is set to be gradually integrated into OpenAI’s developer and consumer products over the upcoming weeks.
Mira Murati, OpenAI’s CTO, explained that GPT-4o offers “GPT-4-level” intelligence with enhanced capabilities across multiple modalities and media. During a streamed presentation at OpenAI’s San Francisco offices, Murati stated, “GPT-4o reasons across voice, text, and vision. This is incredibly important as it represents the future of interaction between humans and machines.”
Previously, GPT-4 Turbo was OpenAI’s most advanced model, trained on both images and text, and capable of analyzing them to perform tasks like extracting text from images or describing image content. GPT-4o, however, adds the ability to process speech, significantly advancing its functionality.
One notable improvement with GPT-4o is in the ChatGPT experience. Although ChatGPT already offered a voice mode that transcribed responses using a text-to-speech model, GPT-4o enhances this feature, allowing users to interact with ChatGPT more fluidly, akin to an assistant. Users can now ask questions and interrupt ChatGPT mid-response. OpenAI claims the model delivers “real-time” responsiveness and can interpret nuances in a user’s voice, generating responses in various emotive styles, including singing.
Additionally, GPT-4o enhances ChatGPT’s vision capabilities. For instance, given a photo or a screenshot, ChatGPT can now quickly answer questions related to the image, from identifying software code issues to recognizing a brand of clothing. Murati mentioned that while GPT-4o can currently translate a menu in a foreign language, future versions could allow ChatGPT to “watch” a live sports game and explain the rules.
Murati emphasized that despite the increasing complexity of these models, the goal is to make interactions more natural and effortless, shifting the focus away from the user interface and towards collaboration with ChatGPT. She highlighted that while the past few years have focused on improving model intelligence, GPT-4o represents a significant step towards enhancing usability.
Moreover, GPT-4o is claimed to be more multilingual, with improved performance in around 50 languages. OpenAI also states that GPT-4o is twice as fast, half the price, and has higher rate limits compared to GPT-4 Turbo in OpenAI’s API and Microsoft’s Azure OpenAI Service.
Currently, GPT-4o’s audio capabilities are not available in the API for all customers. OpenAI plans to launch these features initially to a select group of trusted partners to mitigate misuse risks.
Starting today, GPT-4o is available in the free tier of ChatGPT and to subscribers of OpenAI’s premium ChatGPT Plus and Team plans, offering “5x higher” message limits. When users hit the rate limit, ChatGPT will automatically switch to the older GPT-3.5 model. The enhanced voice experience powered by GPT-4o will be available in alpha for Plus users within the next month, along with enterprise-focused options.
Additionally, OpenAI announced a refreshed ChatGPT web UI with a more conversational home screen and message layout, and a desktop version of ChatGPT for macOS. The desktop version allows users to ask questions via a keyboard shortcut or discuss screenshots. ChatGPT Plus users will have first access, with a Windows version expected later in the year.
Furthermore, OpenAI’s GPT Store, which offers tools for creating third-party chatbots built on its AI models, is now accessible to free-tier users. Free users can also use formerly paywalled features, such as a memory capability for remembering user preferences, uploading files and photos, and searching the web for timely answers.