"If GPT-5 is released, OpenAI is still far ahead. If it is AI Search or voice assistant, it means OpenAI has fallen."
An AI big model practitioner told Huxiu that the industry's expectations for OpenAI are too high. Unless it is a disruptive innovation like GPT-5, it is difficult to satisfy the audience's "appetite".
Although Sam Altman had announced before the OpenAI online live broadcast that GPT-5 (or GPT-4.5) would not be released, the outside world's expectations for OpenAI have long been unstoppable.
In the early morning of May 14th, Beijing time, OpenAI announced the latest GPT-4o, where o stands for Omnimodel. The more than 20-minute live demonstration showed an AI interactive experience that far exceeds all current voice assistants, which basically coincides with the news previously revealed by foreign media.
Although the demonstration effect of GPT-4o can still be called "explosive",industry insiders generally believe that it is difficult to live up to the word "magic" in Altman's preview. Many people believe that these functional products are "deviating from OpenAI's mission."
OpenAI's PR team also seemed to have anticipated this trend of public opinion. Altman explained this at the event and in a blog post after the event:
"A key part of our mission is to make very powerful AI tools available to people for free (or at a discounted price). I'm very proud that we're making the world's best models available for free in ChatGPT, with no ads or anything like that.
When we founded OpenAI, our original idea was that we would create AI and use it to create all kinds of benefits for the world. Instead, it now looks like we're going to create AI and then other people will use it to create all kinds of amazing things that benefit us all."
"If we had to wait 5 seconds for 'each' response, the user experience would plummet. Even if the synthetic audio itself sounds real, it would destroy the immersion and make people feel lifeless."
" On the eve of the OpenAI conference, Jim Fan, head of Nvidia’s Embodied AI, predicted the voice assistant that OpenAI would release at X and proposed:
Almost all voice AI will go through three stages:
1. Speech recognition or “ASR”: audio -> text 1, such as Whisper;
2. LLM that plans what to say next: text1 -> text2;
3. Speech synthesis or “TTS”: text2 -> audio, such as ElevenLabs or VALL-E.
Going through 3 stages will result in huge delays.
GPT-4o has almost solved the latency problem in terms of response speed. The shortest response time of GPT-4o to audio input is 232 milliseconds, and the average response time is 320 milliseconds, which is almost similar to that of humans. The average delay of the ChatGPT voice dialogue function without GPT-4o is 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4).
GPT-4o not only greatly improves the experience by shortening the delay, but also makes many upgrades based on GPT-4, including:
Excellent multimodal interaction capabilities, including voice, video, and screen sharing.
It can recognize and understand human expressions, text, and mathematical formulas in real time.
The interactive voice is rich in emotion, and the voice tone and style can be changed, and it can also imitate and even sing "improvised".
It has ultra-low latency, and can interrupt the AI in real time during the conversation, add information or start a new topic.
All ChatGPT users can use it for free (with a usage cap).
It is twice as fast as GPT-4 Turbo, with 50% lower API costs and 5 times higher rate limits.
"Breaking through these limitations is innovation."
Some industry experts believe that GPT-4o's multimodal capabilities only "look" good, and in fact OpenAI has not demonstrated a truly "breakthrough" feature for visual multimodality.
Here we follow the habits of the large model industry and compare it with Claude 3 from Anthropic, the factory next door.
The technical documentation of Claude 3 mentions that "although Claude's image understanding capabilities are cutting-edge, some limitations need to be noted."
These include:
Person recognition: Claude cannot be used to identify (i.e. name) people in images and will refuse to do so.
Accuracy: Claude may hallucinate or make mistakes when interpreting low-quality, rotated, or very small images under 200 pixels.
Spatial Reasoning: Claude has limited spatial reasoning abilities. It may have difficulty with tasks that require precise positioning or layout, such as reading an analog clock face or describing the exact position of a chess piece.
Counting: Claude can give an approximate count of objects in an image, but may not always be precisely accurate, especially for large numbers of small objects.
AI-generated images: Claude does not know if an image is AI-generated and may not be correct if asked. Do not rely on it to detect fake or synthetic images.
Inappropriate content: Claude will not process inappropriate or explicit images that violate our Acceptable Use Policy.
Healthcare Applications: While Claude can analyze general medical images, it is not designed to interpret complex diagnostic scans such as CT or MRI. Claude's output should not be considered a substitute for professional medical advice or diagnosis.
Among the cases published on the GPT-4o website, there are some capabilities related to "spatial reasoning", but they are still difficult to be considered breakthroughs.
In addition, it is easy to see from the output of GPT-4o in the live demonstration at the press conference that its model capabilities are not much different from GPT-4.
GPT-4o running score
Although the model can add tone to the conversation and even sing impromptu, the content of the conversation is still lacking in details and creativity like GPT-4.
In addition, after the press conference, OpenAI's official website also released a series of application case explorations of GPT-4o. Including: photo conversion to comic style; meeting minutes; image synthesis; 3D content generation based on images; handwriting and draft generation; stylized posters, and comic strip generation; artistic font generation, etc.
Among these capabilities, photo conversion to comic style, meeting minutes, etc., are also some seemingly ordinary Wensheng pictures or AI large model functions.
"If I register 5 free ChatGPT accounts, do I not need to subscribe to ChatGPT Plus for $20 per month?"
OpenAI's announced GPT-4o usage policy is that ChatGPT Plus users have a traffic limit that is 5 times higher than that of ordinary users.
GPT-4o is free for everyone, and the first challenge seems to be OpenAI's own business model.
Data released by the third-party market analysis platform Sensor Tower show that in the past month, ChatGPT has been downloaded 7 million times in the global App Store and has a subscription revenue of 12 million US dollars; the global Google Play market has been downloaded 90 million times and has a subscription revenue of 3 million US dollars.
Currently, the subscription price of ChatGPT Plus in both app stores is $19.99. According to subscription data, ChatGPT Plus has 750,000 paid subscribers through the app store in the past month. Although ChatGPT Plus still has a large number of direct paying users, from the perspective of mobile phone revenue, the annual revenue is less than 200 million US dollars, and it is difficult to support OpenAI's valuation of nearly 100 billion even if it doubles several times.
From this point of view, OpenAI does not need to consider too much about individual user recharges.
What's more, GPT-4o focuses on good experience. If you are chatting with AI and the conversation is interrupted, and you have to change the account to chat again, will you recharge angrily?
"The original ChatGPT hinted at the possibility of language interfaces; this new thing feels fundamentally different. It's fast, smart, fun, natural, and helpful."
Sam Altman's latest blog mentions the "possibility of language interfaces," which is exactly what GPT-4o may do next: challenge all GUIs (graphical interactive interfaces) and those who want to work on LUIs (voice interactive interfaces).
Combined with the recent news of OpenAI's cooperation with Apple revealed by foreign media, it can be speculated that GPT-4o may soon "throw an olive branch" or "turn the table" to all AI PC and AI mobile phone manufacturers.
No matter what kind of voice assistant or AI big model, the core value for AIPC and AI mobile phones is to optimize the experience, and GPT-4o optimizes the experience to the extreme.
GPT-4o is likely to involve all known apps, even the SaaS industry. In the past year or so, all AI agents that have been developed and are being developed on the market will face threats.
A product manager of a resource aggregation app once told Huxiu, "My operating process is the core of the product. If the operating process is optimized by your ChatGPT, it means that my app has no value."
Imagine that if the UI of the takeaway ordering app becomes a sentence "Order for me", then it will be the same for users whether they open Meituan or Ele.me.
The next step for manufacturers can only be to compress the profit margins of the supply chain and ecology, or even a vicious price war.
From the current situation, it may take some time for other manufacturers to defeat OpenAI in terms of model capabilities.
If a product wants to benchmark OpenAI, it may only be through making a more "cheap" model.
"I've been so busy lately that I haven't paid attention to them."
A founder of a large industrial AI model told Huxiu that he has been busy communicating strategic cooperation, product releases, customer exchanges and capital exchanges recently, and has no time to pay attention to releases like OpenAI.
Before OpenAI was released, Huxiu also asked a number of domestic AI practitioners from all walks of life. Their predictions and opinions on OpenAI's latest release were very consistent: I'm looking forward to it, but it has nothing to do with me.
A practitioner said that judging from the current progress in China, it is not realistic to catch up with OpenAI in the short term. So if you care about what OpenAI has released, at most you can just look at the latest technical direction.
At present, domestic companies generally pay more attention to engineering and vertical models in the research and development of large AI models, which are more pragmatic and easy to monetize.
In terms of engineering, Deepseek, which has recently become popular, is setting off a token price war in the domestic large model industry. In terms of vertical models, many industry insiders told Huxiu that in the short term, the research and development of small models and vertical models will basically not be affected by OpenAI.
"Sometimes OpenAI's technical direction is not very worth learning from."A model expert told Huxiu that Sora is a good example. In February 2024, OpenAI released the video model Sora, which achieved a stable output of 60 seconds of video. Although it looks very effective, there is almost no subsequent practice and the landing speed is very slow.
Before Sora, many domestic companies and institutions working in the field of viz video had achieved 15-second stable video generation. After Sora came out, the R&D, financing, and product rhythm of some companies were disrupted, and even the development of the entire viz video industry evolved into a "technological leap forward."
Fortunately, this time GPT-4o is very different from Sora. OpenAI CTO Muri Murati said that in the next few weeks, we will continue our iterative deployment to provide you with all the functions.
Soon after the press conference, GPT-4o was already available for online trial.