The internet is about to be flooded with AI Agents.

ChatGPT’s recent 4o model sounds just like a human and can see our world too. We are about to see a proliferation of AI Agents & Assistants across industries. Are we ready for it?

Uttam Dey

and

Amrita Roy

May 17, 2024

125

«The 2-minute version»

Woah!!! AI now talks just like us: This week we saw some big announcements from OpenAI and Google on all the cool things their own LLM models can now do for us. While Google’s Gemini will be further entrenched into Google’s products, including Google Search, OpenAI’s ChatGPT can now talk just like us with remarkable intonation and voice modulation that sure feels surreal and….a little weird. We had to go back and rewatch the movie, Her.

“Release The Kr(AI)ken”: Based on social media’s chatter, ChatGPT’s ability to hold conversations with such ease has possibly created yet another iPhone moment. As Google and OpenAI redefine user interactions, get ready for a proliferation of AI Agents in diverse fields like coding, dating, logistics, sales and many more with products such as DevinAI, Replika, Braintrust's AIR, and Alice, promising enhanced productivity and tailored experiences.

Interacting with AI will become hands-free: For those excited by the prospects of AI’s multi-modal capabilities, we believe there will be a point soon when users are going to feel the limitations of AI assistants because of the constraints of smartphones.We think Mixed Reality glasses, such as Apple’s Vision Pro($$), Meta’s Quest headset, or RayBan AR glasses is one way technology may bridge the gap. Then there is also Project Gameface, Google’s attempt to envision how humans can wean off their dependance on the mouse, keypad and touchscreens using gestures.

AI is about to become omnipresent: It’s not just about the multimodal features. The Internet-of-Things (IoT) that soared in popularity in the previous decade with the launch of Nest Thermostats, home appliances, light fixtures provide the perfect breeding ground for today’s conversational AI to integrate into any IoT device and displace primitive assistants like Amazon’s Alexa, Apple’s Siri or Google Home.

With great power comes…: great responsibility. We stole that line from Spiderman to delve into the risks of AI’s new conversational and perception abilities. ChatGPT and Gemini demonstrated how it can hold conversation, recall from memory and literally see our world. Cool as that may be, it also poses many forms of risks such as privacy, mental health issues and others. Once again, we are faced with the axiomatic tradeoff between user privacy and promised productivity.

P.S: This post has multiple references to movies about AI. Amrita and I are visual people so we try to visualize concepts in as many ways as we can.

Buy us a coffee and a muffin?

🎥Before we begin, mark your calendars📅, as I will join the amazing duo

Kes Sampanthar

, author of

The Centaurian

and

Scott Wolfson

on the 19th episode of Tech Tonic, LinkedIn’s #1 AI Thought Show at 4 pm ET (1 pm PT), where I will discuss the 🤖 impact of AI agents in the workforce, 💸the potential for AI-powered vertical solutions to bring a renaissance of small businesses and solo-preneurs and 🚀how the weapons of mass production can pave the path for mass personalization and new business models. There’s a lot to unpack, so don’t forget to tune in.

Join us live @ 1 pm ET on TechTonic

This week could arguably just be another pivotal week in the history of AI. There were two big announcements from two firms competing for the top prize to break the barriers of what AI can do (or can’t do): OpenAI and Google GOOG 0.00%↑ .

As we sat watching the product demos and walkthroughs, we couldn’t help but think about the opening scene in the movie, Her. As the protagonist of the movie, Theodore, completes the installation of his computer, he begins interacting with his computer’s virtual assistant, Samantha.

This is the exact reaction to the demos we saw in the last couple of days. Exciting as it is, these demos point to a world where these agents & assistants further proliferate into our lives, and this is just the beginning.

ChatGPT sounds more human than ever

So far, ChatGPT has always operated as a three-headed virtual AI body.

First, there was DALL-E 3, which was initially revealed in the summer of 2021. Users would remember when images of polar bears playing bass guitars or avocado-looking armchairs started surfacing on the internet at the time; images were simply generated using text prompts.

Then, there was OpenAI’s speech recognition and transcription model, Whisper, launched in 2022, quickly followed by the release of the global sensation, ChatGPT 3.5 in the fall of 2022.

On Monday this week, OpenAI took off the wraps of their latest model, the GPT-4o (“o” for “omni”). The focus of the demo seemed to be on the multi-modality of their new 4o model—the capability for human users to interact with the same model over different modes such as voice, text, images, videos, etc.

OpenAI’s employees ran a live demonstration of ChatGPT’s 4o model, helping the employee solve a linear equation. In another couple of segments, employees demonstrated 4o’s voice capabilities, such as ChatGPT acting as a mental health coach, acting as a live translator,or acting as a live coding assistant.

On OpenAI’s demo day of their 4o model, their intentions were made clear from the word go. In addition to making ChatGPT multi-modal, OpenAI wanted to make ChatGPT sound as human-like as possible. ChatGPT was able to provide answers in real-time, the intonation was remarkable, and it felt as if the employees were interacting digitally with someone like us.

At many points in the back-and-forth during the conversations, the AI assistant demonstrated stunning capabilities of modulating its voice to express a broad range of emotions such as excitement, optimism, laughter, and using filler words such as “Ahhh” and "Awww," while even exhibiting embarrassment when it incorrectly identified one employee’s face as a wooden table. Watching this demo felt like we had to rewatch the movie Her all over again.

Google Gemini’s Reach Extends Deeper Into Google’s Products

In contrast to OpenAI’s homely, candid conversation about its latest model releases, Google’s annual developer conference, I/O 2024, sounded closer to the usual scripted Big Tech conferences we have become accustomed to since Steve Jobs' 2007 iPhone reveal.

The event started off with improv DJ Marc Rebillet demoing Google’s AI-powered music tool, MusicFX, which launched in December, but quickly switched gears with announcements about all the added capabilities across Google Search, Google Gemini updates, and Project Astra.

In Google Search, the company will be redesigning its search engine around AI. Users in the U.S. will begin to see AI overviews in their search results starting this week, with support for more countries set to follow shortly.

Project Astra appeared to be Google’s answer to the multi-modal, universally existing AI assistant that OpenAI demonstrated. Much of the demonstration revolved around a Google employee pointing-and-shooting the smartphone’s camera at objects in the room, allowing the smartphone’s camera to be the eyes and ears of Google’s conversational AI assistant.

The highlight of the video was the assistant specifically pointing out where the employee had left her glasses when she asked the assistant later on in the video, “Do you remember where I put my glasses?” This was similar to ChatGPT’s demo where the employee posed specific questions testing ChatGPT’s ability to recall from memory.

Release the Kr.A.I.ken: Brace for the flood of AI Agents

In what has become increasingly evident, as more agents and assistants become available, the internet, which ironically played a role in training all the available models today, will fade into the background. Searching and summarizing are two of the biggest flagship features that most models compete on today.

Many of Google’s demonstrations about its Gemini capabilities centered on the model’s abilities to search quickly through emails, photos, pdf files, or the world wide web for that one photo about a driver’s license, the rent agreement, or a contractor’s quote to fix the kitchen sink and present the user with accurately summarized information as quickly as possible. On the other hand, the focus of OpenAI’s GPT 4o was more on personalizing the user interaction while increasing the wide range of abilities of communicating with ChatGPT using a variety of modes, yet walk away with a feeling of almost having interacted with a human.

But Google and OpenAI aren’t alone.

Last year, Meta launched their own AI assistants across their entire family of apps and devices. The social media company recently added conversational features to its Meta AI assistant last month, with the ability to customize the AI assistant to make it sound like Snoop Dogg or Tom Brady, for example.

While Google and OpenAI showed off their respective models' capabilities to be developers’ coding assistants, other examples include DevinAI. Similarly, there’s Replika, an AI for dating; a logistics-focused AI assistant for searching through supply chains; Braintrust’s AIR, which handles all employee recruitment and onboarding processes; and Alice, an AI-powered sales representative.

These are just some examples of all the different conversational AI products being developed for users like us with a goal to personalize experiences and deliver productivity.

Is there anything AI can’t do?

Plenty.

The demonstrations by both OpenAI and Google have occurred in ideal settings. In the context of software or design modeling, that’s called the Happy Path, the default scenario that leads to no exceptions or errors for software developers.

In OpenAI’s demonstration, there were already a few errors. Let's start with the example where OpenAI’s employees were trying to solve the linear equation. Before the employees could even write down the equation, ChatGPT claimed it could see the equation! The employees had to interrupt ChatGPT and give it a heads-up to wait for them to write the equation down. But this was already a first sign of the new conversational 4o model hallucinating, a term used to describe results thrown by AI that are non-existent, non-factual a.k.a “made-up”. Then, towards the end of the presentation, ChatGPT incorrectly identified someone’s face as a wooden surface, as mentioned earlier. It eventually got the answers right but had to be nudged and nurtured along the way.

Google’s Gemini, on the other hand, still had a near perfect demonstration of its Project Astra, as shown in the video above. But then, this was pre-recorded. Google has admitted in the past to staging their scripted videos to showcase certain multi-modal aspects of Gemini.

Earlier this year, Google was forced to issue a public apology after one of their image models, Imagen 2, incorrectly added diversity to old historical pictures. This would probably point to sample bias in the training data used to train Google’s previous image models.

For now, these models still lack moral intuition, ethical reasoning, or just a lack of understanding of the broader societal implications without explicit programming assistance from humans highlighting the need for extreme caution when using these assistants in their primitive state.

The Future Of Interacting With AI Is Hands-Free

As multi-modal features become more widely available, expect a wider push by technology companies to get further ahead in line to roll out their AI assistants and agents to consumers. It definitely feels like there is a barrage of AI assistants coming our way. For those excited by the prospects of AI’s multi-modal capabilities, there will be a point soon when users are going to feel the limitations of AI assistants because of the constraints of smartphones and current technology devices.

As we watched the demos from both Google and OpenAI, we noticed that employees in both demonstrations were resorting to the delicate balancing act of pointing and shooting their phone’s camera at the world or the object, asking for a summary/solution and working on it in real time. For a technology that promises productivity benefits, the productivity currently seems capped when a user has to constantly hold the smartphone and wave it around for AI to see the spatial surroundings of the user.

Mixed Reality glasses could bridge that gap as human users go hands-free for certain tasks. Apple’s VisionPro (only if they give a sweet deal), Meta’s Quest headset, or RayBan glasses now have some more use cases to plug the usability gap as more users jump on the multi-modal train.

For now, Google looks to plug that gap by integrating Gemini deeper across Google’s services and on their Pixel devices. In addition, Google quietly snuck in a pair of AR test glasses in their AI Sandbox at their event this week, whereas OpenAI just launched a brand new MacOS app in addition to their smartphone apps. Then there is also Project Gameface, Google’s other attempt to envision how humans can wean off their dependance on the mouse, keypad and touchscreens.

In addition to all the multimodal features being pushed into AI by developers, also expect AI to gradually become omnipresent. The Internet-of-Things (IOT) concept that soared in popularity in the previous decade with the launch of Nest Thermostats, home appliances, light fixtures provide the perfect breeding ground for today’s conversational AI to integrate into any IoT device and displace primitive assistants like Amazon’s Alexa, Apple’s Siri or Google Home.

Blade Runner 2049’s virtual assistant Joi would be a good example to visualize the omnipresence concept of AI.

With AI’s power comes AI’s responsibility

As OpenAI’s employees concluded their live demo with ChatGPT on the linear equation, the audience cheered and the employees beamed as they proceeded to summarize the key insights from that demo, but then something strange happened.

ChatGPT was able to recall from memory one of the employee’s outfits and suddenly interrupted the employee, saying, “That’s quite the outfit you got on,” before ChatGPT had to be turned off. This was unsolicited, yes, but it also could be viewed as rude or a breach of privacy.

Privacy has emerged as one of the most important digital issues in this decade. Now, with AI, one can imagine the breadth and depth of privacy-centric questions. (Remember when Amazon’s Alexa and Google Home reportedly listened in on conversations at home?) This becomes even more fundamental in Google’s case as it prepares to deploy Gemini across services like Gmail, Photos, Drive, etc.

How do technology firms balance user features with user privacy in the new world? Will privacy today be a commodity tomorrow? The New York Times had done a deep analysis of the evolution of Google’s privacy policies over the years. Policies change, but they should be done with decades of user data in mind. Especially since the presentations this week show that the AI assistants can now remember by observation and recall from memory. That’s similar to all concerns people had when they woke up one day and realized that “Facebook knows literally everything about you.” Yes, today’s AI, in all its shiny avatars, offers hope for us to feel superhuman with our work, but we should not ignore the axiomatic tradeoff between user privacy and promised productivity.

The implications become more critical as AI inches closer to passing the Turing test, a test to determine whether a machine can demonstrate human intelligence. With every company eager to launch their own AI Assistants, these implications become even more profound. As Bloomberg reasons below:

What are the social and psychological consequences of regularly speaking to a flirty, fun and ultimately agreeable artificial voice on your phone, and then encountering a very different dynamic with men and women in real life? What happens when emotionally vulnerable people develop an unhealthy attachment to GPT-4o?

To answer these questions, we leave you with another scene from the movie Her, where our protagonist, Theordore confronts his AI assistant over all the other users she has been talking to.

Back to you…

Hope you enjoyed the post today. Let us know your thoughts and questions in the comments section below. If you enjoyed out post today and want to support our work, we wouldn’t say no to a couple of coffees and a muffin.

Uttam & Amrita 👋🏼👋🏼

Buy me a cofee and a muffin

Thank you for reading The Pragmatic Optimist. This post is public so feel free to share it.

ShanghaiMoon

Great article!

I have noticed very little being said about AI exploitation of subliminal tactics, or the time discrepancy in dealing with AI nanosecond speed processing versus our thinking speed (everything is portrayed as real-time conversational speeds) or the growing expansion of out right neuro-linguistic programming usage.

Ironically, all three that I mention above are ignored, yet if you watch closely, you can see evidence of their use.

I don't have a solution yet, other than the concept of competition using various AI to combat it.

My childhood enamourment with transhumanism has been quashed by decades of data breaches, computer malware and mobile phone hacks, now all AI enhanced, and faster than we can comprehend.

Expand full comment

11 replies