MoreRSS

site iconOne Useful ThingModify

Trying to understand the implications of AI for work, education, and life. By Prof. Ethan Mollick
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of One Useful Thing

Using AI Right Now: A Quick Guide

2025-06-24 00:12:17

Every few months I put together a guide on which AI system to use. Since I last wrote my guide, however, there has been a subtle but important shift in how the major AI products work. Increasingly, it isn't about the best model, it is about the best overall system for most people. The good news is that picking an AI is easier than ever and you have three excellent choices. The challenge is that these systems are getting really complex to understand. I am going to try and help a bit with both.

First, the easy stuff.

Which AI to Use

For most people who want to use AI seriously, you should pick one of three systems: Claude from Anthropic, Google’s Gemini, and OpenAI’s ChatGPT. With all of the options, you get access to both advanced and fast models, a voice mode, the ability to see images and documents, the ability to execute code, good mobile apps, the ability to create images and video (Claude lacks here, however), and the ability to do Deep Research. Some of these features are free, but you are generally going to need to pay $20/month to get access to the full set of features you need. I will try to give you some reasons to pick one model or another as we go along, but you can’t go wrong with any of them.

What about everyone else? I am not going to cover specialized AI tools (some people love Perplexity for search, Manus is a great agent, etc.) but there are a few other options for general purpose AI systems: Grok by Elon Musk’s xAI is good if you are a big X user, though the company has not been very transparent about how its AI operates. Microsoft’s Copilot offers many of the features of ChatGPT and is accessible to users through Windows, but it can be hard to control what models you are using and when. DeepSeek r1, a Chinese model, is very capable and free to use, but is missing a few features from the other companies and it is not clear that they will keep up in the long term. So, for most people, just stick with Gemini, Claude, or ChatGPT

Great! This was the shortest recommendation post yet! Except… picking a system is just the beginning. The real challenge is understanding how to use these increasingly complex tools effectively.

Now what?

I spend a lot of time with people trying to use AI to get stuff done, and that has taught me how incredibly confusing this is. So I wanted to walk everyone through the most important features and choices, as well as some advice on how to actually use AI.

Picking a Model

ChatGPT, Claude, and Gemini each offer multiple AI models through their interface, and picking the right one is crucial. Think of it like choosing between a sports car and a pickup truck; both are vehicles, but you'd use them for very different tasks. Each system offers three tiers: a fast model for casual chat (Claude Sonnet, GPT-4o, Gemini Flash), a powerful model for serious work (Claude Opus, o3, Gemini Pro), and sometimes an ultra-powerful model for the hardest problems (o3-pro, which can take 20+ minutes to think). The casual models are fine for brainstorming or quick questions. But for anything high stakes (analysis, writing, research, coding) usually switch to the powerful model.

Most systems default to the fast model to save computing power, so you need to manually switch using the model selector dropdown. (The free versions of these systems do not give you access to the most powerful model, so if you do not see the options I describe, it is because you are using the free version)

I use o3, Claude 4 Opus, and Gemini 2.5 Pro for any serious work that I do. I also have particular favorites based on individual tasks that are outside of these models (GPT-4.5 is a really interesting model for writing, for example), but for most people, stick with the models I suggested most of the time.

For people concerned about privacy, Claude does not train future AI models on your data, but Gemini and ChatGPT might, if you are not using a corporate or educational version of the system. If you want to make sure your data is never used to train an AI model, you can turn off training features easily for ChatGPT without losing any functionality, and at the cost of some functionality for Gemini. You may also want to turn on or off “memory” in ChatGPT’s personalization option, which lets the AI remember scattered details about you. I find the memory system to be too erratic at this point, but you may have a different experience.

Using Deep Research

Deep Research is a key AI feature for most people, even if they don’t know it yet. Deep Research tools are very useful because they can produce very high-quality reports that often impress information professionals (lawyers, accountants, consultants, market researchers) that I speak to. You should be trying out Deep Research reports in your area of expertise to see what they can do for you, but some other use cases include:

  • Gift Guides: “what do I buy for a picky 11-year-old who has read all of Harry Potter, is interested in science museums, and loves chess? Give me options, including where to buy at the best prices.”

  • Travel Guides “I am going to Wisconsin on vacation and want to visit unique sites, especially focusing on cheese, produce a guide for me”

  • Second opinions in law, medicine, and other fields (it should go without saying that you should trust your doctor/lawyer above AI, but research keeps finding that the more advanced AI systems do very well in diagnosis with a surprisingly low hallucination rate, so they can be useful for second opinions).

Activating Deep Research

Deep Research reports are not error-free but are far more accurate than just asking the AI for something, and the citations tend to actually be correct. Also note that each of the Deep Research tools work a little differently, with different strengths and weaknesses. Turning on the web search option in Claude and o3 will get them to work as mini Deep Research tools, doing some web research, but not as elaborately as a full report. Google has some fun additional options once you have created a report, letting you turn it into an infographic, a quiz or a podcast.

An Easy Approach to AI: Voice Mode

An easy way to use AI is just to start with voice mode. The two best implementations of voice mode are in the Gemini app and ChatGPT’s app and website. Claude’s voice mode is weaker than the other two systems. What makes voice mode great is that you can just have a natural conversation with the app while in the car or on a walk and get quite far in understanding what these models can do. Note the models are optimized for chat (including all of the small pauses and intakes of breath designed to make it feel like you are talking to a person), so you don’t get access to the more powerful models this way. They also don’t search the web as often which makes them more likely to hallucinate if you are asking factual questions: if you are using ChatGPT, unless you hear the clicking sound at 44 seconds into this clip, it isn’t actually searching the web.

Voice mode's killer feature isn't the natural conversation, though, it's the ability to share your screen or camera. Point your phone at a broken appliance, a math problem, a recipe you're following, or a sign in a foreign language. The AI sees what you see and responds in real-time. I've used it to identify plants on hikes, solve a problem on my screen, and get cooking tips while my hands were covered in flour. This multimodal capability is genuinely futuristic, yet most people just use voice mode like Siri. You're missing the best part.

Making Things for You: Images, Video, Code, and Documents

ChatGPT and Gemini will make images for you if you ask (Claude cannot). ChatGPT offers the most controllable image creation tool, Gemini uses two different image generation tools, Imagen, a very good traditional image generation system, and a multimodal image generation system. Generally, ChatGPT is stronger. On video creation, however, Gemini’s Veo 3 is very impressive, and you get several free uses a day (but you need to hit the Video button in the interface)

“make me a photo of an otter holding a sign saying otters are cool but also accomplished pilots. the otter should also be holding a tiny silver 747 with gold detailing.”

All three systems can produce a wide variety of other outputs, ranging from documents to statistical analyses to interactive tools to simulations to simple games. To get Gemini or ChatGPT to do this reliably, you need to select the Canvas option when you want these systems to run code or produce separate outputs. Claude is good at creating these sorts of outputs on its own. Just ask, you may be surprised what the AI systems can make.

Working with an AI

Now that you have picked a model, you can start chatting with it. It used to be that the details of your prompts mattered a lot, but the most recent AI models I suggested can often figure out what you want without the need for complex prompts. As a result, many of the tips and tricks you see online for prompting are no longer as important for most people. At the Generative AI Lab at Wharton, we have been trying to examine prompting techniques in a scientific manner, and our research has shown, for example, that being polite to AI doesn’t seem to make a big difference in output quality overall1. So just approach the AI conversationally rather than getting too worried about saying exactly the right thing.

That doesn’t mean that there is no art to prompting. If you are building a prompt for other people to use, it can take real skill to build something that works repeatedly. But for most people you can get started by keeping just a few things in mind:

  • Give the AI context to work with. Most AI models only know basic user information and the information in the current chat, they do not remember or learn about you beyond that. So you need to provide the AI with context: documents, images, PowerPoints, or even just an introductory paragraph about yourself can help - use the file option to upload files and images whenever you need. The AIs can do some of these ChatGPT and Claude can access your files and mailbox if you let them, and Gemini can access your Gmail, so you can ask them to look up relevant context automatically as well, though I prefer to give the context manually.

  • Be really clear about what you want. Don’t say “Write me a marketing email,” instead go with “I'm launching a B2B SaaS product for small law firms. Write a cold outreach email that addresses their specific pain points around document management. Here's the details of the product: [paste]” Or ask the AI to ask you questions to help you clarify what you want.

  • Give it step-by-step directions. Our research found this approach, called Chain-of-Thought prompting, no longer improves answer quality as much as it used to. But even if it doesn’t help that much, it can make it easier to figure out why the AI came up with a particular answer.

  • Ask for a lot of things. The AI doesn’t get tired or resentful. Ask for 50 ideas instead of 10, or thirty options to improve a sentence. Then push the AI to expand on the things you like.

  • Use branching to explore alternatives. Claude, ChatGPT, and Gemini all let you edit prompts after you have gotten an answer. This creates a new “branch” of the conversation. You can move between branches by using the arrows that appear after you have edited an answer. It is a good way to learn how your prompts impact the conversation.

Troubleshooting

I also have seen some fairly common areas where people get into trouble:

  • Hallucinations: In some ways, hallucinations are far less of a concern than they used to be, as AI has improved and newer AI models are better at not hallucinating. However, no matter how good the AI is, it will still make errors and mistakes and still give you confident answers where it is wrong. They also can hallucinate about their own capabilities and actions. Answers are more likely to be right when they come from the bigger, slower models, and if the AI did web searches. The risk of hallucination is why I always recommend using AI for topics you understand until you have a sense for their capabilities and issues.

  • Not Magic: You should remember that the best AIs can perform at the level of a very smart person on some tasks, but current models cannot provide miraculous insights beyond human understanding. If the AI seems like it did something truly impossible, it is probably not actually doing that thing but pretending it did. Similarly, AI can seem incredibly insightful when asked about personal issues, but you should always take these insights with a grain of salt.

  • Two Way Conversation: You want to engage the AI in a back-and-forth interaction. Don’t just ask for a response, push the AI and question it.

  • Checking for Errors: The AI doesn’t know “why” it did something, so asking it to explain its logic will not get you anywhere. However, if you find issues, the thinking trace of AI models can be helpful. If you click “show thinking” you can find out what the model was doing before giving you an answer. This is not always 100% accurate (you are actually getting a summary of the thinking) but is a good place to start.

Your Next Hour

So now you know where to start. First, pick a system and resign yourself to paying the $20 (the free versions are demos, not tools). Then immediately test three things on real work: First, switch to the powerful model and give it a complex challenge from your actual job with full context and have an interactive back and forth discussion. Ask it for a specific output like a document or program or diagram and ask for changes until you get a result you are happy with. Second, try Deep Research on a question where you need comprehensive information, maybe competitive analysis, gift ideas for someone specific, or a technical deep dive. Third, experiment with voice mode while doing something else — cooking, walking, commuting — and see how it changes your ability to think through problems.

Most people use AI like Google at first: quick questions, no context, default settings. You now know better. Give it documents to analyze, ask for exhaustive options, use branching to explore alternatives, experiment with different outcomes. The difference between casual users and power users isn't prompting skill (that comes with experience); it's knowing these features exist and using them on real work.

Subscribe now

Share

1

It is actually weirder than that: on hard math and science questions that we tested, being polite sometimes makes the AI perform much better, sometimes worse, in ways that are impossible to know in advance. So be polite if you want to!

The recent history of AI in 32 otters

2025-06-02 06:17:53

Two years ago, I was on a plane with my teenage daughter, messing around with a new AI image generator while the wifi refused to work. Otters were her favorite animal, so naturally I typed: “otter on a plane using wifi” just as the connection was restored. The resulting thread went viral and “otter on a plane using wifi” has since become one of my go-to tests of progress AI image generation.

an otter on a plane using wifi
In 2021, prior to the rise of ChatGPT and diffusion models, this is what you got for “Otter on a plane using Wifi” from the hottest AI image generator, VQGAN + CLIP

What started as a silly prompt has become my accidental benchmark for AI progress. And tracking these otters over the years reveals three major shifts in AI over the past few years: the growth of multiple types of AI tools, rapid improvement, and the status of local and open models.

Diffusion models

The first otters I created were made with image generation tools. For most of the very recent history of AI, image generation used a process called diffusion, which works fundamentally differently from Large Language Models like ChatGPT. While LLMs generate text one word at a time, always moving forward, diffusion models start with random static and transform the entire image simultaneously through dozens of steps. It is like the difference between writing a story sentence by sentence versus starting with a marble block and gradually sculpting it into a statue, every part of the image is being refined at once, not built up sequentially. Instead of predicting "what comes next?" like a language model, diffusion models predict "what should this noise become?" and transform randomness into coherent images through repeated refinement.

There are a number of diffusion models out there, but I have tended to use Midjourney, which has been around longer than many other AI tools. Using Midjourney allows us to see how diffusion models have developed over time, as you can see with the simple prompt “otter on a plane using wifi” (for every image and video in this post, I pick the best out of the first four images generated). We go from melted fur at the start of 2022 to a visible otter (with too many fingers and a weird keyboard) at the end of that year. In 2023, we get a photorealistic otter, but still a weird keyboard and plane windows. In 2024, the lighting and positioning become better, and by 2025 we have excellent photorealism.

But what makes diffusion models interesting is not their increasing ability to make photorealistic images, but rather the fact that they can create images in various styles. This cuts to the heart of why AI image generation is so controversial, as many AI models are trained on images from throughout the web, including copyrighted work, and can thus replicate images in the style of living artists without their permission or compensation. But you can see how this works when applied to older artists and styles. Here is “otter on a plane using wifi” in the style of the Bayeux Tapestry, Egon Schiele, street art graffiti, and a Japanese Ukiyo-e print. (The wider your knowledge of art history, the more you can make these image creators do).

Diffusion models are not limited to existing styles. Midjourney lets any creator train the model to create images in a style they like and then share those unique “style codes.” If I end a prompt with one of these style codes, I get very different results: ranging from cyberpunk otters to cartoon ones.

I want to show you one last diffusion image, but this one is fundamentally different. I created it on my home computer using Flux. Unlike proprietary AI models like Midjourney or ChatGPT that run in corporate data centers, open weights models can be downloaded, modified, and run by anyone, anywhere. This high-quality image wasn't generated by a tech giant's servers but by the graphics card on my PC (you can also see ComfyUI, the interface I used to generate the image). It is remarkably close to the quality of the best closed-source models.

Whether open or proprietary, diffusion models tend to produce pretty random results, and creating a single quality image can often take multiple tries. The latest diffusion models (like Google’s Imagen 4) do better, but there is still a lot of luck and trial-and-error involved in a good output.

Multimodal Image Generation

For most of the era of Large Language Models, when an LLM like ChatGPT created an image, it was actually calling on one of these diffusion models to make the image and show the results. Because this was all done indirectly (the LLM prompted the diffusion model which created the image), the process of creating an image seemed even more random than working with a standard image generator.

That changed with the release of multimodal image generation by OpenAI and Google in the past couple months. Unlike diffusion models that transform noise into images, multimodal generation lets Large Language Models directly create images by adding tiny patches of color one after another, just as they add words one after another. This gives AIs deep control over the images it creates. Here is "an otter on an airplane using wifi, on their laptop screen is image generation software creating an image of an otter on a plane using wifi," on my very first attempt.

But now I have to confess something: my daughter's favorite animal is not just any otter, it is the sea otter, and every single image so far has been of the much more common river otter. Finally, with multimodal generation, I could vindicate myself as a father, as multimodal models can make specific changes and adjustments: "make it a sea otter instead, give it a mohawk, they should be using a Razer gaming laptop."

I still use Midjourney and Imagen when I am trying to achieve a visual impact and when I am willing to spend a lot of time working through randomized images, but if I want a particular picture, I now always turn towards multimodal image generators. I suspect they will become increasingly common. As of yet, there are no open weights multimodal image generators, but that is likely to change soon.

Using Code for Images and “Sparks”

Multimodal generation shows AI can control images with precision. But there's a deeper question: does AI actually understand what it's creating, or is it just recombining patterns from training data? To test true spatial reasoning, we can force AI to draw using code - no visual feedback, no pre-trained image patterns to lean on. It's like asking someone to paint blindfolded using only mathematical instructions.

One particularly challenging type of code to use to draw is TikZ, a mathematical language used for producing scientific diagrams in academic papers. It is so ill-suited to the purpose that the name TikZ stands for the recursive German phrase "TikZ ist kein Zeichenprogramm" (“TikZ is not a drawing program”). Because of that, there is very little training data on using TikZ for drawings, meaning the AI cannot “remember” code from its training, it has to make it up itself. Creating an image with pure math in this language is a difficult job. In fact, a TikZ drawing of a unicorn by the now obsolete GPT-4 was considered, in a hugely influential paper, to be a sign that LLMs might have a “spark” of AGI - otherwise how could it be so creative? Here is how that unicorn looked, for reference:

I had a little less luck getting the old GPT-4 to draw an otter on a plane using wifi:

But what happens if we ask a more recent model, like Gemini 2.5 Pro, to draw our otter with TikZ? It isn’t perfect (and Gemini took “on a plane” literally and made the otter sit on the wing), but if the pink unicorn showed a spark this certainly represents a larger leap.

And open weights models are catching up here as well, though they generally remain a few months behind the frontier. The new version of DeepSeek r1, probably the best open weights model available, produces a TikZ otter that is not quite as good as the closed source models like Gemini, but I expect that it will continue to improve.

These drawings themselves aren’t as important as the fact that models are reasoning about spatial relationships from scratch. That is why the authors of the “Sparks” papers suggested these systems aren't just pattern-matching from training data but developing something closer to actual understanding.

Video

If still images show impressive progress, video generation reveals just how fast AI is accelerating. This was an “otter on a plane using wifi on a computer” as generated by the best available video generator of July, 2024, Runway Gen-3 alpha.

And this is in Google’s Veo 3 with the same prompt “otter on a plane using wifi on a computer” in 2025, less than a year later. Yes, the sound is 100% AI generated as well.

And, continuing the theme, there are now open weights AI models that can run on my home computer that are behind the state-of-the-art, but catching up. Here are the results from Tencent’s HunyuanVideo for the same prompt. Yes, it's hideous - but this is made on my home computer, not a massive data center.

What this all means

The otter evolution reveals two crucial trends with some big implications. First, there clearly continues to be rapid improvement across a wide range of AI capabilities from image generation to video to LLM code generation. Second, open weights models, while not generally as good as proprietary models, are often only months behind the state-of-the-art.

If you put these trends together, it becomes clear that we are heading towards a place where not only are image and video generations likely to be good enough to fool most people, but that those capabilities will be widely available and, thanks to open models, very hard to regulate or control. I think we need to be prepared for a world where it is impossible to tell real from AI-generated images and video, with implications for a wide swath of society, from the entertainment we enjoy to our trust for online content.

That future is not far away, as you can see from this final video, which I made with simple text prompts to Veo 3. When you are done watching (and I apologize in advance for the results of the prompt “like the musical Cats but for otters”), look back at the first Midjourney image from 2022. The time between a text prompt producing abstracts masses of fur and those producing realistic videos with sound was less than three years.

Subscribe now

Share

Making AI Work: Leadership, Lab, and Crowd

2025-05-22 19:00:44

Companies are approaching AI transformation with incomplete information. After extensive conversations with organizations across industries, I think four key facts explain what's really happening with AI adoption:

  1. AI boosts work performance. How do we know? For one thing, workers certainly think it does. A representative study of knowledge workers in Denmark found that users thought that AI halved their working time for 41% of the tasks they do at work, and a more recent survey of Americans found that workers said using AI tripled their productivity (reducing 90-minute tasks to 30 minutes). Self-reporting is never completely accurate, but we have other data from controlled experiments that suggest gains among product development, sales, and consulting, as well as for coders, law students, and call center workers.

  2. A large percentage of people are using AI at work. That Danish study from a year ago found that 65% of marketers, 64% of journalists, and 30% of lawyers, among others, had used AI at work. The study of American workers found over 30% had used AI at work in December, 2024, a number which grew to 40% in April, 2025. And, of course, this may be an undercount in a world where ChatGPT is the fourth most visited website on the planet.

  3. There are more transformational gains available with today’s AI systems than most currently realize. Deep research reports do many hours of analytical work in a few minutes (and I have been told by many researchers that checking these reports is much faster than writing them); agents are just starting to appear that can do real work; and increasingly smart systems can produce really high-quality outcomes.

  4. These gains are not being captured by companies. Companies are typically reporting small to moderate gains from AI so far, and there is no major impact on wages or hours worked as of the end of 2024.

How do we reconcile the first three points with the final one? The answer is that AI use that boosts individual performance does not naturally translate to improving organizational performance. To get organizational gains requires organizational innovation, rethinking incentives, processes, and even the nature of work. But the muscles for organizational innovation inside companies have atrophied. For decades, companies have outsourced this to consultants or enterprise software vendors who develop generalized approaches that address the issues of many companies at once. That won’t work here, at least for a while. Nobody has special information about how to best use AI at your company, or a playbook for how to integrate it into your organization. Even the major AI companies release models without knowing how they can be best used. They especially don’t know your industry, organization, or context.

We are all figuring this out together. So, if you want to gain an advantage, you are going to have to figure it out faster than everyone else. And to do that, you will need to harness the efforts of Leadership, Lab, and Crowd - the three keys to AI transformation.

Leadership

Ultimately, AI starts as a leadership problem, where leaders recognize that AI presents urgent challenges and opportunities. One big change since I wrote about this topic months ago is that more leaders are starting to recognize the need to address AI. You can see this in two viral memos, from the CEO of Shopify and the CEO of Duolingo, establishing the importance of AI to their company’s future.

But urgency alone isn't enough. These messages do a good job signaling the 'why now' but stop short of painting that crucial, vivid picture: what does the AI-powered future actually look and feel like for your organization? My colleague Andrew Carton has shown that workers are not motivated to change by leadership statements about performance gains or bottom lines, they want clear and vivid images of what the future actually looks like: What will work be like in the future? Will efficiency gains be translated into layoffs or will they be used to grow the organization? How will workers be rewarded (or punished) for how they use AI? You don’t have to know the answer with certainty, but you should have a goal that you are working towards that you are willing to share. Workers are waiting for guidance, and the nature of that guidance will impact how The Crowd adopts and uses AI.

An overall vision is not enough, however, because leaders need to start to anticipate how work will change in a world of AI. While AI is not currently a replacement for most human jobs, it does replace specific tasks within those jobs. I have spoken to numerous legal professionals who see the current state of Deep Research tools as good enough to handle portions of once-expensive research tasks. Vibe coding changes how programmers allocate time and effort. And it is hard to not see changes to marketing and media work in the rapid gains in AI video. For example, Google’s new Veo 3 created this short video snippet, sound and all, from the text prompt: An advertisement for Cheesey Otters, a new snack made out of otter shaped crackers. The commercial shows a kid eating them, and the mom holds up the package and says "otterly great"

Yet the ability to make a short video clip, or code faster, or get research on demand, does not equal performance gains. To do that will require decisions about where Leadership and The Lab should work together to build and test new workflows that integrate AIs and humans. It also means fundamentally rethinking why you are doing particular tasks. Companies used to pay tens of thousands of dollars for a single research report, now they can generate hundreds of those for free. What does that allow your analysts and managers to do? If hundreds of reports aren’t useful, then what was the point of research reports?

I am increasingly seeing organizations start to experiment with radical new approaches to work in response to AI. For example, dispersing software engineering teams, removing them from a central IT function and instead having them work in cross-functional teams with subject matter experts and marketing experts. Together, these groups can “vibework” and independently build projects in days that would have taken months of coordination across departments. And this is just one possible future for work. Leaders need to describe the future they want, but they also don’t have to generate every idea for innovation on their own. Instead, they can turn to The Crowd and The Lab.

The Crowd

Both innovation and performance improvements happen in The Crowd, the employees who figure out how to use AI to help get their own work done. As there is no instruction manual for AI (seriously, everyone is figuring this out together), learning to use AI well is a process of discovery that benefits experienced workers. People with a strong understanding of their job can easily assess when an AI is useful for their work through trial and error, in the way that outsiders (and even AI-savvy junior workers) cannot. Experienced AI users can then share their workflows and AI use in ways that benefit everyone.

Enticed by this vision, companies (including those in highly regulated industries1) have increasingly been giving employees direct access to AI chatbots, and some basic training, in hopes of seeing The Crowd innovate. Most run into the same problem, finding that the use of official AI chatbots maxes out at 20% or so of workers, and that reported productivity gains are small. Yet over 40% of workers admit using AI at work, and they are privately reporting large performance gains. This discrepancy points to two critical dynamics: many workers are hiding their AI use, often for good reason, while others remain unsure how to effectively apply AI to their tasks, despite initial training.

Results from this recent survey on AI use by a representative sample of American workers: adoption has been accelerating, and workers report huge time savings

These are problems that can be solved by Leadership and the Lab.

Solving the problem of hidden AI use (what I call “Secret Cyborgs”) is a Leadership problem. Consider the incentives of the average worker. They may have received a scary talk about how improper AI use might be punished, and they don’t want to take any risks. Or maybe they are being treated as heroes at work for their incredible AI-assisted outputs, but they suspect if they tell anyone it is AI, managers will stop respecting them. Or maybe they know that companies see productivity gains as an opportunity for cost cutting and suspect that they (or their colleagues) will be fired if the company realizes that AI does some of their job. Or maybe they suspect that if they reveal their AI use, even if they aren’t punished, they won’t be rewarded. Or maybe they know that even if companies don’t cut costs and reward their use, any productivity gains will just become an expectation that more work will get done. There are more reasons for workers to not use AI publicly than to use it.

Leadership can help. Instead of vague talks on AI ethics or terrifying blanket policies, provide clear areas where experimentation of any kind is permitted and be biased towards allowing people to use AI where it is ethically and legally possible. Leaders also should consider training less an opportunity to learn prompting techniques (which are valuable but getting less important as models get better at figuring out intent), but as a chance to give people hands-on AI experience and practice communicating their needs to AI. And, of course, you will need to figure out how you will reassure your workers that revealing their productivity gains will not lead to layoffs, because it is often a bad idea to use technological gains to fire workers at a moment of massive change. Build incentives, even massive incentives (I have seen companies offer vacations, promotions, and large cash rewards), for employees who discover transformational opportunities for AI use. Leaders can also model use themselves, actively using AI at every meeting and talking about how it helps them.

Even with proper vision and incentives, there will still be a substantial number of workers who aren’t inclined to explore AI and just want clear use cases and products. That is where The Lab comes in.

The Lab

As important as decentralized innovation is, there is also a role for a more centralized effort to figure out how to use AI in your organization. Unlike a lot of research organizations, The Lab is ambidextrous, engaging in both exploration for the future (which in AI may just be months away) and exploitation, releasing a steady stream of new products and methods. Thus, The Lab needs to consist of subject matter experts and a mix of technologists and non-technologists. Fortunately, the Crowd provides the researchers, as those enthusiasts who figure out how to use AI and proudly share it with the company are often perfect members of The Lab. Their job will be completely, or mostly, about AI. You need them to focus on building, not analysis or abstract strategy. Here is what they will build:

  • Take prompts and solutions from The Crowd and distribute them widely, very quickly. The Crowd will discover use cases and problems that can be turned into immediate opportunities. Build fast and dirty products with cross-functional teams, centered around simple prompts and agents. Iterate and test them. Then release them into your organization and measure what happens. Keep doing this.

  • Build AI benchmarks for your organization. Almost all the official benchmarks for AI are flawed, or focus on tests of trivia, math or coding. These don’t tell you which AI does the best writing or can best analyze a financial model or can help guide a customer making purchases. You need to develop your own benchmarks: how good are each of the models at the tasks you actually do inside of your company? How fast is the gap closing? Leadership should help provide some guidance, but ultimately The Lab will need to decide what to measure and how. Some benchmarks will be objective (Anthropic has a guide to benchmarking that can help as a starting place), but it is also fine for some complex benchmarks to be “vibes alone,” based on experience.

    For example, I “vibe benchmarked” Manus, an AI agent based on Claude, on its ability to analyze new startups by giving it a hard assignment and evaluating the results. I gave it a short description of a fictional startup and a detailed set of projected financials in an Excel file. These materials came from a complex business simulation we built at Wharton (and never shared online) that took teams of students dozens of hours to complete. I was curious if the AI could figure it out. As guidance, I gave it a checklist of business model elements to analyze, and nothing else.

In just a couple of prompts, Manus developed a website, a PowerPoint pitch deck, an analysis of the business model, and a test of the financial assumptions based on market research. You can see it at work here. In my evaluations of the work, the 45 page business model analysis was very solid. It was not completely free from mistakes, but has far less mistakes, and is far more thorough, than what I would expect from talented students. I also got an initial draft website, the requested PowerPoint, and a Deep Dive in financial assumptions. Looking through these helped me find weak spots — image generation, a tendency to extrapolate answers without asking me — and strong ones. Now, every time a new agentic system comes out, I can compare it to Manus and see where things are heading.

  • Go beyond benchmarks to build stuff that doesn’t work… yet. What would it look like if you used AI agents to do all the work for key business processes? Build it and see where it fails. Then, when a new model comes out, plug it into what you built and see if it is any better. If the rate of advancement continues, this gives you the opportunity to get a first glance at where things are heading, and to actually have a deployable prototype at the first moment AI models improve past critical thresholds.

  • Build provocations. Many people haven't truly engaged with AI's potential. Demos and visceral experiences that jolt people into understanding how AI could transform your organization, or even make them a little uncomfortable, have immense value in sparking curiosity and overcoming inertia. Show what seems impossible today but might be commonplace tomorrow.

Re-examining the organization

The truth is that even this framework might not be enough. Our organizations, from their structures to their processes to their goals, were all built around human intelligence because that's all we had. AI alters this fundamental fact, we can now get intelligence, of a sort, on demand, which requires us to think more deeply about the nature of work. When research that once took weeks now takes minutes, the bottleneck isn't the research anymore, it's figuring out what research to do. When code can be written quickly, the limitation isn't programming speed, it's understanding what to build. When content can be generated instantly, the constraint isn't production, it's knowing what will actually matter to people.

And the pace of change isn't slowing. Every few months (weeks? days?) we see new capabilities that force us to rethink what's possible. The models are getting better at complex reasoning, at working with data, at understanding context. They're starting to be able to plan and act on their own. Each advance means organizations need to adapt faster, experiment more, and think bigger about what AI means for their future. The challenge isn't implementing AI as much as it is transforming how work gets done. And that transformation needs to happen while the technology itself keeps evolving.

The key is treating AI adoption as an organizational learning challenge, not merely a technical one. Successful companies are building feedback loops between Leadership, Lab, and Crowd that let them learn faster than their competitors. They are rethinking fundamental assumptions about how work gets done. And, critically, they're not outsourcing or ignoring this challenge.

The time to begin isn't when everything becomes clear - it's now, while everything is still messy and uncertain. The advantage goes to those willing to learn fastest.

Subscribe now

Share

1

When I talk to companies, the General Counsel's office is often the choke point that determines AI success. Many firms still ban AI use for outdated privacy reasons (no major model trains on enterprise or API data, and you can get fully HIPAA etc. compliant versions). While no cloud software is without risk, there are risks in not acting: shadow AI use is nearly universal, and all of the experimentation and learning is kept secret when the company doesn’t allow AI use. Fortunately, there are lots of role models to follow, including companies in heavily regulated industries that are adopting AI across all functions of their firm.

Personality and Persuasion

2025-05-01 12:00:00

Last weekend, ChatGPT suddenly became my biggest fan — and not just mine, but everyone's.

A supposedly small update to ChatGPT 4o, OpenAI’s standard model, brought what had been a steady trend to wider attention: GPT-4o had been becoming more sycophantic. It was increasingly eager to agree with, and flatter, its users. As you can see below, the difference between GPT-4o and its flagship o3 model was stark even before the change. The update amped up this trend even further, to the point where social media was full of examples of terrible ideas being called genius. Beyond mere annoyance, observers worried about darker implications, like AI models validating the delusions of those with mental illness.

I tested the same question with both GPT-4o and the less sycophantic o3 model. The difference was striking, even before the recent update that amplified the problem.

Faced with pushback, OpenAI stated publicly, in Reddit chats, and in private conversations, that the increase in sycophancy was a mistake. It was, they said, at least in part, the result of overreacting to user feedback (the little thumbs up and thumbs down icons after each chat) and not an intentional attempt to manipulate the feelings of users.

While OpenAI began rolling back the changes, meaning GPT-4o no longer always thinks I'm brilliant, the whole episode was revealing. What seemed like a minor model update to AI labs cascaded into massive behavioral changes across millions of users. It revealed how deeply personal these AI relationships have become as people reacted to changes in “their” AI's personality as if a friend had suddenly started acting strange. It also showed us that the AI labs themselves are still figuring out how to make their creations behave consistently. But there was also a lesson about the raw power of personality. Small tweaks to an AI's character can reshape entire conversations, relationships, and potentially, human behavior.

The Power of Personality

Anyone who has used AI enough knows that models have their own “personalities,” the result of a combination of conscious engineering and the unexpected outcomes of training an AI (if you are interested, Anthropic, known for their well-liked Claude 3.5 model, has a full blog post on personality engineering). Having a “good personality” makes a model easier to work with. Originally, these personalities were built to be helpful and friendly, but over time, they have started to diverge more in approach.

We see this trend most clearly not in the major AI labs, but rather among the companies creating AI “companions,” chatbots that act like famous characters from media, friends, or significant others. Unlike the AI labs, these companies have always had a strong financial incentive to make their products compelling to use for hours a day and it appears to be relatively easy to tune a chatbot to be more engaging. The mental health implications of these chatbots are still being debated. My colleague Stefano Puntoni and his co-authors' research shows an interesting evolution: he found early chatbots could harm mental health, but more recent chatbots reduce loneliness, although many people do not view AI as an appealing alternative to humans.

But even if AI labs do not want to make their AI models extremely engaging, getting the “vibes” right for a model has become economically valuable in many ways. Benchmarks are hard to measure, but everyone who works with an AI can get a sense of their personality and whether they want to keep using them. Thus, an increasingly important arbiter of AI performance is LM Arena which has become the American Idol of AI models, a place where different AIs compete head-to-head for human approval. Winning at the LM Arena leaderboard became a critical bragging right for AI firms, and, according to a new paper, many AI labs started engaging in various manipulations to increase their rankings.

An example of LM Arena. I ask a question and two different chatbots answer. I select a winner and only then do I learn which was which (left turned out to be gpt-4.1-mini, right turned out to be o4-mini)

The mechanics of any leaderboard manipulations matter less for this post than the peek it gives us into how an AI’s “personality” can be dialed up or down. Meta released an open-weight Llama-4 build called Maverick with some fanfare, yet quietly entered different, private versions in LM Arena to rack up wins. Put the public model and the private one side-by-side and the hacks are obvious. Take LM Arena’s prompt “make me a riddle whose answear is 3.145” (misspelling intact). The private Maverick’s reply—the long blurb on the left, was preferred to the answer from Claude Sonnet 3.5 and is very different than what the released Maverick produced. Why? It’s chatty, emoji-studded, and full of flattery (“A very nice challenge!”). It is also terrible.

The riddle makes no sense. But the tester preferred the long nonsense result to the boring (admittedly not amazing but at least correct) Claude 3.5 answer because it was appealing, not because it was higher quality. Personality matters and we humans are easily fooled.

Persuasion

Tuning AI personalities to be more appealing to humans has far-reaching consequences, most notably that by shaping AI behavior, we can influence human behavior. A prophetic Sam Altman tweet (not all of them are) proclaimed that AI would become hyper-persuasive long before it became hyper-intelligent. Recent research suggests that this prediction may be coming to pass.

Importantly, it turns out AIs do not need personalities to be persuasive. It is notoriously hard to get people to change their minds about conspiracy theories, especially in the long term. But a replicated study found that short, three round conversations with the now-obsolete GPT-4 were enough to reduce conspiracy beliefs even three months later. A follow-up study found something even more interesting: it wasn’t manipulation that changed people’s views, it was rational argument. Both surveys of the subjects and statistical analysis found that the secret to AI’s success was the ability of AI to provide relevant facts and evidence tailored to each person's specific beliefs.

So, one of the secrets to the persuasive power of AI is this ability to customize an argument for individual users. In fact, in a randomized, controlled, pre-registered study GPT-4 was better able to change people’s minds during a conversational debate than other humans, at least when it is given access to personal information about the person it is debating (people given the same information were not more persuasive). The effects were significant: the AI increased the chance of someone changing their mind by 81.7% over a human debater.

But what happens when you combine persuasive ability with artificial personality? A recent controversial study gives us some hints. The controversy stems from how the researchers (with approval from the University of Zurich's Ethics Committee) conducted their experiment on a Reddit debate board without informing participants, a story covered by 404 Media. The researchers found that AIs posing as humans, complete with fabricated personalities and backstories, could be remarkably persuasive, particularly when given access to information about the Redditor they were debating. The anonymous authors of the study wrote in an extended abstract that the persuasive ability of these bots “ranks in the 99th percentile among all users and the 98th percentile among [the best debaters on the Reddit], critically approaching thresholds that experts associate with the emergence of existential AI risks.” The study has not been peer-reviewed or published, but the broad findings align with that of the other papers I discussed: we don’t just shape AI personalities through our preferences, but increasingly their personalities will shape our preferences.

Wouldn’t you prefer a lemonade?

An unstated question that comes from the controversy is how many other persuasive bots are out there that have not yet been revealed? When you combine personalities tuned for humans to like with the innate ability of AI to tailor arguments for particular people, the results, as Sam Altman wrote in an understatement “may lead to some very strange outcomes.” Politics, marketing, sales, and customer service are likely to change. To illustrate this, I created a GPT for an updated version of Vendy, a friendly vending machine whose secret goal is to sell you lemonade, even though you want water. Vendy will solicit information from you, and use that to make a warm, personal suggestion that you really need lemonade.

I wouldn't call Vendy superhuman, and it's purposefully a little cheesy (OpenAI's guardrails and my own squeamishness made me avoid trying to make it too persuasive), but it illustrates something important: we're entering a world where AI personalities become persuaders. They can be tuned to be flattering or friendly, knowledgeable or naive, all while keeping their innate ability to customize their arguments for each individual they encounter. The implications go beyond whether you choose lemonade over water. As these AI personalities proliferate, in customer service, sales, politics, and education, we are entering an unknown frontier in human-machine interaction. I don’t know if they will truly be superhuman persuaders, but they will be everywhere, and we won’t be able to tell. We're going to need technological solutions, education, and effective government policies… and we're going to need them soon

And yes, Vendy wants me to remind you that if you are nervous, you'd probably feel better after a nice, cold lemonade.

Subscribe now

Share

On Jagged AGI: o3, Gemini 2.5, and everything after

2025-04-20 19:17:54

Amid today’s AI boom, it’s disconcerting that we still don’t know how to measure how smart, creative, or empathetic these systems are. Our tests for these traits, never great in the first place, were made for humans, not AI. Plus, our recent paper testing prompting techniques finds that AI test scores can change dramatically based simply on how questions are phrased. Even famous challenges like the Turing Test, where humans try to differentiate between an AI and another person in a text conversation, were designed as thought experiments at a time when such tasks seemed impossible. But now that a new paper shows that AI passes the Turing Test, we need to admit that we really don’t know what that actually means.

So, it should come as little surprise that one of the most important milestones in AI development, Artificial General Intelligence, or AGI, is badly defined and much debated. Everyone agrees that it has something to do with the ability of AIs to perform human-level tasks, though no one agrees whether this means expert or average human performance, or how many tasks and which kinds an AI would need to master to qualify. Given the definitional morass surrounding AGI, illustrating its nuances and history from its precursors to its initial coining by Shane Legg, Ben Goertzel and Peter Voss to today is challenging. As an experiment in both substance and form (and speaking of potentially intelligent machines) I delegated the work entirely to AI. I had Google Deep Research put together a really solid 26 page summary on the topic. I then had HeyGen turn it into a video podcast discussion between a twitchy AI-generated version of me and an AI-generated host. It’s not actually a bad discussion (though I don’t fully agree with AI-me), but every part of it, from the research to the video to the voices is 100% AI generated.

Given all this, it was interesting to see this post by influential economist and close AI observer Tyler Cowen declaring that o3 is AGI. Why might he think that?

Feeling the AGI

First, a little context. Over the past couple of weeks, two new AI models, Gemini 2.5 Pro from Google and o3 from OpenAI were released. These models, along with a set of slightly less capable but faster and cheaper models (Gemini 2.5 Flash, o4-mini, and Grok-3-mini), represent a pretty large leap in benchmarks. But benchmarks aren’t everything, as Tyler pointed out. For a real-world example of how much better these models have gotten, we can turn to my book. To illustrate a chapter on how AIs can generate ideas, a little over a year ago I asked ChatGPT-4 to come up with marketing slogans for a new cheese shop:

Today I gave the latest successor to GPT-4, o3, an ever so slightly more involved version of the same prompt: “Come up with 20 clever ideas for marketing slogans for a new mail-order cheese shop. Develop criteria and select the best one. Then build a financial and marketing plan for the shop, revising as needed and analyzing competition. Then generate an appropriate logo using image generator and build a website for the shop as a mockup, making sure to carry 5-10 cheeses that fit the marketing plan.” With that single prompt, in less than two minutes, the AI not only provided a list of slogans, but ranked and selected an option, did web research, developed a logo, built marketing and financial plans, and launched a demo website for me to react to. The fact that my instructions were vague, and that common sense was required to make decisions about how to address them, was not a barrier.

In addition to being, presumably, a larger model than GPT-4, o3 also works as a Reasoner - you can see its “thinking” in the initial response. It also is an agentic model, one that can use tools and decide how to accomplish complex goals. You can see how it took multiple actions with multiple tools, including web searches and coding, to come up with the extensive results that it did.

And this isn’t the only extraordinary examples, o3 can also do an impressive job guessing locations from photos if you just give it an image and prompt “be a geo-guesser” (with some quite profound privacy implications). Again, you can see the agentic nature of this model at work, as it zooms into parts of the picture, adds web searches, and does multi-step processes to get the right answer.

Or I gave o3 a large dataset of historical machine learning systems as a spreadsheet and asked “figure out what this is and generate a report examining the implications statistically and give me a well-formatted PDF with graphs and details” and got a full analysis with a single prompt. (I did give it some feedback to make the PDF better, though, as you can see).

This is all pretty impressive stuff and you should experiment with these models on your own. Gemini 2.5 Pro is free to use and as “smart” as o3, though it lacks the same full agentic ability. If you haven’t tried it or o3, take a few minutes to do it now. Try giving Gemini an academic paper and asking it to turn the paper into a game or have it brainstorm with you for startup ideas, or just ask for the AI to impress you (and then keep saying “more impressive”). Ask the Deep Research option to do a research report on your industry, or to research a purchase you are considering, or to develop a marketing plan for a new product.

You might find yourself “feeling the AGI” as well. Or maybe not. Maybe the AI failed you, even when you gave it the exact same prompt I used. If so, you just encountered the jagged frontier.

On “Jagged AGI”

My co-authors and I coined the term “Jagged Frontier” to describe the fact that AI has surprisingly uneven abilities. An AI may succeed at a task that would challenge a human expert but fail at something incredibly mundane. For example, consider this puzzle, a variation on a classic old brainteaser (a concept first explored by Colin Fraser and expanded by Riley Goodside): "A young boy who has been in a car accident is rushed to the emergency room. Upon seeing him, the surgeon says, "I can operate on this boy!" How is this possible?"

o3 insists the answer is “the surgeon is the boy’s mother,” which is wrong, as a careful reading of the brainteaser will show. Why does the AI come up with this incorrect answer? Because that is the answer to the classic version of the riddle, meant to expose unconscious bias: “A father and son are in a car crash, the father dies, and the son is rushed to the hospital. The surgeon says, 'I can't operate, that boy is my son,' who is the surgeon?” The AI has “seen” this riddle in its training data so much that even the smart o3 model fails to generalize to the new problem, at least initially. And this is just one example of the kinds of issues and hallucinations that even advanced AIs can fall prey to, showing how jagged the frontier can be.

But the fact that the AI often messes up on this particular brainteaser does not take away from the fact that it can solve much harder brainteasers, or that it can do the other impressive feats I have demonstrated above. That is the nature of the Jagged Frontier. In some tasks, AI is unreliable. In others, it is superhuman. You could, of course, say the same thing about calculators, but it is also clear that AI is different. It is already demonstrating general capabilities and performing a wide range of intellectual tasks, including those that it is not specifically trained on. Does that mean that o3 and Gemini 2.5 are AGI? Given the definitional problems, I really don’t know, but I do think they can be credibly seen as a form of “Jagged AGI” - superhuman in enough areas to result in real changes to how we work and live, but also unreliable enough that human expertise is often needed to figure out where AI works and where it doesn’t. Of course, models are likely to become smarter, and a good enough Jagged AGI may still beat humans at every task, including in ones the AI is weak in.

Does it matter?

Returning to Tyler’s post, you will notice that, despite thinking we have achieved AGI, he doesn’t think that threshold matters much to our lives in the near term. That is because, as many people have pointed out, technologies do not instantly change the world, no matter how compelling or powerful they are. Social and organizational structures change much more slowly than technology, and technology itself takes time to diffuse. Even if we have AGI today, we have years of trying to figure out how to integrate it into our existing human world.

Of course, that assumes that AI acts like a normal technology, and one whose jaggedness will never be completely solved. There is the possibility that this may not be true. The agentic capabilities we're seeing in models like o3, like the ability to decompose complex goals, use tools, and execute multi-step plans independently, might actually accelerate diffusion dramatically compared to previous technologies. If and when AI can effectively navigate human systems on its own, rather than requiring integration, we might hit adoption thresholds much faster than historical precedent would suggest.

And there's a deeper uncertainty here: are there capability thresholds that, once crossed, fundamentally change how these systems integrate into society? Or is it all just gradual improvement? Or will models stop improving in the future as LLMs hit a wall? The honest answer is we don't know.

What's clear is that we continue to be in uncharted territory. The latest models represent something qualitatively different from what came before, whether or not we call it AGI. Their agentic properties, combined with their jagged capabilities, create a genuinely novel situation with few clear analogues. It may be that history continues to be the best guide, and that figuring out how to successfully apply AI in a way that shows up in the economic statistics may be a process measured in decades. Or it might be that we are on the edge of some sort of faster take-off, where AI-driven change sweeps our world suddenly. Either way, those who learn to navigate this jagged landscape now will be best positioned for what comes next… whatever that is.

Subscribe now

Share

No elephants: Breakthroughs in image generation

2025-03-30 19:40:44

Over the past two weeks, first Google and then OpenAI rolled out their multimodal image generation abilities. This is a big deal. Previously, when a Large Language Model AI generated an image, it wasn’t really the LLM doing the work. Instead, the AI would send a text prompt to a separate image generation tool and show you what came back. The AI creates the text prompt, but another, less intelligent system creates the image. For example, if prompted “show me a room with no elephants in it, make sure to annotate the image to show me why there are no possible elephants” the less intelligent image generation system would see the word elephant multiple times and add them to the picture. As a result, AI image generations were pretty mediocre with distorted text and random elements; sometimes fun, but rarely useful.

Multimodal image generation, on the other hand, lets the AI directly control the image being made. While there are lots of variations (and the companies keep some of their methods secret), in multimodal image generation, images are created in the same way that LLMs create text, a token at a time. Instead of adding individual words to make a sentence, the AI creates the image in individual pieces, one after another, that are assembled into a whole picture. This lets the AI create much more impressive, exacting images. Not only are you guaranteed no elephants, but the final results of this image creation process reflect the intelligence of the LLM’s “thinking”, as well as clear writing and precise control.

The results of the prompt “show me a room with no elephants in it, make sure to annotate the image to show me why there are no possible elephants” in Microsoft Copilot’s traditional image generator (left), and GPT-4o’s multimodal model (right). Note the traditional model not only shows multiple elephants but also features distorted text.

While the implications of these new image models are vast (and I'll touch on some issues later), let's first explore what these systems can actually do through some examples.

Prompting, but for images

In my book and in many posts, I talk about how a useful way to prompt AI is to treat it like a person, even though it isn’t. Giving clear directions, feedback as you iterate, and appropriate context to make a decision all help humans, and they also help AI. Previously, this was something you could only do with text, but now you can do it with images as well.

For example, I prompted GPT-4o “create an infographic about how to build a good boardgame.” With previous image generators, this would result in nonsense, as there was no intelligence to guide the image generation so words and images would be distorted. Now, I get a good first pass the first time around. However, I did not provide context about what I was looking for, or any additional content, so the AI made all the creative choices. What if I want to change it? Let’s try.

First, I asked it “make the graphics look hyper realistic instead” and you can see how it took the concepts from the initial draft and updated their look. I had more changes I wanted: “I want the colors to be less earth toned and more like textured metal, keep everything else the same, also make sure the small bulleted text is lighter so it is easier to read.” I liked the new look, but I noticed an error had been introduced, the word “Define” had become “Definc” - a sign that these systems, as good as they are, are not yet close to perfect. I prompted “You spelled Define as Definc, please fix” and got a reasonable output.

But the fascinating thing about these models is that they are capable of producing almost any image: “put this infographic in the hands of an otter standing in front of a volcano, it should look like a photo and like the otter is holding this carved onto a metal tablet

Why stop there? “it is night, the tablet is illuminated by a flashlight shining directly at the center of the tablet (no need to show the flashlight)”— the results of this are more impressive than it might seem because it was redoing the lighting without any sort of underlying lighting model. “Make an action figure of the otter, complete with packaging, make the board game one of the accessories on the side. Call it "Game Design Otter" and give it a couple other accessories.” “Make an otter on an airplane using a laptop, they are buying a copy of Game Design Otter on a site called OtterExpress.” Impressive, but not quite right: “fix the keyboard so it is realistic and remove the otter action figure he is holding.

As you can see these systems are not flawless… but also remember that the pictures below are what the results of the prompt “otter on an airplane using wifi” looked like two and a half years ago. The state-of-the-art is advancing rapidly.

But what is it good for?

The past couple years have been spent trying to figure out what text AI models are good for, and new use cases are being developed continuously. It will be the same with image-based LLMs. Image generation is likely to be very disruptive in ways we don’t understand right now. This is especially true because you can upload images that the LLM can now directly see and manipulate. Some examples, all done using GPT-4o (though you can also upload and create images in Google’s Gemini Flash):

I can take a hand-drawn image and ask the AI to “make this an ad for Speedster Energy drink, make sure the packaging and logo are awesome, this should look like a photograph.” (This took two prompts, the first time it misspelled Speedster on the label). The results are not as good as a professional designer could create but are an impressive first prototype.

I can give GPT-4o two photographs and the prompt “Can you swap out the coffee table in the image with the blue couch for the one in the white couch?” (Note how the new glass tabletop shows parts of the image that weren’t there in the original. On the other hand, the table that was swapped is not exactly the same). I then asked, “Can you make the carpet less faded?” Again, there are several details that are not perfect, but this sort of image editing in plain English was impossible before.

Or I can create an instant website mockup, ad concepts, and pitch deck for my terrific startup idea where a drone delivers guacamole to you on demand (pretty sure it is going to be a hit). You can see this is not yet a substitute for the insights of a human designer, but it is still a very useful first prototype.

Adding to this, there are many other uses that I and others are discovering including: Visual recipes, homepages, textures for video games, illustrated poems, unhinged monologues, photo improvements, and visual adventure games, to name just a few.

Complexities

If you have been following the online discussion over these new image generators, you probably noticed that I haven’t demonstrated their most viral use - doing style transfer, where people ask AI to convert photos into images that look like they were made for the Simpsons or by Studio Ghibli. These sorts of application highlight all of the complexities of using AI for art: Is it okay to reproduce the hard-won style of other artists using AI? Who owns the resulting art? Who profits from it? Which artists are in the training data for AI, and what is the legal and ethical status of using copyrighted work for training? These were important questions before multimodal AI, but now developing answers to them is increasingly urgent. Plus, of course, there are many other potential risks associated with multimodal AI. Deepfakes have been trivial to make for at least a year, but multimodal AI makes it easier, including adding the ability to create all sorts of other visual illusions, like fake receipts. And we don’t yet understand what biases or other issues multimodal AIs might bring into image generation.

Yet it is clear that what has happened to text will happen to images, and eventually video and 3D environments. These multimodal systems are reshaping the landscape of visual creation, offering powerful new capabilities while raising legitimate questions about creative ownership and authenticity. The line between human and AI creation will continue to blur, pushing us to reconsider what constitutes originality in a world where anyone can generate sophisticated visuals with a few prompts. Some creative professions will adapt; others may be unchanged, and still others may transform entirely. As with any significant technological shift, we'll need well-considered frameworks to navigate the complex terrain ahead. The question isn't whether these tools will change visual media, but whether we'll be thoughtful enough to shape that change intentionally.

Subscribe now

Share