Credit: Roman Samborskyi /Shutterstock
Study: Generative AI results depend on user prompts as much as models
By
As generative artificial intelligence systems improve, a natural assumption is that better large language models will lead to better results. But new research from several MIT Sloan affiliates suggests that LLM advances are only part of the story.
In a large-scale experiment, researchers found that only half of the performance gains seen after switching to a more AI advanced model came from the model itself.
The other half came from how users adapted their prompts — that is, the written instructions that tell an AI model what to do — to take advantage of the new system.
The simple but powerful insight that user adaptation contributes as much to performance as the model upgrade itself highlights a critical reality for businesses: Investing in new AI tools won’t deliver their anticipated value unless employees also refine how they use them. In this case, prompting is a learnable skill that people can improve quickly, even without instruction.
“People often assume that better results come mostly from better models,” said Columbia University assistant professor David Holtz, SM ’18, PhD ’21, a research affiliate at the MIT Initiative on the Digital Economy and one of the study’s co-authors. “The fact that nearly half the improvement came from user behavior really challenges that belief.”
Better prompts, improved models boost performance
In the experiment, nearly 1,900 participants were randomly assigned to one of three versions of OpenAI’s DALL-E image generation system: DALL-E 2, the more advanced DALL-E 3, or DALL-E 3 with the users’ prompts automatically rewritten by the GPT-4 LLM without their knowledge.
Participants were shown a reference image — such as a photo, graphic design, or piece of art — and asked to re-create it by typing instructions into the AI. They had 25 minutes to submit at least 10 prompts, and they were told that the top 20% of performers would receive a bonus payment, which motivated them to test and improve their instructions.
The researchers found the following:
- Participants who used the baseline version of DALL-E 3 produced images that were more similar to the target image than were those the DALL-E 2 users generated.
- Participants using the baseline version of DALL-E 3 wrote 24% longer prompts compared with the DALL-E 2 users. Those prompts also tended to be more similar to each other and contained more descriptive words.
- About half of the improvement in image similarity came from the improved model, while the other half came from how users adjusted their prompts to take advantage of improved models.
While this study looked at image generation, the researchers think the same pattern will apply to other tasks too, such as writing and coding.
Prompting is about communication, not coding
The research showed that the ability to adapt prompts over time was not limited to tech-savvy users.
“People often think that you need to be a software engineer to prompt well and benefit from AI,” Holtz said. “But our participants came from a wide range of jobs, education levels, and age groups — and even those without technical backgrounds were able to make the most of the new model’s capabilities.”
The data suggests that prompting is more about communication than coding. “The best prompters weren’t software engineers,” Holtz said. “They were people who knew how to express ideas clearly in everyday language, not necessarily in code.”
That accessibility may also help reduce performance gaps between users with different skill levels and experience. University of Maryland assistant professor Eaman Jahani, PhD ’22, a digital fellow at the MIT Initiative on the Digital Economy, and a study co-author, noted that generative AI has the potential to narrow performance gaps between users.
Related Articles
“People who start off at the lower end of the [performance] scale benefited the most, which means the differences in outcomes became smaller,” Jahani said. “Model advances can actually help reduce inequality in output.”
Jahani pointed out that the team’s findings apply to tasks with clear, measurable outcomes, where there’s an upper limit on what counts as a good result. It’s not clear, he noted, whether the same pattern would hold in more open-ended tasks without a single right answer and with potentially large payoffs, such as coming up with transformative new ideas.
Rewriting prompts using generative AI led to worse performance
One of the more surprising results came from the group that used DALL-E 3 with generative AI rewriting their prompts. While this feature was designed to help users, it backfired, degrading performance in the image-generation task by 58% relative to the baseline DALL-E 3 group.
The team found that the automatic rewrites often added extra details or changed the meaning of what users were trying to say, leading the AI to produce the wrong kind of image.
“[Automatic prompt rewriting] just doesn’t work well for a task like this, where the goal is to match a target image as closely as possible,” Holtz said. “More importantly, it shows how AI systems can break down when designers make assumptions about how people will use them. If you hard-code hidden instructions into the tool, they can easily conflict with what the user is actually trying to do.”

Leading the AI-Driven Organization
In person at MIT Sloan
Register Now
How businesses can unlock value in AI
The takeaway is that beyond choosing the “right” AI model, business leaders should also focus on enabling the right kind of user learning and experimentation. Prompting is not a plug-and-play skill, Jahani said. “Companies need to continually invest in their human resources,” he said. “People need to be caught up with these technologies and know how to use them well.”
To build on the gains enabled by generative AI, the researchers offer several priorities for business leaders looking to make AI systems more effective in real-world settings:
- Invest in training and experimentation. Technical upgrades alone are not enough. Giving employees time and support to refine how they interact with AI systems is essential to realizing full performance gains.
- Design for iteration. Interfaces that encourage users to test, revise, and learn — and display the results clearly — help drive better outcomes over time.
- Be cautious with automation. Automated prompt rewriting may be convenient, but if it obscures or overrides user intent, it can hinder performance rather than improve it.
The paper was also co-authored by MIT Sloan PhD students Benjamin S. Manning, SM ’24; Hong-Yi TuYe, SM ’23; and Mohammed Alsobay, ’16, SM ’24; as well as Stanford University PhD student Joe Zhang, Microsoft computational social scientist Siddharth Suri, and University of Cyprus assistant professor Christos Nicolaides, SM ’11, PhD ’14.