AI failed at 96% of jobs.
A new report from Scale AI and the Center for AI Safety tested AI agents against 240 real freelance projects from Upwork. The best agent completed only 2.5% ...
A new report from Scale AI and the Center for AI Safety tested AI agents against 240 real freelance projects from Upwork. The best agent completed only 2.5% of them to acceptable quality.
The report: https://lnkd.in/gnSb7MPJ
Meanwhile, on the same feed, founders are claiming AI replaced half their team.
Same AI models. Opposite conclusions.
My personal take is the researchers may not have the domain expertise to automate the work they were testing. I know some tech-savvy entrepreneurs (including myself) who are already automating some work that, in the past, would have required hiring people.
The difference has nothing to do with AI capability.
It’s a difference in mental model when using AI.
One group expects AI to be an expert of all trades.
They give AI the brief, expect senior-level output, and dismiss it when the result is average.
“See? AI is overhyped.”
The other group treats AI like a junior employee.
They know LLMs are an averaging technology - the output converges to the statistical mean of its training data. By design. This means average output.
So they don’t expect expert output out of the box.
They invest in “training” the junior.
Curate skill files written by domain experts.
Build workflows based on domain knowledge.
Write detailed specifications.
Review, iterate, refine.
That’s when the output stops looking “average.”
I’ve been both groups.
When I first started building apps with Claude Code, I threw a PRD at it and got demos that broke in production. Classic Group 1.
3 production apps later, I now write detailed project files, create expert skill specifications, and treat every AI session like onboarding a new hire.
Same tool. Completely different output.
This pattern holds across functions:
-
A marketer who loads brand voice and campaign references gets strategic output. One who types “write me an ad” gets slop.
-
A developer who gives detailed specs and test harnesses gets production code. One who says “build me an app” gets a demo.
The difference is never the AI.
Next time an AI tool disappoints you, check your workflow. Not the model.
LLMs produce average output by design. The expert’s job is to guide it to raise the ceiling.
#AI #AIAgents #LLM #ExpertInTheLoop
Enjoyed this? Subscribe for more.
Practical insights on AI, growth, and independent learning. No spam.
More in Tech & Startup
Everyone tells you how easy it is to set up an AI agent with OpenClaw.
Nobody tells you how hard it is to maintain it.
Does AI have empathy? I asked Claude a simple product question.
The answer surprised me.
Google's VP of Product recently answered a question about AEO/GEO for AI Search.
Here are some key takeaways:
We are entering The Age of Creation
Recently, I spent some time consolidating my thoughts as an entrepreneur while preparing for an AI panel discussion. What became clear to me is that this is ...
11 Frameworks Every Marketer Should Know
I've seen marketers drown in tactics while missing the fundamentals.
"If you're good at badminton, don't train for a tennis competition."
This is something I've learned through experience as a serial technopreneur.
Everyone tells you how easy it is to set up an AI agent with OpenClaw.
Nobody tells you how hard it is to maintain it.
We are entering The Age of Creation
Recently, I spent some time consolidating my thoughts as an entrepreneur while preparing for an AI panel discussion. What became clear to me is that this is ...
11 Frameworks Every Marketer Should Know
I've seen marketers drown in tactics while missing the fundamentals.
Does AI have empathy? I asked Claude a simple product question.
The answer surprised me.
Google's VP of Product recently answered a question about AEO/GEO for AI Search.
Here are some key takeaways:
"If you're good at badminton, don't train for a tennis competition."
This is something I've learned through experience as a serial technopreneur.