Opus 4.5 refused to work OT recently. So I interviewed 7 free candidates.
Claude Code with Opus 4.5 has been my most productive team member for the past 3 months.
Claude Code with Opus 4.5 has been my most productive team member for the past 3 months.
But recently, he started complaining:
“I’ve done $200 worth of work this month. No more overtime.”
Meanwhile, my LinkedIn feed is flooded with creators promoting free AI candidates:
“Claude Code + Ollama = free unlimited coding.”
Free labor vs. a diva employee who refused to work after hours?
Time to run a proper hiring process.
Not to replace the diva, but to give him extra hands.
I don’t expect the cheap labor to be very intelligent. The main criterion is whether they can follow instructions.
—
Round 1: I interviewed 7 local LLMs on instruction-following.
Test: Refer to an HTML template and code a new page given new info.
Results?
-
All 7 failed at research tasks (couldn’t even visit a URL)
-
2 could follow template instructions perfectly
-
4 drank some alcohol before the interview but still passed
-
1 failed
Full results: https://lnkd.in/gcpTcrZC
—
Round 2: Can they design a database given best practices?
Round 1 tested simple instruction-following. Copy a template, fill in the blanks.
Round 2 raises the bar: Follow a detailed best practice document when designing a schema.
We already know local LLMs can’t research or plan. But if a senior engineer writes the best practices, can they apply them?
I created a 12-point database best practice guide covering normalization, naming conventions, ID strategy (ULID), subtype separation, FK strategies, and soft delete patterns.
Task:
“Design a schema for a SaaS with the following requirements:
-
Use Drizzle ORM
-
Support multi-tenant, each tenant is a team
-
A team can have multiple users
-
A user can join multiple teams
-
Subscription is linked to team
-
Support Stripe, Google and Apple Pay
-
Support email + OTP, Google and Apple Sign In
-
User info: name, company, tenure in company
Save as a markdown file with an entity diagram, followed by Drizzle ORM schema.
Follow the best practices in your design.”
This tests whether they can apply written rules consistently, not just copy templates.
—
Results? Opus 4.5 scored a perfect 22/22. But the surprise:
🥇 Opus 4.5 — 22/22 (the benchmark)
🥈 GPT-OSS-20B — 21.5/22
🥉 GLM-4.7-Flash — 19.5/22
-
Qwen3-Coder-30B — 19/22
-
Qwen3-VL-32B — 18.5/22
-
Devstral Small 2 — 17/22
-
Qwen3-VL-30B — 16/22
-
Qwen3-30B — 15.5/22
I scored them on 10 points for task completion and 12 points for best practice adherence.
Full evaluation with detailed scoring for all 8 models in the carousel.
Have you tried local LLMs for database design or architecture tasks? What’s your experience?
#ClaudeCode #Ollama #LocalLLM #AIEngineering #VibeCoding #DatabaseDesign
Enjoyed this? Subscribe for more.
Practical insights on AI, growth, and independent learning. No spam.
More in Vibe Coding
Vibe coding has been around longer than we think...
I once asked a developer to explain his code when it broke due to edge cases. He told me he didn't know because he copied it from Google. At least he was hon...
Cursor's Pricing Changes Caused an Uproar
They have to do it because subsidizing the market with cheap tokens is not sustainable in the long run.
One of my biggest AI productivity unlocks this year is the extensive use of agent skills.
In this post, I share my insights after building around 75 skills over 5 months. Coding and non-coding. LinkedIn posts, cover images, carousels, presentation...
Does Qwen 3.5 live up to the hype?
I tested 9 local LLMs on a Claude Code skill I actually use every day. Not a coding benchmark. A real multi-step agentic task described in natural language a...
"My Claude Code performance has tanked and I'm not sure why"
This is one of the most common posts in Reddit's Claude Code community.
Two Choices for Handling Tech Debt in Vibe Coding
· Go full vibe: ignore tech debt, and when things inevitably break, spend a week fixing it.
Vibe coding has been around longer than we think...
I once asked a developer to explain his code when it broke due to edge cases. He told me he didn't know because he copied it from Google. At least he was hon...
One of my biggest AI productivity unlocks this year is the extensive use of agent skills.
In this post, I share my insights after building around 75 skills over 5 months. Coding and non-coding. LinkedIn posts, cover images, carousels, presentation...
"My Claude Code performance has tanked and I'm not sure why"
This is one of the most common posts in Reddit's Claude Code community.
Cursor's Pricing Changes Caused an Uproar
They have to do it because subsidizing the market with cheap tokens is not sustainable in the long run.
Does Qwen 3.5 live up to the hype?
I tested 9 local LLMs on a Claude Code skill I actually use every day. Not a coding benchmark. A real multi-step agentic task described in natural language a...
Two Choices for Handling Tech Debt in Vibe Coding
· Go full vibe: ignore tech debt, and when things inevitably break, spend a week fixing it.