Opus 4.5 refused to work OT recently. So I interviewed 7 free candidates.

Claude Code with Opus 4.5 has been my most productive team member for the past 3 months.

But recently, he started complaining:

“I’ve done $200 worth of work this month. No more overtime.”

Meanwhile, my LinkedIn feed is flooded with creators promoting free AI candidates:

“Claude Code + Ollama = free unlimited coding.”

Free labor vs. a diva employee who refused to work after hours?

Time to run a proper hiring process.

Not to replace the diva, but to give him extra hands.

I don’t expect the cheap labor to be very intelligent. The main criterion is whether they can follow instructions.

—

Round 1: I interviewed 7 local LLMs on instruction-following.

Test: Refer to an HTML template and code a new page given new info.

Results?

All 7 failed at research tasks (couldn’t even visit a URL)
2 could follow template instructions perfectly
4 drank some alcohol before the interview but still passed
1 failed

Full results: https://lnkd.in/gcpTcrZC

—

Round 2: Can they design a database given best practices?

Round 1 tested simple instruction-following. Copy a template, fill in the blanks.

Round 2 raises the bar: Follow a detailed best practice document when designing a schema.

We already know local LLMs can’t research or plan. But if a senior engineer writes the best practices, can they apply them?

I created a 12-point database best practice guide covering normalization, naming conventions, ID strategy (ULID), subtype separation, FK strategies, and soft delete patterns.

Task:

“Design a schema for a SaaS with the following requirements:

Use Drizzle ORM
Support multi-tenant, each tenant is a team
A team can have multiple users
A user can join multiple teams
Subscription is linked to team
Support Stripe, Google and Apple Pay
Support email + OTP, Google and Apple Sign In
User info: name, company, tenure in company

Save as a markdown file with an entity diagram, followed by Drizzle ORM schema.

Follow the best practices in your design.”

This tests whether they can apply written rules consistently, not just copy templates.

—

Results? Opus 4.5 scored a perfect 22/22. But the surprise:

🥇 Opus 4.5 — 22/22 (the benchmark)

🥈 GPT-OSS-20B — 21.5/22

🥉 GLM-4.7-Flash — 19.5/22

Qwen3-Coder-30B — 19/22
Qwen3-VL-32B — 18.5/22
Devstral Small 2 — 17/22
Qwen3-VL-30B — 16/22
Qwen3-30B — 15.5/22

I scored them on 10 points for task completion and 12 points for best practice adherence.

Full evaluation with detailed scoring for all 8 models in the carousel.

Have you tried local LLMs for database design or architecture tasks? What’s your experience?

#ClaudeCode #Ollama #LocalLLM #AIEngineering #VibeCoding #DatabaseDesign

Enjoyed this? Subscribe for more.

Practical insights on AI, growth, and independent learning. No spam.

Opus 4.5 refused to work OT recently. So I interviewed 7 free candidates.

Enjoyed this? Subscribe for more.

More in Vibe Coding

What Publishers Think About AI Image Generation

The Hype Cycle of Claude Code That Everyone Will Go Through

UX/UI and naming matter more than capability for adoption.

3 months ago I posted "Vibe coders, this will happen to you sooner or later."

Is Cloudflare better than GoDaddy? What is edge computing?

Opus failed at a seemingly easy task when I started using Claude Code end of last year.