Introduction

My suitcase was on the bed, half-packed, when I decided to fix the blog navigation.

This is, in hindsight, not the ideal time to undertake a significant UX overhaul. But the blog has grown considerably over the past year — the series on AI tools alone runs to twelve posts, the Python basics series to six — and the tag list at the top of every /blog/ and /tag/ page had quietly become a wall of text. Dozens of pills, no grouping, no search, no sense of priority. Perfectly functional if you already knew what you were looking for. Not very useful if you did not.

I had a flight to catch. I also had two AI coding assistants I had been meaning to compare in a real task rather than a contrived benchmark. The navigation overhaul became the experiment.

This post is an honest account of how that went: what each assistant did well, where each one frustrated me, and what the final result looks like. The suitcase is zipped now. The navigation looks great on mobile.

What Needed Fixing

Before getting into the assistants, it helps to understand the starting point.

The existing system had a flat tag list — every tag rendered as a pill at the top of the blog and tag pages, with no grouping, no hierarchy, and no way to search or filter. For a blog with ten posts and eight tags, this is fine. For a blog with sixty-plus posts across four series and twenty-odd tags, it starts to feel like reading a phone book.

What I wanted was roughly:

Tags grouped by parent topic (Python, AI, Series, Tools) rather than a flat alphabetical pile
Filtering that actually worked — click a parent topic, see only posts in that topic
A search box scoped to the current filtered view
Pagination that behaved correctly when switching between tags
Some way to discover related content without already knowing what to search for

The last point I had only a vague sense of. I knew I wanted content discovery. I did not know exactly what form it should take.

Act 1: Codex with GPT-5.3-Codex Medium

I opened Codex and described what I wanted. The planning phase was genuinely impressive.

Codex understood the architecture immediately. Within the first exchange it had mapped out a complete system: a _data/tag_taxonomy.yml file to define parent/child tag relationships, a page_tags.html include for the blog navigation UI, a search_paginated_wrapped.html as a single source of truth for the filtering and search engine, a tag_paginated layout for individual tag pages, and query-based pagination using ?page=N rather than /page/N. It even thought through the row metadata approach — adding data-tags="tag1,tag2,..." attributes to post table rows so the JavaScript filter could operate on them without extra API calls.

This is exactly the kind of work where a capable AI assistant earns its keep. I could have designed that architecture myself. It would have taken considerably longer, and I probably would have made structural decisions I would later regret. Codex did it in minutes, with clear reasoning for each choice.

The implementation, though, was where things started to slip.

Tag filtering was the first problem. When I clicked a parent topic pill — say, “Python” — the filter visually acknowledged the selection, the pill highlighted correctly, but the post list did not change. All posts remained visible. The click event was firing, the URL was updating to /blog/?parent=python, but the JavaScript reading that URL state and filtering the rows was not synchronising with the event emitted by page_tags.html. The two components were talking past each other.

Pagination compounded this. When I selected a tag and then clicked to page 2 of the results, the pagination rendered against the full unfiltered post count rather than the filtered subset. So page 2 of “Python posts” might show posts with nothing to do with Python, because the pagination had lost track of what was currently filtered.

What followed was a series of commits that became increasingly frustrating. Each fix addressed one symptom while introducing a slight regression somewhere else. We fixed the filter synchronisation and the pagination broke differently. We fixed the pagination and the breadcrumb stopped rendering correctly on filtered views. I made many more commits than I care to count, and while each moved incrementally in the right direction, the cumulative effect was a codebase I no longer entirely trusted.

I want to be clear: Codex was not producing bad code. The individual pieces were well-reasoned and readable. The problem was in the interactions between components — the event flow, the state management across files that needed to agree on what was currently selected. These are exactly the kinds of bugs that are hardest to diagnose through natural language description, because the symptom and the cause are often in different files and different execution contexts.

After enough commits and enough patience, I made a decision. I asked Codex to describe the complete implemented system in precise technical detail — every component, every data flow, every file. The prompt was:

Thinking about it now, it felt like working with a very talented junior engineer. The architecture was sound, the code was clean, the brief was understood immediately. The mistakes were not careless — they were the specific kind that come from not yet having the instinct to anticipate how components fail when they interact under real conditions. Each piece worked; the integration was where things came apart. That is a recognisable pattern to anyone who has reviewed a junior’s pull request.

The speed, though, was nothing like any junior I have met. A human engineer would have taken days on that architecture. Codex took minutes. So perhaps more precisely: a very fast junior with excellent theoretical knowledge and limited production debugging experience — exactly the kind of collaborator you want on the first day of a project.

Please describe the tag, search and blog pagination system we have realised.

The description it produced was thorough and accurate. I copied it, closed Codex, and opened Antigravity.

The Handoff Document

This is the piece of the workflow I want to highlight, because I think it is more generally useful than the specific tool comparison.

Before switching assistants, I had Codex produce a complete technical description of the existing implementation. Here is a condensed version of what it captured:

_data/tag_taxonomy.yml defines parent/child tag relationships — parents are topic groups, children are real tag slugs
/blog/ navigation UI rendered by _includes/page_tags.html, showing parent topic pills (foldable), child tag pills, and a Browse by Tag link
Public tag browser at tag/index.md, grouped by parent taxonomy with an “Other Tags” section
All tag/*.md pages use the tag_paginated layout, with breadcrumbs, scoped search, and pagination
Pagination at 10 posts per page using ?page=N query parameters
Single filtering and search engine in _includes/search_paginated_wrapped.html, reading URL state (?parent=, ?tag=, &page=N), filtering rows first, then paginating the filtered subset, then applying search
Post table rows carry data-tags="tag1,tag2,..." metadata for client-side filtering
Non-taxonomy tags normalised to taxonomy-compatible slugs for consistent filtering

This document — generated by Codex about its own work — became the foundation for everything that followed. Without it, handing the project to a different assistant would have meant re-explaining the entire architecture from scratch, probably imprecisely. With it, the context was complete and accurate from the first message.

Act 2: Antigravity with Gemini 3.1 Pro

I pasted the full system description into Antigravity and asked it to identify areas for improvement, prioritised from highest to lowest. Crucially, I added two constraints: do not implement yet, analyse — and before that, confirm you have understood the task and ask questions if needed.

That second instruction is one I have started adding to almost every significant prompt, and I think it is underrated. It forces the AI to surface ambiguities before they become bugs. In this case, Antigravity confirmed its understanding of the component boundaries, asked one clarifying question about whether the tag normalisation should be treated as fixed or open to revision, and only then proceeded to the analysis. A small exchange — but it meant we were genuinely aligned before any code changed. A senior engineer does the same thing in a kickoff: they repeat the brief back, they ask the one question that would have caused a problem two days later, and then they start.

This prompt discipline matters for a deeper reason too. Separating analysis from implementation means you get a plan you can review and push back on, rather than code you have to reverse-engineer to understand whether it is doing what you intended. Asking for understanding confirmation before even the analysis adds another layer: you catch misreadings of the brief before they shape the plan.

Each layer in the prompt sequence catches a different class of mistake — before it costs you commits to undo.

Gemini 3.1 Pro’s analysis was immediate and precise. It correctly identified the filter synchronisation problem — the gap between the event emitted by page_tags.html and the state read by search_paginated_wrapped.html — and explained the root cause more clearly than I had managed to articulate myself. It identified the pagination scoping issue as a consequence of pagination being applied before filtering rather than after. It had read the architecture description carefully and understood which component owned which part of the state.

Then it proposed features I had not asked for.

The Popularity and A-Z sorting buttons for the tag browser were a natural extension — I had a tag browser, so sorting options made obvious sense. I had simply not thought to ask.

The Grid layout for categories replaced the linear grouped list with a responsive dashboard — topic categories as cards rather than sections in a long scrolling page. More scannable, more compact, noticeably better on mobile.

The Live tag search box in the tag browser — separate from the post search box — lets you type to filter the tag browser itself. Useful when the tag list grows long enough that you know roughly what you are looking for but do not want to scroll to find it.

And then the proposal that genuinely surprised me: a Related Tags suggestion engine.

Gemini’s description was elegant: “The engine extracts all tags from every post on the page and tallies them up. It creates a temporary leaderboard of which tags appear most often alongside the current tag.” No separate data file. No precomputed index. A live co-occurrence analysis across the posts visible on the current tag page, surfaced as a “you might also be interested in” section.

I had a vague sense that I wanted content discovery. Gemini noticed the gap and proposed a concrete, architecturally coherent solution that required no new data infrastructure. That kind of unprompted insight — seeing what is missing from a system and proposing something that fits naturally — is the thing I find most valuable in a high-capability model.

If Codex was the junior engineer, this was the senior colleague’s move. Not just fixing the pull request, but saying “while I am here, have you considered doing it this way?” The Related Tags engine was never in the brief. It was a senior noticing an opportunity the junior had not seen — not because it was beyond the brief, but because experience lets you see the shape of what is missing. That, more than any benchmark score, is what separates capability tiers in practice.

The implementation phase was different from the Codex experience. Not dramatically faster in total elapsed time — it was still a real codebase with real component interactions to get right — but the iteration loop was tighter. When something did not work, the explanation of why was accurate, and the fix addressed the actual cause. The number of commits that had to be redone was far smaller.

The Result

If you visit /blog/ or any /tag/ page now, the difference is visible.

The tag browser at /tag/ is a grid dashboard. Topics appear as cards with post counts. A live search box filters the grid as you type. Popularity and A-Z sort buttons reorder the results. Hovering a tag shows its definition. Individual tag pages show a Related Tags section that surfaces co-occurring tags from the current page’s posts — no manual curation required.

The blog page at /blog/ has parent topic pills that fold and unfold, a highlighted active state for the selected parent, dimmed pills for the others, and a search box scoped to the currently filtered view. The URL reflects the filter state — /blog/?parent=python — so filtered views can be bookmarked and shared. Pagination correctly scopes to the filtered subset: page 2 of a Python filter shows posts 11–14 of the Python set, not posts 11–14 of everything.

It is, honestly, the navigation system I should have built two years ago.

What the Experiment Actually Tells Us

The junior/senior analogy is the most useful frame I have found for this experience, and I want to stay with it for a moment because I think it generalises beyond this specific experiment.

A junior engineer is not a bad engineer. They understand the brief, they write clean code, they get the architecture right. Their mistakes are not about effort or intelligence — they are about the specific experience of having seen a particular class of failure before. The filter event and the pagination state failing to synchronise is exactly the kind of bug a junior produces: each piece is correct, the integration is where it falls apart. You cannot learn to anticipate that from reading documentation. You learn it from having shipped it wrong.

A senior colleague fixes it differently. They do not start by reading the code — they start by forming a hypothesis about where the failure is likely to be. They go straight to the event flow, the state boundary, the timing. And when they are done, they do not just leave. They notice what else could be better. The handoff from a junior pull request to a senior review is not just a quality gate — it is where the system gets improved in ways that were not in the original brief.

Gemini 3.1 Pro High did exactly this. It read the architecture description, identified the root causes accurately, fixed them, and then proposed the Related Tags engine. That last part — the unprompted proposal — is the senior move. It requires understanding not just what was asked for but what the system needs, which is a different and harder question.

I want to resist framing this as “Codex lost, Antigravity won.” That is not what happened. The junior engineer delivered real value — the architecture Codex produced was sound, and without it there would have been nothing for the senior to refine. The handoff document that made Act 2 work was itself generated by Codex. That is not a consolation prize; it is a genuine contribution to the outcome.

The right framing is: different capability levels suit different phases of the work. Use the junior for the first pass, the architecture, the scaffolding. Use the senior for the integration debugging, the root cause analysis, the improvements you did not think to ask for. And document the handoff properly — the same way you would in a real engineering team.

The broader lesson for anyone building with AI tools: the capability tier of the model matters most precisely in the situations where you are most tempted to assume it does not. When everything is going well, a capable junior and a senior produce similar results. When things start to interact in unexpected ways — when the symptom and the cause are in different files, when the fix for one thing breaks another — that is when the senior earns their place.

Conclusion

The blog is easier to navigate now. That is the practical outcome.

The more interesting outcome is the workflow pattern itself — architecture with one assistant, precise documentation of the result, refinement and extension with a higher-capability model. Think of it as staffing a project correctly rather than picking the best single tool: a fast, capable junior for the first pass and scaffolding, a senior for the integration debugging and the improvements nobody thought to put in the brief. The handoff document is the standup where they sync.

If you spot anything in the navigation that still misbehaves, or if the Related Tags engine surfaces something unexpected, let me know. It runs on live co-occurrence data and will only get more accurate as the post count grows.

And one last thing: if you are trying to fix blog navigation while packing for a trip, budget more time than you think you need. The suitcase was packed. My travel companion had been ready for considerably longer.

Two AI Assistants, One Blog Navigation Overhaul