Sopheak

Apr 2026 • Build Log #009

Khmer Web Search: what we have been doing, and what is still getting in the way

A more honest update on the work behind Khmer Web Search, the problems that keep slowing it down, and why better search is still a systems problem, not a single ranking tweak.

Khmer Web Search is live now, which is good, but being live also removes the luxury of pretending the hard parts are somewhere in the future.

The project is real enough that the problems are no longer abstract. People can search. The homepage is populated. New content is flowing through. But the gap between "the system works" and "the system feels trustworthy and useful" is still very real.

So this is the less polished version of the update. Not just that things are moving, but what we have actually been doing, where the friction really is, and why this is taking longer than a simple crawl-and-index story would suggest.

What we have actually been doing

Most of the recent work has not been flashy. It has been the kind of work search products need if they are going to stop lying to themselves.

One part has been tightening the pipeline between crawling, extraction, classification, persistence, indexing, and the product surfaces that people actually see. That includes making the ingestion path less fragile, expanding source support, and fixing the hidden failure points where pages existed in the raw crawl but never became durable, usable articles.

Another part has been building a more practical source-onboarding workflow. We now have a real AI-assisted blueprint flow for new sources: compile a draft, inspect it, promote what looks good, run targeted processing, and then check whether the homepage and search output actually improve. That is a much better loop than manually poking around every new source from scratch.

We have also been fixing source-specific extraction issues one by one. Published dates have been a recurring problem. Some sources only expose a day but not a time. Some expose misleading timestamps. Some appear fresh but are really surfacing category pages, tag pages, or thin pages that should never have earned that placement in the first place.

On top of that, there has been product-path cleanup: homepage fixes, source-diversity caps so one source does not dominate the whole surface, tighter recent-news behavior, smaller query windows to avoid timeouts, and more deliberate evaluation instead of judging quality by vague impressions.

The first big obstacle: quality is spread across the whole system

A search engine teaches humility fast.

The obvious fantasy is that the hard part is ranking. Just tune the search layer, maybe add a better heuristic, and the product gets noticeably better. In reality, search quality is distributed across the entire stack.

If the crawler brings in weak pages, ranking suffers. If extraction fails, good pages become weak documents. If published dates are wrong, freshness gets distorted. If classification is too strict, relevant content never makes it into the article set. If classification is too loose, noise leaks through. If the snippets are weak, even decent results feel less convincing than they are.

That means there is no single fix. It is a compounding game of removing friction and distortion layer by layer.

The second obstacle: hidden architecture friction

One lesson from the recent work is that some of the hardest blockers were not visible from the outside at all.

For example, it turned out that promoting source blueprints in the database was useful for discovery and inspection, but not enough to change live runtime behavior by itself. Real classification still depended on in-repo runtime logic and allowlists. So a source could look promoted and "supported" in one layer, while still being quietly ignored by the actual runtime path that mattered.

That kind of mismatch is expensive. It creates false confidence. It makes a source look fixed before the product benefits from the fix. And it slows down iteration because the problem is not where you first think it is.

We have been cleaning that up, but it is exactly the kind of systems debt that makes progress feel slower from the outside than it really is.

The third obstacle: source quality is inconsistent and messy

Khmer Web Search is not indexing one clean corpus. It is dealing with a web made of different publishing habits, different structures, inconsistent metadata, and varying content quality.

Some sources behave well end to end. Some work but only expose low-precision dates. Some have useful pages but weak extraction quality. Some need better source-specific logic before they stop leaking noise into the system. Some likely need to be deprioritized or treated with more skepticism until the quality bar improves.

This matters because product quality is downstream of source quality. If one source floods the homepage, or if date precision is weak, or if the system keeps mistaking a thin page for a fresh article, the user does not experience that as a pipeline issue. They experience it as bad search.

The fourth obstacle: evaluation has to become more honest

Another thing we have been doing is trying to stop grading the product by vibes.

We started defining more representative Khmer query benchmarks so improvements can be measured against real searches instead of whatever happened to look good in one lucky session. That is important because search can fool you. A few good queries create false confidence. A few bad ones create false despair. Neither is enough.

The right question is whether the system is becoming more useful on a small but representative set of Khmer queries: broad queries, phrase queries, entity queries, navigational queries, and the kinds of ambiguous searches where intent is only partly visible from the words themselves.

That work is still early, but it is necessary. Without it, every change risks becoming performance theater.

The fifth obstacle: reliability still matters more than cleverness

Some of the recent trouble has been less about intelligence and more about basic product reliability.

We hit homepage and API issues where certain endpoints could time out or return 502s because the query window was too heavy. That kind of problem is boring, but it matters. A search product does not get to feel "almost good" if the obvious surfaces are empty, slow, or inconsistent.

So part of the work has simply been making the product behave more reliably under its current architecture. Less breakage. Fewer empty states. Fewer cases where the product technically has the data but fails to present it well.

What the real bottleneck is now

The main bottleneck now is still search relevance and trust.

Not because nothing else matters, but because that is the layer users judge first and hardest. If the top results do not feel sensible, if freshness looks suspicious, if one source dominates too often, or if broad Khmer queries still feel weak, the product does not earn confidence yet.

That is the bar now. Not just more data. Not just more features. More trust in the result set.

What happens next

The next work is fairly clear.

Keep tightening source coverage and extraction quality. Keep improving date and freshness handling. Keep using a representative Khmer query benchmark instead of relying on intuition. Keep removing the hidden architecture mismatches that slow down real iteration. And keep pushing the product toward results that feel more useful, more trustworthy, and less accidental.

This is slower than shipping a flashy demo, but it is the real work. Khmer Web Search does not become good because it exists. It becomes good if the whole loop, from source to search result, gets more honest and more disciplined over time.

Still worth doing

Even with the friction, I still think this matters.

Better Khmer search is not just a nicer interface. It is infrastructure for discovery, access, and usefulness across a web that still feels harder to navigate than it should. The current stage is messy, but at least the mess is real now. That is better than a polished illusion.

You can try the product at khmerwebsearch.com.