Building Khrawler: the crawler for KhmerWebSearch.com

If KhmerWebSearch.com is the search engine you see, Khrawler is the machine room underneath it.

The crawler exists because better search does not start at ranking. It starts with what the system can acquire, what it can trust, and what it can avoid polluting the index with in the first place.

That sounds obvious. In practice, it is where a lot of the pain lives.

Why a dedicated crawler was necessary

The Khmer web is not one clean corpus with neat metadata and predictable publishing habits. Sources differ wildly in structure, timestamps, page quality, navigation noise, canonical URL behavior, and how much of the useful content is actually easy to extract.

That means crawling is not just fetch-and-store work. It is judgment work. Which domains deserve support? Which URL patterns are real articles versus dead weight? Which pages are canonical? Which ones are category pages, tag pages, auth surfaces, or junk that should never reach search?

Khrawler had to become opinionated enough to answer those questions instead of pretending the downstream stack could clean up every mess later.

What Khrawler is responsible for

Khrawler handles raw acquisition and source discovery. It is the layer that keeps track of where content comes from, what should be revisited, and how new pages enter the system.

Over time the job expanded beyond simple crawling. The crawler needs to respect source rules, handle domain-specific quirks, and feed the rest of the stack enough metadata that later stages can make better decisions.

In other words: the crawler is not only gathering pages. It is shaping the quality ceiling of everything that comes after.

The hard part was not volume

More pages is easy to celebrate and mostly useless by itself.

The real difficulty was making crawling disciplined: source onboarding rules, URL-pattern handling, blocked surfaces, cleaner canonical behavior, and enough source-specific logic that the system stops acting surprised by the same problems over and over.

That is also why source support became more deliberate. New domains need usable extraction paths and indexing rules, not just a checkbox that says they were seen once.

Why crawling quality leaks into search quality

Search is downstream of crawling whether the product admits it or not.

If the crawler brings in weak pages, ranking suffers. If the crawler misses good sources, search coverage suffers. If canonical pages are confused with variants or thin surfaces, the index gets polluted. If timestamps or source metadata are weak, freshness gets distorted before the search engine ever scores a query.

That is why Khrawler became a real subsystem instead of a disposable pre-step.

What building it changed

Building Khrawler changed how I think about the project.

It made the whole stack more concrete: acquisition, preparation, indexing, API, and product surfaces all needed to be treated as connected layers, not isolated hacks. It also made it obvious that better Khmer search would be a compounding systems problem, not a one-time build.

Once Khrawler was doing real work, the next step was to turn that raw acquisition into a product stack that people could actually use.