The
crawler is hybrid, using async python requests and
puppeteer with
uBlock Origin. The way
detection works is we count the
number of uBO blocked requests on the page, and if too many (threshold is set
to 5), we kick it out, leaving only "clean" pages
in the index.
Crawler
is also unique in a sense that it will follow an interesting dead link to its
internet
archive page, trying its best to preserve the page in our
index (you will see those results under "Internet Archive" section).
Content and semantic metadata is
extracted using
trafilatura and
readability.js,
while
page language is detected using
fastText.
To produce search results, triple rankings with full-text search (via
Elasticsearch /
Typesense) and NLP-based
semantic search (using
sentence
transformers to produce embeddings and
SCANN
to search through vector space) are combined to produce final
ranking of results. Responses are served using
FastAPI.
"Interesting Recently" section is provided by
TinyGem, a content recommendation
and bookmarking tool built using similar stack.
Teclis also uses results with permission from
Marginalia
Search (another noncommercial search engine).