crawler is hybrid, using async python requests and puppeteer
with uBlock Origin
. The way
detection works is we count the
number of uBO blocked requests on the page, and if too many (threshold is set
to 5), we kick it out, leaving only "clean" pages
in the index.
is also unique in a sense that it will follow an interesting dead link to its internet
page, trying its best to preserve the page in our
index (you will see those results under "Internet Archive" section).
Content and semantic metadata is
extracted using trafilatura
page language is detected using fastText
To produce search results, triple rankings with full-text search (via Elasticsearch
) and NLP-based
semantic search (using sentence
to produce embeddings and SCANN
to search through vector space) are combined to produce final
ranking of results. Responses are served using FastAPI
"Interesting Recently" section is provided by TinyGem
, a content recommendation
and bookmarking tool built using similar stack.
Teclis also uses results with permission from Marginalia
(another noncommercial search engine).