Почему Andrej Karpathy использует SVM в 2026 году (и вам тоже стоит)
На arXiv каждый день публикуются сотни статей по машинному обучению. Читать всё — нереально, а пропустить что-то важное — обидно. Andrej Karpathy, бывший Director of AI в Tesla и соавтор курса Stanford CS231n, решил эту проблему неожиданным способом. Он выбрал не BERT, не GPT и не какой-нибудь модный трансформер. Он остановился на добром старом SVM — алгоритме, которому уже несколько десятков лет. И знаете что? Это работает настолько хорошо, что используется даже в академических системах. В этой статье мы разберём, как устроено его решение, почему «примитивный» подход работает лучше сложных нейросетей, и когда вам тоже стоит выбрать SVM вместо трансформера. Давайте разбираться!
https://habr.com/ru/articles/990386/
#SVM #Andrej_Karpathy #TFIDF #машинное_обучение #Support_Vector_Machine #нейросети #алгоритмы_классификации
🚀 Tôi đã hoàn thànhстер structure một moteur tìm kiếm độc lập bằng Java! Sử dụng算法 TF-IDF và BM25, hỗ trợ token hóa, xóa từ trống, và ranking văn bản. Hoàn hảo bằng Java 21, không dùng thư viện bên outsourcing. versione opensourcerecipes trong GitHub. Learn rao về thông tin trích xuất và cơ sở dữ liệu!
#SearchEngine #Java #TFIDF #BM25 #OpenSource #LearningProject #TiemKiem #JavaCor #LapTrinh #NgoQuyet
https://www.reddit.com/r/opensource/comments/1o5wf1y/built_my_own_search_engine_from_scratc
Recently I've combined various functions which I've been using in other projects (e.g. my personal PKM toolchain) and published them as new library https://thi.ng/text-analysis for better re-use:
- customizable, composable & extensible tokenization (transducer based)
- ngram generation
- Porter-stemming & stopword removal
- vocabulary (bi-directional index) creation
- dense & sparse multi-hot vector encoding/decoding
- histograms (incl. sorted versions)
- tf-idf (term frequency & inverse document frequency), multiple strategies
- k-means clustering (with k-means++ initialization & customizable distance metrics)
- similarity/distance functions (dense & sparse versions)
- central terms extraction
The attached code example (also in the project readme) uses this package to creeate a clustering of all ~210 #ThingUmbrella packages, based on their assigned tags/keywords...
The library is not intended to be a full-blown NLP solution, but I keep on finding myself running into these functions/concepts quite often, and maybe you'll find them useful too...
#Text #Analysis #Cluster #KMeans #TFIDF #Ngram #Vector #TypeScript #JavaScript
Okay, Back of the napkin math:
- There are probably 100 million sites and 1.5 billion pages worth indexing in a #search engine
- It takes about 1TB to #index 30 million pages.
- We only care about text on a page.
I define a page as worth indexing if:
- It is not a FAANG site
- It has at least one referrer (no DD Web)
- It's active
So, this means we need 40TB of fast data to make a good index for the internet. That's not "runs locally" sized, but it is nonprofit sized.
My size assumptions are basically as follows:
- #URL
- #TFIDF information
- Text #Embeddings
- Snippet
We can store an index for 30kb. So, for 40TB we can store an full internet index. That's about $500 in storage.
Access time becomes a problem. TFIDF for the whole internet can easily fit in ram. Even with #quantized embeddings, you can only fit 2 million per GB in ram.
Assuming you had enough RAM it could be fast: TF-IDF to get 100 million candidated, #FAISS to sort those, load snippets dynamically, potentially modify rank by referers etc.
6 128 MG #Framework #desktops each with 5tb HDs (plus one raspberry pi to sort the final condidates from the six machines) is enough to replace #Google. That's about $15k.
In two to three years this will be doable on a single machine for around $3k.
By the end of the decade it should be able to be run as an app on a powerful desktop
Three years after that it can run on a #laptop.
Three years after that it can run on a #cellphone.
By #2040 it's a background process on your cellphone.