Mastodawn

Раннее завершение KNN-поиска в Manticore Search

Современные поисковые системы уже не просто сопоставляют ключевые слова. Когда вы ищете «уютный детектив, действие которого происходит в Париже», а получаете результаты вроде «атмосферный детективный роман во Франции», это векторный поиск в действии: документы и запросы превращаются в списки чисел — эмбеддинги, — а поисковый движок находит документы, чьи векторы ближе всего к вектору запроса. Manticore Search поддерживает это из коробки. Внутри используется структура данных HNSW: граф, который соединяет близкие векторы и позволяет быстро находить ближайших соседей без сканирования каждого документа. Благодаря этому векторный поиск по миллионам документов выполняется за миллисекунды.

https://habr.com/ru/articles/1042064/

#knn #knnsearch #векторный_поиск #семантический_поиск #hnsw #embeddings #oversampling #полнотекстовый_поиск

Раннее завершение KNN-поиска в Manticore Search

Хабр

Show thread

Gregor Kos 🇨🇦🇪🇺May 21

Tom Gauld’s recent cartoon on abbreviations concisely summarises my life as an analytical chemist and data analyst. The number of #acronyms is staggering - anybody doing #PCA data reduction followed by #KNN clustering of #ICPMS data acquired following #MW digestion? #academicchatter #AnalyticalChemistry #machinelearning

https://fed.brid.gy/r/https://bsky.app/profile/did:plc:oflombnj6v6xyohhatfkbzyd/post/3mmecbfj7722z

Tom Gauld (@tomgauld.bsky.social)

ACAA* my latest for @newscientist.com *a cartoon about abbreviations

Bluesky Social

anoncheg Apr 16

Title: P2: fixing mistake task for job for Allmagen [2023-11-17 Fri]
GaussianProcessClassifier may show better result.
I just don't have time to test it now. I have new task now.
☀
I updated my code to sklearn 1.4, in new version OneHotEncoder and LabelEncoder
have min_frequency that able to fight problem of 'sparce classes'
when categorical feature have many unfrequent values.
⚰\n#ml #machinelearning #datascience #ds #kNN

anoncheg Apr 16

Title: P1: fixing mistake task for job for Allmagen [2023-11-17 Fri]
Simple division by deviation was enough to increase accuracy. kNN used in SMOT
algorithm for oversampling to fix extreme skew in target.
Before: ⛈
- accuracy 0.8228730152627735
- roc_auc 0.8231424341535153
After: ⛅️
- accuracy 0.9710410979970012
- roc_auc 0.9708914087236685
I have used random forest classifier. As I see from my step of algoritm choosing,\n#ml #machinelearning #datascience #ds #kNN

anoncheg Apr 16

Title: P0: fixing mistake task for job for Allmagen [2023-11-17 Fri]
I found my mistake in solved task for job for Allmagen company last week.
It was task to create simple model of binary classification with calibration.
I studied kNN and manifold learning(new approach for non-linear dimensionaly
reduction), kNN uses metric and matrix of distances just like clustering
hence requre 'mixed data types scaling'.\n#ml #machinelearning #datascience #ds #kNN

yegorov Mar 10

If you are building an application that requires search, I recommend using Elasticsearch early on. In addition to the usual full-text search, Elasticsearch allows you to perform a hybrid search: combine the results of text and vector search.
Of course, for small amounts of data, you can use PostgreSQL tsvector with the pgvector extension, but in the long term, Elasticsearch will provide good performance.

#Elasticsearch #Search #tsvector #pgvector #KNN #Embedding #SentenceTransformers #AI

ansuz Mar 9

Given how vocal I am against the "AI" industry, some of my followers might be surprised to learn that I'm now a co-author on a machine learning paper.

That paper has been submitted to the proceedings of an upcoming conference under their "Responsible AI" track, but it has nothing to do with LLMs or really anything that has recently been pushed by the industry's hype-machine. A pre-print is available on arxiv.org ("Tiny, Hardware-Independent, Compression-based Classification") while its formal review is pending.

Our paper expands on a technique I've been using to classify my emails for more than two years called "NCD-KNN" (Normalized Compression Distance with K-Nearest Neighbours). This method uses commonly available compression utilities like GZIP to estimate the relative "distance" between an input and a set of labeled examples, ultimately categorizing that input according to the labels of the K-nearest examples.

We solved some fundamental problems which could result in negative distances under specific circumstances which we identified, addressed some other theoretical limitations which prevented its broader use, and extended NCD to applications using Support-Vector-Machines (SVMs) for non-linear classification.

My co-authors are not on Fedi, but if any of this interests you then feel free to Ask Me Anything

#AMA #ML #AI #SVM #NCD #KNN

Tiny, Hardware-Independent, Compression-based Classification

The recent developments in machine learning have highlighted a conflict between online platforms and their users in terms of privacy. The importance of user privacy and the struggle for power over user data has been intensified as regulators and operators attempt to police online platforms. As users have become increasingly aware of privacy issues, client-side data storage, management, and analysis have become a favoured approach to large-scale centralised machine learning. However, state-of-the-art machine learning methods require vast amounts of labelled user data, making them unsuitable for models that reside client-side and only have access to a single user's data. State-of-the-art methods are also computationally expensive, which degrades the user experience on compute-limited hardware and also reduces battery life. A recent alternative approach has proven remarkably successful in classification tasks across a wide variety of data -- using a compression-based distance measure (called normalised compression distance) to measure the distance between generic objects in classical distance-based machine learning methods. In this work, we demonstrate that the normalised compression distance is actually not a metric; develop it for the wider context of kernel methods to allow modelling of complex data; and present techniques to improve the training time of models that use this distance measure. We demonstrate that the normalised compression distance works as well as and sometimes better than other metrics and kernels -- while requiring only marginally more computational costs and in spite of the lack of formal metric properties. The end results is a simple model with remarkable accuracy even when trained on a very small number of samples allowing for models that are small and effective enough to run entirely on a client device using only user-supplied data.

arXiv.org

Habr Nov 16, 2025

Как написать собственные классы классификации для маленьких

В прошлый раз я уже рассказывала о том, как в ходе обучения в "Школе 21" создавала класс линейной регресии , на этот раз будем рассматривать реализацию LogisticRegression, GaussianNB, KNN. Как и в прошлый раз, минимум теории, максимум практики.

https://habr.com/ru/articles/966764/

#LogisticRegression #GaussianNB #KNN #школа_21

Как написать собственные классы классификации для маленьких

Хабр

GripNews Nov 16, 2025

🌘 使用 KNN 進行特徵提取
➤ 透過 KNN 距離計算，創造更具判別力的數據特徵
✤ https://davpinto.github.io/fastknn/articles/knn-extraction.html
這篇文章介紹了 fastknn 套件中的 knnExtract 函式，該函式能透過計算樣本與其在各類別中的 k 個最近鄰居之間的距離，生成新的特徵。透過交叉驗證避免過度擬合，並支援平行運算。實測結果顯示，使用 KNN 提取的特徵能顯著提升分類模型的準確度，優於僅使用原始特徵的線性模型。
+ 這是一個非常實用的技術，特別適合用來處理那些原始特徵難以區分類別的資料集。感謝作者分享！
+ 原以為 KNN 只能用於分類，沒想到還能用來做特徵提取，學到了新知識！平行運算的部分也很棒。
#機器學習 #特徵工程 #KNN

Feature Extraction with KNN • fastknn

HGPU group Sep 21, 2025

High Performance GPU Implementation of KNN Algorithm: A Review

#kNN #MachineLearning #ML

https://hgpu.org/?p=30219

High Performance GPU Implementation of KNN Algorithm: A Review

With large volumes of complex data generated by different applications, Machine Learning (ML) algorithms alone may not yield significant performance benefits on a single or multi-core CPU. Applying…

hgpu.org