How to Build Vector Search from Scratch in Python

이 글은 Python과 NumPy만 사용해 벡터 검색 엔진을 처음부터 구현하는 방법을 상세히 설명한다. 텍스트를 고차원 임베딩 벡터로 변환해 코사인 유사도로 의미적 근접성을 측정하는 벡터 검색의 기본 원리를 다루며, 간단한 상품 설명 데이터셋을 활용해 임베딩 생성, 정규화, 인덱싱, 검색 쿼리 처리 과정을 단계별로 보여준다. 또한 PCA를 이용해 임베딩 공간을 2차원으로 시각화해 클러스터 구조와 쿼리 벡터의 위치를 직관적으로 이해할 수 있도록 한다. 벡터 검색의 핵심 개념과 구현 원리를 이해하고자 하는 AI 개발자에게 실용적인 입문 자료다.

https://www.kdnuggets.com/how-to-build-vector-search-from-scratch-in-python

#vectorsearch #python #embedding #cosinesimilarity #pca

How to Build Vector Search from Scratch in Python

Learn how to build a vector search engine from scratch in Python with embeddings, similarity scoring, and basic retrieval logic.

KDnuggets
Finding Similar Products with LINQ: An Efficient Approach
Discover efficient product similarity search using LINQ! Learn how LINQ in .NET enables elegant & efficient methods for finding similar products based on various criteria. Optimize your queries for large datasets. #LINQ #.NET #SQL #ProductRecommendation #CosineSimilarity #MachineLearning
https://tech-champion.com/database/sql-server/finding-similar-products-with-linq-an-efficient-approach/
...
Finding Similar Products with LINQ: An Efficient Approach
Discover efficient product similarity search using LINQ! Learn how LINQ in .NET enables elegant & efficient methods for finding similar products based on various criteria. Optimize your queries for large datasets. #LINQ #.NET #SQL #ProductRecommendation #CosineSimilarity #MachineLearning
https://tech-champion.com/database/sql-server/finding-similar-products-with-linq-an-efficient-approach/
...
How to Implement a Cosine Similarity Function in TypeScript for Vector Comparison | alexop.dev

Learn how to build an efficient cosine similarity function in TypeScript for comparing vector embeddings. This step-by-step guide includes code examples, performance optimizations, and practical applications for semantic search and AI recommendation systems

I’m excited to share my newest blog post, "Don't sure cosine similarity carelessly"

https://p.migdal.pl/blog/2025/01/dont-use-cosine-similarity

We often rely on cosine similarity to compare embeddings—it's like “duct tape” for vector comparisons. But just like duct tape, it can quietly mask deeper problems. Sometimes, embeddings pick up a “wrong kind” of similarity, matching questions to questions instead of questions to answers or getting thrown off by formatting quirks and typos rather than the text's real meaning.

In my post, I discuss what can go wrong with off-the-shelf cosine similarity and share practical alternatives. If you’ve ever wondered why your retrieval system returns oddly matched items or how to refine your embeddings for more meaningful results, this is for you!
`
I want to thank Max Salamonowicz and Grzegorz Kossakowski for their feedback after my flash talk at the Warsaw AI Breakfast, Rafał Małanij for inviting me to give a talk at the Python Summit, and for all the curious questions at the conference, and LinkedIn.

#cosineSimilarity #embedding #llm #similarity

Don't use cosine similarity carelessly

Cosine similarity - the duct tape of AI. Convenient but often misused. Let's find out how to use it better.

For folks who work in #DataScience, what's the easiest way for me to to calculate the #CosineSimilarity of two strings? I'm looking at sklearn cosine_similarity first.

Related to hallucination detection in #ASR - low cosine similarity indicative of hallucination.