Mastodawn

Visualizing LLM embeddings on a sphere

이 프로젝트는 OpenAI의 text-embedding-3-small 모델로 생성한 1002개 단어 임베딩을 PCA와 UMAP을 이용해 3차원 구면 좌표로 축소하고, React Three Fiber를 활용해 인터랙티브 3D 시각화를 제공합니다. UMAP의 haversine 메트릭을 사용해 임베딩을 구면(S²) 위에 직접 매핑함으로써 평면 3D 공간에서 발생하는 클러스터링 왜곡을 방지합니다. 프론트엔드와 백엔드가 분리되어 있어 API 키 없이도 즉시 시각화 가능하며, Python 스크립트로 임베딩 재생성 및 좌표 재계산도 지원합니다. AI 개발자가 LLM 임베딩의 공간적 관계를 직관적으로 이해하고 분석하는 데 유용한 도구입니다.

https://github.com/dbyter/sphere-embed

#openai #embedding #visualization #umap #react

GitHub - dbyter/sphere-embed: Visualize embeddings and LLM relationships on a sphere's surface

Visualize embeddings and LLM relationships on a sphere's surface - dbyter/sphere-embed

GitHub

sayzard 6h ago

Extracting alignment data in open models

이 논문은 사후 학습된 오픈 모델에서 상당한 양의 정렬(alignment) 훈련 데이터를 추출할 수 있음을 보여준다. 특히, 문자열 매칭 대신 고품질 임베딩 모델을 활용해 의미적 유사성을 측정함으로써 기존 방식보다 10배 이상 많은 데이터를 식별할 수 있음을 입증했다. 또한, SFT나 RL과 같은 사후 훈련 단계에서 사용된 데이터가 모델에 의해 쉽게 재생산되며, 이를 활용해 원본 성능을 회복하는 베이스 모델 훈련도 가능함을 확인했다. 이 연구는 정렬 데이터 추출의 잠재적 위험성을 드러내고, 증류(distillation) 과정이 사실상 원본 데이터에 간접적으로 재학습하는 효과를 가질 수 있음을 시사한다.

https://arxiv.org/abs/2510.18554

#alignment #openmodels #embedding #trainingdata #distillation

Extracting alignment data in open models

In this work, we show that it is possible to extract significant amounts of alignment training data from a post-trained model -- useful to steer the model to improve certain capabilities such as long-context reasoning, safety, instruction following, and maths. While the majority of related work on memorisation has focused on measuring success of training data extraction through string matching, we argue that embedding models are better suited for our specific goals. Distances measured through a high quality embedding model can identify semantic similarities between strings that a different metric such as edit distance will struggle to capture. In fact, in our investigation, approximate string matching would have severely undercounted (by a conservative estimate of $10\times$) the amount of data that can be extracted due to trivial artifacts that deflate the metric. Interestingly, we find that models readily regurgitate training data that was used in post-training phases such as SFT or RL. We show that this data can be then used to train a base model, recovering a meaningful amount of the original performance. We believe our work exposes a possibly overlooked risk towards extracting alignment data. Finally, our work opens up an interesting discussion on the downstream effects of distillation practices: since models seem to be regurgitating aspects of their training set, distillation can therefore be thought of as indirectly training on the model's original dataset.

arXiv.org

sayzard 1d ago

How to Build Vector Search from Scratch in Python

이 글은 Python과 NumPy만 사용해 벡터 검색 엔진을 처음부터 구현하는 방법을 상세히 설명한다. 텍스트를 고차원 임베딩 벡터로 변환해 코사인 유사도로 의미적 근접성을 측정하는 벡터 검색의 기본 원리를 다루며, 간단한 상품 설명 데이터셋을 활용해 임베딩 생성, 정규화, 인덱싱, 검색 쿼리 처리 과정을 단계별로 보여준다. 또한 PCA를 이용해 임베딩 공간을 2차원으로 시각화해 클러스터 구조와 쿼리 벡터의 위치를 직관적으로 이해할 수 있도록 한다. 벡터 검색의 핵심 개념과 구현 원리를 이해하고자 하는 AI 개발자에게 실용적인 입문 자료다.

https://www.kdnuggets.com/how-to-build-vector-search-from-scratch-in-python

#vectorsearch #python #embedding #cosinesimilarity #pca

How to Build Vector Search from Scratch in Python

Learn how to build a vector search engine from scratch in Python with embeddings, similarity scoring, and basic retrieval logic.

KDnuggets

sayzard 1d ago

Show HN: Obsidian-Semantic, a CLI that lets agents search your vault by meaning

Obsidian-Semantic은 CLI 기반 도구로, AI 에이전트가 Obsidian 노트 저장소를 의미 기반으로 검색할 수 있게 해줍니다. 이를 통해 에이전트가 단순 텍스트 검색을 넘어 노트 간의 연관성을 찾아내고, 점차 위키처럼 지식을 확장할 수 있습니다. Ollama, LMStudio의 로컬 임베딩 모델과 Gemini API 클라우드 모델을 지원해 모델 선택과 제어가 가능합니다. AI 에이전트와 개인 지식 관리 통합에 유용한 도구입니다.

https://github.com/ravila4/obsidian-semantic-search

#obsidian #semanticsearch #embedding #cli #aiagent

GitHub - ravila4/obsidian-semantic-search: Semantic search for Obsidian vaults using LanceDB and Gemini/Ollama embeddings

Semantic search for Obsidian vaults using LanceDB and Gemini/Ollama embeddings - ravila4/obsidian-semantic-search

GitHub

sayzard 1d ago

S Banerjee (@SB434223)

RAG에서 임베딩 품질만으로는 충분하지 않으며, 데이터가 커질수록 검색 공간이 조밀해져 ‘거의 관련 있는’ 문서가 늘고 recall이 떨어진다는 점을 강조한다. 따라서 대규모 RAG에서는 reranking 같은 후처리와 검색 설계가 중요하다는 기술적 인사이트를 제시한다.

https://x.com/SB434223/status/2052648564321595428

#rag #embedding #reranking #retrieval #llm

S Banerjee (@SB434223) on X

@akshay_pachaar this is such an important point people miss with RAG embedding quality alone isn’t enough , retrieval becomes a density problem at scale as collection grow, semantic neighborhoods become crowded with “almost relevant” docs, and recall collapses which is why: - reranking

X (formerly Twitter)

sayzard 2d ago

My Claude dreams at night and remembers everything. Better than mempalace

iai-mcp는 Claude 및 MCP 호환 AI 어시스턴트를 위한 로컬 장기 기억 메모리 시스템으로, 모든 대화 내용을 정확히 기록하고 관련 정보를 자동으로 회상해 대화 시작 시 적절히 주입한다. 완전 로컬에서 임베딩을 계산하며, AES-256-GCM으로 암호화된 메모리를 관리해 개인정보 보호를 강화한다. 자동 캡처, 자동 회상, 그리고 백그라운드 통합 과정을 통해 시간이 지날수록 사용자 맞춤형 기억 능력이 향상된다. Python 데몬과 TypeScript 래퍼로 구성되어 macOS와 Linux에서 동작하며, 설치 및 상태 점검 도구를 제공한다.

https://github.com/CodeAbra/iai-mcp

#localmemory #mcp #claude #aicodeassistant #embedding

sayzard 3d ago

Show HW: Vectors.Space – An free service for embeddings

Vectors.Space는 OpenAI, Gemini, Voyage, 로컬 Llama 등 여러 임베딩 제공자를 단일 API로 통합해 개발자가 임베딩 파이프라인 관리에 신경 쓰지 않고 제품 개발에 집중할 수 있도록 지원하는 무료 서비스입니다. 내장된 캐싱과 사용량 추적, 토큰 오버플로우 처리 기능으로 비용과 지연을 줄이며, 대시보드를 통해 사용 현황과 성능을 한눈에 파악할 수 있습니다. 벡터 임베딩 모델 간 즉각적인 전환과 키 관리, 상세 로그 제공 등 AI 인프라 운영에 필요한 안정성과 편의성을 제공합니다.

https://vectors.space

#embedding #api #caching #llm #aiinfrastructure