Moxy, a Go reliability layer for Redis-style queues

Moxy는 Redis 스타일 큐에 신뢰성 계층을 추가하는 Go 기반 오픈소스 프로젝트입니다. 작업자가 중간에 죽어도 작업이 사라지지 않도록 임대(lease), ACK, 가시성 타임아웃, 만료 복구 기능을 제공합니다. Redis의 단순 LPOP 방식의 위험성을 보완하며, Redis와 메모리 큐 백엔드를 지원하고 원자적 작업 처리를 Lua 스크립트로 구현했습니다. 현재는 단일 노드 및 백엔드 어댑터 기반이며, TCP/RESP 프로토콜과 분산 기능은 향후 계획입니다. 개발자에게 신뢰성 높은 작업 큐 구현 참고용으로 유용합니다.

https://github.com/an8kk/moxy

#go #redis #queue #reliability #distributedsystems

GitHub - an8kk/Moxy

Contribute to an8kk/Moxy development by creating an account on GitHub.

GitHub
What 9 years of data reveals about B.C. Ferries cancellations and delays
How have cancellations and delays changed on the routes you travel? CBC News took a look at data provided by the corporation to see what course these ships are taking qwhen it comes to reliability.
https://www.cbc.ca/news/canada/british-columbia/what-s-changed-for-b-c-ferries-cancellations-and-delays-9.7194665?cmp=rss

fly51fly (@fly51fly)

Google 연구진이 제로샷 LLM의 신뢰도를 높이기 위해 의미적으로 가까운 응답 후보들을 집계하는 ‘The Silent Vote’ 기법을 제안했다. 추가 학습 없이도 여러 의미 이웃을 활용해 출력 안정성과 일관성을 개선하는 연구다.

https://x.com/fly51fly/status/2054314303034163618

#llm #zeroshot #reliability #google #arxiv

fly51fly (@fly51fly) on X

[CL] The Silent Vote: Improving Zero-Shot LLM Reliability by Aggregating Semantic Neighborhoods S Badhe, P Tiwari, D Shah [Google] (2026) https://t.co/JGoCuMJUUG

X (formerly Twitter)

Have you tried turning it off and on again?

소프트웨어와 하드웨어에서 '재시작, 재부팅, 재설치'는 가장 보편적이고 효과적인 문제 해결 방법이다. 저자는 대규모 통신 서버의 메모리 누수 문제를 임의 재부팅 스크립트로 해결한 경험과, 전기 자전거의 펌웨어 문제로 인해 리셋 버튼 부재가 큰 비용 손실을 초래한 사례를 공유한다. 이를 통해 소프트웨어와 하드웨어 설계 시 고장 가능성을 인정하고, 빠르고 쉽게 재시작할 수 있도록 설계하는 것이 중요함을 강조한다. 특히, 빠른 부팅과 간편한 재설치, 그리고 물리적 리셋 수단의 제공이 필수적이라는 점을 실무 개발자 관점에서 시사한다.

https://eblog.fly.dev/onoff.html

#softwareengineering #reboot #reliability #iot #bugmanagement

onoff.md

A practical guide to designing software for failure by making it easy to restart, reboot, and reinstall.

Семь раз посчитай — один раз урони: моделируем инциденты до деплоя

Ракету не отправляют в космос только потому, что её двигатель и насос успешно прошли стендовые испытания по отдельности. Перед стартом инженеры рассчитывают траекторию, моделируют режимы работы и анализируют сценарии отказов. Расчёт не заменяет реальные тесты, но задаёт для них осмысленную рамку. В софте всё обычно иначе. Распределённый пользовательский путь — например, оформление заказа — собирается из десятков микросервисов, баз и очередей. Разработчики добавляют новую зависимость, видят зелёные тесты, проверяют локальные метрики и выкатывают релиз. Считается, что если при сбое что-то пойдёт не так, настроенная система наблюдаемости обязательно это покажет. Она, конечно, покажет. Но почему при проектировании микросервисов мы так спокойно относимся к тому, что узнаём о хрупкости архитектуры в основном по факту инцидента? Эта статья о том, как получить грубый расчёт деградации системы ещё до релиза. Без отказа от хаос-инжиниринга или мониторинга, а как шаг перед ними. Я расскажу о двух экспериментах, в которых топологическая модель автоматически извлекалась из распределённых трейсов, после чего на ней просчитывались сценарии отказов методом Монте-Карло. Результаты моделирования я затем сравнивал с реальными инъекциями отказов на стендах DeathStarBench и OpenTelemetry Demo. Два эксперимента, результаты и код

https://habr.com/ru/articles/1033570/

#resilience #causality #графы #sre #reliability #modeling

Семь раз посчитай — один раз урони: моделируем инциденты до деплоя

Ракету не отправляют в космос только потому, что её двигатель и насос успешно прошли стендовые испытания по отдельности. Перед стартом инженеры рассчитывают траекторию, моделируют режимы работы и...

Хабр

AI Model Reliability

A vulnerability in a popular AI model's training data is a wake-up call, promising a closer look at AI model reliability

https://airanked.dev/posts/ai-model-reliability-concerns

#AI #Model #Reliability

I've got a new article up on the Resilience in Software Foundation (RISF) website!

This post is an introduction to the concept of practicing together in teams and points to some resources for learning more, including the RISF event this coming Wednesday where we'll play through one of the games!

https://resilienceinsoftware.org/news/11517597

#PracticeOfPractice #Expertise #ConnectiveLabor #CommonGround #SRE #Resilience #Reliability

Sources of Practice

Practice of Practice is a way to provide a social architecture that supports connective labor between teams and team members. Solidifying this layer of relationships is key to nourishing resilience in any organization. It does this through a tripod of supportive and interconnected parts: A Practice of Practice regimen forms a community of practice that reflects ambient values: the types of learning that happen there are useful elsewhere, driving a culture surrounding these values. The norms and rituals provided by Practice of Practice come about because the organization sanctions and supports the time and effort that goes into making space in peoples' schedules to become part of the community of practice. This all requires a dedicated leader who makes sure the time and resources exist for connection and empathy. They learn to facilitate and run the gameplay and discover new things about the system along with other community members. They are the champion of this discipline at the org. This article introduces a resource for Practice of Practice gameplay by providing some of the context for why this is a good idea for teams managing ambiguity. We'll see how communication is dependent on the signals we use to connect as humans and how practicing them helps us collaborate better. When software systems fail, people are called to action. Because failures are unexpected, this typically results in those people being interrupted. Regardless of sophisticated auto-remediation or defect location, there will be a point where everyone is gathered together in a chat channel or video bridge, and people are the ones who coordinate to restore the system. Coordination is gained through close collaboration. People working together. Collaboration is possible because people communicate. To do that, people use signals. Signals Our basic building block of communication is a signal. To understand the origin of how we came to need digital signals let's talk a little bit about talk. In 1983, a unix command called talk was added to BSD. This version could connect users across the network, not just on the local machine. In two panes on the screen, one above the other, two people can write out a conversation. When I first used it around 1990 it felt magical, I still remember the room where I sat late at night on campus, being astonished while chatting over talk with someone in England. It's still in MacOS and other unix-derived systems today. The rise of simultaneous chat came shortly after the dawn of the smiley, or emoticon. :-) This one is a signal to someone else, portraying a smile, using only the character set available in text-based apps like talk. Japanese kaomoji also appeared around the same time, used by teenagers instead of computer nerds. They tend to be more expressive, in that they carry stronger signals. (^_^) Digital media has expanded around emoticons and kaomoji. We use Group Emails, Video Calls, Chat Threads, Memes, DMs, @ Mentions, even Document Comments to send signals. Contrast this with what we do when we're in a room together, when we can make eye contact or even touch. When we're not in the same room, humans will adapt their signaling to the situation. ¯\_(ツ)_/¯ By doing things together as a team we learn our signals. Over time, we notice how they appear, shared across digital media where we communicate. Actions together become streamlined because we begin to share a common jargon, or argot: "Cycling 30 in 5, someone on dbmain WAL?" This may look like it says one thing, but mean quite another to a well-practiced team. Practice Consider that Practice has two phases that are closely interconnected and yet separate activities: declarative study where we practice theory, and interactive application where we practice the craft. The first is the learned skill of performing a role. This is declarative knowledge that can be learned through observation and study. This Theory of Practice is connected strongly to intellectual activity, it represents a map of how things should happen. The second is a non-intellectual activity. The Practice of Practice connects us to instincts, intuitions, and insights. It contains practical knowledge and represents hands-on learning. Ambiguity is a quality of this kind of practice. Deliberate Practice contains both. It takes deliberate practice to become skillful through theories of the technique, it takes deliberate practice to work in concert with a group of humans. Blending the phases of Practice is necessary for overall collaboration, communication, and coordination. Rehearsing our Practice together is what "Practice of Practice" means. This is why music groups practice improvising together, why arctic rescuers practice improvising together, and why site reliability engineers practice improvising together. To practice improvisation, many disciplines play games. Through their signal building, games help us build relationships. Relationships Practice of Practice Resources is a catalog of games, exercises, definitions, and links. These are activities for teams looking for ideas to get their own learning discipline going. There are no more than a dozen or so items listed today, with potential to grow larger as we find and try more things to play. When we engage in play together, our group unlocks a special ability called connective labor. Each person feels seen by the others. Community is created, trust is built, knowledge is shared, and empathy is gained. Training to improvise together reveals another special ability in teams: adaptive capacity. An example of this is when responders become overloaded by a new ambiguity and our practiced team knows quickly how to shed load or get expert help using the limited resources they have at hand. Different games will cover different ways that resilience like this shows up during failures. Regardless of which gameplay style fits your needs, you can help make the practice stick by following some basic guidelines: Designate a person who stewards these sessions and champions them in the org. They are usually the first to pick the game, plan, schedule, and facilitate. This isn't a shared responsibility, this is a chance for someone to pick up leadership skills. Establish your sessions as a Community of Practice by making it a recurring event. Block that time for future meetings and let attendance be optional. Pick a time when people are generally available. Get the official buy-in from leadership by including them in the activity. When leaders join and participate in learning, they are supporting the kind of social architecture that nourishes connective labor. Building relationships is the primary thing Practice of Practice hopes to do. For that, we need opportunities for sharing. Practice is where the context of our work comes into play. For musicians it might be jazz bebop, for arctic rescuers a harsh dynamic landscape, for SRE the joint cognitive activities we use to manage computer systems. Hopefully this library of games and exercises will help your team. If you are interested in learning more, we’re holding our first Practice of Practice Gamelan session on May 13th with interactive exercises and an opportunity to learn how to help your teams work together better.   Matt Davis

Resilience in Software Foundation

Build Reliable Notifications with Postgres

Postgres를 활용해 신뢰성 높은 알림 및 메시징 시스템을 구축하는 방법을 소개합니다. 메시지를 Postgres 테이블에 저장하여 정확히 한 번만 전송 및 소비하는 원자적 워크플로우를 구현할 수 있으며, LISTEN/NOTIFY 기능을 활용해 효율적이고 저지연의 알림 수신을 지원합니다. 연결이 끊겼을 때 알림 손실을 방지하기 위해 재연결 시 테이블을 조회하는 전략도 설명합니다. 이 접근법은 분산 시스템에서 중복 메시지 처리와 장애 복구 문제를 효과적으로 해결합니다.

https://www.dbos.dev/blog/low-latency-reliable-messaging-with-postgres

#postgresql #distributedsystems #messaging #reliability #workflow

‍Signed, Sealed, Delivered: Low-Latency, Reliable Messaging with Postgres | DBOS

How to use Postgres to implement high-performance, fault-tolerant, atomic notifications and messaging.

Show HN: Statewright – Visual state machines that make AI agents reliable

Statewright는 AI 에이전트의 신뢰성을 높이기 위해 시각적 상태 머신을 활용하는 도구로, 13억 파라미터급 모델이 일반 하드웨어에서 안정적으로 작업을 수행하도록 설계됐다. 기존 대규모 모델과 긴 프롬프트에 의존하는 방식 대신, 상태 머신으로 도구와 솔루션 공간을 제한해 효율성과 신뢰성을 동시에 달성한다. 현재 Claude Code에 쉽게 통합 가능하며, 사용자 정의 가능한 상태, 전이, 가드레일을 그래픽 인터페이스로 설계할 수 있다. 이는 AI 에이전트 개발자에게 실용적인 워크플로우 제어 수단을 제공한다.

https://statewright.ai/

#aiagent #statemachine #workflow #reliability #claudecode

Statewright — Design the process. Your agent follows it exactly.

Visual workflow builder with protocol-level enforcement for AI coding agents. Free tier. Open-source engine.

What's Gone Wrong at GitHub?

GitHub은 최근 1년간 심각한 신뢰성 문제를 겪으며 서비스 중단이 급증했고, 특히 GitHub Actions가 잦은 장애로 개발자 워크플로우에 큰 영향을 미치고 있다. 이러한 문제의 주요 원인은 AI 기반 자동화 개발 활동의 급격한 증가로, 플랫폼이 이를 감당할 수 있는 규모로 설계되지 않았기 때문이다. Microsoft가 GitHub을 CoreAI 조직에 통합하며 독립적 리더십이 약화된 점도 문제 해결에 영향을 미치고 있다. GitHub는 투명한 사고 분석과 아키텍처 개선을 약속했으나, 커뮤니티의 인내심은 한계에 다다른 상태다.

https://leaddev.com/software-quality/whats-gone-wrong-at-github

#github #ai #reliability #cicd #microsoft

What’s gone wrong at GitHub?

For most of its life, GitHub has been remarkably reliable. Then, somewhere in the last year or so, that stopped being the case. What happened?

LeadDev