The Box You Cannot Check

The clinic intake form on the clipboard at the front desk has two boxes next to the word “sex.” A patient who is neither of the two options has been given three choices: pick one box and lie, write something in the margin, or refuse the form. The receptionist will not read the margin. Data entry clerks will not transcribe it. EHR systems will not store anything outside the two values the form lists. The patient walks out of the clinic with a treatment plan based on a box that does not correspond to their body, their history, or their current endocrine state. The form has done its job, which is not the job it claimed to do.

The form claims to collect information. Its actual function is categorization. Information collection would mean recording what the patient told the clinic. Categorization means sorting the patient into one of the boxes the system already had, regardless of whether the patient fits. These are different operations. The form does the second while presenting itself as doing the first.

I focus on the sex/gender field because it is currently the most visible example of the categorization failure, but the pattern is general. Race fields on most forms still offer five or six options plus an “other” line that researchers routinely discard in aggregate analysis because “other” cannot be merged with the named categories without distorting the comparison the categories were built to support. Ethnicity fields on US forms famously split Hispanic into its own question while leaving Middle Eastern and North African respondents to choose between “White,” “Asian,” and “Other,” none of which describes them. The Census Bureau plans to add a MENA category in 2030, decades after the gap was identified. Respondents could check multiple race boxes for the first time in 2000, which was a real improvement on prior forms. The same census kept the sex question as a binary male/female, the way every US Census has since 1790, on the grounds that adding a third option would compromise the time series.

The time series argument is worth examining because it surfaces what is actually happening. A research instrument that has measured a binary for 230 years has produced 230 years of data that reads the population as binary. Adding a third or fourth or fifth option in 2030 would mean that comparisons between 2020 and 2030 require methodological accommodation. That accommodation is doable, well-documented in survey methodology, and routine when other categories shift. Choosing not to make the accommodation keeps the data legible to historians of a category system that is no longer the category system in use. The form preserves the past at the cost of misrepresenting the present.

What happens to the data after the form is the harder problem. A survey of 10,000 people that includes 9,200 binary-box-checkers, 600 “other” or write-in responses, and 200 refusals will, in most aggregate reports, appear as a clean 9,200-person dataset. The 600 “other” responses get coded as missing, recoded into the binary categories by an analyst making a judgment call, or dropped entirely under a methodology footnote that says “respondents who declined to specify were excluded from analysis.” Another 200 refusals disappear under one of those clauses. A final published table reads as if 9,200 people answered the question cleanly, when in fact 10,000 people interacted with the question and 800 of them produced data the analyst could not use.

The aggregate therefore summarizes only the inputs that fit the categories already chosen. This is the gap between what the form does and what the form claims to do. The form does classification work while presenting itself as a question. Its classification system was built before the form was printed, and respondents who do not fit that classification are removed from the dataset the form generates. The dataset reads as comprehensive because the cleaning happened before anyone with access to the aggregate could see what was removed.

The mathematical consequence of this should bother statisticians more than it currently does. A dataset that excludes 8 percent of respondents on the grounds that their responses were illegible has an 8 percent selection bias that propagates into every downstream analysis. Confidence intervals computed on the 9,200 do not account for the 800. P-values look strong because the variance in the included data is smaller than the variance in the actual respondent pool. Models trained on the cleaned data fit the cleaned data well and fail in production when they encounter the kind of respondent the cleaning removed. Every machine learning system that classifies people on the basis of survey-derived training data carries this bias forward in ways the system’s documentation almost never describes.

The political consequence is what I think interests the new readers who arrived after the elevator essay. A form that excludes a category of people from the dataset also excludes that category from the policy decisions the dataset informs. A health system that does not record nonbinary patients in a way its analytics engine can read does not know how many nonbinary patients it serves, does not allocate resources to nonbinary patient care, does not train staff to address nonbinary patient needs, and does not appear in funding requests for nonbinary patient programs because the funding agency requires headcount data the EHR cannot produce. The form is upstream of the spreadsheet, the spreadsheet upstream of the budget, the budget upstream of the clinic. By the time the missing patients show up at the front desk, the building has been designed for the patients the form was capable of recording.

A clinic that fixes its form does not solve the problem because the EHR vendor downstream still has a two-value field. Fixing the EHR fails because the state public health reporting system still requires data in the older format. Fixing the state system fails because the federal CDC reporting standard underneath still uses the binary. The categorization is layered. Each layer has a defensible local reason for the binary it inherited. The cumulative effect is a healthcare system that cannot count its actual patient population, and a healthcare system that cannot count cannot fund, and a healthcare system that cannot fund cannot serve. The form on the clipboard at the front desk is the bottom button of a fifteen-story panel where every button on every floor is wired to the same controller, and the controller only stops the elevator on floors the original engineer drew on the original blueprint.

The fix has the same structure as the placebo button fix. Recognize which boxes work and which do not. Refuse to mistake compliance for collection. Push for upstream rewiring of forms before adding more “other” lines downstream. Demand that aggregate reports publish the count of excluded responses in the same table as the included ones, with the same prominence, in the same font. Refuse to treat a survey that loses 8 percent of its respondents as a survey of the population it sampled. Insist on the difference between a question that asks and a question that classifies, and refuse to fill out the second one as if it were the first.

The form is not neutral. It encodes what its designers were willing to recognize, and it discards what its designers were not willing to recognize, and the discard happens silently in the data pipeline rather than visibly at the front desk. A patient who writes a third answer in the margin is doing the work the form refused to do. An aggregate report that publishes 9,200 clean responses hides 800 acts of refusal by people who would not lie to the clipboard. Counting is the claim. Selection is the politics. That politics rides downstream into every room the dataset enters, every dollar the budget allocates, every protocol the staff is trained on, and every body the building was built to serve.

#binary #categorization #category #education #ehrSystems #ethnicity #female #gender #human #male #medicine #MENA #nonbinary #race #sex #tech #timeSeries

Toto 2.0: Time series forecasting enters the scaling era

Datadog이 공개한 Toto 2.0은 4백만에서 25억 파라미터 규모의 시계열 예측용 파운데이션 모델로, 모델 크기 확장에 따른 성능 향상을 입증했다. Toto 2.0은 BOOM, GIFT-Eval, TIME 등 주요 벤치마크에서 최고 성능을 기록하며, 이전 버전 대비 7배 이상 파라미터 효율성과 추론 속도 개선을 달성했다. 공개된 모델과 분산 학습 라이브러리는 Apache 2.0 라이선스로 제공되어 AI 시계열 예측 및 인프라 운영에 즉시 활용 가능하다.

https://www.datadoghq.com/blog/ai/toto-2/

#timeseries #forecasting #foundationmodels #distributedtraining #datadog

Toto 2.0: Time series forecasting enters the scaling era | Datadog

For the first time, a time series foundation model gets reliably better with scale—five open-weights sizes from 4m to 2.5B parameters, trained from a single recipe.

Datadog

Scaling As-Of Joins

Daft 라이브러리가 시계열 데이터 정렬에 필수적인 ASOF 조인을 네이티브로 지원하며, 세 가지 최적화(해시 그룹화, 이진 탐색, 멀티스레드 병렬 처리)를 통해 기존 대비 6배 빠른 성능과 메모리 사용량 절반 감소를 달성했다. V3에서는 데이터 카디널리티에 의존하지 않는 스트리밍 병렬 처리 방식을 도입해 멀티코어 활용도를 극대화했고, V4에서는 데이터 스큐 문제를 해결하기 위해 범위 파티셔닝과 캐리오버 메커니즘을 적용해 분산 환경에서도 효율적이고 확장 가능한 ASOF 조인을 구현했다. 이는 대규모 시계열 AI 파이프라인과 ML 피처 스토어 구축에 실질적 도움을 줄 수 있다.

https://www.daft.ai/blog/scaling-asof-joins

#asofjoin #timeseries #distributedcomputing #parallelprocessing #daft

Scaling As-of Joins

How we built, broke, and re-built our ASOF joins — 6x faster, half the memory of pandas, and scaled to a distributed cluster.

Daft

Lies, damned lies, and Elastic's benchmarks

Elastic이 발표한 Prometheus 대비 30배 빠르다는 TSDS 시계열 엔진 벤치마크가 신뢰성이 떨어진다는 지적이 나왔다. 벤치마크 코드와 환경 정보가 공개되지 않아 재현이 불가능하며, Elastic 쪽은 높은 샘플링 속도에서 데이터 적재에 어려움을 겪었다. 작성자는 직접 벤치마크 하네스를 만들어 테스트했으나 Elastic의 성능은 기대에 미치지 못했고, 벤치마크 결과를 신중히 해석할 필요가 있다고 강조했다. 또한, 관측성 메트릭용 표준 벤치마크 도구 개발의 필요성을 제기했다.

https://www.gouthamve.dev/lies-damned-lies-and-elastics-benchmarks/

#benchmarking #timeseries #prometheus #elasticsearch #observability

Lies, damned lies, and Elastic's benchmarks

It's a spicy title, but one that is warranted. Elastic recently published a blog post titled 30x faster than Prometheus, that claimed that Elastic's new TSDS timeseries engine is much better than Prometheus. When I started reading it, I was curious; maybe there are useful optimizations we can learn about.

Goutham City

Context parroting: A simple but tough-to-beat baseline for foundation models in scientific machine learning https://arxiv.org/abs/2505.11349

Context parroting relies on short stretches of time-series data (or context). As it moves through the time series, it scans for similar patterns or motifs that appeared earlier in the sequence, and uses those patterns to predict what might come

https://openreview.net/forum?id=EUAXc9Hlvm

https://www.santafe.edu/news-center/news/a-simple-baseline-for-ai-forecasting-in-machine-learning

#machineLearning #forecasting #timeseries #forecasting #ML

Anomaly detection isn't just for metrics.

#VictoriaMetrics Anomaly Detection supports additional input data sources through #VictoriaLogs reader, allowing you to monitor #log-derived and traces-derived metrics for anomalies.
This expands the versatility, enabling it to handle a wider range of data sources beyond #timeseries metrics from VictoriaMetrics or #Prometheus, including VictoriaLogs and #VictoriaTraces.

🕐 2026-04-04 12:04 UTC

📰 パラメータ4個で710M超えのFoundation Modelに勝った時系列予測手法FLAIRの全貌 (👍 35)

🇬🇧 FLAIR beats Amazon's 710M-parameter Chronos model with just 4 parameters. No GPU needed, pure numpy/scipy. Simple yet powerful time series forecast...
🇰🇷 FLAIR가 단 4개 파라미터로 Amazon의 7억1천만 파라미터 Chronos 모델을 능가. GPU 불필요, numpy/scipy만으로 구현된 시계열 예측 기법.

🔗 https://zenn.dev/t_honda/articles/flair-time-series-forecasting

#TimeSeries #MachineLearning #Zenn

パラメータ4個で710M超えのFoundation Modelに勝った時系列予測手法FLAIRの全貌

Zenn

📰 パラメータ4個で710M超えのFoundation Modelに勝った時系列予測手法FLAIRの全貌 (👍 35)

🇬🇧 FLAIR: 4-parameter time series method beats 710M foundation models. Only needs numpy/scipy, no GPU required
🇰🇷 FLAIR: 파라미터 4개로 710M 모델 능가한 시계열 예측 - numpy/scipy만 필요, GPU 불필요

🔗 https://zenn.dev/t_honda/articles/flair-time-series-forecasting

#TimeSeries #MachineLearning #Zenn

パラメータ4個で710M超えのFoundation Modelに勝った時系列予測手法FLAIRの全貌

Zenn

📰 パラメータ4個で710M超えのFoundation Modelに勝った時系列予測手法FLAIRの全貌 (👍 31)

🇬🇧 FLAIR: Just 4 parameters & numpy/scipy outperform Amazon's 710M-param Chronos on 25 time-series benchmarks. No GPU needed, 500-line Python file.
🇰🇷 FLAIR: 단 4개 파라미터와 numpy/scipy로 Amazon의 710M 파라미터 Chronos 능가. GPU 불필요, 500줄 파이썬 파일.

🔗 https://zenn.dev/t_honda/articles/flair-time-series-forecasting

#TimeSeries #MachineLearning #Zenn

パラメータ4個で710M超えのFoundation Modelに勝った時系列予測手法FLAIRの全貌

Zenn

📰 パラメータ4個で710M超えのFoundation Modelに勝った時系列予測手法FLAIRの全貌 (👍 30)

🇬🇧 FLAIR: A 4-parameter time series forecasting method that outperforms Amazon's 710M-parameter Chronos model with just numpy.
🇰🇷 FLAIR: numpy만으로 아마존의 7억 파라미터 모델을 이긴 단 4개 파라미터 시계열 예측 기법.

🔗 https://zenn.dev/t_honda/articles/flair-time-series-forecasting

#MachineLearning #TimeSeries #Zenn

パラメータ4個で710M超えのFoundation Modelに勝った時系列予測手法FLAIRの全貌

Zenn