Metris Arts
Bloomberg Philanthropies
Create Today

These folks don't care about your privacy and are actively paying other companies to steal and sell your data.

#Avoid #evaluation #Privacy

Deeply disappointed and angry at arts evaluators who are seemingly okay, using surveillance tech, like Placer.AI, because they see it as a shortcut to more direct and relational data collection.

Yes, let's give money to these assholes who buy and sell your data by scraping location and financial data up indiscriminately from anyone with any sort of device.

But I guess I'm the stupid one, because I'm the one that's unemployed.

#evaluation #RealEvalTalk #privacy #ethics #surveillance

Registrations are open for the EDA's #Test & #Evaluation Community Days 2026 ๐Ÿ‡ช๐Ÿ‡บ

Discussions will focus on artificial intelligence, autonomous systems and strengthening European cooperation.

โžก๏ธhttps://eda.europa.eu/news-and-events/events/2026/09/29/default-calendar/test---evaluation-community-days

#EUdefence #AI #AutonomousSystems

Registrations are open for the EDA's #Test & #Evaluation Community Days 2026 ๐Ÿ‡ช๐Ÿ‡บ

Discussions will focus on artificial intelli...
---
https://nitter.net/EUDefenceAgency/status/2044302927179796979#m

Events

Default

Ivan Fioravanti แฏ… (@ivanfioravanti)

MLX ์ „์šฉ ์ƒˆ๋กœ์šด ๋ฒค์น˜๋งˆํฌ suite๊ฐ€ ๊ณต๊ฐœ/์†Œ๊ฐœ๋˜์—ˆ์œผ๋ฉฐ, ๋กœ์ปฌ AI ์„ฑ๋Šฅ ์ธก์ •๊ณผ ๋น„๊ต์— ์œ ์šฉํ•œ ๋„๊ตฌ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค. ํŠนํžˆ MLX ์ƒํƒœ๊ณ„์—์„œ ๋‹ค์–‘ํ•œ ๋ชจ๋ธ๊ณผ ์„ค์ •์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๋ ค๋Š” ๊ฐœ๋ฐœ์ž๋“ค์—๊ฒŒ ๊ด€์‹ฌ์„ ๋Œ ๋งŒํ•œ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค.

https://x.com/ivanfioravanti/status/2045446687750009024

#mlx #benchmark #localai #ai #evaluation

Ivan Fioravanti แฏ… (@ivanfioravanti) on X

A new MLX specific benchmark suite from the mythical @ActuallyIsaak ๐Ÿ”ฅ I have to try it absolutely!

X (formerly Twitter)

Gaurav Vij (@Gaurav_vij137)

Alibaba Qwen์˜ ์ตœ์‹  ๋ชจ๋ธ Qwen 3.6 35B A3B์— ๋Œ€ํ•œ ํ‰๊ฐ€๊ฐ€ ์™„๋ฃŒ๋๋‹ค๊ณ  ๋ฐํ˜”๋‹ค. ์ž์œจ ํ‰๊ฐ€ ์‹œ์Šคํ…œ withneo๋ฅผ ์‚ฌ์šฉํ•ด ์„ฑ๋Šฅ์„ ๋ถ„์„ํ–ˆ์œผ๋ฉฐ, ํ–ฅํ›„ ์ƒ์„ธ ๊ฒฐ๊ณผ๋ฅผ ๊ณต์œ ํ•  ์˜ˆ์ •์ด๋ผ๊ณ  ์–ธ๊ธ‰ํ–ˆ๋‹ค.

https://x.com/Gaurav_vij137/status/2045434731077173435

#qwen #benchmark #llm #evaluation #ai

Gaurav Vij (@Gaurav_vij137) on X

Just finished evaluating Qwen 3.6 35B A3B - the most recent release from @Alibaba_Qwen The evaluation was performed autonomously by @withneo - More on that later. Here is what we found about the model's performance: 1/n ๐Ÿ‘‡

X (formerly Twitter)

stevibe (@stevibe)

BenchLocal์šฉ ์ƒˆ ๋ฒค์น˜๋งˆํฌ ํŒฉ HermesAgent-20์„ ์†Œ๊ฐœํ•œ๋‹ค. Hermes Agent ์†Œ์Šค์ฝ”๋“œ์—์„œ ์ถ”์ถœํ•œ 20๊ฐœ ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ์‹ค์ œ Hermes ์ธ์Šคํ„ด์Šค์— ์‹คํ–‰ํ•ด ๋ชจ๋ธ์˜ ์‹ค์ œ ์ž‘์—… ๋ถ€ํ•˜๋ฅผ ํ‰๊ฐ€ํ•˜๋„๋ก ์„ค๊ณ„ํ–ˆ๋‹ค. ๊ธฐ์กด ๋ฒค์น˜๋งˆํฌ๊ฐ€ ์ถฉ๋ถ„ํžˆ ํ˜„์‹ค์ ์ด์ง€ ์•Š๋‹ค๋Š” ๋ฌธ์ œ์˜์‹์—์„œ ๋งŒ๋“  ๋„๊ตฌ๋กœ, ์—์ด์ „ํŠธ ํ‰๊ฐ€ ๋ฐฉ์‹ ๊ฐœ์„ ์— ์˜๋ฏธ๊ฐ€ ์žˆ๋‹ค.

https://x.com/stevibe/status/2045165824294658539

#benchmark #aiagents #evaluation #tooling #opensource

stevibe (@stevibe) on X

Introducing HermesAgent-20, a new Bench Pack for BenchLocal. 20 scenarios extracted straight from the Hermes Agent source code, run against a REAL Hermes instance. The actual workload you'd put your model through. Why I built BenchLocal in the first place: most benchmarks are

X (formerly Twitter)

Ivan Fioravanti แฏ… (@ivanfioravanti)

ํ”Œ๋žซํผ๋ณ„๋กœ ๊ถŒ์žฅ๋˜๋Š” ์ƒ˜ํ”Œ๋ง ์„ค์ •์„ ์ค‘์•™ ์ €์žฅ์†Œ์—์„œ API๋กœ ๊ฐ€์ ธ์™€ ๋ฒค์น˜๋งˆํ‚น์— ํ™œ์šฉํ•˜๋Š” ์‹œ์Šคํ…œ์ด ๊ฑฐ์˜ ์™„์„ฑ๋๋‹ค๊ณ  ์•Œ๋ฆฐ๋‹ค. temperature, top-p, top-k, min-p ๊ธฐ๋ณธ๊ฐ’์„ ์ œ๊ณตํ•ด ๋ชจ๋ธ ํ‰๊ฐ€์™€ ์‹คํ—˜ ์žฌํ˜„์„ฑ์„ ๋†’์ผ ์ˆ˜ ์žˆ๋‹ค.

https://x.com/ivanfioravanti/status/2045160329785598231

#sampling #benchmarking #api #llm #evaluation

Ivan Fioravanti แฏ… (@ivanfioravanti) on X

Yes it works! Centralized repo of best sampling settings suggested by providers and retrieved through API for benchmarking. Nearly ready! ๐Ÿš€ "Sampling defaults from platform (family โ†’ qwen). Override with --temperature/--top-p/--top-k/--min-p."

X (formerly Twitter)

Sumeet Motwani (@sumeetrm)

๊ธด ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ์ƒˆ๋กœ์šด ๋ฒค์น˜๋งˆํฌ LongCoT๊ฐ€ ๊ณต๊ฐœ๋๋‹ค. ์ˆ˜๋งŒ~์ˆ˜์‹ญ๋งŒ ํ† ํฐ ๊ทœ๋ชจ์˜ ์žฅ๊ธฐ ์ถ”๋ก  ์„ฑ๋Šฅ์„ ์ธก์ •ํ•˜๋„๋ก ์„ค๊ณ„๋์œผ๋ฉฐ, ํ™”ํ•™ยท์ˆ˜ํ•™ยท์ฒด์Šคยท๋…ผ๋ฆฌยท์ปดํ“จํ„ฐ๊ณผํ•™ ๋“ฑ 2.5K๊ฐœ์˜ ๋ฌธ์ œ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. ์ตœ์ฒจ๋‹จ ๋ชจ๋ธ๋“ค๋„ 10% ๋ฏธ๋งŒ์˜ ์ ์ˆ˜๋ฅผ ๊ธฐ๋กํ–ˆ๋‹ค.

https://x.com/sumeetrm/status/2044805806567104971

#longcot #benchmark #reasoning #llm #evaluation

X

X (formerly Twitter)

AshutoshShrivastava (@ai_for_success)

์Œ์„ฑ ํ•ฉ์„ฑ(TTS) ๋ฒค์น˜๋งˆํฌ๊ฐ€ ์‹ค์ œ ํ”„๋กœ๋•์…˜ ์„ฑ๋Šฅ์„ ์ž˜ ์˜ˆ์ธกํ•˜์ง€ ๋ชปํ•œ๋‹ค๋Š” ๋ฌธ์ œ๋ฅผ ์ œ๊ธฐํ•œ๋‹ค. ์˜๋ฃŒ์šฉ ์Œ์„ฑ๊ณผ ์ฒญ๊ตฌ์„œ ์•ˆ๋‚ด ์Œ์„ฑ์ฒ˜๋Ÿผ ์‚ฌ์šฉ ๋งฅ๋ฝ์— ๋”ฐ๋ผ ํ•„์š”ํ•œ ํ†ค๊ณผ ์—๋„ˆ์ง€๊ฐ€ ๋‹ค๋ฅด๋ฉฐ, ์ด๋ฅผ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด LLM์„ ์‹ฌ์‚ฌ๊ด€์œผ๋กœ ํ™œ์šฉํ•ด ํ…Œ์ŠคํŠธํ–ˆ๋‹ค๊ณ  ์–ธ๊ธ‰ํ•œ๋‹ค. ์Œ์„ฑ AI ํ‰๊ฐ€ ๋ฐฉ์‹ ๊ฐœ์„ ์˜ ํ•„์š”์„ฑ์„ ๋ณด์—ฌ์ค€๋‹ค.

https://x.com/ai_for_success/status/2044772943302115445

#tts #llm #benchmark #voiceai #evaluation

AshutoshShrivastava (@ai_for_success) on X

The TTS benchmarks everyone uses don't predict what actually works in production. What works in production is whether it fits what it's supposed to do.. A healthcare voice needs different energy than a billing assistant. We tested LLMs as judges. They confidently score flawed

X (formerly Twitter)