A good day to trie-hard: saving compute 1% at a time

Link📌 Summary: 本文探討了Cloudflare團隊如何透過優化請求處理功能來減少CPU使用率,著重於其新的开源Rust库——trie-hard。隨著請求量的劇增,團隊發現每個請求需執行的clear_internal_headers函數佔用著1.7%的CPU時間,因此進行了優化工作。通過將清除內部標頭的邏輯改進和選擇更高效的資料結構,最終將平均執行時間從3.65微秒減少到0.93微秒,使CPU使用率優化至1.28%。優化過程中使用了基準測試工具,並在實際生產中進行了性能監測,結果驗證了其預測的效能。

🎯 Key Points:
- Cloudflare的Pingora框架支撐著全球高達3500萬個請求每秒。
- 原始的可執行function clear_internal_headers使用過多CPU時間,佔總使用率的1.7%。
- 透過調整資料結構和清除邏輯,優化後的函數執行時間顯著縮短。
- 使用trie資料結構的新的實現trie-hard,讓執行時間小於1微秒。
- 性能測試結果與實際運行數據一致,確保了優化的有效性。

🔖 Keywords: #Cloudflare #Rust #優化 #性能測試 #開源

A good day to trie-hard: saving compute 1% at a time

Pingora handles 35M+ requests per second, so saving a few microseconds per request can translate to thousands of dollars saved on computing costs. In this post, we share how we freed up over 500 CPU cores by optimizing one function and announce trie-hard, the open source crate that we created to do it.

The Cloudflare Blog
🌘 使用 eBPF 儀器化 Python GIL | Coroot
➤ 測量 Python GIL 影響的最佳實踐
https://coroot.com/blog/instrumenting-python-gil-with-ebpf
本文探討 Python 的全局解釋器鎖(GIL)如何影響多線程應用的性能,並介紹如何使用 eBPF 來測量 GIL 的影響。GIL 確保了線程安全,但也限制了 CPU 密集型程序的性能。通過分析 CPython 的源代碼,本文展示瞭如何追蹤 GIL 的獲取時間,並提供了具體的實作步驟。
+ 這篇文章對於理解 Python 的性能瓶頸非常有幫助,尤其是對於多線程開發者。
+ 使用 eBPF 來分析 GIL 的方法聽起來很有趣,期待看到更多實際案例!
#Python #GIL #eBPF #性能測試
Instrumenting Python GIL with eBPF

Learn how to measure the impact of the Python GIL on application latency using eBPF

🌘 GitHub - seddonm1/sqlite-bench:伴隨部落格文章的程式碼
➤ 測試SQLite交易行為的工具,支持多平臺運行
https://github.com/seddonm1/sqlite-bench
這篇文章介紹了專為測試SQLite交易行為而設計的專案「sqlite-bench」。讀者可以在部落格文章中找到更詳細的說明。該專案提供多種選項來設置測試參數,包括路徑、輸出檔案、資料庫種子數量、並發執行緒數量、掃描操作和更新操作的次數。建議先在內存文件系統中運行,以保護固態硬碟。此專案的多平臺Docker映像也可供使用。
+ 這個工具對於需要優化SQLite性能的開發者來說非常有用。
+ Docker映像的支持讓跨平臺測試變得更方便。
#軟體開發 #SQLite #性能測試
GitHub - seddonm1/sqlite-bench: Code to accompany blog post https://reorchestrate.com/posts/sqlite-transactions

Code to accompany blog post https://reorchestrate.com/posts/sqlite-transactions - seddonm1/sqlite-bench

GitHub
🌘 隊員錯誤和errors.Is()將您的代碼速度減少了3000% | DoltHub博客
➤ 不同策略的錯誤處理性能測試結果
https://www.dolthub.com/blog/2024-05-31-benchmarking-go-error-handling/
在這篇博客文章中,我們對在Go語言中處理錯誤的不同策略進行了基準測試,並討論了它們的性能和其他優劣。我們發現使用sentinel error模式結合errors.Is()會導致代碼速度下降約30倍,這一結果令我們感到震驚。我們將分享這些結果並進行深入討論。
+ 這篇文章很好地分析了不同的錯誤處理策略對代碼性能的影響,對於開發Go語言的人來說非常有價值。
+ 從這篇文章中我深切感受到了對代碼性能進行測試的重要性,期待更多類似的實踐和經驗分享。
#Golang #錯誤處理 #性能測試
Sentinel errors and errors.Is() slow your code down by 500%

An exhaustive set of benchmarks on different ways to approach error handling in Golang. We demonstrate that common sentinel error idioms are slow your code down by 5x.

🌗 使用 Typometer 測量終端延遲
➤ 發現終端模擬器延遲表現,Alacritty 成為理想選擇
https://beuke.org/terminal-latency/
這篇文章介紹了作者對終端延遲的測量,試圖尋找適合替代 Xterm 的終端模擬器。通過 Typometer 工具進行基準測試,揭示了各終端模擬器的延遲性能,發現 Alacritty 為最佳選擇。同時分享了對選擇終端模擬器的心得和結論。
+ 這篇文章深入探討了終端模擬器的延遲問題,讓人對不同選擇有更清晰的瞭解。
+ 使用 Typometer 工具進行基準測試是一個很好的方法,對比各種終端模擬器的延遲表現,讓讀者更容易做出選擇。
#終端 #延遲 #性能測試
beuke.org

A personal blog about functional programming, category theory, chess, physics and linux topics

🌘 關於系統調用的成本
➤ 系統調用成本的評估和結果
https://gms.tf/on-the-costs-of-syscalls.html
本文探討系統調用的成本,作者通過編寫微型基準測試來測量系統調用的最小成本,並報告了不同系統調用的實際時間。結果顯示,系統調用的成本大約在數百納秒的範圍。文章介紹了相關的方法和結果,以及一些有趣的發現和觀察。
+ 這篇文章提供了有價值的信息,對於那些對系統調用性能感興趣的人來說很有幫助。
+ 這篇文章對系統調用的成本進行了詳細的分析和測試,很有見解。
#系統調用成本 #性能測試
On the Costs of Syscalls

It's well known that syscalls are expensive. And that software mitigations against CPU bugs (such as Meltdown) even have made them more expensive. But how expensive are they really? To begin to answer this question I wrote a small micro-benchmark in order to measure the minimal costs of a syscall …

Georg's Log
🌘 H2O.ai db-benchmark 更新了!- DuckDB
➤ H2O.ai db-benchmark 更新了,DuckDB 是最快的庫
https://duckdb.org/2023/11/03/db-benchmark-update.html
H2O.ai db-benchmark 已經更新了最新的結果,同時也更換了用於測試的 AWS EC2 實例,以提高測試的可重複性和公平性。DuckDB 在各種數據大小下的聯接和分組查詢中都是最快的庫。
+ DuckDB 在性能測試中的表現一直很出色。
+ 更新測試環境可以確保結果的公平性和可重複性。
#數據庫 #性能測試 #DuckDB
Updates to the H2O.ai db-benchmark!

TL;DR: the H2O.ai db-benchmark has been updated with new results. In addition, the AWS EC2 instance used for benchmarking has been changed to a c6id.metal for improved repeatability and fairness across libraries. DuckDB is the fastest library for both join and group by queries at almost every data size. Skip directly to the results The Benchmark Has Been Updated! In April, DuckDB Labs published a blog post reporting updated H2O.ai db-benchmark results. Since then, the results haven’t been updated. The original plan was to update the results with every DuckDB release. DuckDB 0.9.1 was recently released, and DuckDB Labs has updated the benchmark. While updating the benchmark, however, we noticed that our initial setup did not lend itself to being fair to all solutions. The machine used had network storage and could suffer from noisy neighbors. To avoid these issues, the whole benchmark was re-run on a c6id.metal machine. New Benchmark Environment: c6id.metal Instance Initially, updating the results to the benchmark showed strange results. Even using the same library versions from the prior update, some solutions regressed and others improved. We believe this variance came from the AWS EC2 instance we chose: an m4.10xlarge. The m4.10xlarge has 40 virtual CPUs and EBS storage. EBS storage is highly available network block storage for EC2 instances. When running compute-heavy benchmarks, a machine like the m4.10xlarge can suffer from the following issues: Network storage is an issue for benchmarking solutions that interact with storage frequently. For the 500MB and 5GB workloads, network storage was not an issue on the m4.10xlarge since all solutions could execute the queries in memory. For the 50GB workload, however, network storage was an issue for the solutions that could not execute queries in memory. While the m4.10xlarge has dedicated EBS bandwidth, any read/write from storage is still happening over the network, which is usually slower than physically mounted storage. Solutions that frequently read and write to storage for the 50GB queries end up doing this over the network. This network time becomes a chunk of the execution time of the query. If the network has variable performance, the query performance is then also variable. Noisy neighbors is a common issue when benchmarking on virtual CPUs. The previous machine most likely shared its compute hardware with other (neighboring) AWS EC2 instances. If these neighbors are also running compute heavy workloads, the physical CPU caches are repeatedly invalidated/flushed by the neighboring instance and the benchmark instance. When the CPU cache is shared between two workloads on two instances, both workloads require extra reads from memory for data that would already be in CPU cache on a non-virtual machine. In order to be fair to all solutions, we decided to change the instance type to a metal instance with local storage. Metal instance types negate any noisy neighbor problems because the hardware is physical and not shared with any other AWS users/instances. Network storage problems are also fixed because solutions can read and write data to the local instance storage, which is physically mounted on the hardware. Another benefit of the the c6id.metal box is that it stresses parallel performance. There are 128 cores on the c6id.metal. Performance differences between solutions that can effectively use every core and solutions that cannot are clearly visible. See the updated settings section on how settings were change for each solution when run on the new machine. Updating the Benchmark Moving forward we will update the benchmark when PRs with new performance numbers are provided. The PR should include a description of the changes to a solution script or a version update and new entries in the time.csv and logs.csv files. These entries will be verified using a different c6id.metal instance, and if there is limited variance, the PR will be merged and the results will be updated! Updated Settings ClickHouse Storage: Any data this gets spilled to disk also needs to be on the NVMe drive. This has been changed in the new format_and_mount.sh script and the clickhouse/clickhouse-mount-config.xml file. Julia (juliadf & juliads) Threads: The threads were hardcoded for juliadf/juliads to 20/40 threads. Now the max number of threads are used. No option was given to spill to disk, so this was not changed/researched. DuckDB Storage: The DuckDB database file was specified to run on the NVMe mount. Spark Storage: There is an option to spill to disk. I was unsure of how to modify the storage location so that it was on the NVMe drive. Open to a PR with storage location changes and improved results! Many solutions do not spill to disk, so they did not require any modification to use the instance storage. Other solutions use parallel::ncores() or default to a maximum number of cores for parallelism. Solution scripts were run in their current form on github.com/duckdblabs/db-benchmark. Please read the Updating the Benchmark section on how to re-run your solution. Results The first results you see are the 50GB group by results. The benchmark runs every query twice per solution, and both runtimes are reported. The “first time” can be considered a cold run, and the “second time” can be considered a hot run. DuckDB and DuckDB-latest perform very well among all dataset sizes and variations. The team at DuckDB Labs has been hard at work improving the performance of the out-of-core hash aggregates and joins. The most notable improvement is the performance of query 5 in the advanced group by queries. The cold run is almost an order of magnitude better than every other solution! DuckDB is also one of only two solutions to finish the 50GB join query. Some solutions are experiencing timeouts on the 50GB datasets. Solutions running the 50GB group by queries are killed after running for 180 minutes, meaning all 10 group by queries need to finish within the 180 minutes. Solutions running the 50GB join queries are killed after running for 360 minutes. Link to result page

DuckDB
🌖 eBPF-Based自動儀器化優於手動儀器化
➤ eBPF-Based自動儀器化比手動儀器化快20倍以上
https://odigos.io/blog/ebpf-instrumentation-faster-than-manual
本文分享了使用eBPF-Based自動儀器化與傳統儀器化進行OpenTelemetry追蹤的性能測試結果。測試結果顯示,eBPF-Based自動儀器化比手動儀器化快20倍以上,並且不需要進行程式碼更改。
+ 這是一個很有用的技術,可以大大提高分散式追蹤的效率。
+ 我很期待這項技術在更多的程式語言中的應用。
#自動儀器化 #性能測試 #分散式追蹤
Unlocking Speed: eBPF-Based Auto-Instrumentation Over 20x Faster Than Traditional Instrumentation

🌘 GitHub - ChimeHQ/TextViewBenchmark: macOS文本視圖性能測試套件
➤ 使用XCTest的UI性能測試系統和自定義OSSignpost進行自動化測試
https://github.com/ChimeHQ/TextViewBenchmark
這是一個用於測試macOS文本視圖性能的套件,使用XCTest的UI性能測試系統和自定義OSSignpost進行自動化測試。該套件支持TextKit 1和TextKit 2,並提供了測試結果。
+ 這個套件對於開發macOS應用程序的人來說非常有用,可以幫助他們測試文本視圖的性能。
+ 很棒的工具,可以幫助開發人員更好地瞭解他們的應用程序在不同版本的macOS上的性能表現。
#GitHub #macOS #性能測試
GitHub - ChimeHQ/TextViewBenchmark: A suite of performance tests for macOS text views

A suite of performance tests for macOS text views. Contribute to ChimeHQ/TextViewBenchmark development by creating an account on GitHub.

GitHub
🌘 Hot Chips 2023: Zen 4處理遊戲工作負載的表現
➤ Zen 4處理遊戲工作負載的性能測試
https://chipsandcheese.com/2023/09/06/hot-chips-2023-characterizing-gaming-workloads-on-zen-4/
本文介紹了作者在Zen 4上進行的遊戲工作負載性能測試,並分析了Zen 4在前端和後端的瓶頸,以及分析了指令快取和分支預測器的表現。
+ 這篇文章很有趣,讓我對Zen 4的表現有了更深入的瞭解。
+ 很好的性能測試,期待更多關於Zen 4的文章。
#Zen 4 #遊戲工作負載 #性能測試
Hot Chips 2023: Characterizing Gaming Workloads on Zen 4

AMD didn’t present a lot of new info about the Zen 4 core at their Hot Chips 2023 presentation. Uops.info has measured execution throughput and latency for instructions on Zen 4. We’ve …

Chips and Cheese