Mastodawn

https://blog.gslin.org/archives/2026/03/03/12911/2025-%e5%b9%b4%e7%88%ac%e5%8d%81%e5%84%84%e5%80%8b%e9%a0%81%e9%9d%a2%e7%9a%84%e6%88%90%e6%9c%ac/

2025 年爬十億個頁面的成本

#amazon #aws #cloud #crawler #crawling #engine #html #https #javascript #js #library #lxml #performance #search #selectolax #service #speed #ssl #tls #web #webpage

2025 年爬十億個頁面的成本

上禮拜看到的文章，作者在 AWS 上面只用 25.5 個小時就爬了 1B 個頁面，在 tune 過效能後的成本是 $462：「Crawling a billion web pages in just over 24 hours, in 2025 (via)」。作者裡面有提到一篇 2012 年的「How to crawl a quarter billion webpages in 40 hou...

Gea-Suan Lin's BLOG

GripNews Mar 28, 2025

🌗 祕密網誌 • Xee: 一個現代的Rust XPath和XSLT引擎
➤ 對XML技術歷史、XPath、XSLT和現代的Rust Xee引擎的深入探討
✤ https://blog.startifact.com/posts/xee/
近兩年來，作者一直在開發一種名為Xee的Rust程式語言實現，支援現代版本的XPath和XSLT。Xee是一種程式語言實現，成果包含一個命令行工具和一個Rust庫，用於發行XPath查詢和在Rust中發出XPath查詢。文章介紹了Xee的起源、XML技術歷史以及對於XML以及XPath和XSLT在基於開源堆上的現狀和未來的看法。
+ 讀完這篇文章後，對於XML技術的演進和Xee的出現更有深入瞭解。
+ 文章精簡而清晰地介紹了Xee的背景和價值，對於XML技術愛好者具有啟發性。
#XML #Rust #XPath #lxml

Xee: A Modern XPath and XSLT Engine in Rust

I announce Xee, the implementation of XPath and XSLT in Rust that I've been working on for the last two years.

Secret Weblog

Simon 🐮

Mar 6, 2025

Quite literally, #fuck all your searches!
It really helps!
Exhibit A: A search for a failed #python #lxml build on #DuckDuckGo returns a useless "AI" helptext.
Exhibit B: Change the search from "Faild to build" to "Failed to fucking build" returns normal search results.
#Fucking with your search queries is useful

BSI WID Advisories Feed Nov 20, 2024

#BSI WID-SEC-2024-3505: [NEU] [mittel] #lxml: Schwachstelle ermöglicht Cross-Site Scripting

Ein entfernter, anonymer Angreifer kann eine Schwachstelle in lxml ausnutzen, um einen Cross-Site Scripting Angriff durchzuführen.

https://wid.cert-bund.de/portal/wid/securityadvisory?name=WID-SEC-2024-3505

Warn- und Informationsdienst

Habr Dec 14, 2023

Бенчмарк HTML парсеров в Python: сравнение скорости

Привет, Хабр! Меня зовут Вадим Москаленко и я разработчик инновационных технологий Страхового Дома ВСК. В этой статье хочу поделиться с вами информацией по проведенному сравнению производительности нескольких популярных библиотек для простого HTML-парсинга. При необходимости сбора данных с HTML или XML, многим python-разработчикам сразу вспомнятся две популярные библиотеки «BeautifulSoup4» и «lxml» — они весьма удобны и стали широко применяемыми. Но что, если в нашем проекте важна скорость сбора данных? Возникает вопрос: кто из них быстрее и есть ли еще более быстрые библиотеки? При поиске данной информации на Хабре, я нашел подобные статьи, но им уже несколько лет. Так как прогресс не стоит на месте и появляются новые инструменты или те, о которых еще не слышали, мне было интересно провести личное исследование и поделиться информацией. Ремарка: выбор библиотеки зависит от конкретных требований проекта, также существует еще множество инструментов, которые не были освещены в данной статье, к примеру «Scrapy» — это мощный асинхронный фреймворк. В исследовании акцентируется внимание на более простой задаче, поэтому я не гарантирую что лидер бенчмарка подойдет именно вам. Помните о важности проведения собственных тестов и анализа требований вашего проекта перед принятием решения. В качестве задачи используем поисковик нашего любого habr.com , в который отправим запрос с ключевыми словами «html parsing python» и соберем следующие данные по каждой статье: имя автора, заголовок, дату создания статьи, количество просмотров и голоса (оценки).

https://habr.com/ru/companies/vsk_insurance/articles/780500/

#benchmark #бенчмарк #html #parsing #python #beautifulsoup4 #lxml #parsel #requestshtml #selectolax

Бенчмарк HTML парсеров в Python: сравнение скорости

Хабр

Show thread

adamghill Oct 12, 2023

Using https://pytest-benchmark.readthedocs.io to compare #regex vs #HTMLParser vs lxml.html vs #BeautifulSoup with a typical @unicorn #HTML component fragment.

Still haven't figured out how to use parsel. And still scared to use #regex even though it would be way faster. 🥺

#python #lxml #pytest #benchmark

Welcome to pytest-benchmark’s documentation! — pytest-benchmark 4.0.0 documentation

djr Aug 9, 2023

Great news. I "found" an RSS feed for the news site that stopped serving them.

#xpath #lxml #python

DeaDSouL

Nov 23, 2022

#Python #Frameworks #Libraries #numpy #tensorflow #theano #pandas #pytorch #keras #matplotlib #scipy #seaborn #django #flask #bottle #cherrypy #pyramid #web2py #turboGears #cubic #dash #falcon #pyunit #behave #splinter #robot #pytest #opencv #mahotas #pgmagick #simpletk $scikit #arcade #pyglet #pyopengl #pygame #panda3d #lxml #requests #selenium #scrapy #code #developing #programming #coding

xameer May 23, 2021

#Python #lxml lib grants access to tree as it is being constructed, if you don't delete a node it will later modify, it's safe to e.g. parse 1 element/step from a big serialized array, then deleting it from its parent container, so memory usage is const
Remainder of parsing:

Ronal Forero 🇻🇪 🐧Dec 12, 2019

#CNC #inkscape Estoy convertiendo imagenes a gcode.. pero para hacer eso se necesita de varias librerias de #python2 como #numpy #lxml