Mastodawn

dehora Dec 9, 2022

Back to basics:

The Unified Logging Infrastructure for Data Analytics at Twitter, Lee, Lin, Liu, Lorek, Ryaboy, 2012

The Unified Logging Infrastructure for Data Analytics at Twitter

In recent years, there has been a substantial amount of work on large-scale data analytics using Hadoop-based platforms running on large clusters of commodity machines. A less-explored topic is how those data, dominated by application logs, are collected and structured to begin with. In this paper, we present Twitter's production logging infrastructure and its evolution from application-specific logging to a unified "client events" log format, where messages are captured in common, well-formatted, flexible Thrift messages. Since most analytics tasks consider the user session as the basic unit of analysis, we pre-materialize "session sequences", which are compact summaries that can answer a large class of common queries quickly. The development of this infrastructure has streamlined log collection and data analysis, thereby improving our ability to rapidly experiment and iterate on various aspects of the service.

arXiv.org

Show thread

dehora Dec 9, 2022

Batch focused semantics and Hadoop focused mechanics, perhaps natural for its time, but still relevant in it thinks client side as a stream of things that arrive eventually. Lots to be interested in, for example two clever ideas and a takeaway.

Show thread

dehora

Representing client interaction space as a tree grammar where each possible interaction is a tuple. If your site or app has a design system this property can fall out of that.

Show thread

dehora Dec 9, 2022

Representing data as Unicode code points to reduce the volume a map reduce has to rip through, then reverse encoding post reduce. That's just a very cool hack.

Show thread

dehora Dec 9, 2022

One takeaway is it requires intentional design to get good data 'off of' clients. In that sense, The Unified Logging Infrastructure for Data Analytics at Twitter feels very much like a letter to a constant future reader.