Data Prep for Graphs PyData Global 2022
Graph technologies and use cases are growing in popularity in industry. Open source libraries are available for graph data science, which integration with the PyData stack and related practices. Tools such as graph databases, visualization, etc., tend to take center stage in discussions about graph technologies.
However – and this is a relatively BIG "however" – similar to what was recognized a decade ago when data science become mainstream practice, so much time and effort and cost must go into _data preparation_ long before these other tools downstream can be used effectively.
In the early-ish days of Big Data, many commercial database vendors claimed to provide full suites for data science work. Practitioners found that, in contrast, they spent more of their time working in data wrangling, often using tools such as Pandas. This has become the proverbial 80% of data science.
Graph data science is no exception to this rule. Case in point, data visualization tools can render beautiful representations from nearly raw data. Unfortunately, without careful preparation, the beautiful renderings become expensive wallpaper since they don't lead to meaningful outcomes. For example, if a large dataset contains many _cycles_ for a business process where these are undefined (e.g., supply networks) or it contains many duplicates (e.g., slight variations of vendor or author names) then we can get pretty pictures, but not meaningful analysis.
Unfortunately, data preparation techniques for graphs such _cycle detection_, _similarity analysis_, _transitive closure_, and _unique identifier assignment_ often involve graph algorithms or distributed data structures which are computationally hard problems, expensive to perform, and not supported well at scale by the commercial graph databases.
This talk shows examples of data preparation for graphs, along with an overview of typical graph use cases in industry in which these need to be used. We'll show a progressive example based on recipe data (analogous to customer data in manufacturing) along with use of the PyData stack and other open source integrations such as Ray, Keyvi, Datasketch, Arrow/Parquet, PSL, etc., which help alleviate bottlenecks at scale when working with large graphs.