@mdwaldman22 @alesegura
Hi Ale, and thank you Marilyn -
I should correct that I was the evangelist for Spark circa 2014 (our hypergrowth) and early at Databricks. Matei Zaharia @ Stanford created Spark. I'll have to see who from Databricks currently might be around on the Fediverse here ...
Meanwhile, great question about ontologies! That's much closer to what our team (Derwen) has been doing in industry these past few years (will make a mini thread to answer)
@alesegura @mdwaldman22
For context, our team (derwen.ai) works with large manufacturing in EU, mostly chemists and material scientists, and definitely lots of OBO and lots of Databricks on their enterprise data lake :)
We focus on open source integration to support AI applications that use graph technologies.
A recent overview is at https://www.anyscale.com/ray-summit-2022/agenda/sessions/232
(free access, requires registration/email)
@alesegura @mdwaldman22
There are some good community resources for this general area of work.
I moderate the "Graph Data Science" group on LinkedIn, at https://www.linkedin.com/groups/6725785/
(~600 members)
Another is "Knowledge Graph Conference" which has a Slack board with ~3000 members (I can send you a link)
Another is "Connected Data World" and "Orchestrate All The Things" podcast by George Anadiotis https://www.linkedin.com/company/connecteddataworld/
We curate a list of related conferences at https://derwen.ai/events#watchlist
@alesegura @mdwaldman22
One of the core problems in this field is that it's quite complex, and the word "graph" means so many different things ...
For example, there's much theory based on random scale-free graphs (Barabasi, et al.) which we see supported in graph algorithm libraries. That said, real-world graph data tends to not fit those models. There's also lots of theoretical work regarding social networks or say PageRank - also very different from what we encounter in science or industry.
OBO is based on OWL, so it falls within the W3C area of semantic graphs, which use SPARQL queries, SHACL, etc.
Many data-intensive problems in industry tend to use labeled property graphs (LPG) such as neo4j and Cypher queries.
There's much work with probabilistic graphs and statistical relational learning.
There's much work with GNNs
There's much work with graph visualization.
Unfortunately, these different camps do not align much.
@alesegura @mdwaldman22
Here's a talk (slides) that goes into more detail: https://derwen.ai/s/kcgh#35
and a recent video which goes with these slides
https://www.youtube.com/watch?v=dVjsBNXcg6U
We're tracking ~6 different camps that claim the word "graph" which tend to be mutually exclusive.
Python offers excellent libraries for working with graphs: semantic technologies, graph queries, interactive visualizations, graph algorithms, probabilistic graph inference, as well as embedding and other integrations with deep learning. However, most of these approaches share little common ground, nor do many of them integrate effectively with popular data science tools (pandas, scikit-learn, spaCy, PyTorch), nor efficiently with popular data engineering infrastructure such as Spark, RAPIDS, Ray, Parquet, fsspec, etc. The `kglab` https://github.com/DerwenAI/kglab open source project integrates most all of the above, and moreover provides ways to leverage disparate techniques in ways that complement each other. This talk also explores _graph thinking_ as a cognitive framework for approaching complex problem spaces. This is the missing part between what the stakeholders, domain experts, and business use cases require – versus what comes from more "traditional" enterprise IT, which is probably focused on approaches such as "data lakehouse" or similar topics, but not doing much yet with large graphs.
In an open source project called `kglab` (since 2020) we've worked to build integration paths between these different camps, making them more compatible with PyData approaches, and providing tutorials with examples.
https://github.com/DerwenAI/kglab
https://derwen.ai/docs/kgl/tutorial/
Graph Data Science: an abstraction layer in Python for building knowledge graphs, integrated with popular graph libraries – atop Pandas, NetworkX, RAPIDS, RDFlib, pySHACL, PyVis, morph-kgc, pslpyth...
Spark / Databricks has a component called GraphX (renamed GraphFrames)
https://graphframes.github.io/graphframes/docs/_site/index.html
This was essentially a post-doc project circa mid-2010s, and hasn't had full support in Databricks.
It focuses on large-scale graph traversals based on Spark, especially for an implementation of Pregel and bulk synchronous parallel (BSP).
TL;DR: that's not going to be especially useful with use cases that leverage OBO :)
For example, in one of the GraphX tutorials I showed how to use bike share data and an SSSP algorithm to approximate transit times predicted in Google Maps.
Again, that's more toward the random scale-free graphs from theoretical math.
There are many reasons why tooling like Databricks is focused on tabular, SQL-ish kinds of data, and also why that approach tends to fail at scale for graph problems. We can start with the problem of self-joins, we could also talk about transitive closure. There are large-ish religious wars fought over these topics, so I'll avoid them here.
Some of my videos explore these operational topics in more detail.
Long story short: in like 99% of the usage in industry, people will use Databricks for ETL and data preparation to pull data out of a data warehouse / data lake, and make it available to work in graphs.
NB: data preparation in graphs is *significantly* different from data prep in other data science work. A core issue is that data prep with graphs is computationally expensive, and often must be performed *before* any of the commercial graph tools can be used.
There are ~50 vendors now for "graph databases" and I'm certain their respective sales people will try to refute most of what I've said above. However, if you talk privately with their large customers, you'll hear back most of what I've said above :) Caveat emptor.
Here's a public spreadsheet where we curate the graph database vendors, related open source projects, and also the smaller consultancies with graph experts
vendors company,url,SPARQL,RDF-Star,Gremlin,Cypher,misc query,distrib,open source,parent,speaker,notes AgensGraph,<a href="https://bitnine.net/agensgraph/">https://bitnine.net/agensgraph/</a>,Y,SQL,Bitnine AllegroGraph,<a href="https://allegrograph.com/products/allegrograph/">https://allegrograp...
Yes, as you mentioned, the flattening approach aligns with Databricks practices (SQL-ish in general) but then leads to information loss. That's a key point.
For a better solution, I'd need to know more about data rates and use cases downstream, before I could make much more of a recommendation.
Happy to have a call, if that helps (we can arrange off the timeline)
FWIW, I spend a lot of time in Madrid, where there's much graph expertise!
As a rule of thumb, we find that open source graph libraries (e.g., in Py) can handle up to ~10 M nodes on a late-model laptop. No DB needed. But that depends on the use case (viz, query, graph algo, GNN, etc.)
We also find that most graph DBs have troubles beyond ~100 M nodes.
Our open source efforts scale out to billions of nodes on Kubernetes