Any #DataBricks experts around? I work for a company in the biology field and we use a lot of nomenclatures/ontologies. We are starting to move to #DataBricks for data management/governance and I would like to get some of our ontologies in there. What would be the best approach to store an ontology (originally .obo file) that is by default a non-SQL, in a way that is somehow nice and compatible with the other delta tables we have? #DataEngineering #database
@alesegura Hi Ale, ask @pacoid. Paco Nathan was a founder of Apache Spark. He knows everyone at DataBricks and is in my opinion the Chief Data Scientists if there ever was one.

@mdwaldman22 @alesegura
Hi Ale, and thank you Marilyn -
I should correct that I was the evangelist for Spark circa 2014 (our hypergrowth) and early at Databricks. Matei Zaharia @ Stanford created Spark. I'll have to see who from Databricks currently might be around on the Fediverse here ...

Meanwhile, great question about ontologies! That's much closer to what our team (Derwen) has been doing in industry these past few years (will make a mini thread to answer)

@pacoid @mdwaldman22 That is super kind of you! Thanks! I asked in my group and it seems we are all kind of flattening the obo files but missing part of the info. So it would be great to hear from the experts! 💜

@alesegura @mdwaldman22
For context, our team (derwen.ai) works with large manufacturing in EU, mostly chemists and material scientists, and definitely lots of OBO and lots of Databricks on their enterprise data lake :)

We focus on open source integration to support AI applications that use graph technologies.

A recent overview is at https://www.anyscale.com/ray-summit-2022/agenda/sessions/232
(free access, requires registration/email)

Ray Summit 2022 - Agenda

Centered on Ray, the open-source distributed computing framework, Ray Summit brings together ML practitioners, data scientists, and more to discuss the growing demand for distributed computing.

Anyscale

@alesegura @mdwaldman22
There are some good community resources for this general area of work.

I moderate the "Graph Data Science" group on LinkedIn, at https://www.linkedin.com/groups/6725785/
(~600 members)

Another is "Knowledge Graph Conference" which has a Slack board with ~3000 members (I can send you a link)

Another is "Connected Data World" and "Orchestrate All The Things" podcast by George Anadiotis https://www.linkedin.com/company/connecteddataworld/

We curate a list of related conferences at https://derwen.ai/events#watchlist

Sign Up | LinkedIn

500 million+ members | Manage your professional identity. Build and engage with your professional network. Access knowledge, insights and opportunities.

@alesegura @mdwaldman22
One of the core problems in this field is that it's quite complex, and the word "graph" means so many different things ...

For example, there's much theory based on random scale-free graphs (Barabasi, et al.) which we see supported in graph algorithm libraries. That said, real-world graph data tends to not fit those models. There's also lots of theoretical work regarding social networks or say PageRank - also very different from what we encounter in science or industry.

@alesegura @mdwaldman22

#graphthinking

OBO is based on OWL, so it falls within the W3C area of semantic graphs, which use SPARQL queries, SHACL, etc.

Many data-intensive problems in industry tend to use labeled property graphs (LPG) such as neo4j and Cypher queries.

There's much work with probabilistic graphs and statistical relational learning.

There's much work with GNNs

There's much work with graph visualization.

Unfortunately, these different camps do not align much.

@alesegura @mdwaldman22
Here's a talk (slides) that goes into more detail: https://derwen.ai/s/kcgh#35

and a recent video which goes with these slides
https://www.youtube.com/watch?v=dVjsBNXcg6U

We're tracking ~6 different camps that claim the word "graph" which tend to be mutually exclusive.

#graphthinking

Graph Thinking

Python offers excellent libraries for working with graphs: semantic technologies, graph queries, interactive visualizations, graph algorithms, probabilistic graph inference, as well as embedding and other integrations with deep learning. However, most of these approaches share little common ground, nor do many of them integrate effectively with popular data science tools (pandas, scikit-learn, spaCy, PyTorch), nor efficiently with popular data engineering infrastructure such as Spark, RAPIDS, Ray, Parquet, fsspec, etc. The `kglab` https://github.com/DerwenAI/kglab open source project integrates most all of the above, and moreover provides ways to leverage disparate techniques in ways that complement each other. This talk also explores _graph thinking_ as a cognitive framework for approaching complex problem spaces. This is the missing part between what the stakeholders, domain experts, and business use cases require – versus what comes from more "traditional" enterprise IT, which is probably focused on approaches such as "data lakehouse" or similar topics, but not doing much yet with large graphs.

Derwen, Inc.

@alesegura @mdwaldman22

In an open source project called `kglab` (since 2020) we've worked to build integration paths between these different camps, making them more compatible with PyData approaches, and providing tutorials with examples.
https://github.com/DerwenAI/kglab
https://derwen.ai/docs/kgl/tutorial/

#graphthinking #graphdatascience

GitHub - DerwenAI/kglab: Graph Data Science: an abstraction layer in Python for building knowledge graphs, integrated with popular graph libraries – atop Pandas, NetworkX, RAPIDS, RDFlib, pySHACL, PyVis, morph-kgc, pslpython, pyarrow, etc.

Graph Data Science: an abstraction layer in Python for building knowledge graphs, integrated with popular graph libraries – atop Pandas, NetworkX, RAPIDS, RDFlib, pySHACL, PyVis, morph-kgc, pslpyth...

GitHub

@alesegura @mdwaldman22

Spark / Databricks has a component called GraphX (renamed GraphFrames)
https://graphframes.github.io/graphframes/docs/_site/index.html

This was essentially a post-doc project circa mid-2010s, and hasn't had full support in Databricks.

It focuses on large-scale graph traversals based on Spark, especially for an implementation of Pregel and bulk synchronous parallel (BSP).

TL;DR: that's not going to be especially useful with use cases that leverage OBO :)

Overview - GraphFrames 0.8.0 Documentation

GraphFrames 0.8.0 documentation homepage

@alesegura @mdwaldman22

For example, in one of the GraphX tutorials I showed how to use bike share data and an SSSP algorithm to approximate transit times predicted in Google Maps.

Again, that's more toward the random scale-free graphs from theoretical math.

@alesegura @mdwaldman22

There are many reasons why tooling like Databricks is focused on tabular, SQL-ish kinds of data, and also why that approach tends to fail at scale for graph problems. We can start with the problem of self-joins, we could also talk about transitive closure. There are large-ish religious wars fought over these topics, so I'll avoid them here.

Some of my videos explore these operational topics in more detail.

@alesegura @mdwaldman22

Long story short: in like 99% of the usage in industry, people will use Databricks for ETL and data preparation to pull data out of a data warehouse / data lake, and make it available to work in graphs.

NB: data preparation in graphs is *significantly* different from data prep in other data science work. A core issue is that data prep with graphs is computationally expensive, and often must be performed *before* any of the commercial graph tools can be used.

@alesegura @mdwaldman22

There are ~50 vendors now for "graph databases" and I'm certain their respective sales people will try to refute most of what I've said above. However, if you talk privately with their large customers, you'll hear back most of what I've said above :) Caveat emptor.

Here's a public spreadsheet where we curate the graph database vendors, related open source projects, and also the smaller consultancies with graph experts

https://derwen.ai/s/52hztjkknx6n

#graphthinking

Graph Technologies

vendors company,url,SPARQL,RDF-Star,Gremlin,Cypher,misc query,distrib,open source,parent,speaker,notes AgensGraph,<a href="https://bitnine.net/agensgraph/">https://bitnine.net/agensgraph/</a>,Y,SQL,Bitnine AllegroGraph,<a href="https://allegrograph.com/products/allegrograph/">https://allegrograp...

Google Docs

@alesegura @mdwaldman22

Yes, as you mentioned, the flattening approach aligns with Databricks practices (SQL-ish in general) but then leads to information loss. That's a key point.

@alesegura @mdwaldman22

For a better solution, I'd need to know more about data rates and use cases downstream, before I could make much more of a recommendation.

Happy to have a call, if that helps (we can arrange off the timeline)

FWIW, I spend a lot of time in Madrid, where there's much graph expertise!

@alesegura @mdwaldman22

As a rule of thumb, we find that open source graph libraries (e.g., in Py) can handle up to ~10 M nodes on a late-model laptop. No DB needed. But that depends on the use case (viz, query, graph algo, GNN, etc.)

We also find that most graph DBs have troubles beyond ~100 M nodes.

Our open source efforts scale out to billions of nodes on Kubernetes

@pacoid @mdwaldman22 Thanks Paco for all the resources and your time answering this question. I found your youtube talk very enlightening! I think my ontology and use case are now very simplistic (I just want to be able to look for a node and find the name of their parent). I think flattening the graph would work for this use case BUT I am trying to think about beyond this use case and graphs would be more useful once I take into account multiple ontologies interacting with each other.
@pacoid @mdwaldman22 I'd love to have a short call if you are up for it. I am incredibly ignorant in this field but I think I have to go along with the field and the business cases, not only with what I know now and I feel safe with. Really thanks! :)