What I learned working on NebulaGraph, an open source and distributed graph database
This article is based on a talk given by Dr. Min Wu, a senior expert at Vesoft Inc. Dr. Wu talked about the status quo of the global graph database market, the design and features of NebulaGraph as a distributed graph database, as well as NebulaGraph’s open-source community.
The Global Graph Database Market
Let’s start with some numbers. Markets and Markets anticipates the graph database market will reach $2.4 billion by 2023 from $821.8 million in 2018.
Graph database is still in a rising trend and it is one of Gartner’s top 10 data and analytics trends in 2021. This is because graph databases can be used in many more areas than traditional databases, including computing, processing, deep learning, and machine models.
Graph databases gained the most in popularity in the past 10 years, according to data compiled by DB-Engines. The data is based on social media mentions, Stack Overflow questions, and search trends.
Advantages of Graph Databases
One of the most significant advantages of graph databases is that they are intuitive. If you want to express the character relationships of Game of Thrones, you can use both traditional tabular databases and graph databases. But as it is shown below, using graph data is much more intuitive, though they both express the same data model.
In another example, we can compare how to search in SQL databases and graph databases. For example, here is how to find out how many posts and comments were created in a given time frame and rank the results in SQL databases and graph databases.
--PostgreSQL WITH RECURSIVE post_all (psa_threadid , psa_thread_creatorid, psa_messageid , psa_creationdate, psa_messagetype ) AS ( SELECT m_messageid AS psa_threadid , m_creatorid AS psa_thread_creatorid , m_messageid AS psa_messageid , m_creationdate, 'Post' FROM message WHERE 1=1 AND m_c_replyof IS NULL -- post, not comment AND m_creationdate BETWEEN :startDate AND :endDate UNION ALL SELECT psa.psa_threadid AS psa_threadid , psa.psa_thread_creatorid AS psa_thread_creatorid , m_messageid, m_creationdate, 'Comment' FROM message p, post_all psa WHERE 1=1 AND p.m_c_replyof = psa.psa_messageid AND m_creationdate BETWEEN :startDate AND :endDate ) SELECT p.p_personid AS "person.id" , p.p_firstname AS "person.firstName" , p.p_lastname AS "person.lastName" , count(DISTINCT psa.psa_threadid) AS threadCount END) AS messageCount , count(DISTINCT psa.psa_messageid) AS messageCount FROM person p left join post_all psa on ( 1=1 AND p.p_personid = psa.psa_thread_creatorid AND psa_creationdate BETWEEN :startDate AND :endDate ) GROUP BY p.p_personid, p.p_firstname, p.p_lastname ORDER BY messageCount DESC, p.p_personid LIMIT 100;
Here is how to realize the same query using the Cypher graph query language:
--Cypher MATCH (person:Person)<-[:HAS_CREATOR]-(post:Post)<-[:REPLY_OF*0..]-(reply:Message) WHERE post.creationDate >= $startDate AND post.creationDate <= $endDate AND reply.creationDate >= $startDate AND reply.creationDate <= $endDate person. RETURN id, person.firstName, person.lastName, count(DISTINCT post) AS threadCount, count(DISTINCT reply) AS messageCount ORDER BY messageCount DESC, person.id ASC LIMIT 100
In addition, the graph ecosystem is diversified. The following is the graph technology landscape in 2020, and we can expect more graph related technologies coming along in 2021.
Now, let's take NebulaGraph, a distributed graph database, as an example to talk about the evolution of graph technology. I will also share the challenges the team had faced when developing NebulaGraph and how we had solved them.
When we started designing the blueprint of NebulaGraph in late 2018, the team had set four goals for the database. They are scalability, production-ready, OLTP(Online Transaction Processing), and open source. The four goals are still influencing the roadmapping of NebulaGraph until today.
Scalability is the No.1 design principle of NebulaGraph. This is because we believe the data that businesses will process in the future must be massive and there will be no way that single machines can handle them. That’s why we designed NebulaGraph in a way that it is capable of handling graph data with trillions of vertices and edges.
NebulaGraph is also designed to be production-ready on the first day, including the design of its query language, visualization, programmability, and DevOps.
OLTP (online transactional processing) enables the real-time execution of large numbers of database transactions by large numbers of users, typically over the internet. One of the priorities of NebulaGraph’s design is OLTP. This makes NebulaGraph an online, high-concurrency, and low-latency graph database.
NebulaGraph is also devoted to building an open-source community and integrating with the big data world, supporting graph computing and training frameworks like Tencent Plato and Spark GraphX.
The Nebula Core
The above figure shows the ecosystem built around NebulaGraph. The section with red background is the NebulaGraph Core, which consists of three parts called Meta, Graph, and Storage.
NebulaGraph’s query language is our in-house nGQL, which is also compatible with openCypher. We have also developed clients in languages including Java, C++, Python, and Go. Then on the top we have a number of SDKs that can work with frameworks like Spark, Flink, GraphX, Tencent Plato.
Let’s dive into the Nebula core, which, as I mentioned above, consists of the meta, graph, and storage services. The meta service deals with metadata, while the storage service stores the data, and the graph service is in charge of querying. The three modules run on their independent processes, ensuring the separation of compute and storage.
The meta service manages the schema. NebulaGraph is not a schema-free database, and it requires the properties of vertices and edges to be pre-configured. The meta service also manages storage spaces, long-duration tasks, and data cleaning.
NebulaGraph can handle data with trillions of vertices and edges. This means the system must segment the data in storage and handling. NebulaGraph uses the segmentation of edges and stores vertices in partitions. Each partition may have a few replicas and run on different machines. The query engine is stateless, meaning that all query data should either be retrieved from the meta service or the storage service and there is no communication between query services.
The above is about NebulaGraph’s separation of compute and storage. Now let’s talk about data characteristics. We have mentioned that NebulaGraph is not a schema-free database and that all the data stored is pre-defined by Data Definition Languages (DDLs). We call types of vertices Tag and types of edges EdgeType. Vertices are defined using a 2-tuple consisting of the vid and the tag. Edges are defined using a 4-tuple consisting of the endpoints, EdgeType, and rank.
NebulaGraph supports primitive data types like boolean, int, and double as well as composite types like list, set, or graph data types like path and subgraph. If there is long string data stored in the database, it is usually indexed by Elasticsearch.
Here is some additional information about the storage engine. For the query engine graphd, the external interface exposed by the storage engine is a distributed graph service, but it can also be used as a distributed key-value (KV) service if necessary. In the storage engine, partitions adopt the Raft consensus protocol. NebulaGraph stores vertices and edges in separated partitions. The following figure is about how KV is implemented using the storage partitioning.
In NebulaGraph, each edge is stored as two pieces of data. As mentioned above, the storage layer relies on VID and guarantees strong consistency using the Raft protocol.
NebulaGraph uses ElasticSearch for full-text index. From NebulaGraph v2.x, our R&D team has optimized the write performance of Nebula's indexing capability. Since v2.5.0, NebulaGraph has started to support the combination of data expiration TTL and indexing. And from v2.6.0, NebulaGraph started to support the TOSS (Transaction on storage side) function to achieve the eventual consistency of edges. That is to say, edges are either successfully written or failed at the same time when they are inserted or modified.
NebulaGraph uses its in-house nGQL as its query language. In June 2021, the International Organization for Standardization has drafted the standard for the syntax and semantics of GQL and there is a consensus between major graph database vendors.
From NebulaGraph v2.0, nGQL started to be compatible with openCypher, which was an open-source version of Neo4j’s Cypher query language. Now, nGQL supports the Doctrine Query Language (DQL) of openCypher and has developed its own vanilla nGQL syntax style in Data Manipulation Language (DML) and Data Definition Language (DDL).
We also mentioned that NebulaGraph is born to be production-ready. So it supports a wide range of operation features like data isolation, user permission and authentication, and replica configuration. Also, NebulaGraph supports clustering. Nebula Operator, which was released in April 2021, started to support Kubernetes.
As for the performance of NebulaGraph, most performance tests are carried out by users in the community, such as engineers from tech companies like Meituan, WeChat, 360 DigiTech, and WeBank, etc. The figures below are performance reports compiled by users.
As for the performance of NebulaGraph, most performance tests are carried out by users in the community, such as Meituan, WeChat, 360 DigiTech, WeBank. The figures below are performance reports compiled by users.
We have mentioned that one of the priorities of NebulaGraph’s design is OLTP. But that doesn’t mean it neglects analytical processing (AP). NebulaGraph has integrated AP frameworks like Apache Spark’s GraphX and Plato developed by Tencent.
Generally speaking, NebulaGraph performance has improved significantly when the deep traversal is conducted.
NebulaGraph meets users' AP (Access Point) requirements of OLTP (Online Transactional Processing ) by docking with Spark's Graph X and supports Plato, the graph computing engine of Tencent's WeChat team. Plato docking is actually the data connection between the two engines, which needs to change the internal data format of NebulaGraph to be that of Plato and then Partition to map them one by one.
The NebulaGraph Community
NebulaGraph became open source in May 2019. Its v1.0 GA was released in June 2020, even though some companies had applied NebulaGraph in production before that. NebulaGraph first entered DB-Engines’ graph database management system ranking two years ago and now it ranks 15th on the list.
NOTE: The screenshot is the DB-Engine ranking in Apr. 2021 when this talk was given.
NebulaGraph is also one of the top open source players in China. This following report, which was published by the X-Lab of East China Normal University, ranks the companies according to the community popularity of their open source products. Vesoft Inc., the maker of NebulaGraph, ranks the eighth, before TikTok parent Bytedance, and just one position after Huawei.
Here are some of my thoughts about open source graph databases. Open source is very common in the graph database industry because it is a relatively new area and only got traction in recent years. This is why NebulaGraph chose to be open source from day one. Open source software can also attract more developers to use and gain valuable feedback from adopters.
If you encounter any problems in the process of using NebulaGraph, please refer to NebulaGraph Database Manual to troubleshoot the problem. It records in detail the knowledge points and specific usage of the graph database and the graph database NebulaGraph.
Join our Slack channel if you want to discuss with the rest of the NebulaGraph community!