Development Tools
Jun 2, 2022
Importing data into NebulaGraph using Nebula Importer
Reid
NebulaGraph is now a mature product with many ecosystem tools. It offers a wide range of options in terms of data import. There is the large and comprehensive Nebula Exchange, the small and compact Nebula Importer, and the Nebula Spark Connector and Nebula Flink Connector for Spark and Flink integrations.
But which of the many import methods is more convenient?
Here are my takes:
Nebula Exchange
If you need to import streaming data from Kafka and Pulsar into the NebulaGraph database
If you need to read batch data from relational databases (e.g. MySQL) or distributed file systems (e.g. HDFS)
If you need to generate SST files recognized by NebulaGraph from large batches of data
Nebula Importer
Nebula Importer is best for importing local CSV files into NebulaGraph
Nebula Spark Connector:
Migrate data between different NebulaGraph clusters
Migrate data between different graph spaces within the same NebulaGraph cluster
Migrate data between NebulaGraph and other data sources
Combining Nebula Algorithm for graph computation
For more options about how to import data from Spack, read: 4 different ways to work with NebulaGraph in Apache Spark
Nebula Flink Connector
Migrate data between different NebulaGraph clusters
Migrate data between different graph spaces within the same NebulaGraph cluster
Migrate data between NebulaGraph and other data sources
Overall, Nebula Exchange is large and comprehensive, and can be combined with most storage engines to import into Nebula, but requires a Spark environment to be deployed.
Nebula Importer is simple to use and requires fewer dependencies, but you need to generate your own data file in advance and configure the schema once and for all, but it does not support breakpoint transfer and is suitable for medium data volume.
Spark / Flink Connector needs to be combined with stream batch data.
Choose different tools for different scenarios. For newcomers to Nebula, it is recommended to use Nebula Importer, a data import tool, because it is easy to use and quick to get started.
Using Nebula Importer
When we first came across NebulaGraph, because the ecology was not perfect, and only some businesses migrated to Nebula, we used to import NebulaGraph data, whether full or incremental, by pushing Hive tables to Kafka and consuming Kafka to write NebulaGraph in batch. Later, as more and more data and businesses switched to NebulaGraph, the problem of importing data efficiency became more and more serious. The increase in import time made it unacceptable to still be importing data at full volume during peak business hours.
For the above problems, after trying Nebula Spark Connector and Nebula Importer, we decided to use Hive table → CSV → Nebula Server → Nebula Importer to import the full amount of data for the sake of easy maintenance and migration, and the overall time spent was significantly reduced. The overall time consumption is significantly reduced.
Configuring Nebula Importer
System environment
Cluster Environment
Data Size
Nebula Importer configuration
Set up the Crontab, Hive generates the tables and transfers them to the NebulaGraph Server, running Nebula Importer tasks at night when traffic is low:
In total, it took 2 hours to complete the import of the full amount of data at 6 am.
Some of the logs are as follows, and the import speed is maintained at a maximum of about 200000/s
Then at 7:00, Kafka is re-consumed to import the incremental data from the early morning of the day to 7:00 based on the timestamp, preventing the full amount of t+1 data from overwriting the incremental data of the day.
The incremental consumption takes about 10-15 min.
Real-time
The incremental data obtained from the MD5 comparison is imported into Kafka, and Kafka data is consumed in real-time to ensure that the data delay is no more than 1 minute.
In addition, there may be unanticipated data issues that are not detected in real-time for a long time, so the full amount of data is imported every 30 days, which is the Nebula Importer import described above. Then add a_ TTL=35 days _to the point side of Space to ensure that any data not updated in time will be filtered and subsequently recycled.
About the author
Reid is an engineer at Qichacha, China’s biggest corporate information platform.

Go From Zero to Graph in Minutes
Spin Up Your NebulaGraph Cluster Instantly!