Using Ansible to Automate Deployment of NebulaGraph Cluster

Background

To do tests on NebulaGraph, we need to frequently deploy it on servers, so we need a tool to improve the efficiency of deployment. The tool need to meet these major requirements:

It can allow a non-root account to deploy NebulaGraph, so that we can usecgroupsto limit resource usage.
It can modify the configuration files on the operating machines, and then distribute the configurations to the cluster for testing the parameters.
It can be called by using scripts to ease integrating it to the testing platform or tool later.

About tool selection, there are several options. Fabric and Puppet are veterans and Ansible and SaltStack are rising stars.

The Ansible project on GitHub has earned more than 40 thousand stars. It was acquired by Red Hat in 2015, and its community is very active. A lot of open-source projects adopt Ansible for deployment. For example,kubesprayof Kubernetes andtidb-ansibleof TiDB. Therefore, we decided to use Ansible to deploy NebulaGraph.

Introduction to Ansible

Features

Ansible is an open-source project, but Ansible Tower, an automate deployment tool, is commercial. Ansible has these features:

SSH protocol by default. Compared with SaltStack, Ansible manages machines in an agentless manner.
Flexibility. Playbook, role, and module are used to define the deployment steps.
Idempotent behavior.
Supporting modular development with a variety of modules.

Some features have obvious advantages and disadvantages:

Using SSH protocol:It enables Ansible to deploy a cluster on most machines with the password authentication mechanism by default and its disadvantage is that the performance will be compromised.
Using playbook to define deployment steps and using Jinja2 as the template engine. For those who are familiar with these technologies, they are easy, but for new players, the learning curve is high.

In summary, Ansible is suitable for the batch deployment on a small batch of machines in anagentless manner, and this is right where we are.

How to Deploy

Usually, to deploy a cluster offline, the machines are assigned three roles:

Control node: Ansible is installed on the control node and manages machines over the SSH protocol.

Resource machine: The tasks that need Internet access such as downloading RPM packages are executed on the resource machine.

Managed node: The managed nodes run the services. They can be in an isolated network. The services are deployed by Ansible.

ansible workflow

How to Execute Tasks

In Ansible, there are three levels of tasks:

Module
Role
Playbook

Modules are the basic Ansible task units. There are two types of modules: CoreModule and CustomerModule.

A task is executed as follows:

In the control node, Ansible generates a new Python file based on the module code and the passed parameters.
The control node copies the Python file to thetmpfolder in the managed nodes over SSH.
The control node executes the Python file over SSH.
The managed nodes return the execution result to the control node and then the folder is removed from the managed nodes.

ansible workflow

# Do not delete the tmp folder.
export ANSIBLE_KEEP_REMOTE_FILES=1
# Run -vvv to view the debug information.
ansible -m ping all -vvv
<192.168.8.147> SSH: EXEC ssh -o ControlMaster=auto -o ControlPersist=30m -o ConnectionAttempts=100 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="nebula"' -o ConnectTimeout=10 -o ControlPath=/home/vesoft/.ansible/cp/d94660cf0d -tt 192.168.8.147 '/bin/sh -c '"'"'/usr/bin/python /home/nebula/.ansible/tmp/ansible-tmp-1618982672.9659252-5332-61577192045877/AnsiballZ_ping.py && sleep 0'"'"''

The debug information is returned as above. The file named AnsiballZ_ping.pyis the Python file generated based on the module. We can run this file on the managed node to do a check of the file.

python3 AnsiballZ_ping.py
#{"ping": "pong", "invocation": {"module_args": {"data": "pong"}}}

The result is the standard output of a Python file and Ansible processes the result additionally.

A role contains a series of tasks defined in modules. It uses register to pass context parameters.

Here is a typical example:

Creating a directory.
Continuing installation if the directory is created successfully, otherwise, canceling the deployment.

A playbook bridges the managed nodes and the roles.

Ansible can use inventories to group remote machines (managed nodes) and use different roles to perform deployment on the machines of different groups.The installation and deployment task can be completed flexibly.

After a playbook is defined, if we need to adapt the deployment to different environments, only changes to the machine configuration in the inventory are necessary.

Customizing Modules

Customizing filter

Ansible uses Jinja2 as the template engine.Filterof Jinja2 can be used.

# Uses the default filter. By default, 5 is output.
ansible -m debug -a 'msg={{ hello | default(5) }}' all

Sometimes, we need to customizefilterto operate variables. For NebulaGraph, a typical scenario is configuring themeta_server_addrsparameter for the IP addresses of nebula-metad servers. For a cluster where only one nebula-metad process is deployed, the value of the parameter is in themetad:9559format. For those where three nebula-metad processes are deployed, the value of the parameter is in themetad1:9559,metad2:9559,metad3:9559format.

In an Ansible playbook project, create a new directory namedfilter_pluginsand then a Python file namedmap_format.pyin it. Copy the following content to the.pyfile.

# -*- encoding: utf-8 -*-
from jinja2.utils import soft_unicode

def map_format(value, pattern):
    """
    e.g.  
    "{{ groups['metad']|map('map_format', '%s:9559')|join(',') }}"
    """
    return soft_unicode(pattern) % (value)
class FilterModule(object):
    """ jinja2 filters """
    def filters(self):
        return {
            'map_format': map_format,
        }

{{ groups['metad'] | map('map_format', '%s:9559') | join(',') }}is the value that we need.

Customizing module

A custom module needs to be compliant with the Ansible framework, including obtaining parameters, standard returns, and error returns. After customization, we need to configureANSIBLE_LIBRARYinansible.cfgto enable Ansible to obtain the custom module. For more information, seehttps://ansible-docs.readthedocs.io/zh/stable-2.0/rst/developing_modules.html.

Practice of Ansible on NebulaGraph

Startup of NebulaGraph is not complicated, so it is very simple to use Ansible to deploy a NebulaGraph cluster.

Downloading the RPM package.
Copying the RPM package to managed nodes. Unzipping the package and moving it to the target folder.
Updating the configuration files.
Running shell scripts to start the cluster.

Using Common Role

NebulaGraph has three process types: nebula-graphd, nebula-metad, and nebula-storaged. They can be named and started in the same manner, so the common role can be applied to these processes.

By using this mechanism, the maintenance gets easier and the services can be more fine-grained. For example, if we want to deploy the nebula-storaged process on machine A, B, and C and to deploy the nebula-graphd process on machine C only, the nebula-graphd configuration file will be distributed to machine C, but not machine A and B.

# install/task/main.yml, a common role. Variables are used. 
- name: config {{ module }}.conf
  template:
    src: "{{ playbook_dir}}/templates/{{ module }}.conf.j2"
    dest: "{{ deploy_dir }}/etc/{{ module }}.conf"

# nebula-graphd/task/main.yml, the graphd role. A value is assigned to the variable.
- name: install graphd
  include_role:
    name: install
  vars:
    module: nebula-graphd

In the playbook, the machine group for the nebula-graphd process runs the graphd role. For example, no nebula-graphd process is deployed on machine A and B, so its configuration file will not be passed to them.

When a NebulaGraph cluster is deployed in this way, you cannot runnebula.service start allto start it, because some machines do not have thenebula-graphd.conffile. Similarly, we can use parameters in a playbook to define machine groups and to pass the parameters.

# playbook start.yml
- hosts: metad
  roles:
    - op
  vars:
    - module: metad
    - op: start
- hosts: storaged
  roles:
    - op
  vars:
    - module: storaged
    - op: start
- hosts: graphd
  roles:
    - op
  vars:
    - module: graphd
    - op: start

It works like running the startup script over SSH multiple times. The execution efficiency is not as high asstart all, but the services can be started and stopped more flexibly. Using the common role

ansible demo

Using vars_prompt to End Playbook

If we want to delete the binary files but not to delete the data directory, we can addvars_promptin theremoveplaybook to double confirm the task. If the task is double confirmed, the data will be deleted and the playbook will be ended.

# playbook remove.yml
- hosts: all
  vars_prompt:
    - name: confirmed
      prompt: "Are you sure you want to remove the Nebula-Graph? Will delete binary only  (yes/no)"

  roles:
    - remove

However, in the role, the value for the double confirmation will be verified.

# remove/task/main.yml
---
- name: Information
  debug:
    msg: "Must input 'yes', abort the playbook "
  when:
    - confirmed != 'yes'
- meta: end_play
  when:
    - confirmed != 'yes

The playbook will be executed as shown in the following figure. In the double confirmation step, if no 'yes' is input, the playbook will be aborted, which will delete the binary files only but not the data of the NebulaGraph cluster.

final result