Elasticsearch with gluster-block

This is the updated version of my previous blog about using Gluster Block Storage with Elastic Search.

In this blog, will introduce gluster-block utility and demonstrate how simple is it to use gluster block storage with Elastic search engine.

Introduction to gluster block

gluster-block is a block device management framework which aims at making gluster backed block storage creation and maintenance as simple as possible. gluster-block provisions block devices and exports them using iSCSI.  More details while you run through the blog.

Read More about gluster-block here

Note:  I have used 4 Fedora 25 Machines for creating this howto.
Setup at a glance:

  1. We use Node1, Node2 and Node3 for creating gluster volume and use same for exporting target block storage from gluster volume.
  2. On Node 4, we enable and configure mutipath, then discover and login to the individual target portals exported from Node1, Node2 and Node3. Finally configure and run Elasticsearch.

Gluster block storage setup

Pre-requisites

  • A gluster volume  in a trusted storage pool (3 nodes, we also use the same nodes as block exports)

Creating a block device

Install gluster-block
# dnf config-manager --add-repo https://copr.fedorainfracloud.org/coprs/pkalever/gluster-block/repo/fedora-25/pkalever-gluster-block-fedora-25.repo 
# dnf install gluster-block

# systemctl start gluster-blockd
# systemctl status gluster-blockd

Create a block of size 40GiB (using same Nodes as gluster volume) 
# gluster-block create sampleVol/elasticBlock ha 3 10.70.35.109,10.70.35.104,10.70.35.51 40GiB
IQN: iqn.2016-12.org.gluster-block:c1029cc3-7c40-48a0-94bf-16c1b4fad254
PORTAL(S): 10.70.35.109:3260 10.70.35.104:3260 10.70.35.51:3260
RESULT: SUCCESS

# gluster-block list sampleVol
elasticBlock 

# gluster-block info sampleVol/elasticBlock 
NAME: elasticBlock
VOLUME: sampleVol
GBID: c1029cc3-7c40-48a0-94bf-16c1b4fad254
SIZE: 42949672960
HA: 3
BLOCK CONFIG NODE(S): 10.70.35.109 10.70.35.51 10.70.35.104
# gluster-block help
gluster-block (0.1)
usage:
 gluster-block  <volname[/blockname]> []

commands:
 create <volname/blockname> [ha ] <host1[,host2,...]> 
 create block device.

 list 
 list available block devices.

 info <volname/blockname>
 details about block device.

 delete <volname/blockname>
 delete block device.

 help
 show this message and exit.

 version
 show version info and exit.

Initiator side setup  (on Elasticsearch node) (NODE 4)

# dnf install iscsi-initiator-utils

Multipathing to achieve high availability
# mpathconf 
multipath is enabled
find_multipaths is enabled
user_friendly_names is enabled
dm_multipath module is not loaded
multipathd is not running

# modprobe dm_multipath
# lsmod | grep dm_multipath
dm_multipath 24576 0

# mpathconf --enable

# mpathconf 
multipath is enabled
find_multipaths is enabled
user_friendly_names is enabled
dm_multipath module is loaded
multipathd is running

# cat >> /etc/multipath.conf
# LIO iSCSI
devices {
        device {
                vendor "LIO-ORG"
                user_friendly_names "yes" # names like mpatha
                path_grouping_policy "failover" # one path per group
                path_selector "round-robin 0"
                path_checker "tur"
                prio "const"
                rr_weight "uniform"
        }
}
^Ctrl+C

# systemctl restart multipathd

Check existing block devices
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda 252:0 0 40G 0 disk 
├─vda2 252:2 0 39G 0 part 
│ ├─fedora-swap 253:1 0 4G 0 lvm [SWAP]
│ └─fedora-root 253:0 0 15G 0 lvm /
└─vda1 252:1 0 1G 0 part /boot

Discovery and login to target
# iscsiadm --mode discovery --type sendtargets --portal 10.70.35.109 -l
# iscsiadm --mode discovery --type sendtargets --portal 10.70.35.104 -l
# iscsiadm --mode discovery --type sendtargets --portal 10.70.35.51 -l

# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb 8:16 0 40G 0 disk 
└─mpatha 253:2 0 40G 0 mpath 
sdc 8:32 0 40G 0 disk 
└─mpatha 253:2 0 40G 0 mpath 
sda 8:0 0 40G 0 disk 
└─mpatha 253:2 0 40G 0 mpath 
[...]

# mkfs.xfs /dev/mapper/mpatha 
meta-data=/dev/mapper/mpatha isize=512  agcount=4, agsize=2621440 blks
         =                   sectsz=512 attr=2, projid32bit=1
         =                   crc=1      finobt=1, sparse=0
data     =                   bsize=4096 blocks=10485760, imaxpct=25
         =                   sunit=0    swidth=0 blks
naming   =version 2          bsize=4096 ascii-ci=0 ftype=1
log      =internal log       bsize=4096 blocks=5120, version=2
         = sectsz=512        sunit=0    blks, lazy-count=1
realtime =none               extsz=4096 blocks=0, rtextents=0

# mount /dev/mapper/mpatha /mnt/

# df -Th
Filesystem Type Size Used Avail Use% Mounted on
[...]
/dev/mapper/mpatha xfs 40G 33M 40G 1% /mnt

Elasticsearch configuration (Node 4)

get latest release
# dnf install https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.0.2.rpm

# dnf install jq

# /usr/share/elasticsearch/bin/elasticsearch-plugin install analysis-icu

Configure Elasticsearch to use gluster block mount directory for storage
Uncomment and edit the below parameters as per your choice
# vi /etc/elasticsearch/elasticsearch.yml
cluster.name: gluster-block
node.name: blocktest-node
path.data: /mnt/data
path.logs: /mnt/logs

# mkdir /mnt/data /mnt/logs
# chown -R elasticsearch:elasticsearch /mnt/

# systemctl start elasticsearch.service 

Check the status
# systemctl status elasticsearch.service 


List the Indices
# curl -XGET http://localhost:9200/_cat/indices?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size

Now let’s create an index name "bank"
# curl -XPUT http://localhost:9200/bank?pretty 
{
 "acknowledged" : true,
 "shards_acknowledged" : true
}

Note that the docs.count is 0.

# curl -XGET http://localhost:9200/_cat/indices?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open bank hM-25KP6RvWrPJa2oG1o2g 5 1 0 0 650b 650b

Let’s now put something into our bank index.
In order to index a document, we must tell Elasticsearch which type in the index it should go to.
Let’s index a simple document into the bank index, "account" type, with an ID of 1 as follows:
# curl -XPUT http://localhost:9200/bank/account/1?pretty -d '
> {
> "account_number": "999120999",
> "name": "pkalever"
> }'
{
 "_index" : "bank",
 "_type" : "account",
 "_id" : "1",
 "_version" : 1,
 "result" : "created",
 "_shards" : {
 "total" : 2,
 "successful" : 1,
 "failed" : 0
 },
 "created" : true
}

By looking at the response we can say that a new bank document was successfully created.
# curl -XGET http://localhost:9200/_cat/indices?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open bank hM-25KP6RvWrPJa2oG1o2g 5 1 1 0 4.3kb 4.3kb

Query a document
# curl -XGET http://localhost:9200/bank/account/1?pretty
{
 "_index" : "bank",
 "_type" : "account",
 "_id" : "1",
 "_version" : 1,
 "found" : true,
 "_source" : {
 "account_number" : "999120999",
 "name" : "pkalever"
 }
}

Delete the entry
# curl -XDELETE http://localhost:9200/bank/account/1?pretty
{
 "found" : true,
 "_index" : "bank",
 "_type" : "account",
 "_id" : "1",
 "_version" : 2,
 "result" : "deleted",
 "_shards" : {
 "total" : 2,
 "successful" : 1,
 "failed" : 0
 }
}

If we study the above commands carefully, we can actually see a pattern of how we access data in Elasticsearch.
That pattern can be summarized as follows:
 ///

Also read about how to Loading Wikipedia’s Search Index

Conclusion

This blog showcases how block storage has been made simple with gluster-block utility. More details will come by in further posts.

References

https://www.elastic.co/blog/loading-wikipedia

https://www.elastic.co/guide/en/elasticsearch/reference/current/_basic_concepts.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/_create_an_index.html

Previous posts on gluster block storage

https://pkalever.wordpress.com/2016/11/18/elasticsearch-with-gluster-block-storage/

https://pkalever.wordpress.com/2016/06/23/gluster-solution-for-non-shared-persistent-storage-in-docker-container/

https://pkalever.wordpress.com/2016/06/29/non-shared-persistent-gluster-storage-with-kubernetes/

https://pkalever.wordpress.com/2016/08/16/read-write-once-persistent-storage-for-openshift-origin-using-gluster/

https://pkalever.wordpress.com/2016/11/04/gluster-as-block-storage-with-qemu-tcmu/

Advertisements

Elasticsearch with Gluster Block Storage

In this blog we shall see

  1. Gluster block storage setup
  2. Elasticsearch Configuration with single node
  3. Testing
  4. Conclusion
  5. References

Before we begin,

  • In this post, I will try not to talk much about gluster block storage as that is not our main focus, one can look at my previous posts for more details on block storage terminology and architecture related information.
  • This post does not explain everything about Elasticsearch, it is just a POC that helps in setting up the gluster block storage as the backend persistent storage for Elasticsearch engine, and
  • Finally, be aware that gluster block storage is fresh and new and still in POC state.

All we need to perform this POC is 2 nodes with fedora 24 installed, and each having ~50G disk space.

Setup at a glance:

On Node1:
1. Install and run gluster and create a volume
2. Mount the volume created in step 1 and create a file of size 40G in the volume
3. Install and run tcmu-runner, create and export LUN using targetcli user:glfs handler
On Node2:
1. Discover and login to the target device exported in Node1
2. Notice the block device (/dev/sda) format it with xfs and mount
3. Install, configure Elasticsearch to use the mount point created in step 2 as data path and run it.
4. Play with the Elasticsearch engine by creating indices and querying.

Lets begin …

Gluster block storage setup

Installing glusterfs-server and configuring volume

Installing glusterfs 
# dnf install glusterfs-server
got glusterfs-server-3.8.5-1.fc24.x86_64.rpm

Run
# systemctl start glusterd
# systemctl status glusterd

Create a gluster volume
# gluster vol create block 10.70.42.151:/root/brick force
volume create: block: success: please start the volume to access data

Start the volume
# gluster vol start block
volume start: block: success

Mount the gluster volume
# mount.glusterfs localhost:/block /mnt/

Create a big file who play as target device
# fallocate -l 40G /mnt/elastic-media.img

# ls -l /mnt/
total 41943040
-rw-r--r--. 1 root root 42949672960 Nov 17 12:56 elastic-media.img

# df -Th
[...]
localhost:/block fuse.glusterfs 50G 41G 10G 81% /mnt

Tcmu-runner target emulation setup

Install tcmu-runner
# dnf install tcmu-runner

Run
# systemctl start tcmu-runner
# systemctl status tcmu-runner

Choose some iSCSI Qualified Name
# IQN=iqn.2016-11.org.gluster:10.70.42.151

Create the backend with glfs storage module
# targetcli /backstores/user:glfs create glfsLUN 40G block@10.70.42.151/elastic-media.img
Created user-backed storage object glfsLUN size 42949672960.

Create a target
# targetcli /iscsi create $IQN
Created target iqn.2016-11.org.gluster:10.70.42.151.
Created TPG 1.
Global pref auto_add_default_portal=true
Created default portal listening on all IPs (0.0.0.0), port 3260.

Share a glfs backed LUN without any auth checks
# targetcli /iscsi/$IQN/tpg1 set attribute generate_node_acls=1 demo_mode_write_protect=0
Parameter generate_node_acls is now '1'.
Parameter demo_mode_write_protect is now '0'.

Set/Export LUN
# targetcli /iscsi/$IQN/tpg1/luns create /backstores/user:glfs/glfsLUN
Created LUN 0.

# iptables -F

Initiator side setup (on Elasticsearch node) (NODE 2)

# dnf install iscsi-initiator-utils

Check existing block devices
# lsblk
NAME                       MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sr0                         11:0    1 1024M  0 rom  
vda                        252:0    0   40G  0 disk 
├─vda2                     252:2    0 39.5G  0 part 
│ ├─fedora_dhcp42--17-swap 253:1    0    4G  0 lvm  [SWAP]
│ └─fedora_dhcp42--17-root 253:0    0   15G  0 lvm  /
└─vda1                     252:1    0  500M  0 part /boot

Discovery and login to target
# iscsiadm -m discovery -t st -p 10.70.42.151 -l
10.70.42.151:3260,1 iqn.2016-06.org.gluster:10.70.42.151
Logging in to [iface: default, target: iqn.2016-06.org.gluster:10.70.42.151, portal: 10.70.42.151,3260] (multiple)
Login to [iface: default, target: iqn.2016-06.org.gluster:10.70.42.151, portal: 10.70.42.151,3260] successful.

Boom! got sda with 40G space 
# lsblk
NAME                       MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sr0                         11:0    1 1024M  0 rom  
sda                          8:0    0   40G  0 disk
vda                        252:0    0   40G  0 disk 
├─vda2                     252:2    0 39.5G  0 part 
│ ├─fedora_dhcp42--17-swap 253:1    0    4G  0 lvm  [SWAP]
│ └─fedora_dhcp42--17-root 253:0    0   15G  0 lvm  /
└─vda1                     252:1    0  500M  0 part /boot

Lets format the block device with xfs
#  mkfs.xfs /dev/sda

# mkdir /home/pkalever/block

Mount the block device
# mount /dev/sda /home/pkalever/block

# df -Th
Filesystem Type Size Used Avail Use% Mounted on
[...]
/dev/sda xfs 40G 0.2G 39.8G 1% /home/pkalever/block

Elasticsearch configuration with single node

Elasticsearch is an open-source, distributed, scalable, enterprise-grade search engine. Accessible through an extensive and elaborate API, Elasticsearch can power extremely fast searches that support your data discovery applications.

elasticsearch-2.3.4 (As it is compatible version with wiki dumps)

Download the rpm, this version is compatible with wiki indexes/dumps/docs
# wget https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/rpm/elasticsearch/2.3.4/elasticsearch-2.3.4.rpm

Install Elasticsearch
# dnf install ./elasticsearch-2.3.4.rpm

Install Command-line JSON processor
# dnf install jq

Run
# sudo systemctl daemon-reload
# sudo systemctl enable elasticsearch.service
# sudo systemctl start elasticsearch.service

Check the status
# sudo systemctl status elasticsearch.service

Configure Elasticsearch to use gluster block mount directory for storage
Uncomment and edit the below parameters as per your choice
# sudo vi /etc/elasticsearch/elasticsearch.yml
cluster.name: gluster-block-17                 
node.name: node-17                             
path.data: /home/pkalever/block/data2     
path.logs: /home/pkalever/block/logs2

# mkdir  ~/block/data2  ~/block/log2

# /usr/share/elasticsearch/bin/plugin install analysis-icu

# sudo systemctl restart elasticsearch.service

Check the status
# sudo systemctl status elasticsearch.service

Testing

Simple test to make sure setup works

List the Indices
# curl -XGET http://localhost:9200/_cat/indices?v
health status index pri rep docs.count docs.deleted store.size pri.store.size 

Now let’s create an index name "bank"
# curl -XPUT http://localhost:9200/bank?pretty 
{
 "acknowledged" : true
}

# curl -XGET http://localhost:9200/_cat/indices?v
health status index pri rep docs.count docs.deleted store.size pri.store.size 
yellow open bank 5 1 0 0 650b 650b 

Note docs.count = 0 

Let’s now put something into our bank index.
In order to index a document, we must tell Elasticsearch which type in the index it should go to.
Let’s index a simple document into the bank index, "account" type, with an ID of 1 as follows:
# curl -XPUT http://localhost:9200/bank/account/1?pretty -d '
{
 "account_number": "999120999",
 "name": "pkalever"
}'

And the Response:
{
 "_index" : "bank",
 "_type" : "account",
 "_id" : "1",
 "_version" : 1,
 "_shards" : {
 "total" : 2,
 "successful" : 1,
 "failed" : 0
 },
 "created" : true
}

By looking at the response we can say that a new bank document was successfully created.
# curl -XGET http://localhost:9200/_cat/indices?v
health status index pri rep docs.count docs.deleted store.size pri.store.size 
yellow open bank 5 1 1 0 3.7kb 3.7kb

And now, Note docs.count = 1
 
Query a document
# curl -XGET http://localhost:9200/bank/account/1?pretty
{
 "_index" : "bank",
 "_type" : "account",
 "_id" : "1",
 "_version" : 1,
 "found" : true,
 "_source" : {
 "account_number" : "999120999",
 "name" : "pkalever"
 }
}

If we study the above commands carefully, we can actually see a pattern of how we access data in Elasticsearch.
That pattern can be summarized as follows:
<REST Verb> /<Index>/<Type>/<ID>

Delete the entry
# curl -XDELETE http://localhost:9200/bank/account/1?pretty

So we have manually created the indices and then added the documents, Lets now load some of the data sets/search index’s that Wikipedia provides.

Loading Wikipedia’s Search Index

In the very next script we do:
1. Delete if there is an index with name 'enwikiquote'
2. fetch the settings that en.wikiquote.org uses for its index and
   set them as template to create a new index
3. fetches the mapping for the content index and apply
# cat > run1.sh 
export es=localhost:9200
export site=en.wikiquote.org
export index=enwikiquote

curl -XDELETE $es/$index?pretty

curl -s 'https://'$site'/w/api.php?action=cirrus-settings-dump&format=json&formatversion=2' |
  jq '{
    analysis: .content.page.index.analysis,
    number_of_shards: 1,
    number_of_replicas: 0
  }' |
  curl -XPUT $es/$index?pretty -d @-

curl -s 'https://'$site'/w/api.php?action=cirrus-mapping-dump&format=json&formatversion=2' |
  jq .content |
  curl -XPUT $es/$index/_mapping/page?pretty -d @-

# ./run1.sh
{
  "acknowledged" : true
}
{
  "acknowledged" : true
}
{
  "acknowledged" : true
}

Now lets download the wiki dumps (the json formatted documents)
# wget https://dumps.wikimedia.org/other/cirrussearch/current/enwikiquote-20161114-cirrussearch-content.json.gz

Or you can go here and download whatever is needed for you https://dumps.wikimedia.org/other/cirrussearch/

In the very next script we
1. create a directory with name chunks and
2. extract 500 lines chunks from each file (250 lines metadata and 250 actual doc)
# cat > run2.sh 
export dump=enwikiquote-20161114-cirrussearch-content.json.gz
export index=enwikiquote

mkdir chunks
cd chunks
zcat ../$dump | split -a 10 -l 500 - $index


# ./run2.sh 
# ls chunks/
enwikiquoteaaaaaaaaaa  enwikiquoteaaaaaaaabd  enwikiquoteaaaaaaaacg  enwikiquoteaaaaaaaadj
enwikiquoteaaaaaaaaab  enwikiquoteaaaaaaaabe  enwikiquoteaaaaaaaach  enwikiquoteaaaaaaaadk
[...]
enwikiquoteaaaaaaaaba  enwikiquoteaaaaaaaacd  enwikiquoteaaaaaaaadg  enwikiquoteaaaaaaaaej
enwikiquoteaaaaaaaabb  enwikiquoteaaaaaaaace  enwikiquoteaaaaaaaadh  enwikiquoteaaaaaaaaek
enwikiquoteaaaaaaaabc  enwikiquoteaaaaaaaacf  enwikiquoteaaaaaaaadi

The loop in the script loads each file and deletes it after it's loaded. 
# cat > ./run3.sh
export es=localhost:9200
export index=enwikiquote
cd chunks
for file in *; do
  echo -n "${file}:  "
  took=$(curl -s -XPOST $es/$index/_bulk?pretty --data-binary @$file |
    grep took | cut -d':' -f 2 | cut -d',' -f 1)
  printf '%7s\n' $took
  [ "x$took" = "x" ] || rm $file
done

# ./run3.sh 
enwikiquoteaaaaaaaaaa:     9306
enwikiquoteaaaaaaaaab:    10607
enwikiquoteaaaaaaaaac:     6652
[...]
enwikiquoteaaaaaaaaaz:     4178
enwikiquoteaaaaaaaaba:     4800
enwikiquoteaaaaaaaabb:     4469
enwikiquoteaaaaaaaabc:     4349
[...]
enwikiquoteaaaaaaaabz:     8228
enwikiquoteaaaaaaaaca:     5152
enwikiquoteaaaaaaaacb:     4134
enwikiquoteaaaaaaaacc:     4510
[...]

List the indices 
# curl -XGET  http://localhost:9200/_cat/indices?v
health status index       pri rep docs.count docs.deleted store.size pri.store.size 
green  open   enwikiquote   1   0      28533            0      1.1gb          1.1gb

Query for page 1
# curl -XGET http://localhost:9200/enwikiquote/page/1?pretty

# curl -X GET  http://localhost:9200/enwikiquote/_search | less

Conclusion

This blog just showcases how Gluster block storage can be used as a backed persistent storage for Elasticsearch engine at POC level. More details will come by in further posts.

References

https://www.elastic.co/blog/loading-wikipedia

https://www.elastic.co/guide/en/elasticsearch/reference/current/_basic_concepts.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/_create_an_index.html

Previous posts on gluster block storage

https://pkalever.wordpress.com/2016/06/23/gluster-solution-for-non-shared-persistent-storage-in-docker-container/

https://pkalever.wordpress.com/2016/06/29/non-shared-persistent-gluster-storage-with-kubernetes/

https://pkalever.wordpress.com/2016/08/16/read-write-once-persistent-storage-for-openshift-origin-using-gluster/

https://pkalever.wordpress.com/2016/11/04/gluster-as-block-storage-with-qemu-tcmu/

Gluster as Block Storage with qemu-tcmu

In this blog we shall see

  1. Terminology and background
  2. Our approach
  3. Setting up
    • Gluster Setup
    • Tcmu-Runner
    • Qemu and Target Setup
    • iSCSI Initiator
  4. Conclusion
  5. Similar Topics

Terminology and background

Gluster is a well known scale-out distributed storage system, flexible in its design and easy to use. One of its key goals is to provide high availability of data.  Despite its distributed nature, Gluster is very easy to set up and use. Addition and removal of storage servers from a Gluster cluster is very easy. These capabilities along with other data services that Gluster provides makes it a very nice software defined storage platform.

We can access glusterfs via FUSE module. However to perform a single filesystem operation various context switches are required which leads to performance issues. Libgfapi is a userspace library for accessing data in Glusterfs. It can perform IO on gluster volumes without the FUSE module, kernel VFS layer and hence requires no context switches. It exposes a filesystem like API for accessing gluster volumes. Samba, NFS-Ganesha, QEMU and now the tcmu-runner all use libgfapi to integrate with Glusterfs.

A unique distributed storage solution build on traditional filesystems

The SCSI subsystem uses a sort of client-server model.  The Client/Initiator request IO happen through target which is a storage device. The SCSI target subsystem enables a computer node to behave as a SCSI storage device, responding to storage requests by other SCSI initiator nodes.

In simple terms SCSI is a set of standards for physically connecting and transferring data between computers and peripheral devices.

The most common implementation of the SCSI target subsystem is an iSCSIserver, iSCSI transports block level data between the iSCSI initiator and the target which resides on the actual storage device. iSCSi protocol wraps up the SCSI commands and sends it over TCP/IP layer. Up on receiving the packets at the other end it disassembles them to form the same SCSI commands, hence on the OS’es it seen as local SCSI device.

In other words iSCSI is SCSI over TCP/IP.

The LIO project began with the iSCSI design as its core objective, and created a generic SCSI target subsystem to support iSCSI. LIO is the SCSI target in the Linux kernel. It is entirely kernel code, and allows exported SCSI logical units (LUNs) to be backed by regular files or block devices.

LIO is Linux IO target, is an implementation of iSCSI target.

TCM is another name for LIO, an in-kernel iSCSI target (server). As we know existing TCM targets run in the kernel. TCMU (TCM in Userspace) allows userspace programs to be written which act as iSCSI targets. These enables wider variety of backstores without kernel code. Hence the TCMU userspace-passthrough backstore allows a userspace process to handle requests to a LUN. TCMU utilizes the traditional UIO subsystem, which is designed to allow device driver development in userspace.

One such backstore with best clustered network storage capabilities is GlusterFS

Tcmu-runner utilizes the TCMU framework handling the messy details of the TCMU interface (Thanks to Andy Grover), Tcmu-runner itself has a glusterfs handler that can interact with the backed file in gluster volume over libgfapi interface and can show it as a target (over network).

Some responsibilities for tcmu-runner include

  1. Discovering and configuring TCMU UIO devices
  2. waiting for the events on the device and
  3. managing the command ring buffers

TargetCli is the general management platform for the LIO/TCM/TCMU. TargetCli with its shell interface is used to configure LIO.

Think it like a shell which makes life easy in configuring LIO core

QEMU (Quick Emulator) is a generic and open source machine emulator and virtualizer. It is free and open source tool that allows users to create and manage Virtual machines inside the host operating system. The resources of the host operating system, such as Hard drive, RAM, Processor, will be divided and shared by the guest operating systems(Virtual machines).

When used as a machine emulator, QEMU can run OSes and programs made for one machine (e.g. an ARM board) on a different machine (e.g. your own PC). By using dynamic translation, it achieves very good performance.

When used as a virtualizer, QEMU achieves near native performances by executing the guest code directly on the host CPU. QEMU supports virtualization when executing under the Xen hypervisor or using the KVM kernel module in Linux.

QEMU can access the disk/drive/VMimage files not just from local directories but also from remote locations using various protocols (iSCSI, nfs, gluster, rbd, nbd, sheepdog etc.)

In one line, QEMU is a quick emulator and virtualizer that is capable of accessing storage locally and remotely using various protocol drivers. 

Qemu-tcmu is an another utility/package from QEMU (Thanks to Fam Zheng), that uses libtcmu to create, register the protocol handlers which will help in exporting LUNs. The best part about qemu-tcmu is being able to export any format/protocol that QEMU supports, for local or remote access. Examples being gluster, rbd, nbd, sheepdog, nfs, qed, qcow, qcow2, vdi vmdk, vhdx and few other.

New4.png

Our Approach

With all the background discussed above now let’s jump into actual essence of this blog and explain how we can expose the file in gluster volume as a block device using qemu-tcmu.

  1. Start glusterd and tcmu-runner, create a gluster volume
  2. Create a file in the gluster volume
  3. Register and start the gluster protocol handler with tcmu using qemu-tcmu.
  4. Create the iSCSI target and export the LUN
  5. From the client side discover and login to the target portal, play with the block device


Setting Up

You need 2 nodes for setting this up, 1 acts as gluster node where the iSCSI target is served from and other machine as a iSCSI Initiator/Client where we play with block device.

I’m using fedora24 on both the nodes.


Gluster Setup

For simplicity of this blog I’m using only one node gluster setup which is 1×1 plane distribute volume.

# dnf -y install git
# git clone https://github.com/gluster/glusterfs.git
# cd glusterfs
As we have noticed a critical bug in latest master.
# git checkout -b tag-3.8.4 v3.8.4 
# dnf -y install gcc automake autoconf libtool flex
         bison openssl-devel libxml2-devel         \
         python-devel libaio-devel sqlite-devel    \
         libibverbs-devel librdmacm-devel          \
         readline-devel lvm2-devel glib2-devel     \
         userspace-rcu-devel libcmocka-devel       \
         libacl-devel sqlite-devel redhat-rpm-config
# ./autogen.sh && ./configure && make -j install

# systemctl start glusterd

# gluster vol status
No volumes present

# gluster vol create block-store NODE1:/brick force
volume create: block-store: success: please start the volume ...

# gluster vol start block-store
volume start: block-store: success

# gluster vol status
Status of volume: block-store
Gluster process     TCP Port RDMA Port Online Pid
-----------------------------------------------------
Brick Node1:/brick  49152    0         Y      13372
 
Task Status of Volume block-store
-----------------------------------------------------
There are no active volume tasks 

Tcmu-Runner Setup

# git clone https://github.com/open-iscsi/tcmu-runner.git
# cd tcmu-runner
# dnf -y install targetcli cmake "*kmod*" libnl3* zlib-devel
For libgfapi.so* gluster libraries 
# export LD_LIBRARY_PATH=/usr/local/lib/
# cmake -DSUPPORT_SYSTEMD=ON -DCMAKE_INSTALL_PREFIX=/usr 
# make -j 
# make -j install

Run tcmu-runner
# systemctl start tcmu-runner

 Qemu Setup

# git clone https://github.com/qemu/qemu.git
# cd qemu
copy and apply the qemu-tcmu RFC patch from 
https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg00711.html

# dnf -y install libiscsi-devel pixman-devel

For libgfapi.so* and libtcmu.so*
# export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib64
# ./configure --target-list=x86_64-softmmu \
              --enable-glusterfs --enable-libiscsi \
              --enable-tcmu
# make -j 
# make -j install

Target Setup

Create a file in the gluster volume 
# qemu-img create -f qcow2 gluster://NODE1/block-store/storage.qcow2 10G
Formatting 'gluster://NODE1/block-store/storage.qcow2', 
fmt=qcow2 size=10737418240 encryption=off 
cluster_size=65536 lazy_refcounts=off refcount_bits=16

Check for the details/info
# qemu-img info gluster://NODE1/block-store/storage.qcow2
image: gluster://NODE1/block-store/storage.qcow2
file format: qcow2
virtual size: 10G (10737418240 bytes)
disk size: 193K
cluster_size: 65536
Format specific information:
 compat: 1.1
 lazy refcounts: false
 refcount bits: 16
 corrupt: false

Register and start the gluster protocol handler
# qemu-tcmu gluster://NODE1/block-store/storage.qcow2 &
[scsi/tcmu.c:0298] tcmu start
[scsi/tcmu.c:0314] register

Should be able to see something like
# targetcli ls | grep user:qemu
| o- user:qemu ................... [Storage Objects: 0]

Define/set IQN
# IQN=iqn.2016-11.org.gluster:qemu-tcmu-glfs

Create a target
# targetcli /iscsi create $IQN
Created target iqn.2016-11.org.gluster:qemu-tcmu-glfs.
Created TPG 1.
Global pref auto_add_default_portal=true
Created default portal listening on all IPs (0.0.0.0), port 3260.

Share a qemu-tcmu backed LUN without any auth checks
# targetcli /iscsi/$IQN/tpg1 set attribute \
                             generate_node_acls=1 \
                             demo_mode_write_protect=0
Parameter generate_node_acls is now '1'.
Parameter demo_mode_write_protect is now '0'.

Create the backend with qemu-tcmu storage module
# targetcli /backstores/user:qemu create QemuLUN 10G @drive
Created user-backed storage object QemuLUN size 10737418240.

Set/Export LUN
# targetcli /iscsi/$IQN/tpg1/luns create /backstores/user:qemu/QemuLUN
Created LUN 0.

Check the configuration
# targetcli ls
o-/ ...................................................... [...]
 o- backstores ........................................... [...]
 | o- block ............................... [Storage Objects: 0]
 | o- fileio .............................. [Storage Objects: 0]
 | o- pscsi ............................... [Storage Objects: 0]
 | o- ramdisk ............................. [Storage Objects: 0]
 | o- user:glfs ........................... [Storage Objects: 0]
 | o- user:qcow ........................... [Storage Objects: 0]
 | o- user:qemu ........................... [Storage Objects: 1]
 |   o- QemuLUN ................... [@drive (10.0GiB) activated]
 o- iscsi ......................................... [Targets: 1]
 | o- iqn.2016-11.org.gluster:qemu-tcmu-glfs ......... [TPGs: 1]
 |   o- tpg1 ............................... [gen-acls, no-auth]
 |     o- acls ....................................... [ACLs: 0]
 |     o- luns ....................................... [LUNs: 1]
 |     | o- lun0 ................................ [user/QemuLUN]
 |     o- portals ................................. [Portals: 1]
 |       o- 0.0.0.0:3260 .................................. [OK]
 o- loopback ...................................... [Targets: 0]
 o- vhost ......................................... [Targets: 0]

All we have done till now was on server side i.e. Node1.

Initiator Setup

On the Client side (Node 2)

# dnf install iscsi-initiator-utils sg3_utils

Check existing block devices
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 477G 0 disk 
├─sda1 8:1 0 4.7G 0 part /boot
└─sda2 8:2 0 472.3G 0 part 
 ├─fedora-root 253:0 0 328.8G 0 lvm /
 ├─fedora-swap 253:1 0 3.8G 0 lvm [SWAP]
 └─fedora-home 253:2 0 139.7G 0 lvm /home

# systemctl start iscsid 

Discovery and login to target
# iscsiadm -m discovery -t st -p NODE1 -l 
NODE1:3260,1 iqn.2016-11.org.gluster:qemu-tcmu-glfs
Logging in to [iface: default, target: ..., portal: NODE1,3260] (multiple)
Login to [iface: default, target: ..., portal: NODE1,3260] successful.

Trouble shoot tip!
If you see something like
# iscsiadm -m discovery -t st -p NODE1 -l
iscsiadm: cannot make connection to NODE1: No route to host
iscsiadm: cannot make connection to NODE1: No route to host
[...]
iscsiadm: connection login retries (reopen_max) 5 exceeded
iscsiadm: Could not perform SendTargets discovery: connection failure
Then flush the IP tables with "iptables -F" command on server NODE

# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 477G 0 disk 
├─sda1 8:1 0 4.7G 0 part /boot
└─sda2 8:2 0 472.3G 0 part 
 ├─fedora-root 253:0 0 328.8G 0 lvm /
 ├─fedora-swap 253:1 0 3.8G 0 lvm [SWAP]
 └─fedora-home 253:2 0 139.7G 0 lvm /home
sdb 8:16 0 10G 0 disk 

Boom! got sdb with 10G space 
Check sdb is the right one,
you should be able to see 'vendor specific' as 'qemu/@drive'
# sg_inq -i /dev/sdb 
VPD INQUIRY: Device Identification page 
 Designation descriptor number 1, descriptor length: 49 
 designator_type: T10 vendor identification, code_set: ASCII 
 associated with the addressed logical unit 
 vendor id: LIO-ORG 
 vendor specific: fe14a7d8-ca4d-4fa0-9646-cceb4961fd92 
 Designation descriptor number 2, descriptor length: 20 
 designator_type: NAA, code_set: Binary 
 associated with the addressed logical unit 
 NAA 6, IEEE Company_id: 0x1405 
 Vendor Specific Identifier: 0xfe14a7d8c 
 Vendor Specific Identifier Extension: 0xa4d4fa09646cceb4 
 [0x6001405fe14a7d8ca4d4fa09646cceb4] 
 Designation descriptor number 3, descriptor length: 16 
 designator_type: vendor specific [0x0], code_set: ASCII 
 associated with the addressed logical unit 
 vendor specific: qemu/@drive 

Lets format the block device with xfs
# mkfs.xfs /dev/sdb
meta-data=/dev/sdb      isize=512   agcount=4, agsize=655360 blks
         =              sectsz=512  attr=2, projid32bit=1
         =              crc=1       finobt=1, sparse=0
data     =              bsize=4096  blocks=2621440, imaxpct=25
         =              sunit=0     swidth=0 blks
naming   =version 2     bsize=4096  ascii-ci=0 ftype=1
log      =internal log  bsize=4096  blocks=2560, version=2
         =              sectsz=512  sunit=0 blks, lazy-count=1
realtime =none          extsz=4096  blocks=0, rtextents=0

# mount /dev/sdb /mnt

# df -Th
Filesystem Type Size Used Avail Use% Mounted on
[...]
/dev/sdb xfs 10G 33M 10G 1% /mnt
[...]

# cd /mnt

# touch like.{1..10}
like.1 like.2 like.4 like.6 like.8
like.10 like.3 like.5 like.7 like.9

Wow! this is cool isn’t it ?

Qemu-Tcmu is fresh cake still in POC, with the following work on going:

1. For now we have qemu-tcmu process running per target, i.e. for 10 targets we should run 10 “qemu-tcmu … &” processes. This is more resource consumption at the cost of  better isolation. But don’t worry Fam is working on this, he will shrink up these to one single process.

Hopefully we should be able to get something like

# qemu-tcmu -drive gluster://..../file1 \
            -drive gluster://..../file2 \
            -drive gluster://..../file3 \
            ...

so you can dynamically add the target just like any other extra disk in a VM.

2. How do you create second target ?
well we have “-x, –handler-name=NAME” for that 🙂

Example:

# qemu-tcmu -x qemu2  gluster://NODE1/block-store/storage2.qcow2 &
so now we have 2 handlers, one for each target
# targetcli ls | grep user:qemu
| o- user:qemu ........................... [Storage Objects: 1]
| o- user:qemu2 .......................... [Storage Objects: 0]

Though @drive is very generic, this is how @drive will be understandable by qemu-tcmu for now.

3. Work  related to Snapshots, backup, mirror and etc is in progress, which can be expected in the near future releases.

Conclusion

With this approach of exporting file in the gluster volume as a iSCSI target, we achieve easy and free snapshots for block storage’s. We see more features/improvements ahead in qemu-tcmu landing very soon such as multiple targets within the same process and multiple objects within the same target by unique @drive Id and improvements in areas like snapshots/mirror/backup and etc.,

I shall keep you updated with the latest improvements in qemu-tcmu.

Similar Topics

https://pkalever.wordpress.com/2016/06/23/gluster-solution-for-non-shared-persistent-storage-in-docker-container/

https://pkalever.wordpress.com/2016/06/29/non-shared-persistent-gluster-storage-with-kubernetes/

https://pkalever.wordpress.com/2016/08/16/read-write-once-persistent-storage-for-openshift-origin-using-gluster/

Read Write Once Persistent Storage for OpenShift Origin using Gluster

In this blog we shall learn about:

  1. Containers and Persistent Storage
  2. About OpenShift Origin
  3. Terminology and background
  4. Our approach
  5. Setting up
    • Gluster and iSCSI target
    • iSCSI Initiator
    • Origin master and nodes
  6. Conclusion
  7. References

 

Containers and Persistent Storage

As we all know containers are stateless entities which are used to deploy  applications and hence need persistent storage to store  application data for availability across container incarnations.

Persistent storage in containers are of two types, shared and non-shared.
Shared storage:
Consider this as a volume/store where multiple Containers perform both read and write operations on the same data. Useful for applications like web servers that need to serve the same data from multiple container instances.

Non Shared/Read Write Once Storage:
Only a single container can perform write operations to this store at a given time.

This blog will explain about Non Shared Storage for OpenShift Origin using gluster.

 

About OpenShift Origin

OpenShift Origin is a distribution of Kubernetes optimized for continuous application development and multi-tenant deployment.

A few interesting features include Multi-tenancy support, Web console, Centralized administration, capability to automatically deploy applications on a new commit in source repo and etc..

Difference_dock_openshift

Read More @ origin github

 

Terminology and background

Refer to ‘Terminology and background’ section from our previous post

 

Our Approach

With all the background discussed above now I shall jump into actual essence of this blog and explain how we can expose the file in gluster volume as a read write once persistent storage in openshift pods.

The Current version of Kubernetes v1.2.x  which origin uses in my case, does not provide/understand multipathing, this patch got merged in v1.3.alpha3 release

Hence, In this blog I’m going with multipath disabled, once ansible playbook is upgraded to latest origin which use k8s v1.3.0, I shall update the blog to have multipath changes.

In our approach all the OpenShift Origin nodes initiate the iSCSI session, attaches iSCSI target as block device and serve it to pod where the application is running and requires persistent storage.

OpenShiftOrigin

Now without any delay let me walk through the setup details…

 

Setting Up

You need 6 nodes for setting this up, 3 acts as gluster nodes where the iSCSI target is served from and 1 as OpenShift Origin master and other 2 as the iSCSI initiators which also acts as Origin nodes.

  • We create a gluster replica 3 volume using the 3 nodes {Node1, Node2 and Node3}.
  • Define iSCSI target using the same nodes, expose ‘LUN’ from each of them.
  • Use Node 4 and Node 5 as as iSCSI initiators, by logging-in to the iSCSI target session created above (No multipathing)
  • Setup OpenShift Origin cluster by using {Node4, Node5 and Node6}, Node 6 is master and other 2 are slave nodes
  • From Node 6 create the pod and examine the iSCSI target device mount inside it.

Gluster and iSCSI target Setup

Refer to ‘Gluster and iSCSI target Setup’ section from our previous post

iSCSI initiator Setup

Refer to ‘iSCSI initiator Setup’ section from our previous post

OpenShift Origin Master and Nodes Setup

Master -> Node6
Slaves -> Node5 & Node4

Clone the openshift ansible repo
[root@Node6 ~]# git clone https://github.com/openshift/openshift-ansible.git

Install ansible on all the nodes including master
# dnf install -y ansible pyOpenSSL python-cryptography

Configure nodes in inventory file,
all you need to do is replacehost addresses, highlighted in bold
[root@Node6 ~]# cat > /etc/ansible/hosts
# Create an OSEv3 group that contains the masters and nodes groups
[OSEv3:children]
masters
nodes

# Set variables common for all OSEv3 hosts
[OSEv3:vars]
# SSH user, this user should allow ssh based auth without requiring a password
ansible_ssh_user=root

# If ansible_ssh_user is not root, ansible_sudo must be set to true
#ansible_sudo=true

deployment_type=origin

# uncomment the following to enable htpasswd authentication; defaults to DenyAllPasswordIdentityProvider
#openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true',
# 'kind': 'HTPasswdPasswordIdentityProvider', 'filename': '/etc/origin/master/htpasswd'}]

# host group for masters
[masters]
Node6

# host group for nodes, includes region info
[nodes]
Node6 openshift_node_labels="{'region': 'infra', 'zone': 'default'}"
Node5 openshift_node_labels="{'region': 'primary', 'zone': 'east'}"
Node4 openshift_node_labels="{'region': 'primary', 'zone': 'west'}"
^C

Make nodes password less authorise logins, on all machines

Generate ssh key 
# ssh-keygen

Share ssh key with all the nodes, to do so, execute below on master,
$HOSTS being all the addresses/ip including master's, one at a time
[root@Node6 ~]# ssh-copy-id -i ~/.ssh/id_rsa.pub $HOSTS

Just matter of precaution on all the hosts, disable selinux
# setenforce 0

Install some package dependencies, ignored by playbook
[root@Node6 ~]# ansible all -m shell -a "dnf install python2-dnf -y" 
[root@Node6 ~]# ansible all -m shell -a "dnf install python-dbus -y"
[root@Node6 ~]# ansible all -m shell -a "dnf install libsemanage-python -y"

Lets, execute the playbook
[root@Node6 ~]# cd $PATH/openshift-ansible
[root@Node6 openshift-ansible]# ansible-playbook playbooks/byo/config.yml
It takes ~40 minutes to finish this, at least that's what it took me. 

Check all nodes are ready
[root@Node6 ~]# oc get nodes
NAME STATUS AGE
Node4 Ready 1h
Node5 Ready 1h
Node6 Ready,SchedulingDisabled 1h

Check for pods
[root@Node6 ~]# oc get pods

 

login to the origin web console https://Node6:8443
Credentials: user->admin, passwd->admin

Screenshot from 2016-08-16 16-20-01.png

create a New project “say blockstore-gluster”

Attach4.png

 

Switch to 'blockstore-gluster' project
[root@Node6 ~]# oc project blockstore-gluster
Now using project "blockstore-gluster" on server "https://Node6:8443".

Write a manifest/artifact for the pod
[root@Node6 ~]# cat > iscsi-pod.json
{
   "apiVersion": "v1",
   "kind": "Pod",
   "metadata": {
      "name": "glusterpod"
   },
   "spec": {
      "containers": [
         {
            "name": "iscsi-rw",
            "image": "fedora",
            "volumeMounts": [
               {
                  "mountPath": "/mnt/gluster-store",
                  "name": "iscsi-rw"
               }
            ],
            "command": [ "sleep", " 100000" ]
         }
      ],
      "volumes": [
         {
            "name": "iscsi-rw",
            "iscsi": {
               "targetPortal": "Node1:3260",
               "iqn": "iqn.2016-06.org.gluster:Node1",
               "lun": 0,
               "fsType": "xfs",
               "readOnly": false
            }
         }
      ]
   } 
}
^C

Create the pod
[root@Node6 ~]# oc create -f ~/iscsi-pod.json 
pod "glusterpod" created

Get the pod info
[root@Node6 ~]# oc get pods
NAME READY STATUS RESTARTS AGE
glusterpod 0/1 ContainerCreating 0 20s

Check events
[root@Node6 ~]# oc get events -w
FIRSTSEEN LASTSEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE
2016-08-16 16:16:10 +0530 IST 2016-08-16 16:16:10 +0530 IST 1 glusterpod Pod Normal Scheduled {default-scheduler } Successfully assigned glusterpod to dhcp43-73.lab.eng.blr.redhat.com
FIRSTSEEN LASTSEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE
2016-08-16 16:16:14 +0530 IST 2016-08-16 16:16:14 +0530 IST 1 glusterpod Pod spec.containers{iscsi-rw} Normal Pulling {kubelet Node5} pulling image "fedora"
2016-08-16 16:17:17 +0530 IST 2016-08-16 16:17:17 +0530 IST 1 glusterpod Pod spec.containers{iscsi-rw} Normal Pulled {kubelet Node5} Successfully pulled image "fedora"
2016-08-16 16:17:18 +0530 IST 2016-08-16 16:17:18 +0530 IST 1 glusterpod Pod spec.containers{iscsi-rw} Normal Created {kubelet Node5} Created container with docker id 0208911923f1
2016-08-16 16:17:18 +0530 IST 2016-08-16 16:17:18 +0530 IST 1 glusterpod Pod spec.containers{iscsi-rw} Normal Started {kubelet Node5} Started container with docker id 0208911923f1

[root@Node6 ~]# oc get pods
NAME READY STATUS RESTARTS AGE
glusterpod 1/1 Running 0 1m

Get into the pod
[root@Node6 ~]# oc exec -it glusterpod bash

[root@glusterpod /]# df -Th
Filesystem Type Size Used Avail Use% Mounted on
[...]
/dev/sda xfs 8G 33M 8G 1% /mnt/gluster-store
/dev/mapper/fedora_dhcp42--82-root xfs 15G 1.8G 14G 12% /etc/hosts
[...]

[root@glusterpod /]# cd /mnt/gluster-store/
[root@glusterpod gluster-store]# ls
1 10 2 3 4 5 6 7 8 9 


 

Origin Web console with pod running:

Attach1.png

Details of pod:

Attach2.png

That’s cool Isn’t it ?

 

Conclusion

This just showcases how Gluster can be used as a distributed block store with OpenShift Origin cluster. More details about multipathing, integration with Mesos etc. will come by in further posts.

 

References

https://docs.openshift.org/latest/welcome/index.html

https://github.com/openshift/openshift-ansible/

http://kubernetes.io/

http://severalnines.com/blog/installing-kubernetes-cluster-minions-centos7-manage-pods-services

http://rootfs.github.io/iSCSI-Kubernetes/

http://blog.gluster.org/2016/04/using-lio-with-gluster/

https://docs.docker.com/engine/tutorials/dockervolumes/http://scst.sourceforge.net/scstvslio.html

http://events.linuxfoundation.org/sites/events/files/slides/tcmu-bobw_0.pdf

https://www.kernel.org/doc/Documentation/target/tcmu-design.txt

https://lwn.net/Articles/424004/

http://www.gluster.org/community/documentation/index.php/GlusterFS_Documentation

 

Non Shared Persistent Gluster Storage with Kubernetes

In this blog we shall learn about:

  1. Containers and Persistent Storage
  2. About Kubernetes
  3. Terminology and background
  4. Our approach
  5. Setting up
    • Gluster and iSCSI target
    • iSCSI Initiator
    • Kubernetes master and nodes
  6. Conclusion
  7. References

 

 Containers and Persistent Storage

As we all know containers are stateless entities which are used to deploy  applications and hence need persistent storage to store  application data for availability across container incarnations.

Persistent storage in containers are of two types, shared and non-shared.
Shared storage:
Consider this as a volume/store where multiple Containers perform both read and write operations on the same data. Useful for applications like web servers that need to serve the same data from multiple container instances.

Non Shared Storage:
Only a single container can perform write operations to this store at a given time.

This blog will explain about Non Shared Storage for Kubernetes using gluster.

 

About Kubernetes

Kubernetes (k8s) is an open source container cluster manager. It aims to provide a “platform for automating deployment, scaling, and operations of application containers across clusters of hosts”.

In simple words, It works on server and client model, manages to clusterize docker and schedules containers(pods) for application deployment.  k8s uses flannel to create networking between containers, it has load balance integrated, it uses etcd for service discovery, and so on.

Pod:
Pod is a  smallest deployable unit that can be created and managed in Kubernetes.
This could be one or more containers working for an application.

Nodes/Minions:
Node is a slave machine in Kubernetes, previously known as Minion. Each node has the services necessary to run Pods and is managed by the k8s master. The services on a node include {docker, flannel,  kube-proxy and kubelet }.

Master:
The managing machine, which oversees one or more minions/nodes. The services on master include {etcd, kube-apiserver, kube-controller-manager and kube-scheduler}

Services:
kube-apiserver – Provides the API for Kubernetes orchestration.
kube-controller -manager – Enforces Kubernetes services.
kube-scheduler – Schedules pods on hosts.
kubelet – Parse PodSpecs and ensures that the containers running as described.
kube-proxy – Provides network proxy services.
etcd – A highly available key-value store for shared configuration and service discovery.
flannel – An etcd backed create networking between containers

 

Terminology and background

Refer to ‘Terminology and background’ section from our previous post

 

Our Approach

With all the background discussed above now I shall jump into actual essence of this blog and explain how we can expose the file in gluster volume as a nonshared persistent storage in kubernetes pods.

The Current version of Kubernetes v1.2.x  does not provide/understand multipathing, this patch got merged in v1.3.alpha3 release, hence in-order to use multipathing we need to wait for v1.3.0.

Hence, In this blog I’m going with multipath disabled, once k8s v1.3.0 is out I shall update the blog to have multipath changes.

In our approach all the kubernetes nodes initiate the iSCSI session, attaches iSCSI target as block device and serve it to Kubernetes pod where the application is running and requires persistent storage.

Kubernetes2.png

Now without any delay let me walk through the setup details…

 

Setting Up

You need 6 nodes for setting this up, 3 acts as gluster nodes where the iSCSI target is served from and 1 as k8s master and other 2 as the iSCSI initiators which also acts as k8s nodes.

  • We create a gluster replica 3 volume using the 3 nodes {Node1, Node2 and Node3}.
  • Define iSCSI target using the same nodes, expose ‘LUN’ from each of them.
  • Use Node 4 and Node 5 as as iSCSI initiators, by logging-in to the iSCSI target session created above (No multipathing)
  • Setup k8s cluster by using {Node4, Node5 and Node6}, Node 6 is master and other 2 are slaves
  • From Node 6 create the pod and examine the iSCSI target device mount inside it.

Gluster and iSCSI target Setup

On all the nodes I have Installed freshly baked fedora 24 Beta server.

Perform below on all the 3 gluster nodes:

# dnf upgrade
# dnf install glusterfs-server
got 3.8.0-0.2.rc2.fc24.x86_64
# iptables -F
# vi /etc/glusterfs/glusterd.vol
add 'option rpc-auth-allow-insecure on' this is needed by libgfapi
# systemctl start glusterd

On Node 1 perform below:

# form a gluster trusted pool of nodes
[root@Node1 ~]# gluster peer probe Node2
[root@Node1 ~]# gluster peer probe Node3
[root@Node1 ~]# gluster pool list
UUID                                    Hostname        State
51023fac-edc7-4149-9e3c-6b8207a02f7e    Node1   *Connected 
ed9af9d6-21f0-4a37-be2c-5c23eff4497e    Node2    Connected 
052e681e-cdc8-4d3b-87a4-39de26371b0f    Node3    Connected
 `
# create volume
[root@Node1 ~]# gluster vol create nonshared-store replica 3 Node1:/subvol1 Node2:/subvol2 Node3:/subvol3 force

# volume set to allow insecure port ranges
[root@Node1 ~]# gluster vol set nonshared-store server.allow-insecure on

# start the volume 
[root@Node1 ~]# gluster vol start nonshared-store
[root@Node1 ~]# gluster vol status
check the status

# mount volume and create a required target file of 8G in volume
[root@Node1 ~]# mount -t glusterfs Node1:/nonshared-store /mnt
[root@Node1 ~]# cd /mnt
[root@Node1 ~]# fallocate -l 8G app-store.img

# finally unmount the volume
[root@Node1 ~]# umount /mnt

Now we have done with the gluster  setup, lets expose file ‘app-store.img’ as an iSCSI target by creating a LUN

Again On Node 1:

[root@Node1 ~]# dnf install tcmu-runner targetcli

# enter the admin console
[root@Node1 ~]# targetcli
targetcli shell version 2.1.fb42
Copyright 2011-2013 by Datera, Inc and others.
For help on commands, type 'help'.
/> ls
o- / ...................................................... [...]
  o- backstores ........................................... [...]
  | o- block ............................... [Storage Objects: 0]
  | o- fileio .............................. [Storage Objects: 0]
  | o- pscsi ............................... [Storage Objects: 0]
  | o- ramdisk ............................. [Storage Objects: 0]
  | o- user:glfs ........................... [Storage Objects: 0]
  | o- user:qcow ........................... [Storage Objects: 0]
  o- iscsi ......................................... [Targets: 0]
  o- loopback ...................................... [Targets: 0]
  o- vhost ......................................... [Targets: 0]

# Create a Image file with the name "glfsLUN" on gluster volume with 8G
/> cd /backstores/user:glfs
/backstores/user:glfs> create glfsLUN 8G nonshared-store@Node1/app-store.img
Created user-backed storage object glfsLUN size 8589934592.

# create a target
/backstores/user:glfs> cd /iscsi/
/iscsi> create iqn.2016-06.org.gluster:Node1            
Created target iqn.2016-06.org.gluster:Node1.
Created TPG 1.
Global pref auto_add_default_portal=true
Created default portal listening on all IPs (0.0.0.0), port 3260.

# set LUN
/iscsi> cd /iscsi/iqn.2016-06.org.gluster:Node1/tpg1/luns
/iscsi/iqn.20...de1/tpg1/luns> create /backstores/user:glfs/glfsLUN 0
Created LUN 0.
/iscsi/iqn.20...de1/tpg1/luns> cd /
# set ACL (it's the IQN of an initiator you permit to connect)
# Copy InitiatorName from Docker host machine
# which will be in ‘/etc/iscsi/initiatorname.iscsi’
# In my case it is the one put in bold at next commands
/> cd /iscsi/iqn.2016-06.org.gluster:Node1/tpg1/acls 
/iscsi/iqn.20...de1/tpg1/acls> create iqn.1994-05.com.redhat:8277148d16b2
Created Node ACL for iqn.1994-05.com.redhat:8277148d16b2
Created mapped LUN 0.

# set UserID and password for authentication
/iscsi/iqn.20...de1/tpg1/acls> cd iqn.1994-05.com.redhat:8277148d16b2
/iscsi/iqn.20...:8277148d16b2> set auth userid=username 
Parameter userid is now 'username'.
/iscsi/iqn.20...:8277148d16b2> set auth password=password 
Parameter password is now 'password'.
/iscsi/iqn.20...:8277148d16b2> cd /

/> saveconfig
Last 10 configs saved in /etc/target/backup.
Configuration saved to /etc/target/saveconfig.json

/> exit
Global pref auto_save_on_exit=true
Last 10 configs saved in /etc/target/backup.
Configuration saved to /etc/target/saveconfig.json

Now On Node 2:

[root@Node2 ~]# dnf install tcmu-runner targetcli

# enter the admin console
[root@Node2 ~]# targetcli
targetcli shell version 2.1.fb42
Copyright 2011-2013 by Datera, Inc and others.
For help on commands, type 'help'.
/> ls
o- / ...................................................... [...]
  o- backstores ........................................... [...]
  | o- block ............................... [Storage Objects: 0]
  | o- fileio .............................. [Storage Objects: 0]
  | o- pscsi ............................... [Storage Objects: 0]
  | o- ramdisk ............................. [Storage Objects: 0]
  | o- user:glfs ........................... [Storage Objects: 0]
  | o- user:qcow ........................... [Storage Objects: 0]
  o- iscsi ......................................... [Targets: 0]
  o- loopback ...................................... [Targets: 0]
  o- vhost ......................................... [Targets: 0]

# create a Image file with the name "glfsLUN" on gluster volume with 8G
/> cd /backstores/user:glfs
/backstores/user:glfs> create glfsLUN 8G nonshared-store@Node2/app-store.img
Created user-backed storage object glfsLUN size 8589934592.

# create a target
/backstores/user:glfs> cd /iscsi/
/iscsi> create iqn.2016-06.org.gluster:Node2            
Created target iqn.2016-06.org.gluster:Node2.
Created TPG 1.
Global pref auto_add_default_portal=true
Created default portal listening on all IPs (0.0.0.0), port 3260.

# set LUN
/iscsi> cd /iscsi/iqn.2016-06.org.gluster:Node2/tpg1/luns
iscsi/iqn.20...de2/tpg1/luns> create /backstores/user:glfs/glfsLUN 0
Created LUN 0.
iscsi/iqn.20...de2/tpg1/luns> cd /

# set ACL (it's the IQN of an initiator you permit to connect)
# Copy InitiatorName from Docker host machine
# which will be in ‘/etc/iscsi/initiatorname.iscsi’
# In my case it is the one put in bold at next commands
/> cd /iscsi/iqn.2016-06.org.gluster:Node2/tpg1/acls 
/iscsi/iqn.20...de2/tpg1/acls> create iqn.1994-05.com.redhat:8277148d16b2
Created Node ACL for iqn.1994-05.com.redhat:8277148d16b2
Created mapped LUN 0.

# set UserID and password for authentication
/iscsi/iqn.20...de2/tpg1/acls> cd iqn.1994-05.com.redhat:8277148d16b2
/iscsi/iqn.20...:8277148d16b2> set auth userid=username 
Parameter userid is now 'username'.
/iscsi/iqn.20...:8277148d16b2> set auth password=password 
Parameter password is now 'password'.
/iscsi/iqn.20...:8277148d16b2> cd /

/> saveconfig
Last 10 configs saved in /etc/target/backup.
Configuration saved to /etc/target/saveconfig.json

/> exit
Global pref auto_save_on_exit=true
Last 10 configs saved in /etc/target/backup.
Configuration saved to /etc/target/saveconfig.json

Note: Replicate  Node2 setup on Node3 as well, skipping to avoid duplication

iSCSI initiator Setup

On Node4 and Node 5:

[root@Node4 ~]# dnf upgrade
[root@Node4 ~]# dnf install iscsi-initiator-utils sg3_utils

# change to the same IQN you set on the iSCSI target server
[root@Node4 ~]# vi /etc/iscsi/initiatorname.iscsi
InitiatorName=iqn.1994-05.com.redhat:8277148d16b2

[root@Node4 ~]# vi /etc/iscsi/iscsid.conf
# uncomment below line
node.session.auth.authmethod = CHAP
[...]
# uncomment and edit username and password if required
node.session.auth.username = username
node.session.auth.password = password

[root@Node4 ~]# systemctl restart iscsid 
[root@Node4 ~]# systemctl enable iscsid

# make sure multipath is disabled (kubernetes <= 1.2.x doesn't support)
[root@Node4 ~]# systemctl stop multipathd

# One way of Login to the iSCSI target
[root@Node4 ~]#  iscsiadm -m discovery -t st -p Node1 -l 
Node1:3260,1 iqn.2016-06.org.gluster:Node1
Logging in to [iface: default, target: iqn.2016-06.org.gluster:Node1, portal: Node1,3260] (multiple)
Login to [iface: default, target: iqn.2016-06.org.gluster:Node1, portal: Node1,3260] successful.

[root@Node4 ~]#  iscsiadm -m discovery -t st -p Node2 -l 
Node2:3260,1 iqn.2016-06.org.gluster:Node2
Logging in to [iface: default, target: iqn.2016-06.org.gluster:Node2, portal: Node2,3260] (multiple)
Login to [iface: default, target: iqn.2016-06.org.gluster:Node2, portal: Node2,3260] successful.

[root@Node4 ~]#  iscsiadm -m discovery -t st -p Node3 -l 
Node3:3260,1 iqn.2016-06.org.gluster:Node3
Logging in to [iface: default, target: iqn.2016-06.org.gluster:Node3, portal: Node3,3260] (multiple)
Login to [iface: default, target: iqn.2016-06.org.gluster:Node3, portal: Node3,3260] successful.

# here you see that three paths appered as different devices,
# hence we dont have multipathing
[root@Node4 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0  0 8G 0 disk 
sdb 8:16 0 8G 0 disk 
sdc 8:32 0 8G 0 disk 
[...]

Note: Replicate Node 4 steps on Node 5 as well, since it will be also an iSCSI initiator.

Format the device to with xfs.

On any one, among Node 4 and Node 5 perform below:

[root@Node5 ~]# mkfs.xfs -f /dev/sda 
meta-data=/dev/sda           isize=512   agcount=4, agsize=327680 blks
         =                   sectsz=512  attr=2, projid32bit=1
         =                   crc=1       finobt=1, sparse=0
data     =                   bsize=4096  blocks=1310720, imaxpct=25
         =                   sunit=0     swidth=0 blks
naming   =version 2          bsize=4096  ascii-ci=0 ftype=1
log      =internal log       bsize=4096  blocks=2560, version=2
         =                   sectsz=512  sunit=0 blks, lazy-count=1
realtime =none               extsz=4096  blocks=0, rtextents=0

[root@Node5 ~]# mount /dev/sda /mnt
[root@Node5 ~]# touch /mnt/{1..10}
[root@Node5 ~]# umount /mnt

 

Kubernetes Master and Nodes Setup

Installing Kubernetes Cluster with 2 nodes/minions.

On K8s Master:

[root@Node6 ~]# dnf upgrade

# Disable iptables to avoid conflicts with Docker iptables rules:
[root@Node6 ~]# systemctl stop firewalld
[root@Node6 ~]# systemctl disable firewalld
Removed symlink /etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service.
Removed symlink /etc/systemd/system/basic.target.wants/firewalld.service.

# Install NTP and make sure it is enabled and running
[root@Node6 ~]# dnf -y install ntp
[root@Node6 ~]# systemctl start ntpd
[root@Node6 ~]# systemctl enable ntpd

# Install etcd and Kubernetes
[root@Node6 ~]# dnf install etcd kubernetes
got Kubernetes with GitVersion:"v1.2.0"

# Configure etcd 
# Ensure the following lines are uncommented, and assign the given values:
[root@Node6 ~]# vi /etc/etcd/etcd.conf
ETCD_NAME=default
ETCD_DATA_DIR="/var/lib/etcd/default.etcd"
ETCD_LISTEN_CLIENT_URLS="http://0.0.0.0:2379"
ETCD_ADVERTISE_CLIENT_URLS="http://localhost:2379"

# Configure Kubernetes API server
# Ensure the following lines are uncommented, and assign the given values:
[root@Node6 ~]# vi /etc/kubernetes/apiserver
KUBE_API_ADDRESS="--address=0.0.0.0"
KUBE_API_PORT="--port=8080"
KUBELET_PORT="--kubelet_port=10250"
KUBE_ETCD_SERVERS="--etcd_servers=http://127.0.0.1:2379"
KUBE_SERVICE_ADDRESSES="--service-cluster-ip-range=10.254.0.0/16"
KUBE_ADMISSION_CONTROL="--admission_control=NamespaceLifecycle,NamespaceExists,LimitRanger,ResourceQuota"
KUBE_API_ARGS=""

# start and enable etcd, kube-apiserver, kube-controller-manager and kube-schedule
[root@Node6 ~]# \
for SERVICES in etcd kube-apiserver kube-controller-manager kube-scheduler; do
    systemctl restart $SERVICES
    systemctl enable $SERVICES
    systemctl status $SERVICES 
done

# Define flannel network configuration in etcd.
# This configuration will be pulled by flannel service on nodes
[root@Node6 ~]# etcdctl mk /atomic.io/network/config '{"Network":"172.17.0.0/16"}'

# At this point the below should show nothing because we have not configured nodes yet
[root@Node6 ~]# kubectl get nodes

 

On K8s Nodes/Minions:

# Disable iptables to avoid conflicts with Docker iptables rules:
[root@Node4 ~]# systemctl stop firewalld
[root@Node4 ~]# systemctl disable firewalld
Removed symlink /etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service.
Removed symlink /etc/systemd/system/basic.target.wants/firewalld.service.

# Install NTP and make sure it is enabled and running
[root@Node4 ~]# dnf -y install ntp
[root@Node4 ~]# systemctl start ntpd
[root@Node4 ~]# systemctl enable ntpd

# disable selinux
[root@Node4 ~]# setenforce 0

# Install flannel and Kubernetes
[root@Node4 ~]# dnf install flannel kubernetes

# Configure etcd server for flannel service.
# Update this to connect to the respective master:
[root@Node4 ~]# vi /etc/sysconfig/flanneld
FLANNEL_ETCD="http://Node6:2379"

# Configure Kubernetes default config
[root@Node4 ~]# vi /etc/kubernetes/config
KUBE_MASTER="--master=http://Node6:8080"

# Configure kubelet service
[root@Node4 ~]# vi /etc/kubernetes/kubelet
KUBELET_ADDRESS="--address=0.0.0.0"
KUBELET_PORT="--port=10250"
# change the hostname to this kubnode IP address
KUBELET_HOSTNAME="--hostname_override=Node4"
KUBELET_API_SERVER="--api_servers=http://Node6:8080"
KUBELET_ARGS=""

# Start and enable kube-proxy, kubelet, docker and flanneld services:
[root@Node4 ~]# \
for SERVICES in kube-proxy kubelet docker flanneld; do
    systemctl restart $SERVICES
    systemctl enable $SERVICES
    systemctl status $SERVICES 
done

# On each minion/node, you should notice that you will have two new interfaces
# added, docker0 and flannel0. You should get different range of IP addresses
# on flannel0 interface on each minion/node, similar to below:
[root@Node4 ~]# ip a | grep flannel | grep inet
 inet 172.17.21.0/16 scope global flannel0

Note: Replicate Node4 steps on Node 5 as well, since it is also a node/minion.

 

So far we are done with kubernetes cluster setup, lets get on to k8s master now and create a pod to showcase the file in gluster volume as a block device mounted for persistent storage.

Again on K8s Master:

[root@Node6 ~]# kubectl get nodes
NAME STATUS AGE
10.70.43.239 Ready    2m
10.70.43.242 NotReady 6s

# Don't worry with in a minute you should notice
[root@Node6 ~]# kubectl get nodes
NAME STATUS AGE
10.70.43.239 Ready 2m
10.70.43.242 Ready 32s

[root@Node6 ~]# cat > iscsi-pod.json
{
    "apiVersion": "v1",
    "kind": "Pod",
    "metadata": {
        "name": "glusterpod"
     },
     "spec": {
         "containers": [
             {
                 "name": "iscsi-rw",
                 "image": "fedora",
                 "volumeMounts": [
                     {
                         "mountPath": "/mnt/gluster-store",
                         "name": "iscsi-rw"
                     }
                 ],
                 "command": [ "sleep", " 100000" ]
            }
        ],
        "volumes": [
            {
                "name": "iscsi-rw",
                "iscsi": {
                    "targetPortal": "Node1:3260",
                    "iqn": "iqn.2016-06.org.gluster:Node1",
                    "lun": 0,
                    "fsType": "xfs",
                    "readOnly": false
                }
            }
        ]
    }  
}
^C

# create the pod defined in the file
[root@Node6 ~]# kubectl create -f iscsi-pod.json 
pod "glusterpod" created

[root@Node6 ~]# kubectl get event -w
[...]
2016-06-28 16:47:56 +0530 IST 2016-06-28 16:47:56 +0530 IST 1 glusterpod Pod Normal Scheduled {default-scheduler } Successfully assigned glusterpod to 10.70.43.239
2016-06-28 16:48:06 +0530 IST 2016-06-28 16:48:06 +0530 IST 1 10.70.43.239 Node Warning MissingClusterDNS {kubelet 10.70.43.239} kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. pod: "glusterpod_default(f6c763e5-3d21-11e6-bdd3-00151e000014)". Falling back to DNSDefault policy.
2016-06-28 16:48:06 +0530 IST 2016-06-28 16:48:06 +0530 IST 1 glusterpod Pod Warning MissingClusterDNS {kubelet 10.70.43.239} kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to DNSDefault policy.
2016-06-28 16:48:07 +0530 IST 2016-06-28 16:48:07 +0530 IST 1 glusterpod Pod spec.containers{iscsi-rw} Normal Pulling {kubelet 10.70.43.239} pulling image "fedora"
[...]

[root@Node6 ~]# kubectl get pods
NAME       READY     STATUS        RESTARTS AGE
glusterpod  0/1  ContainerCreating     0    1m

# it takes some time to pull the image for the first time
[root@Node6 ~]# kubectl get pods
NAME       READY STATUS   RESTARTS AGE
glusterpod  1/1  Running     0     3m

# get into the pod 
[root@Node6 ~]# kubectl exec -it glusterpod  bash

# Pod Shell
[root@glusterpod /]# df -Th
Filesystem Type Size Used Avail Use% Mounted on
[...]
/dev/sda    xfs 8.0G 33M   8.0G  1%  /mnt/gluster-store
[...]

[root@glusterpod /]# cd /mnt/gluster-store/
[root@glusterpod gluster-store]# ls
1 10 2 3 4 5 6 7 8 9 

Conclusion

This just showcases how Gluster can be used as a distributed block store with Kubernetes cluster. More details about multipathing, integration with Openshift etc. will come by in further posts.

 

References

http://kubernetes.io/

http://severalnines.com/blog/installing-kubernetes-cluster-minions-centos7-manage-pods-services

http://rootfs.github.io/iSCSI-Kubernetes/

http://blog.gluster.org/2016/04/using-lio-with-gluster/

https://docs.docker.com/engine/tutorials/dockervolumes/http://scst.sourceforge.net/scstvslio.html

http://events.linuxfoundation.org/sites/events/files/slides/tcmu-bobw_0.pdf

https://www.kernel.org/doc/Documentation/target/tcmu-design.txt

https://lwn.net/Articles/424004/

http://www.gluster.org/community/documentation/index.php/GlusterFS_Documentation

 

Gluster Solution for Non Shared Persistent Storage in Docker Container

In this blog we shall see

  1. Containers and Persistent Storage
  2. Terminology and background
  3. Our approach
  4. Setting up
    • Gluster and iSCSI target
    • iSCSI Initiator
    • Docker host and Container
  5. Conclusion
  6. References

Containers and Persistent Storage

As we all know containers are stateless entities which are used to deploy  applications and hence need persistent storage to store  application data for availability across container incarnations.

Persistent storage in containers are of two types, shared and non-shared.
Shared storage:
Consider this as a volume/store where multiple Containers perform both read and write operations on the same data. Useful for applications like web servers that need to serve the same data from multiple container instances.

Non Shared Storage:
Only a single container can perform write operations to this store at a given time.

This blog will explain about Non Shared Storage for containers using gluster.

Terminology and background

Gluster is a well known scale-out distributed storage system, flexible in its design and to use. One of its key goals is to provide high availability of data.  Despite its distributed nature, Gluster is very easy to set up and use. Addition and removal of storage servers from a Gluster cluster is very easy. These capabilities along with other data services that Gluster provides makes it a very nice software defined storage platform.

We can access glusterfs via FUSE module. However to perform a single filesystem operation various context switches are required which leads to performance issues. Libgfapi is a userspace library for accessing data in Glusterfs. It can perform IO on gluster volumes without the FUSE module, kernel VFS layer and hence requires no context switches. It exposes a filesystem like API for accessing gluster volumes. Samba, NFS-Ganesha, QEMU and now the tcmu-runner all use libgfapi to integrate with Glusterfs.

The SCSI subsystem uses a sort of client-server model.  The Client/Initiator request IO happen through target which is a storage device. The SCSI target subsystem enables a computer node to behave as a SCSI storage device, responding to storage requests by other SCSI initiator nodes.

In simple terms SCSI is a set of standards for physically connecting and transferring data between computers and peripheral devices.

The most common implementation of the SCSI target subsystem is an iSCSI server, iSCSI transports block level data between the iSCSI initiator and the target which resides on the actual storage device. iSCSi protocol wraps up the SCSI commands and sends it over TCP/IP layer. Up on receiving the packets at the other end it disassembles them to form the same SCSI commands, hence on the OS’es it seen as local SCSI device.

In other words iSCSI is SCSI over TCP/IP.

The LIO project began with the iSCSI design as its core objective, and created a generic SCSI target subsystem to support iSCSI. LIO is the SCSI target in the Linux kernel. It is entirely kernel code, and allows exported SCSI logical units (LUNs) to be backed by regular files or block devices.

TCM is another name for LIO, an in-kernel iSCSI target (server). As we know existing TCM targets run in the kernel. TCMU (TCM in Userspace) allows userspace programs to be written which act as iSCSI targets. These enables wider variety of backstores without kernel code. Hence the TCMU userspace-passthrough backstore allows a userspace process to handle requests to a LUN.

One such backstore with best clustered network storage capabilities is GlusterFS

Tcmu-runner utilizes the TCMU framework handling the messy details of the TCMU interface, allows target in the glusterfs volume to be used over gluster’s libgfapi interface.

TargetCli is the general management platform for the LIO/TCM/TCMU. TargetCli with its shell interface is used to configure LIO.

Our Approach

With all the background discussed above now I shall jump into actual essence of this blog and explain how we can expose the file in gluster volume as a nonshared persistant storage in Docker

We use iSCSI which are highly sensitive to network performance, jitters in the connections will cause iSCSI connection to perform poorly.

Also we all know that container suffer from suboptimal network performance, because of the fact that Docker NAT doesn’t deliver good performance.

Hence in our approach Docker initiates the iSCSI session, attaches iSCSI target as block device with multipath enabled, mounts it in local directory, and share that via bind mount to the container. This approach doesn’t need Docker NAT (hence we don’t lose performance).

imageedit_1_7869107488

Now without any delay let me walk through the setup details…

Setting Up

You need 4 nodes for setting this up, 3 acts as gluster nodes where the iSCSI target is served from and 1 machine as a iSCSI Initiator/Docker Host where container deployment happens.

Gluster and iSCSI target Setup

On all the nodes I have Installed freshly baked fedora 24 Beta server.

Perform below on all the 3 gluster nodes:

# dnf upgrade
# dnf install glusterfs-server
got 3.8.0-0.2.rc2.fc24.x86_64
# iptables -F
# vi /etc/glusterfs/glusterd.vol
add 'option rpc-auth-allow-insecure on' this is needed by libgfapi
# systemctl start glusterd

On Node1 perform below:

# form a gluster trusted pool of nodes
[root@Node1 ~]# gluster peer probe Node2
[root@Node1 ~]# gluster peer probe Node3
[root@Node1 ~]# gluster pool list
UUID                                    Hostname        State
51023fac-edc7-4149-9e3c-6b8207a02f7e    Node1   *Connected 
ed9af9d6-21f0-4a37-be2c-5c23eff4497e    Node2    Connected 
052e681e-cdc8-4d3b-87a4-39de26371b0f    Node3    Connected
 `
# create volume
[root@Node1 ~]# gluster vol create nonshared-store replica 3 Node1:/subvol1 Node2:/subvol2 Node3:/subvol3 force

# volume set to allow insecure port ranges
[root@Node1 ~]# gluster vol set nonshared-store server.allow-insecure on

# start the volume 
[root@Node1 ~]# gluster vol start nonshared-store
[root@Node1 ~]# gluster vol status
check the status

# mount volume and create a required target file of 8G in volume
[root@Node1 ~]# mount -t glusterfs Node1:/nonshared-store /mnt
[root@Node1 ~]# cd /mnt
[root@Node1 ~]# fallocate -l 8G app-store.img

# finally unmount the volume
[root@Node1 ~]# umount /mnt

Now we have done with the gluster  setup, lets expose file ‘app-store.img’ as an iSCSI target by creating a LUN

Again On Node1:

[root@Node1 ~]# dnf install tcmu-runner targetcli

# enter the admin console
[root@Node1 ~]# targetcli
targetcli shell version 2.1.fb42
Copyright 2011-2013 by Datera, Inc and others.
For help on commands, type 'help'.
/> ls
o- / ...................................................... [...]
  o- backstores ........................................... [...]
  | o- block ............................... [Storage Objects: 0]
  | o- fileio .............................. [Storage Objects: 0]
  | o- pscsi ............................... [Storage Objects: 0]
  | o- ramdisk ............................. [Storage Objects: 0]
  | o- user:glfs ........................... [Storage Objects: 0]
  | o- user:qcow ........................... [Storage Objects: 0]
  o- iscsi ......................................... [Targets: 0]
  o- loopback ...................................... [Targets: 0]
  o- vhost ......................................... [Targets: 0]

# Create a Image file with the name "glfsLUN" on gluster volume with 8G
/> cd /backstores/user:glfs
/backstores/user:glfs> create glfsLUN 8G nonshared-store@Node1/app-store.img
Created user-backed storage object glfsLUN size 8589934592.

# create a target
/backstores/user:glfs> cd /iscsi/
/iscsi> create iqn.2016-06.org.gluster:Node1            
Created target iqn.2016-06.org.gluster:Node1.
Created TPG 1.
Global pref auto_add_default_portal=true
Created default portal listening on all IPs (0.0.0.0), port 3260.

# set LUN
/iscsi> cd /iscsi/iqn.2016-06.org.gluster:Node1/tpg1/luns
iscsi/iqn.20...de1/tpg1/luns> create /backstores/user:glfs/glfsLUN 0
Created LUN 0.
iscsi/iqn.20...de1/tpg1/luns> cd /

# set ACL (it's the IQN of an initiator you permit to connect)
# Copy InitiatorName from Docker host machine
# which will be in ‘/etc/iscsi/initiatorname.iscsi’
# In my case it is the one put in bold at next commands
/> cd /iscsi/iqn.2016-06.org.gluster:Node1/tpg1/acls 
/iscsi/iqn.20...de1/tpg1/acls> create iqn.1994-05.com.redhat:8277148d16b2
Created Node ACL for iqn.1994-05.com.redhat:8277148d16b2
Created mapped LUN 0.

# set UserID and password for authentication
/iscsi/iqn.20...de1/tpg1/acls> cd iqn.1994-05.com.redhat:8277148d16b2
/iscsi/iqn.20...:8277148d16b2> set auth userid=username 
Parameter userid is now 'username'.
/iscsi/iqn.20...:8277148d16b2> set auth password=password 
Parameter password is now 'password'.
/iscsi/iqn.20...:8277148d16b2> cd /

/> saveconfig
Last 10 configs saved in /etc/target/backup.
Configuration saved to /etc/target/saveconfig.json

/> exit
Global pref auto_save_on_exit=true
Last 10 configs saved in /etc/target/backup.
Configuration saved to /etc/target/saveconfig.json

Multipathing will help in achieving high availability of the LUN at client side, so even when a node is down for maintenance your LUN remains accessible.

Using the same wwn across the nodes will help enable multipathing, we need to ensure the LUN exported by all the three gateways/nodes to share the same wwn – if they don’t match, the client will see three devices, not three paths to the same device.

Copy the wwn from the Node1, the one shown in bold below:

[root@Node1 ~]# cat /etc/target/saveconfig.json
{
 "fabric_modules": [],
 "storage_objects": [
 {
 "config": "glfs/nonshared-store@Node1/app-store.img",
 "name": "glfsLUN",
 "plugin": "user",
 "size": 8589934592,
 "wwn": "cdc1e292-c21a-41ce-aa3f-d49658633bdf"
 } 
],
"targets": [
[...]

Now On Node2:

[root@Node2 ~]# dnf install tcmu-runner targetcli

# enter the admin console
[root@Node2 ~]# targetcli
targetcli shell version 2.1.fb42
Copyright 2011-2013 by Datera, Inc and others.
For help on commands, type 'help'.
/> ls
o- / ...................................................... [...]
  o- backstores ........................................... [...]
  | o- block ............................... [Storage Objects: 0]
  | o- fileio .............................. [Storage Objects: 0]
  | o- pscsi ............................... [Storage Objects: 0]
  | o- ramdisk ............................. [Storage Objects: 0]
  | o- user:glfs ........................... [Storage Objects: 0]
  | o- user:qcow ........................... [Storage Objects: 0]
  o- iscsi ......................................... [Targets: 0]
  o- loopback ...................................... [Targets: 0]
  o- vhost ......................................... [Targets: 0]

# create a Image file with the name "glfsLUN" on gluster volume with 8G
/> /backstores/user:glfs create glfsLUN 8G nonshared-store@Node2/app-store.img
Created user-backed storage object glfsLUN size 8589934592.

/> saveconfig
Last 10 configs saved in /etc/target/backup.
Configuration saved to /etc/target/saveconfig.json

/> exit
Global pref auto_save_on_exit=true
Last 10 configs saved in /etc/target/backup.
Configuration saved to /etc/target/saveconfig.json

[root@Node2 ~]# vi /etc/target/saveconfig.json
edit wwn to point to one that is copied from Node1 and save

[root@Node2 ~]# systemctl restart target
[root@Node2 ~]# targetcli
targetcli shell version 2.1.fb42
Copyright 2011-2013 by Datera, Inc and others.
For help on commands, type 'help'.
# create a target
/> /iscsi/ create iqn.2016-06.org.gluster:Node2            
Created target iqn.2016-06.org.gluster:Node2.
Created TPG 1.
Global pref auto_add_default_portal=true
Created default portal listening on all IPs (0.0.0.0), port 3260.

# set LUN
/> cd /iscsi/iqn.2016-06.org.gluster:Node2/tpg1/luns
iscsi/iqn.20...de2/tpg1/luns> create /backstores/user:glfs/glfsLUN 0
Created LUN 0.
iscsi/iqn.20...de2/tpg1/luns> cd /

# set ACL (it's the IQN of an initiator you permit to connect)
# Copy InitiatorName from Docker host machine
# which will be in ‘/etc/iscsi/initiatorname.iscsi’
# In my case it is the one put in bold at next commands
/> cd /iscsi/iqn.2016-06.org.gluster:Node2/tpg1/acls 
/iscsi/iqn.20...de2/tpg1/acls> create iqn.1994-05.com.redhat:8277148d16b2
Created Node ACL for iqn.1994-05.com.redhat:8277148d16b2
Created mapped LUN 0.

# set UserID and password for authentication
/iscsi/iqn.20...de2/tpg1/acls> cd iqn.1994-05.com.redhat:8277148d16b2
/iscsi/iqn.20...:8277148d16b2> set auth userid=username 
Parameter userid is now 'username'.
/iscsi/iqn.20...:8277148d16b2> set auth password=password 
Parameter password is now 'password'.
/iscsi/iqn.20...:8277148d16b2> cd /

/> saveconfig
Last 10 configs saved in /etc/target/backup.
Configuration saved to /etc/target/saveconfig.json

/> exit
Global pref auto_save_on_exit=true
Last 10 configs saved in /etc/target/backup.
Configuration saved to /etc/target/saveconfig.json

Note: Replicate Node2 setup on Node3 as well, avoiding the duplication

Setting up iSCSI initiator

On the fourth Machine:

[root@DkNode ~]# dnf install iscsi-initiator-utils sg3_utils 

# Multipathing to achieve high availability
[root@DkNode ~]# mpathconf 
multipath is enabled
find_multipaths is enabled
user_friendly_names is enabled
dm_multipath module is not loaded
multipathd is not running

[root@DkNode ~]# modprobe dm_multipath
[root@DkNode ~]# lsmod | grep dm_multipath
dm_multipath           24576  0

[root@DkNode ~]# cat /etc/multipath.conf
cat: /etc/multipath.conf: No such file or directory

[root@DkNode ~]# mpathconf --enable
[root@DkNode ~]# cat >> /etc/multipath.conf                                                                                     

# LIO iSCSI
devices { 
        device { 
                vendor "LIO-ORG"
                user_friendly_names "yes" # names like mpatha
                path_grouping_policy "failover" # one path per group
                path_selector "round-robin 0"
                path_checker "tur"
                prio "const"
                rr_weight "uniform"
        } 
}
^C

[root@DkNode ~]# systemctl start multipathd
[root@DkNode ~]# mpathconf 
multipath is enabled
find_multipaths is enabled
user_friendly_names is enabled
dm_multipath module is loaded
multipathd is running

[root@DkNode ~]# vi /etc/iscsi/iscsid.conf
# uncomment below line
node.session.auth.authmethod = CHAP
[...]
# uncomment and edit username and password if required
node.session.auth.username = username
node.session.auth.password = password

[root@DkNode ~]# systemctl restart iscsid 
[root@DkNode ~]# systemctl enable iscsid
# One way of Login to the iSCSI target
[root@DkNode ~]#  iscsiadm -m discovery -t st -p Node1 -l 
Node1:3260,1 iqn.2016-06.org.gluster:Node1
Logging in to [iface: default, target: iqn.2016-06.org.gluster:Node1, portal: Node1,3260] (multiple)
Login to [iface: default, target: iqn.2016-06.org.gluster:Node1, portal: Node1,3260] successful.

[root@DkNode ~]#  iscsiadm -m discovery -t st -p Node2 -l 
Node2:3260,1 iqn.2016-06.org.gluster:Node2
Logging in to [iface: default, target: iqn.2016-06.org.gluster:Node2, portal: Node2,3260] (multiple)
Login to [iface: default, target: iqn.2016-06.org.gluster:Node2, portal: Node2,3260] successful.

[root@DkNode ~]#  iscsiadm -m discovery -t st -p Node3 -l 
Node3:3260,1 iqn.2016-06.org.gluster:Node3
Logging in to [iface: default, target: iqn.2016-06.org.gluster:Node3, portal: Node3,3260] (multiple)
Login to [iface: default, target: iqn.2016-06.org.gluster:Node3, portal: Node3,3260] successful.

# Here you can see three paths to the same device
[root@DkNode ~]# multipath -ll
mpatha (36001405cdc1e292c21a41ceaa3fd4965) dm-2 LIO-ORG ,TCMU device     
size=8.0G features='0' hwhandler='0' wp=rw
`-+- policy='queue-length 0' prio=1 status=active
  |- 2:0:0:0 sda 8:0  active ready running
  |- 3:0:0:0 sdb 8:16 active ready running
  `- 4:0:0:0 sdc 8:32 active ready running

[root@DkNode ~]# lsblk
NAME                       MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
sda                          8:0    0    8G  0 disk  
└─mpatha                   253:2    0    8G  0 mpath 
sdb                          8:16   0    8G  0 disk  
└─mpatha                   253:2    0    8G  0 mpath 
sdc                          8:32   0    8G  0 disk  
└─mpatha                   253:2    0    8G  0 mpath 
[...]

Some ways to check multipathing is working as expected

# From sda, sdb, sdc check wwn's are same and Nodes are as expected

[root@DkNode ~]# sg_inq -i /dev/sda 
VPD INQUIRY: Device Identification page
  Designation descriptor number 1, descriptor length: 49
    designator_type: T10 vendor identification,  code_set: ASCII
    associated with the addressed logical unit
      vendor id: LIO-ORG 
      vendor specific: cdc1e292-c21a-41ce-aa3f-d49658633bdf
  Designation descriptor number 2, descriptor length: 20
    designator_type: NAA,  code_set: Binary
    associated with the addressed logical unit
      NAA 6, IEEE Company_id: 0x1405
      Vendor Specific Identifier: 0xcdc1e292c
      Vendor Specific Identifier Extension: 0x21a41ceaa3fd4965
      [0x6001405cdc1e292c21a41ceaa3fd4965]
  Designation descriptor number 3, descriptor length: 51
    designator_type: vendor specific [0x0],  code_set: ASCII
    associated with the addressed logical unit
      vendor specific: glfs/nonshared-store@Node1/app-store.img

[root@DkNode ~]# sg_inq -i /dev/sdb
VPD INQUIRY: Device Identification page
  Designation descriptor number 1, descriptor length: 49
    designator_type: T10 vendor identification,  code_set: ASCII
    associated with the addressed logical unit
      vendor id: LIO-ORG 
      vendor specific: cdc1e292-c21a-41ce-aa3f-d49658633bdf
  Designation descriptor number 2, descriptor length: 20
    designator_type: NAA,  code_set: Binary
    associated with the addressed logical unit
      NAA 6, IEEE Company_id: 0x1405
      Vendor Specific Identifier: 0xcdc1e292c
      Vendor Specific Identifier Extension: 0x21a41ceaa3fd4965
      [0x6001405cdc1e292c21a41ceaa3fd4965]
  Designation descriptor number 3, descriptor length: 52
    designator_type: vendor specific [0x0],  code_set: ASCII
    associated with the addressed logical unit
      vendor specific: glfs/nonshared-store@Node2/app-store.img

[root@DkNode ~]# sg_inq -i /dev/sdc
VPD INQUIRY: Device Identification page
  Designation descriptor number 1, descriptor length: 49
    designator_type: T10 vendor identification,  code_set: ASCII
    associated with the addressed logical unit
      vendor id: LIO-ORG 
      vendor specific: cdc1e292-c21a-41ce-aa3f-d49658633bdf
  Designation descriptor number 2, descriptor length: 20
    designator_type: NAA,  code_set: Binary
    associated with the addressed logical unit
      NAA 6, IEEE Company_id: 0x1405
      Vendor Specific Identifier: 0xcdc1e292c
      Vendor Specific Identifier Extension: 0x21a41ceaa3fd4965
      [0x6001405cdc1e292c21a41ceaa3fd4965]
  Designation descriptor number 3, descriptor length: 51
    designator_type: vendor specific [0x0],  code_set: ASCII
    associated with the addressed logical unit
      vendor specific: glfs/nonshared-store@Node3/app-store.img

[root@DkNode ~]# ls -l /dev/mapper/
total 0
[...]
lrwxrwxrwx. 1 root root       7 Jun 21 12:13 mpatha -> ../dm-2

[root@DkNode ~]# sg_inq -i /dev/mapper/mpatha 
VPD INQUIRY: Device Identification page
  Designation descriptor number 1, descriptor length: 49
    designator_type: T10 vendor identification,  code_set: ASCII
    associated with the addressed logical unit
      vendor id: LIO-ORG 
      vendor specific: cdc1e292-c21a-41ce-aa3f-d49658633bdf
  Designation descriptor number 2, descriptor length: 20
    designator_type: NAA,  code_set: Binary
    associated with the addressed logical unit
      NAA 6, IEEE Company_id: 0x1405
      Vendor Specific Identifier: 0xcdc1e292c
      Vendor Specific Identifier Extension: 0x21a41ceaa3fd4965
      [0x6001405cdc1e292c21a41ceaa3fd4965]
  Designation descriptor number 3, descriptor length: 51
    designator_type: vendor specific [0x0],  code_set: ASCII
    associated with the addressed logical unit
      vendor specific: glfs/nonshared-store@Node3/app-store.img

Here we partition the target device and mount it local to iSCSI initiator/Docker Host

# partition the disk file
[root@DkNode ~]# sgdisk -n 1:2048 /dev/mapper/mpatha 
Creating new GPT entries.
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot or after you
run partprobe(8) or kpartx(8)
The operation has completed successfully.

# now commit the changes to disk
[root@DkNode ~]# partprobe /dev/mapper/mpatha

# check if that appeared as expected
[root@DkNode ~]# lsblk
NAME                       MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
sda                          8:0    0    8G  0 disk  
└─mpatha                   253:2    0    8G  0 mpath 
  └─mpatha1                253:3    0    8G  0 part  
sdb                          8:16   0    8G  0 disk  
└─mpatha                   253:2    0    8G  0 mpath 
  └─mpatha1                253:3    0    8G  0 part  
sdc                          8:32   0    8G  0 disk  
└─mpatha                   253:2    0    8G  0 mpath 
  └─mpatha1                253:3    0    8G  0 part  
[...]

[root@DkNode ~]# sgdisk -p /dev/mapper/mpatha
Disk /dev/mapper/mpatha: 16777216 sectors, 8.0 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): 71183D83-4290-41F4-8EF9-69B3D14495F8
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 16777182
Partitions will be aligned on 2048-sector boundaries
Total free space is 2014 sectors (1007.0 KiB)
Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048        16777182   8.0 GiB     8300  

[root@DkNode ~]# ls -l /dev/mapper/
total 0
[...]
lrwxrwxrwx. 1 root root       7 Jun 21 12:23 mpatha -> ../dm-2
lrwxrwxrwx. 1 root root       7 Jun 21 12:23 mpatha1 -> ../dm-3

# format the target device
[root@DkNode ~]#  mkfs.xfs /dev/mapper/mpatha1 
meta-data=/dev/mapper/mpatha1    isize=512    agcount=4, agsize=524223 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=0
data     =                       bsize=4096   blocks=2096891, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

# mount the iSCSI target

[root@DkNode ~]# mkdir /root/nonshared-store/

[root@DkNode ~]# mount /dev/mapper/mpatha1 /root/nonshared-store/

[root@DkNode ~]# df -Th
Filesystem                         Type      Size  Used Avail Use% Mounted on
[...]
/dev/mapper/mpatha1                xfs       8.0G   33M  8.0G   1% /root/nonshared-store/

Let me point you, the iSCSI initiator for us will be the Docker host and viz.

Setting up Docker host and container

[root@DkNode ~]# touch /root/nonshared-store/{1..10}

[root@DkNode ~]# ls /root/nonshared-store/
1 10 2 3 4 5 6 7 8 9

[root@DkNode ~]# cat > /etc/yum.repos.d/docker.repo
[dockerrepo]
name=Docker Repository
baseurl=https://yum.dockerproject.org/repo/main/fedora/$releasever/
enabled=1
gpgcheck=1
gpgkey=https://yum.dockerproject.org/gpg
^c

[root@DkNode ~]# dnf install docker-engine

[root@DkNode ~]# systemctl start docker

# create a container
[root@DkNode ~]# docker run --name bindmount -v /root/nonshared-store/:/mnt:z -t -i fedora /bin/bash

--name         Assign a name to the container
-v             Create a bind mount. If you specify [[HOST-DIR:]CONTAINER-DIR[:OPTIONS]]
               Option: z is the selinux label (instead of ':z' you are free to do # setenforce 0) 
-t             Allocate a pseudo-TTY
-i             Keep STDIN open (Interactive)

# docker interactive tty is here for us
[root@5bbb1e4cb8f8 /]# ls /mnt/
1 10 2 3 4 5 6 7 8 9

[root@5bbb1e4cb8f8 /]# df -Th
Filesystem                               Type   Size  Used Avail Use% Mounted on
[...]
/dev/mapper/mpatha1                      xfs    8.0G   33M  8.0G   1% /mnt
/dev/mapper/fedora_dhcp42--88-root       xfs     15G  1.2G   14G   8% /etc/hosts
shm                                      tmpfs   64M     0   64M   0% /dev/sh

Conclusion

This just showcases how Gluster can be used as a distributed block store for containers. More details about high availability, integration with Kubernetes etc. will come by in further posts.

References

http://domino.research.ibm.com/library/cyberdig.nsf/papers/0929052195DD819C85257D2300681E7B/$File/rc25482.pdf

http://rootfs.github.io/iSCSI-Kubernetes/

http://blog.gluster.org/2016/04/using-lio-with-gluster/

https://docs.docker.com/engine/tutorials/dockervolumes/http://scst.sourceforge.net/scstvslio.html

http://events.linuxfoundation.org/sites/events/files/slides/tcmu-bobw_0.pdf

https://www.kernel.org/doc/Documentation/target/tcmu-design.txt

https://lwn.net/Articles/424004/

http://www.gluster.org/community/documentation/index.php/GlusterFS_Documentation