Bonding Fetishism
On glancing over my notes from past days I can see a lot of words like: bonding, slave, master,… You might think we are organizing orgies here all the time. Unfortunately we don’t, we are just crazy about HA – High Availability. Here Infiniband (IB) devices are meant. How can we setup HA for devices on OEL/OVS? This article covers basic information needed on how to setup HA devices in Linux kernel based distros running openibd.
In the past blog we talked about IB partitions setup and we skimped over the networking stuff a bit. Here we will do it in detail and correctly.
_.:Bonding… Everything… Everywhere:._
The basic idea behind bondig is to failover to another slave device if the primary one is down. Typically we bond something like ibX and ibY to bondZ. The bondZ takes the IP/netmask and is served by ibX or ibY respectively.
To show this in practice first ssh to any OVS running compute node in the Exalogic machine and see the ifconfig command output:
bond0 Link encap:Ethernet HWaddr 00:21:28:D6:00:18 UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 RX packets:3319758 errors:0 dropped:0 overruns:0 frame:0 TX packets:1714742 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:4944329941 (4.6 GiB) TX bytes:117305870 (111.8 MiB) bond1 Link encap:InfiniBand HWaddr 80:00:05:4A:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:192.168.10.1 Bcast:192.168.10.255 Mask:255.255.255.0 UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 RX packets:2402 errors:0 dropped:0 overruns:0 frame:0 TX packets:1784 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:176368 (172.2 KiB) TX bytes:266004 (259.7 KiB) bond2 Link encap:InfiniBand HWaddr 80:00:05:4C:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:192.168.20.1 Bcast:192.168.20.255 Mask:255.255.255.0 UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 RX packets:2863 errors:0 dropped:0 overruns:0 frame:0 TX packets:3993 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:519096 (506.9 KiB) TX bytes:330900 (323.1 KiB) ...snip...
And so on … These bondX devices are always slaved by someone:
# ip link | grep '\<bond1\>' 6: ib0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond1 state UP qlen 256 7: ib1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond1 state UP qlen 256 17: bond1: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
In this case bond1, the master, has two slaves, ib0 and ib1 device. Let’s take a look at bond2:
# ip link | grep '\<bond2\>' 8: ib0.8001: mtu 1500 qdisc pfifo_fast master bond2 state UP qlen 256 12: ib1.8001: mtu 1500 qdisc pfifo_fast master bond2 state UP qlen 256 18: bond2: mtu 1500 qdisc noqueue state UP
Master device bond2 is slaved by ib0.8001 and ib1.8001. These are the child devices of ib0 or ib1 respectively. How are these being creating at bootup? The answer is in /etc/sysconfig/network-scripts:
# cd /etc/sysconfig/network-scripts/ # ls ifcfg-* ifcfg-bond0 ifcfg-bond2 ifcfg-bond4 ifcfg-eth0 ifcfg-eth2 ifcfg-ib0 ifcfg-ib0.8002 ifcfg-ib0.8004 ifcfg-ib1.8001 ifcfg-ib1.8003 ifcfg-lo ifcfg-bond1 ifcfg-bond3 ifcfg-bond5 ifcfg-eth1 ifcfg-eth3 ifcfg-ib0.8001 ifcfg-ib0.8003 ifcfg-ib1 ifcfg-ib1.8002 ifcfg-ib1.8004 ifcfg-xenbr0
Now examine e.g. bond2 and ib[01].8001 related slaves:
# cat *bond2 DEVICE=bond2 BONDING_OPTS="mode=1 miimon=250 use_carrier=1 primary=ib0.8001" BOOTPROTO=static ONBOOT=yes IPADDR=192.168.20.1 NETMASK=255.255.255.0
We see how the primary device is selected, now take a look at slaves:
# cat *ib0.8001 DEVICE=ib0.8001 BOOTPROTO=none ONBOOT=yes MASTER=bond2 SLAVE=yes
This is a basic info on how the bonding is achieved after bootup. In this example we saw a so called Exalogic Admin partition. There’s a couple of things we can investigate here:
- What is the magic behind the scripts so the IB children are created? The answer is simple: /etc/init.d/openibd
for all ib.???? children in the network-config directory: pkey=0x${ch_i##*.} if [ ! -e /sys/class/net/${i}.${ch_i##*.} ] ; then echo $pkey > /sys/class/net/${i}/create_child fi bring_up $ch_i RC=$?
- How to disable zero-conf? (169.254.0.0 route)?
# grep -A2 ZEROCONF /etc/sysconfig/network-scripts/ifup-eth if [ -z "${NOZEROCONF}" -a "${ISALIAS}" = "no" -a "${REALDEVICE}" != "lo" ]; then ip route replace 169.254.0.0/16 dev ${REALDEVICE} fi
Meaning putting NOZEROCONF to whatever in ifcfg-* would disable the zero config.
_.:Putting it All Together:._
The correct configuration for the IB partitions from the previous blog would be (assuming bond2 is the first free available device on the system):
# cd /etc/sysconfig/network-scripts # cat > ifcfg-bond2 <<EOS DEVICE=bond2 BONDING_OPTS="mode=1 miimon=250 use_carrier=1 primary=ib0.8033" BOOTPROTO=static ONBOOT=yes IPADDR=192.168.33.15 NETMASK=255.255.255.0 EOS # cat > ifcfg-ib0.8033 <<EOS DEVICE=ib0.8033 BOOTPROTO=none ONBOOT=yes MASTER=bond2 SLAVE=yes EOS # cat > ifcfg-ib1.8033 <<EOS DEVICE=ib1.8033 BOOTPROTO=none ONBOOT=yes MASTER=bond2 SLAVE=yes EOS
That’s it! Now reboot … and pray 🙂 !
Rendezvous with Sun QDR Infiniband Gateway Switch inside Exalogic
This is the last article about the basic Exalogic components introduction. In the next one we will look more under the hood. We can do a lot of things with this switch and we will show today how to create Infiniband Partitions to isolate the network traffic on the IB layer.
_.:Introducing Infiniband Partitions:._
The basic idea behind the IB partitions is to have a secured isolation on the IB fabric. Similar to vlans we define IB partitions for Ports, in this case for gateway ports and also for Host Channel Adapter (HCA) Ports. The important parameter is the partition key, P_Key, which identifies the partition. In order to be able to isolate something or someone, IB Partitions should have members assigned. Members are determined by the port GUID. Each member can have a full or limited membership. Full member can access all other members in the partition (full and limited), however the limited member can’t access other limited members. This can be useful, e.g. in the Storage Appliance (SA) network case, where HCA GUIDs will have the full membership set while all the clients will have the limited membership setup in the IB partition. Today such a partition really exists in Exalogic Virtual and is called Storage IB partition. The limited and full membership doesn’t apply for the ports only. It applies for the partition key as well, and since the P_Key itself is a number, it is denoted by the Most Significant Bit (MSB) set meaning full membership, or MSB unset meaning limited membership. The most typical use for MSB/non-MSB P_Key is in the virtual environment where keys are propagated to the dom0. If there’s a MSB P_Key in the dom0, VM can use this one meaning the VM will be a full member. If there’s a non-MSB P_Key in the dom0, VM can use it for limited membership in the partition. There can be also both of them meaning VM can choose between being a limited or full member in the partition denoted by the P_Key.
Here are P_Key identifier examples: 0x8033 or 0x0033. These two differ only in the MSB set in the first example and MSB unset in the latter one.
_.:Creating IB Partitions for Physical Nodes:._
We will create an IB partition for two compute nodes running the OEL 5.6 (physical) and for SA, Compute nodes will be nfs clients of the SA on this partition. We will choose the full membership for the SA and limited membership for the compute nodes. This means all computes nodes can communicate with the SA (as nfs clients) but they can’t talk to each other (at least not using this partition). On the other hand, SA can talk to anyone in this partition.
IB partition is always created on the switch, we can determine how many switches we have in the Exalogic system by issuing the ibswitches command:
CN01# ibswitches Switch : 0x002128548002c0a0 ports 36 "SUN IB QDR GW switch el01gw02 192.168.4.202" enhanced port 0 lid 15 lmc 0 Switch : 0x002128547b82c0a0 ports 36 "SUN IB QDR GW switch el01gw01 192.168.4.201" enhanced port 0 lid 6 lmc 0
This is a quarter rack with 2 IB switches. Both of them should be running the Subnet Manager (SM) and since they operate as a cluster one of them is the master SM. To determine which one is the master we ssh to one of them and issue the getmaster command:
CN01# ssh 192.168.4.201
Now on the switch …
root@el01gw01-c ~]# getmaster Local SM enabled and running 20120112 17:54:48 Master SubnetManager on sm lid 6 sm guid 0x2128547b82c0a0 : SUN IB QDR GW switch el01gw01 192.168.4.201
and we are lucky guys, this is the master one! Now we can start the sm partitioning and create the partition:
# smpartition start # smpartition create -n Mixaal_Storage -pkey 0x8033 -flag ipoib -m full
Before we commit we can use the command smpartition list modified to see if we created what we wanted:
# yes | smpartition list modified | tail -1 Mixaal_Storage = 0x8033,ipoib,defmember=full: ;
The ipoib flag is essential here, since NFS is running on the top of the tcp/ip stack, so we need IPoIB here. The option -m full means the default membership for newly added port. I would recommend to always add all the members with explicit membership and not relying on the default membership defined here. BTW if we create the P_Key as 0x8033 on the switch we should not use 0x0033 as P_Key for another partition (and vice versa).
Now commit the partition:
# smpartition commit
We can list all active partitions with smpartition list active command. Now return to the compute node and determine which GUIDs we need to add. We will use setup-ssh.sh to setup password-less ssh, dcli to run distributed cli command and finally ibstat to provide us the info needed:
CN01# /opt/exalogic.tools/tools/setup-ssh.sh -H 192.168.1.1 -P xxxxxxx CN01# /opt/exalogic.tools/tools/setup-ssh.sh -H 192.168.1.2 -P xxxxxxx CN01# /opt/exalogic.tools/tools/dcli -c 192.168.1.1,192.168.1.2 ibstat | grep 'Port GUID' 192.168.1.1: Port GUID: 0x0021280001a0a44d 192.168.1.1: Port GUID: 0x0021280001a0a44e 192.168.1.2: Port GUID: 0x0021280001a0a3dd 192.168.1.2: Port GUID: 0x0021280001a0a3de
We also need the ZFS SA ports, ssh to ZFS SA, go to configuration, net, devices, select ibp0 and show:
el01sn01:> configuration el01sn01:configuration> net el01sn01:configuration net> devices el01sn01:configuration net devices> select ibp0 el01sn01:configuration net devices ibp0> show Properties: speed = 32000 Mbit/s up = true active = false media = Infiniband factory_mac = not available port = 1 guid = 0x212800013e8fbf
Do the same for ibp1 device. These ports will be added with full membership. In general case (not the case here) we might need to add BridgeX Ports – on the switch, issue the following command and look for Bridge-* lines:
# showgwports -v
Now we add the compute node port GUIDs into the partition with the limited membership and ibp0 and ibp1 GUIDs with the full membership, so back on the switch:
# smpartition start # smpartition add -pkey 0x8033 -port 0x0021280001a0a44d 0x0021280001a0a44e 0x0021280001a0a3dd 0x0021280001a0a3de -m limited # smpartition add -pkey 0x8033 -port 0x212800013e8fbf 0x212800013e8fc0 -m full # yes | smpartition list modified | grep -A10 Mixaal Mixaal_Storage = 0x8033,ipoib,defmember=full: 0x00212800013e8fbf=full, 0x00212800013e8fc0=full, 0x0021280001a0a44d=limited, 0x0021280001a0a44e=limited, 0x0021280001a0a3dd=limited, 0x0021280001a0a3de=limited; # smpartition commit
For now we are done on the switch and we have to go to the ZFS Storage Appliance BUI. In the previous blog we determined the EoIB IP address as 10.240.13.231. We can access this as https URL from the browser:
firefox https://10.240.13.231:215/
After providing credentials we go to Configuration >> Networking, a similar screen should appear:
Drag the ibp0 interface to the Datalinks add widget (denoted by a plus symbol), a new configuration window will appear and we enter the datalink name: ibp.8033 and partition key: 8033. Do the same for the ibp1 device:
The datalinks should have a chain symbol, grey dot means we didn’t create the partition on the switch correctly, e.g. we specified wrong port GUIDs. Click Apply button in the top right corner.
For datalinks we do a similar thing as we did for device: drag and drop datalinks to create new interfaces: ib0.8033 and ib1.8033. The IP address will be 0.0.0.0/8, statically configured. The last step is to create IP Multi-Path Group from these two interfaces: Click add a new interface and fill in the information as on the following picture:
We select name which denotes the IB partition, we assign an IP bind address for the nfs server, we must check the IP Multipathing Group in order to select our interfaces created before.What remains? We need to configure the network on the compute nodes. Since this article is becoming a long one we will do it in a fast way and come back to this in the future.
Determine the active ib device:
# ifconfig ib0 # ifconfig ib1
The active device will have non-zero Rx/Tx bytes. For the active device, configure the network:
CN01 # echo 0x8033 > /sys/class/net/ib0/create_child CN01 # ifconfig -a | grep -A6 8033 ib0.8033 Link encap:InfiniBand HWaddr 80:50:05:4C:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 BROADCAST MULTICAST MTU:2044 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:256 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) CN01 # ifconfig ib0.8033 192.168.33.1
The same we can do for the compute node 2. Now we must be able to ping to the shared storage. We can mount it to the compute nodes:
CN01 # umount /mnt/mixaal CN01 # mount 192.168.33.15:/export/mixaal /mnt/mixaal
Well now we have a secured isolated network traffic carrying nfs data. No one else can sniff this traffic. That’s nice, isn’t it 🙂 ?
Interview with Exalogic ZFS Storage Appliance
As in every soap opera we should recall what happened in the last episode. We briefly introduced Exalogic machine and promised that we will take a more detailed look at its components. Today we will chat with the ZFS SA, which is a Sun 7320 ZFS Storage Appliance.
_.:Intro:._
Sun 7320 ZFS SA is based on Solaris 11, from a hw point of view we have 2xXeon E5620 CPU running @ 2.4GHz, every cpu has 4 cores, with Intel HyperThreading enabled we have 16 cores and 24gigs of memory per one controller. NFS Server supports v4, v3, and v2 protocols, on a block level we have e.g.: ISCSI, Fibre Channel, IP over Infiniband , RDMA over Infinidband. We can manage the appliance over HTTPS BUI (Browser User Interface), ssh (we will take a look at this one today), SNMP and IPMI (console management).
_.:Appliance Management over SSH:._
The storage appliance has a sexy BUI but today we use appliance shell which runs as a process on top of the Solaris 11. After using ssh to the SA we should see the appliance prompt where we can enter commands:
el01sn01:> help Subcommands that are valid in this context: configuration => Perform configuration actions maintenance => Perform maintenance actions raw => Make raw XML-RPC calls analytics => Manage appliance analytics status => View appliance status shares => Manage shares help [topic] => Get context-sensitive help. If [topic] is specified, it must be one of "builtins", "commands", "general", "help", "script" or "properties". show => Show information pertinent to the current context get [prop] => Get value for property [prop]. ("help properties" for valid properties.) If [prop] is not specified, returns values for all properties. set [prop] => Set property [prop] to [value]. ("help properties" for valid properties.) For properties taking list values, [value] should be a comma-separated list of values.
The help is context sensitive. This is handy since we only get the information we can use in the given context. Let’s take a look at the configuration menu:
el01sn01:> configuration el01sn01:configuration> show Children: net => Configure networking services => Configure services version => Display system version users => Configure administrative users roles => Configure administrative roles preferences => Configure user preferences alerts => Configure alerts cluster => Configure clustering storage => Configure Storage san => Configure storage area networking
See what interfaces are up and running, we will come to the interfaces section from the BUI in one of the upcoming episodes when setting up Infiniband partitions:
el01sn01:configuration> net el01sn01:configuration net> show Children: datalinks => Manage datalinks devices => Manage physical devices interfaces => Manage IP interfaces routing => Manage routing configuration el01sn01:configuration net> interfaces el01sn01:configuration net interfaces> show Interfaces: INTERFACE STATE CLASS LINKS ADDRS LABEL aggr1 up ip aggr1 10.240.13.231/21 DR igb0 up ip igb0 192.168.1.15/24 igb0 igb1 offline ip igb1 192.168.1.16/24 igb1 ipmp1 up ipmp pffff_ibp1 192.168.10.15/24 ipmp1 pffff_ibp0 ipmp2 up ipmp p8001_ibp0 192.168.20.9/24 IB_IF_8001 p8001_ibp1 ipmp3 up ipmp p8002_ibp0 192.168.21.9/24 IB_IF_8002 p8002_ibp1 p8001_ibp0 up ip p8001_ibp0 0.0.0.0/8 ibp0.8001 p8001_ibp1 up ip p8001_ibp1 0.0.0.0/8 ibp1.8001 p8002_ibp0 up ip p8002_ibp0 0.0.0.0/8 ibp0.8002 p8002_ibp1 up ip p8002_ibp1 0.0.0.0/8 ibp1.8002 pffff_ibp0 up ip pffff_ibp0 0.0.0.0/8 ibp0 pffff_ibp1 up ip pffff_ibp1 0.0.0.0/8 ibp1
There is a couple of IB parittions created on the switch and managed by the ib.8NNN interfaces (8001,8002), the nfs server is also listening on EoIB (10Gbps access) public accessible address (10.240.13.231) and private IB interfaces (40Gbps).
To not collide with other users and not pollute the shared space, we will create a project. The project will have its own mountpoint and we can define separate permissions, quotas and other attributes for the project:
el01sn01:> shares el01sn01:shares> show Properties: pool = exalogic Projects: ExalogicControl NODE_4 NODE_5 NODE_6 NODE_7 OVM common default elp_dev elp_qa patches remote-zfs Children: replication => Manage remote replication schema => Define custom property schema
Now we create our mixaal project and commit it:
el01sn01:shares> project mixaal el01sn01:shares mixaal (uncommitted)> commit
Using select we will move into newly created project and we must create a filesystem for it:
el01sn01:shares> select mixaal el01sn01:shares mixaal> filesystem mixaal el01sn01:shares mixaal/mixaal (uncommitted)> commit
The show command will display our new project created:
el01sn01:shares mixaal> show
Back on the compute node we can mount out newly created mountpoint (we left the default permissions for users):
CN01# mkdir /mnt/mixaal CN01# mount 192.168.1.15:/export/mixaal /mnt/mixaal CN01# df -h /mnt/mixaal
Shows the space we have on the storage. That’s it for today, tomorrow we will take a look at the Sun QDR Infiniband Gateway Switch.
Exalogic – Amazingly Fast Animal
_.:Intro:._
Oracle has released Exalogic 2.0 Physical in December and now it’s about to release Exalogic 2.0 Virtual this calendar year. Let us have a look what is inside of this incredible rack.
Exalogic is a platform for customers who really need power. The full rack contains 30 compute nodes (x4170M2) with Mellanox Infiniband cards, Shared Storage Appliance (ZFS, 7320), Infiniband Fabrics for fast communication (40Gbps), management network and Sun QDR Infiniband gateway switches with fast 40Gbps Infiniband ports and 10Gbps Ethernet ports which are used for Ethernet over Infiniband communication (There are more things in the rack but for high level overview this is enough to talk about).
_.:Dating Exalogic:._
I’m a guy who likes to jump into things, so let’s have a speed date with a compute node today. In the upcoming blogs of the Exalogic Intro series we will take a look at the ZFS storage appliance and Sun QDR Infiniband switch and others as well.
For Exalogic 2.0 Virtual we are running Oracle Virtual Server 3.x series:
# cat /etc/ovs-release Oracle VM server release 3.0.3
and for Physical we are running OEL 5.6. The total amount of memory of each compute node is 96G:
# grep MemTotal /proc/meminfo MemTotal: 99191272 kB
We have 2 Xeon 5670 series CPUs, each has 6 cores resulting in 12 physical cores, by default we have Intel HyperThreading enabled resulting to 24 cores available to the system:
# grep -A4 "processor.*23" /proc/cpuinfo # CPUs indexed from 0 processor : 23 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU X5670 @ 2.93GHz
The OS boots from fast SSD disk, we have two of them:
# dmesg | grep "Direct-Access.*ATA" scsi 0:0:8:0: Direct-Access ATA SSDSA2SH032G1SB 8855 PQ: 0 ANSI: 5 scsi 0:0:9:0: Direct-Access ATA SSDSA2SH032G1SB 8855 PQ: 0 ANSI: 5
We also have commands related to the IB stack:
# ibstat CA 'mlx4_0' CA type: MT26428 Number of ports: 2 Firmware version: 2.9.1000 Hardware version: b0 Node GUID: 0x0021280001a0a44c System image GUID: 0x0021280001a0a44f Port 1: State: Active Physical state: LinkUp Rate: 40 Base lid: 31 LMC: 0 SM lid: 6 Capability mask: 0x02510868 Port GUID: 0x0021280001a0a44d Link layer: IB Port 2: State: Active Physical state: LinkUp Rate: 40 Base lid: 32 LMC: 0 SM lid: 6 Capability mask: 0x02510868 Port GUID: 0x0021280001a0a44e Link layer: IB
Which shows the two 40Gb/sec IB Ports are up.
Typically we want to have a shared storage mounted for our project, and usually we want to have it mounted over Infiniband:
# df -h /mnt/sa/ Filesystem Size Used Avail Use% Mounted on 192.168.21.9:/export/qa_images 15T 369G 15T 3% /mnt/sa
We also have a couple of specific exalogic tools located in /opt/exalogic/exalogic.tools. Here is a couple of examples, there are more tools in this directory:
# /opt/exalogic.tools/tools/CheckSWProfile [SUCCESS]........Has supported operating system [SUCCESS]........Has supported processor [SUCCESS]........Kernel is at the supported version [SUCCESS]........Has supported kernel architecture [SUCCESS]........Software is at the supported profile
This one checks the software versions and kernel architecture, etc. The following investigates the hardware and ILOM firmware:
# /opt/exalogic.tools/tools/CheckHWnFWProfile Verifying Hardware... System product name: SUN FIRE X4170 M2 SERVER System product manufacturer: Oracle Corporation ...snip... output is quite long ;-)
Another useful command is dcli which allows us to run the same command on more compute nodes, which is quite useful since the full rack have 30 compute nodes.
Let us stop the talk today here and we will take a look at some other Exalogic component in some of the upcoming blogs.
Export VYM URLs as \input for LaTeX
Vym (View Your Mind) is a tool to generate and manipulate maps which show your thoughts. Such maps can help you to improve your creativity and effectivity. You can use them for time management, to organize tasks, to get an overview over complex contexts, to sort your ideas etc. (citation from vym site) I found it useful in many occasions, e.g. recently I was writing documentation which was a bit longer. Vym helped me to find out the proper granularity for chapters, sections, subsections, subsub…, etc. However, there was no way how to create links for chapters for LaTeX which I used to produce the final document. I hooked URLs for this purpose and translated them to \input{} LaTeX command. That helped me to have chapters in a separate file and jongle with sections separately.