Private Cloud Lessons

2020/12/19

Private Cloud Lessons

During March 2017 I implemented openstack for one of the customers. For 5 months I had worked on learning and planning the implementation of Openstack. This 5 months of work made the deployment easier, reduced to around 70 minutes. The deployment was minimal and the solution only used three services Compute instances Block storage for compute instances Object storage

August, 2018 - Openstack failed for customer.

I had moved to working with a different company. One evening I start getting calls from the customer that the openstack instances were just not working. I wasn't the one they should be calling first but they had decided they didn't need support for openstack after a year. After having a call with my previous employer and customer, late night I got to look at openstack to figure out what happened.

I was exhausted by 2:30am I had to sleep, and woke up at 5am started looking again into the logs. This time I decided to look at ceph storage which was at 90% usage and I had ignored that earlier. The instantiation logs revealed IO errors for block storage backend. This time I just had to dig more into ceph's health. After looking at detailed health and configuration parameters. I was able to figure out that all OSDs had reached full-ratio which prevented IO operations on ceph

The solution was just to add another equal sized ceph node and rebalance the OSDs disks. Later other things followed.

Lessons I learned

I have learned from this project by being a vendor as well as having had the chance to witness the business loss of customer.

Plan for all phases (Architecture, Deployment, Maintenance).

I just wanted a working highly-available openstack. I paid so much attention to architecture and deployment that I missed the maintenance phase of lifecycle.

Not a virtualization solution.

I had skipped taking it seriously when I started noticing customer's cloud usage habits. Private cloud was not their virtualization solution when the resource are free. Customer didn't have to pay bill for keeping unused instances running like on AWS but this did led to using up storage space which had a business loss.

Ignoring monitoring alerts.

WTF!

Don't discontinue support.

Just because you have hired an expert doesn't mean you can stop support from vendor.

Learn to analyse and adapt your operations.

You don't always need a new product to solve a problem you just need to look at your existing solution and find out what/where it has caused problem, and fix it. Pull the lever, Kronk!

Individual goals of corporate world.

Use technology to reach business/organization goals not to attain technical advantage over others teams.

Other info about that project

Tools that I used: