All posts by waelaltaqi_

A day in the trenches with IT operations: How to create a more seamless practice

Traditionally, IT operators are responsible for ‘keeping the lights on’ in an IT organisation. This sounds simple, but the reality is harsh, with much complexity behind the scenes. Furthermore, digital transformation trends are quickly changing the IT operations responsibility from ‘keeping the lights on’ to ‘keeping the business competitive’.

IT operators are now not only responsible for uptime, but also for the performance and quality of digital services provided by and to the business. To a large extent, maintaining available and high-performing digital services is precisely what it means to be digitally transformed.

I’ve spent my fair share of time as an MSP team lead, and on the operations floor in large IT organisations. The job of an enterprise IT operator is full of uncertainty. Let’s look at a typical day in the life of an IT operator, and how she addresses common challenges like:

  • Segregated monitoring and alerting tools causing confusion and unnecessary delays in troubleshooting
  • Resolving a critical issue quickly through creative investigations that go beyond analysing alert data
  • Legacy processes, such as from ITIL, working against the kind of open collaboration required to fix issues in the DevOps era

Starting the day with a critical application outage

Karen is a senior network analyst (L4  IT Operator) who works for a large global financial organisation. She is considered a subject matter expert (SME) in network load balancing, network firewalls, and application delivery. She is driving to the office when  she gets a call informing her that a major banking application is down at her company. Every minute of downtime affects the bottom line of the business. She finds parking and rushes to her desk, only to find hundreds of alert emails queued in her inbox. The alerts are coming from an application monitoring tool she can’t access – more on that later.

The L1 operator walks to Karen’s desk in a distressed state. Due to the criticality of the app, the outage caused the various monitoring and logging tools to generate hundreds of incidents, all of which were assigned to Karen. She spends considerable time looking through the incidents with no end in sight. Karen logs on to her designated network connectivity, bandwidth analysis, load balancer and firewall uptime monitoring tools—none of which indicate any issues.

Yet the application is still down, so Karen decides that the best course of action is to ignore the alert flood and the monitoring metrics and tackle the problem head-on. She starts troubleshooting every link in the application chain, confirming that the firewall ports are open and that the load balancer is configured correctly. She crawled through dozens of long log files, and finally, five hours later, discovered that the application servers behind the load balancer were unresponsive: bingo, the culprit has been identified.

Root cause found: now more stalls

Next, Karen contacts the application team. The person responsible for the application was out of the office so the application managers scheduled a war room call two hours later. Karen joins the call from home, along with 12 other individuals, most of whom she’s never worked with in her role.

The manager starts the call tackling all angles of the issue. Karen, however, knew that the issue was caused by two application servers. After a 30-minute  discussion, Karen shared her screen and was able to prove that the issue was caused by the app servers. After further investigation, the application team discovered that an approved change executed the night before had changed the application’s TCP port: a critical error on the application’s team part.

Later investigations showed that an APM (Application Performance Monitoring) tool generated a relevant alert and an incident that could have helped solve the issue much quicker.  The alert was missed by the application team, and adding to that misery, the ITOps team didn’t have access to the APM system.  Karen had no way of gathering telemetry (or lack of) from the APM tool directly.

A day later, the fix is applied

The application team requested approval for emergency change so they could fix the application configuration file and restart the servers. The repair took less than 10  minutes, but the application had been down for almost 24 hours. 

It is now 10pm on Monday. Karen is exhausted, having worked a 14-hour day with no breaks.  How does the business measure the value of the time Karen spent resolving this outage? While her manager applauded her analytical skills, it wasn’t the best use of her specialised skill set and definitely not how she should have spent her day (and night).

Does this sound familiar?

I’m sure the story above resonates with IT operations professionals and it is unfortunate that similar occurrences are common.

Here are some takeaways:

  • The segregated monitoring and alerting tools did not provide operational value. That’s because the alerts and metrics are not centralised for view by all the appropriate stakeholders, and aren’t mapped to the business and in this case, the banking application
  • Just because a tool generates alerts and incidents, it doesn’t necessarily help the user locate the root cause
  • A flood of uncorrelated alerts and incidents makes matters worse. Many operators spend a lot of time looking at irrelevant data, sifting through the noise with their naked eyes. Karen quickly decided to go to the source, the application that was down, but not all ITOps people will do that
  • Legacy processes (such as ITIL) are designed to restrain the user from abrupt changes by implementing a lot of process red tape. On the flipside, this prevents the operators from fixing issues quickly when they arise. Karen did not have access to the application monitoring tool nor was she allowed to communicate directly with the application team.  She needed a manager to schedule a war room call. This hierarchy created costly delays which turned a five-to-10 minute fix into an all-day outage

Creating a better path for IT operators

Too many enterprise IT operations teams are living in the past: disconnected tools and antiquated processes which don’t map well to the pace of change and complexity in modern IT environments. Applications are going to live between on-premises and multi-public cloud for the foreseeable future. Coupled with the growing volume of event data and the rising velocity of deployments, complexity will grow and along with it, increased risks to user productivity and customer experience. 

Here’s an action plan for 2020 to better manage IT performance and enable ITOps teams to be more productive:

  • It’s time to seriously consider machine learning alert and event correlation platforms: It is no longer humanly possible for operators to sift through the flood of alarm data. Machine-learning alert correlation products are maturing and providing tangible value to IT organisations
  • It’s also time to restructure relic processes designed for mostly static infrastructure and applications: Today’s application agility requires training of IT operators so that they intuitively identify business risk and cooperate fluidly to keep digital services in optimal state
  • Finally, it’s time to reconsider the traditional siloed approach for ITOps monitoring and alerting: Having the observable data separated in different buckets does not provide much value unless we can correlate it to the respective business services

In taking these three steps, we can create a new IT operations practice that supports and even enhances the elusive digital transformation that most every company today would like to achieve.

https://www.cybersecuritycloudexpo.com/wp-content/uploads/2018/09/cyber-security-world-series-1.pngInterested in hearing industry leaders discuss subjects like this and sharing their experiences and use-cases? Attend the Cyber Security & Cloud Expo World Series with upcoming events in Silicon Valley, London and Amsterdam to learn more.

The new hybrid cloud will transform IT operations: How the big three clouds are responding

CIOs have been telling their cloud partners for years now that they’re not ready to fully commit to the public cloud. They have existing infrastructure investments which they’re loathe to jettison and some applications and workloads can’t easily move off site. In other cases, IT needs workloads to run nearby in the data centre to satisfy the extreme low latency requirements for services on the factory floor or in the emergency room.

Highly sensitive regulatory requirements also are a factor when CIOs choose to keep some assets in the corporate data centre. The cloud giants are finally listening. And suddenly, hybrid cloud is red hot.

A majority, or 86% of enterprises, have more than a quarter of their IT infrastructure running in cloud environments and more than half have 50-75% in the cloud, according to a recent OpsRamp survey. According to the RightScale 2019 State of the Cloud report, hybrid cloud is the dominant enterprise strategy, with 58% of respondents stating that is their preferred approach. Organisations are using an average of five different clouds. Clearly, the appetite for hybrid and multi-cloud environments, which are now increasingly characterised by including multiple cloud vendors, is strong.

The big three cloud providers are responding

Amazon, Google and Microsoft have been releasing new hybrid cloud offerings in recent months. The latest is Arc from Azure. This bundled multi-cloud layer claims to be the single platform for extending Azure services whether its on-premise using Azure Stack, competing cloud services, or edge environments. Arc promises a hybrid automation framework for deploying and managing apps in all clouds.

Given its enterprise heritage, Microsoft placed the right bet when it announced Azure Stack in 2015. Amazon and Google quickly followed. If the recent Arc announcement is any indication, the complex multi-cloud world is here to stay. In this multi-cloud hybrid reality, software engineers have to design or re-factor applications to account for interoperability between all clouds, while releasing applications ever faster to stay competitive. Meanwhile, IT operations teams are challenged to manage assets, metrics, alerts, events and services from on-premise and major cloud vendors. The greatest challenge for IT ops today is to discover a cohesive framework to process multi-cloud hybrid data and complexity without undue cost and pain.

Now, there are finally several options for doing just that.

Clouds competing for the middle

While all three companies have competitive solutions to address the hybrid IT challenge and be the provider to solve hybrid woes, none of them are a perfect solution. This is a new battleground for cloud vendors to compete in the world of infrastructure management and orchestration. 

Microsoft was the first to consider enterprise IT cloud realities, when it launched Azure Stack in 2017. Stack has been wildly successful because it’s delivered a safe runway to the cloud. CIOs can run the Azure cloud operating system on-premise and obtain tangible benefits of cloud architecture and services without giving up the security and control they want from keeping IT inside. That’s especially important in companies with heavy compliance requirements or in traditional IT organisations and cultures.

Stack also supports an easier migration path to Azure, and simpler integration between on-premise and cloud workloads. Arc takes this a step further by adding an abstraction layer over Azure and Stack environments so that IT can orchestrate and manage all infrastructure (including, says Microsoft, a competing cloud environment) from one place.

In April, Google released its version of the hybrid play: Anthos. Google calls it an “open application modernisation platform that enables you to modernise your existing applications, build new ones, and run them anywhere.” Google’s service, as expected, focuses on open-source technologies Kubernetes, Istio, and Knative and allows IT to manage both on-premises and cloud environments.

AWS Outposts brings AWS infrastructure, services, and operating models to “virtually” any data centre or on-premises facility. AWS Outposts builds on the AWS Nitro system technologies that enable customers to launch EC2 instances and EBS volumes on the same AWS-designed infrastructure used in AWS data centres. Companies can run AWS Outposts on VMware Cloud or using AWS native technologies.

Now for the outlier: Kubernetes. While not a cloud provider, this container orchestration platform has been a boon to IT because it decouples the underlying infrastructure (on-premises or public cloud) from the applications. This has never been done so elegantly before and transformed Kubernetes into a strategic and beloved abstraction layer allowing developers to spread their wings across any cloud vendor or environment. This decoupling empowers application developers to focus solely on the app development and port applications between environments with ease. That said, Kubernetes is also an extension of the overall hybrid infrastructure picture, one that needs to be tracked and monitored as well.

Impact on IT and IT ops

Most enterprises will end up operating in a hybrid-cloud configuration between on-premise and one or more cloud vendor. Products like Azure Arc, AWS Outpost and GCP Anthos will enable enterprises to speed cloud adoption, minimise internal risk, maintain consistent deployment and automation models across apps and infrastructure, and allow IT to migrate workloads between locations and clouds with ease.

There may also be significant benefits for IT Operations in adopting a hybrid cloud service from Amazon, Google or Microsoft along with Kubernetes. Legacy, on-premise monitoring and management tools are still common in enterprises. A survey by Forrester found that some 33% of companies are using more than 20 monitoring tools and only 12% were solely using modern tools. But as you put more workloads in the cloud, legacy tools don’t really fit the bill. Infrastructure leaders will be looking to select a modern cloud management solution that doesn’t lock them into a certain cloud. In addition, many IT organisations still need a vendor-neutral system to monitor and optimise these environments. Those systems will of course need to integrate with the company’s hybrid cloud service of choice. 

Enterprises can realise the following three benefits from a hybrid public cloud approach:

  • Use best of breed products from all cloud vendors
  • Use the most cost effective products from each vendor
  • Distribute risks between cloud vendors

Will the customer be locked in?

To that extent, it’s questionable if any of the public cloud vendors will provide the optimal approach for operating other clouds. Regardless of which approach enterprises choose, 2020 will be a pivotal year for realising real-world benefits from the cloud as the market matures for hybrid cloud products and services.

https://www.cybersecuritycloudexpo.com/wp-content/uploads/2018/09/cyber-security-world-series-1.pngInterested in hearing industry leaders discuss subjects like this and sharing their experiences and use-cases? Attend the Cyber Security & Cloud Expo World Series with upcoming events in Silicon Valley, London and Amsterdam to learn more.

Why it continues to make sense for IT ops to move to the cloud: A guide

There’s been a lot of movement in the IT operations management (ITOM) business lately, from the acquisition of SignalFx by Splunk to the PagerDuty IPO, and all signs point to a Datadog IPO in the future. What’s with all this consolidation? I believe we’re seeing the rise of a future-state of ITOM; that is to say, it’s the rise of SaaS-based ITOM. And it’s easy to see why.

In my previous consulting career as lead enterprise systems architect, our team had an impeccable record in designing and implementing well-architected hybrid infrastructure solutions. We maintained an immaculate record and near-flawless customer satisfaction record. By project sign-off, our job was always done. And yet, returning to the same solutions six months later told a different story entirely.

Well-architected solutions are similar to human bodies: They are perfect when they’re born but need constant care and feeding. These same solutions that satisfied SLAs, exceeded expectations and transformed organizational efficiency can easily degenerate, and just like our bodies have their nervous systems to monitor, brains to send alerts, and tissue to self-heal, well-architected systems need operational maintenance to keep them humming.

The traditional approach for solving this eternal need was and still is to design and implement well-architected IT Operation Management (ITOM) solutions, around a well-architected infrastructure. And yet, this is a self-fulfilling paradox because the ITOM solution itself needs the same care and feeding.

There’s a problem on-premise

ITOM is a broad term encompassing application and operating system (OS) performance, alerts, log management, notification, asset configuration, incident management and more. It typically involves the purchase of a suite of on-premise point tools addressing each need, and then to develop an internal framework to help those tools interoperate in a meaningful way. While that is possible conceptually, the facts on the ground reflect a very different reality:

  • Multi-vendor tools are often not designed to work together
  • Creating an internal logical framework that orchestrates various teams and technologies can be very complex in large enterprises
  • It’s near-impossible to create technical integrations flexible enough to adhere to inevitable organizational and technological changes that will affect this logical framework
  • Predicting cost-of-ownership is nearly impossible since each tool is controlled by a different vendor, and the internal integration effort is often unknown
  • Predicting the cost of the manpower required is also very difficult, as each tool requires its own set of specialists, in addition to integration specialists to make it all work together
  • Upkeep is often overwhelming, as vendors offload software patches, upgrades, and on-premises hardware costs to the customer

In the face of all these challenges, the end result is often unrealized value, overwhelmed operational teams, loss of service, inability to accommodate new technologies resulting business service disruptions.

Why cloud? Why now?

It’s historically been near-impossible to build a traditional ITOM platform on-premise. Vendors typically sell a collection of white-labeled tools cobbled together by acquisitions, and this is far from a platform. The complexity and the rate of technological change make it difficult to provide consistent quality and value across the various product lines. This puts the IT ops team squarely in the middle of a pickle: How can they ride the wave of a changing environment without relying on static tool suites?

The future is flexible

Enterprise IT operations has been stretched now more than ever. There is a serious skills gap, shortage in IT workforce, and ever-increasing technical complexity. Time and resources are precious and enterprise IT operations need simplicity and predictability along with flexibility and control.

Enter SaaS ITOM. By moving the ITOM function to a SaaS orientation, the responsibilities, workloads, and daily tasks can transform according to the needs of the organisation:

  • Keeping up with the business: SaaS ITOM can keep up with technological change, and keep pace with cloud, DevOps, artificial intelligence and more. In the world of SaaS, change is an accepted constant and not an inconvenience. What’s more, SaaS ITOM is infinitely more consumable than the tool suites of legacy past, and that reduces the learning curves associated with running IT operations
     
  • Keeping up with industry needs: A SaaS ITOM platform will be able to deliver a framework that’s both flexible and governed, and can accommodate technical and organizational complexities. This agility is a feature of modern SaaS. SaaS ITOM can also integrate features running on a single code base supported completely by the SaaS vendor, who will absorb maintenance and upgrade cycles, freeing considerable and valuable time back to the operator. All of this results in a more predictable total cost of ownership, improved service quality and more value to the business user

It’s not news that the world of IT is moving to the cloud. It is news, however, that cloud can offer such transformational benefits in ways we’ve never seen before.

https://www.cybersecuritycloudexpo.com/wp-content/uploads/2018/09/cyber-security-world-series-1.pngInterested in hearing industry leaders discuss subjects like this and sharing their experiences and use-cases? Attend the Cyber Security & Cloud Expo World Series with upcoming events in Silicon Valley, London and Amsterdam to learn more.