{"id":40510,"date":"2020-03-05T10:14:45","date_gmt":"2020-03-05T10:14:45","guid":{"rendered":"https:\/\/www.cloudcomputing-news.net\/news\/2020\/mar\/05\/day-trenches-it-operations-how-make-more-seamless-practice\/"},"modified":"2020-03-05T10:14:45","modified_gmt":"2020-03-05T10:14:45","slug":"a-day-in-the-trenches-with-it-operations-how-to-create-a-more-seamless-practice","status":"publish","type":"post","link":"https:\/\/icloud.pe\/blog\/a-day-in-the-trenches-with-it-operations-how-to-create-a-more-seamless-practice\/","title":{"rendered":"A day in the trenches with IT operations: How to create a more seamless practice"},"content":{"rendered":"<p><img decoding=\"async\" src=\"http:\/\/www.cloudcomputing-news.net\/media\/img\/news\/soldiers-marching-picture-id182756526_MsT5a8l.jpg\"><\/p>\n<p style=\"margin-left:0cm; margin-right:0cm\"><span style=\"color:null\"><span style=\"background-color:white\">Traditionally, IT operators are responsible for &lsquo;keeping the lights on&rsquo; in an IT organisation. This sounds simple, but the reality is harsh, with much complexity behind the scenes. Furthermore, digital transformation trends are quickly changing the IT operations responsibility&nbsp;from &lsquo;keeping the lights on&rsquo; to &lsquo;keeping the business competitive&rsquo;. <\/span><\/span><\/p>\n<p style=\"margin-left:0cm; margin-right:0cm\"><span style=\"color:null\"><span style=\"background-color:white\">IT operators are now not only responsible for uptime, but also for the performance and quality of digital services provided by and to the business. To a large extent, maintaining available and high-performing digital services is precisely what it means to be digitally transformed.<\/span><\/span><\/p>\n<p style=\"margin-left:0cm; margin-right:0cm\"><span style=\"color:null\"><span style=\"background-color:white\">I&rsquo;ve spent my fair share of time as an MSP team lead, and on the operations floor in large IT organisations. The job of an enterprise IT operator is full of uncertainty. Let&rsquo;s look at a typical day in the life of an IT operator, and how she addresses common challenges like: <\/span><\/span><\/p>\n<ul>\n<li><span style=\"color:null\"><span style=\"background-color:white\">Segregated monitoring and alerting tools causing confusion and unnecessary delays in troubleshooting<\/span><\/span><\/li>\n<li><span style=\"color:null\"><span style=\"background-color:white\">Resolving a critical issue quickly through creative investigations that go beyond analysing alert data<\/span><\/span><\/li>\n<li><span style=\"color:null\"><span style=\"background-color:white\">Legacy processes, such as from ITIL, working against the kind of open collaboration required to fix issues in the DevOps era<\/span><\/span><\/li>\n<\/ul>\n<h3 style=\"margin-left:0cm; margin-right:0cm\"><span style=\"color:null\"><strong><span style=\"background-color:white\">Starting the day with a critical application outage<\/span><\/strong><\/span><\/h3>\n<p style=\"margin-left:0cm; margin-right:0cm\"><span style=\"color:null\"><span style=\"background-color:white\">Karen is a senior network analyst (L4&nbsp; IT Operator) who works for a large global financial organisation. She is considered a subject matter expert (SME)&nbsp;in network load balancing, network firewalls, and application delivery. She is driving to the office when&nbsp; she gets a call informing her that a major banking application is down at her company. Every minute of downtime affects the bottom line of the business. She finds parking and rushes to her desk, only to find hundreds of alert emails queued in her inbox. The alerts are coming from an application monitoring tool she can&rsquo;t access &#8211; more on that later.<\/span><\/span><\/p>\n<p style=\"margin-left:0cm; margin-right:0cm\"><span style=\"color:null\"><span style=\"background-color:white\">The L1 operator walks to Karen&rsquo;s desk in a distressed state. Due to the criticality of the app, the outage caused the various monitoring and logging tools to generate hundreds of incidents, all of which were assigned to Karen. She spends considerable time looking through the incidents with no end in sight. Karen logs on to her designated network connectivity, bandwidth analysis, load balancer and firewall uptime monitoring tools&mdash;none of which indicate any issues. <\/span><\/span><\/p>\n<p style=\"margin-left:0cm; margin-right:0cm\"><span style=\"color:null\"><span style=\"background-color:white\">Yet the application is still down, so Karen decides that the best course of action is to ignore the alert flood and the monitoring metrics and tackle the problem head-on. She starts troubleshooting every link in the application chain, confirming that the firewall ports are open and that the load balancer is configured correctly. She crawled through dozens of long log files, and finally, five hours later, discovered that the application servers behind the load balancer were unresponsive: bingo, the culprit has been identified.<\/span><\/span><\/p>\n<h3 style=\"margin-left:0cm; margin-right:0cm\"><span style=\"color:null\"><strong><span style=\"background-color:white\">Root cause found: now more stalls<\/span><\/strong><\/span><\/h3>\n<p style=\"margin-left:0cm; margin-right:0cm\"><span style=\"color:null\">Next, Karen contacts the application team. <span style=\"background-color:white\">The person responsible for the application was out of the office so the application managers scheduled a war room call two hours later. Karen joins the call from home, along with 12 other individuals, most of whom she&rsquo;s never worked with in her role. <\/span><\/span><\/p>\n<div style=\"text-align:center\"><a href=\"https:\/\/www.cybersecuritycloudexpo.com\/\"><img decoding=\"async\" alt=\"\" src=\"https:\/\/www.cloudcomputing-news.net\/media\/uploads\/James\/2019\/03\/14\/imgpsh_fullsize_anim_C9Lchw3.jpg\" style=\"height:94px; width:404px\" \/><\/a><\/div>\n<p style=\"margin-left:0cm; margin-right:0cm\"><span style=\"color:null\"><span style=\"background-color:white\">The manager starts the call tackling all angles of the issue. Karen, however, knew that the issue was caused by two application servers. After a 30-minute&nbsp; discussion, Karen shared her screen and was able to prove that the issue was caused by the app servers. After further investigation, the application team discovered that an approved change executed the night before had changed the application&rsquo;s TCP port: a critical error on the application&rsquo;s team part.<\/span><\/span><\/p>\n<p style=\"margin-left:0cm; margin-right:0cm\"><span style=\"color:null\"><span style=\"background-color:white\">Later investigations showed that an APM (Application Performance Monitoring) tool generated a relevant alert and an incident that could have helped solve the issue much quicker.&nbsp; The alert was missed by the application team, and adding to that misery, the ITOps team didn&rsquo;t have access to the APM system.&nbsp; Karen had no way of gathering telemetry (or lack of) from the APM tool directly. <\/span><\/span><\/p>\n<h3 style=\"margin-left:0cm; margin-right:0cm\"><span style=\"color:null\"><strong><span style=\"background-color:white\">A day later, the fix is applied<\/span><\/strong><\/span><\/h3>\n<p style=\"margin-left:0cm; margin-right:0cm\"><span style=\"color:null\"><span style=\"background-color:white\">The application team requested approval for emergency change so they could fix the application configuration file and restart the servers. The repair took less than 10&nbsp; minutes, but the application had been down for almost 24 hours.&nbsp; <\/span><\/span><\/p>\n<p style=\"margin-left:0cm; margin-right:0cm\"><span style=\"color:null\"><span style=\"background-color:white\">It is now 10pm&nbsp;on Monday. Karen is exhausted, having worked a 14-hour day with no breaks.&nbsp; <\/span><\/span><span style=\"color:null\"><span style=\"background-color:white\">How does the business measure the value of the time Karen spent resolving this outage? While her manager applauded her analytical skills, it wasn&rsquo;t the best use of her specialised skill set and definitely not how she should have spent her day (and night).<\/span><\/span><\/p>\n<h3 style=\"margin-left:0cm; margin-right:0cm\"><span style=\"color:null\"><strong><span style=\"background-color:white\">Does this sound familiar?<\/span><\/strong><\/span><\/h3>\n<p style=\"margin-left:0cm; margin-right:0cm\"><span style=\"color:null\"><span style=\"background-color:white\">I&rsquo;m sure the story above resonates with IT operations professionals and it is unfortunate that similar occurrences are common. <\/span><\/span><\/p>\n<p style=\"margin-left:0cm; margin-right:0cm\"><span style=\"color:null\"><span style=\"background-color:white\">Here are some takeaways: <\/span><\/span><\/p>\n<ul>\n<li><span style=\"color:null\"><span style=\"background-color:white\">The segregated monitoring and alerting tools did not provide operational value. That&rsquo;s because the alerts and metrics are not centralised for view by all the appropriate stakeholders, and aren&rsquo;t mapped to the business and in this case, the banking application<\/span><\/span><\/li>\n<li><span style=\"color:null\"><span style=\"background-color:white\">Just because a tool generates alerts and incidents, it doesn&rsquo;t necessarily help the user locate the root cause<\/span><\/span><\/li>\n<li><span style=\"color:null\"><span style=\"background-color:white\">A flood of uncorrelated alerts and incidents makes matters worse. Many operators spend a lot of time looking at irrelevant data, sifting through the noise with their naked eyes. Karen quickly decided to go to the source, the application that was down, but not all ITOps people will do that<\/span><\/span><\/li>\n<li><span style=\"color:null\"><span style=\"background-color:white\">Legacy processes (such as ITIL) are designed to restrain the user from abrupt changes by implementing a lot of process red tape. On the flipside, this prevents the operators from fixing issues quickly when they arise. Karen did not have access to the application monitoring tool nor was she allowed to communicate directly with the application team.&nbsp; She needed a manager to schedule a war room call. This hierarchy created costly delays which turned a five-to-10 minute fix into an all-day outage<\/span><\/span><\/li>\n<\/ul>\n<h3 style=\"margin-left:0cm; margin-right:0cm\"><span style=\"color:null\"><strong><span style=\"background-color:white\">Creating a better path for IT operators <\/span><\/strong><\/span><\/h3>\n<p style=\"margin-left:0cm; margin-right:0cm\"><span style=\"color:null\"><span style=\"background-color:white\">Too many enterprise IT operations teams are living in the past: disconnected tools and antiquated processes which don&rsquo;t map well to the pace of change and complexity in modern IT environments. Applications are going to live between on-premises and multi-public cloud for the foreseeable future. Coupled with the growing volume of event data and the rising velocity of deployments, complexity will grow and along with it, increased risks to user productivity and customer experience.&nbsp; <\/span><\/span><\/p>\n<p style=\"margin-left:0cm; margin-right:0cm\"><span style=\"color:null\"><span style=\"background-color:white\">Here&rsquo;s an action plan for 2020 to better manage IT performance and enable ITOps teams to be more productive: <\/span><\/span><\/p>\n<ul>\n<li><span style=\"color:null\"><strong><span style=\"background-color:white\">It&rsquo;s time to seriously consider machine learning alert and event correlation platforms:<\/span><\/strong><span style=\"background-color:white\">&nbsp;It is no longer humanly possible for operators to sift through the flood of alarm data. Machine-learning alert correlation products are maturing and providing<\/span> tangible <span style=\"background-color:white\">value to IT organisations<\/span><\/span><\/li>\n<li><span style=\"color:null\"><strong><span style=\"background-color:white\">It&rsquo;s also time to restructure relic processes designed for mostly static infrastructure and applications:<\/span><\/strong><span style=\"background-color:white\">&nbsp;Today&rsquo;s application agility requires training of IT operators so that they intuitively identify business risk and cooperate fluidly to keep digital services in optimal state<\/span><\/span><\/li>\n<li><span style=\"color:null\"><strong><span style=\"background-color:white\">Finally, it&rsquo;s time to reconsider the traditional siloed approach for ITOps monitoring and alerting:<\/span><\/strong><span style=\"background-color:white\">&nbsp;Having the observable data separated in different buckets does not provide much value unless we can correlate it to the respective business services<\/span><\/span><\/li>\n<\/ul>\n<p style=\"margin-left:0cm; margin-right:0cm\"><span style=\"color:null\"><span style=\"background-color:white\">In taking these three steps, we can create a new IT operations practice that supports and even enhances the elusive digital transformation that most every company today would like to achieve.<\/span><\/span><\/p>\n<p style=\"margin-left:0cm; margin-right:0cm\"><a href=\"https:\/\/www.cybersecuritycloudexpo.com\/\" style=\"color:#0563c1; text-decoration:underline\"><span style=\"color:blue\"><img decoding=\"async\" alt=\"https:\/\/www.cybersecuritycloudexpo.com\/wp-content\/uploads\/2018\/09\/cyber-security-world-series-1.png\" src=\"https:\/\/www.cybersecuritycloudexpo.com\/wp-content\/uploads\/2018\/09\/cyber-security-world-series-1.png\" style=\"height:59px; width:272px\" \/><\/span><\/a><strong>Interested in hearing industry leaders discuss subjects like this and sharing their experiences and use-cases<\/strong>? Attend the <a href=\"https:\/\/www.cybersecuritycloudexpo.com\/\" style=\"color:#0563c1; text-decoration:underline\">Cyber Security &amp; Cloud Expo World Series<\/a> with upcoming events in Silicon Valley, London and Amsterdam to learn more.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Traditionally, IT operators are responsible for &lsquo;keeping the lights on&rsquo; in an IT organisation. This sounds simple, but the reality is harsh, with much complexity behind the scenes. Furthermore, digital transformation trends are quickly chan&#8230;<\/p>\n","protected":false},"author":594,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[],"tags":[],"class_list":["post-40510","post","type-post","status-publish","format-standard","hentry"],"_links":{"self":[{"href":"https:\/\/icloud.pe\/blog\/wp-json\/wp\/v2\/posts\/40510","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/icloud.pe\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/icloud.pe\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/icloud.pe\/blog\/wp-json\/wp\/v2\/users\/594"}],"replies":[{"embeddable":true,"href":"https:\/\/icloud.pe\/blog\/wp-json\/wp\/v2\/comments?post=40510"}],"version-history":[{"count":1,"href":"https:\/\/icloud.pe\/blog\/wp-json\/wp\/v2\/posts\/40510\/revisions"}],"predecessor-version":[{"id":40511,"href":"https:\/\/icloud.pe\/blog\/wp-json\/wp\/v2\/posts\/40510\/revisions\/40511"}],"wp:attachment":[{"href":"https:\/\/icloud.pe\/blog\/wp-json\/wp\/v2\/media?parent=40510"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/icloud.pe\/blog\/wp-json\/wp\/v2\/categories?post=40510"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/icloud.pe\/blog\/wp-json\/wp\/v2\/tags?post=40510"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}