All posts by nazarenofeito

Is performance engineering still needed when it comes to cloud?

Opinion Now that cloud vendors are delivering features constantly, which are backed with hard data and with good specs, the question which comes to mind is: shall we continue to measure, as we did in the days of the data centre, or shall we blindly trust the vendor and save ourselves plenty of time and duplicated effort?

This is a question I asked myself some time ago – and it has taken me some time to come up with an answer I’m happy with.

Round one: The beginning

A few months ago, I was invited to a meeting in which the aim was decide and weight the ‘need’ of measuring performance versus not in the company cloud. The reason I was invited was two-fold: one, it is part of my role and within my circle of competence, and two, I am all for cloud-native philosophy, methodology and application, and I have been using it for many years before Oracle Cloud infrastructure was born.

The meeting started with some attendees asking my team to perform measurements and find out if the infrastructure will or will not support our set of applications with the current network architecture. My response was: do we need to? In cloud, we need to trust our vendor. We usually must not over-measure and stress test a platform that is given to us with clear features and metrics. There are SLIs/SLOs/SLAs in place to assure the client – us – that the systems will perform adequately.

So far, this meant performance engineering was not needed for this task. We agreed on that and we called it a day. It was something the vendor made clear in terms of specs, and we were clear in terms of what we’ve got, from how many VM cores and how much memory per VM, to load balancing bandwidth and latency, and so on. In conclusion, with all these specs in place, there is no need to go overboard doing stress tests, smoke tests et al, in the same way we were – and still are – in a data centre.

Round two: The revelation

After that meeting, some performance tasks we were used to were less necessary, especially as different clouds kept adding features and guaranteeing they will perform up to the levels expected. After all, it’s their responsibility.

But a few weeks back, I was required in a different situation. The aim this time was not to ‘confirm’ what the vendor was saying; it was basically using the skillset, to go the extra mile the vendor couldn’t or wasn’t within the scope.

In this case, it wasn’t to measure networking specs but to compare native versus paravirtualization launch modes, and other related areas. Although the vendor is saying that it will be better or faster, nothing indicates how much better, or how much faster, and opinions can be very subjective, especially when dealing with many components in a complex architecture. This case was justified, as metrics were unclear, there was a grey area, and things got subjective quickly.

Round three: The conclusion

This means with cloud things are simplified, as they were meant to be, and we shouldn’t complicate things if we have a trusted vendor, because all those tasks were already carried by them.

That being said, there are situations in which the vendor was not able to, or not meant to run some performance tasks. These are very particular situations that may appear, and performance engineering will still be needed.

Now, my circle was closed and I understood when it was a good time for investment and when wasn’t. However, in some situations, two things happen. Firstly, we might want to have that extra assurance that the specs are valid. There’s nothing wrong with that, we just need to pick those situations well to avoid wasting gunpowder. Secondly, management wants to do it; and even though engineers sometimes know better, occasionally the business just wins.

Performance engineering is far from death, particularly so with new approaches such as failure injection, chaos engineering, and intuition engineering. New techniques, knowledge and tools are being created all the time – we just need to be able to leave pride to the side and acknowledge when that part of our role is not needed.

https://www.cybersecuritycloudexpo.com/wp-content/uploads/2018/09/cyber-security-world-series-1.pngInterested in hearing industry leaders discuss subjects like this and sharing their experiences and use-cases? Attend the Cyber Security & Cloud Expo World Series with upcoming events in Silicon Valley, London and Amsterdam to learn more.

A guide to the key principles of chaos engineering

Chaos engineering can be defined as experiments over a distributed system at scale, which increases the confidence that the system will behave as desired and expected under undesired and unexpected conditions.

The concept was popularised initially by Netflix and its Chaos Monkey approach. As the company put it as far back as 2010: "The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most  –  in the event of an unexpected outage."

The foundation of chaos engineering lies in controlled experiments; a simple approach follows.

Interim on controlled experiments with control and experimental groups

A controlled experiment is simply an experiment under controlled conditions. Unless necessary, when performing an experiment it is important to only do so one variable at a time, otherwise it would be increasingly difficult to determine what caused the changes in the results.

One type of controlled experiment is the ‘control and experimental’ group experiment. In this kind of experiment a control group is subject to observation with no variables being modified/affected purposefully, and the experimental group will have one variable at a time modified/affected with the consequent observation of the output at that stage.

A simple approach

Defining a steady state: The main focus is to aim for output metrics and not for system behaviour; the goal is to find out if the system can continue to provide the expected service, but not how it is providing that service. It is useful to define thresholds that will make for an easy comparison between the control group and the experimental group. Also, this will allow for automated comparisons as well, which makes comparing large quantity of metrics easier.

Building the hypothesis around control and experimental group: Due to the nature of chaos engineering, which is a mixture between science and engineering, the foundation is built around having two groups; a control group, which will be unaffected by injected events, and an experimental group, which will be the objective of the variable manipulation.

Introducing variables that correspond to undesired/unexpected events: Changing the state of the variables is what makes the experiment, however those variables need to be of significance and within reason; also, it is of utmost importance to change one variable input at a time.

Try to disprove the hypothesis: The purpose of the experiment is not to validate the hypothesis, it is to disprove it; we must not fool ourselves, knowing that we are the easiest to fool.

Production means production

The only way of increasing confidence in a system running in production is to experiment on the system running in production, under live production traffic, which may seem odd at first glance, but it is absolutely necessary.

One important aspect that sometimes goes unnoticed is that we must not attack the point where we know the system will fail; speaking with upper management I have got answers of the like ‘I know that if I unplug the DB the system will break’. Well that is not chaos engineering – that is just plain foolishness. A chaos experiment will inject failure in parts of the system we are confident will continue to provide the service. Be it be failing over, using HA, or recovering, we know that the service to the client will not be disrupted, and we try our best to prove ourselves wrong, so we can learn from it.

It is also absolutely necessary to minimise the impact of the experiment on real traffic; although we are looking for disruption, we are not pursuing interruption or fault SLO/SLI/SLA; it is an engineering task to minimise negative impact.

Interim on the blast radius

Chaos engineering or failure injection testing is not about causing outages, it is about learning from the system being managed; in order to do so, the changes injected into the system must go from small to big. Inject a small change, observe the output and what it has caused. If we have learned something, splendid; if not, we increase the change and consequently the blast radius. Rinse and repeat. Many people would argue that they know when and where the system will go down, but that is not the intention. The intention is to start small and improve the system incrementally. It is a granular approach, from small to large scale.

Automation

The importance of automation is undisputed, more so on these experiments where it is necessary to:

  • Be able to rollback fast enough without human interaction or with minimal HI
  • Be able to examine a large set of output metrics at first glance
  • Be able to pinpoint infrastructure weak spots visually

Other sources and good reads

The basics: https://principlesofchaos.org/
An extended introduction: https://www.gremlin.com/community/tutorials/chaos-engineering-the-history-principles-and-practice/
A big list of resources: https://github.com/dastergon/awesome-chaos-engineering

The perils of not having disaster recovery – or, why we love a good reserve parachute

One of the most important but often missed steps in having a reliable infrastructure is disaster recovery (DR). Surprisingly, most companies decide either to not implement DR or to implement it halfway. Here, I intend to explore common terms and understandings in disaster recovery; how to leverage the cloud, different types, the plan and important considerations, as well as the economic impact.

Regional vs. zone/domain DR

DR can be implemented at regional or zone/domain level, depending on needs. I advocate and adopt having high availability (HA) at zone/domain level and DR at regional level; the cloud presents itself as a good alternative in terms of cost value for HA and DR – even more so with the plethora of providers that exist nowadays.

Levelling the field

First, some widely used terms:

RTO – recovery time objective. Essentially how long it will take to have the DR site operational and ready to accept traffic.

RPO – recovery point objective. Essentially to which point in time in the past of primary site the secondary site will return. It is also an indicator of data loss; if data is synced every hour, site A crashes at 11:59am, site B has data until 11am, so worst case scenario is about an hour is lost and the secondary site will be operational as primary site was operational at 11am. That is RPO 1h. The smaller the better – alas the more costly the implementation will be.

Regional – how far is too far and how close is too close? When is close too close? Having primary in London region and secondary in Dublin region, an asteroid the size of Kent falling over Wales can make the solution unviable, but the likeliness of that happening is negligible.

Cost – it is always a factor and in this case, it can make a difference since regions such as Ashburn (USA) are usually (significantly) cheaper than regions in Europe. Other than these main reasons, having a secondary site close to primary is priceless. Now, can it be too far? It depends. If the nature of the business depends on milisecond transactions, then analysts and customers in Bangalore cannot use a site in Phoenix. If it does not, the savings of having a secondary site (temporarily) in a different region are worth it. Also, it is not something permanent – the system is in a degraded state.

An alternative approach is having three DR sites – a primary site with a given RTO/RPO in case it is needed, and a secondary site in the form of pilot light only.

Hot, cold, warm standby

In some circles DR is covered on a hot/cold/warm approach. I usually prefer these terms in high availability architectures, although I have seen DR sites referred to as warm. A hot site is usually a site that is up and running and to which I can failover immediately. That is for me something I would relate to HA as mentioned, however a warm site can be a site that has the resources, and only the critical part is running or ready to run. It may take a few minutes until things are in order and can failover into that DR site.

A cold standby is one that, although it receives updates, they are not necessarily frequent, meaning that failing over may mean that the RPO is much larger than desired, and of course, the RTO and RPO are usually numbers bound to the SLA, so they need to be well thought and taken care of.

Domain/zone DR – worth it or not?

DR at zone/domain level is a difficult decision for different reasons; availability zones consist of sites within a region with independent network, power, cooling, and so on i.e. isolated from each other. One or more data centres comprise a zone and one or more zones (usually three) comprise a region. Zones are used frequently for high availability. Network connectivity between zones is usually very low latency – in the order of a few hundred microseconds – and transfer rate is of such orders of magnitude that RPOs can be made almost obsolete, since data is replicated everywhere in an instant.

As sometimes HA within zones is a luxury, a DR solution can be necessary within the zones. In this case, it is usually an active/passive configuration, meaning the secondary site is stopped.

Economic impact

It is a given that the economic impact is a big factor regarding RTO, RPO, compliance, security, and GDPR as well. It is not necessarily true that the more responsive the secondary site is, the more expensive it is as well. It will depend on the architecture, how it is implemented, and how it is carried on when needed. Basically, the economic impact will be given by the amount of information kept in different sites, not so much by the size of the infrastructure, nor the replication of that information; that which can be automated, and nowadays done with enough frequency as to almost have the same data in two or more regions at any given moment.

Also, as long as the infrastructure is stopped, it is possible to resume operations in minutes without a large impact. Of course, this will depend on the cloud provider. Some providers will charge even for stopped VMs or stopped BMs, depending on the shape/family – for instance Oracle Cloud will continue billing if the instance stopped uses NVMe and SSD, meaning any Dense/HighIO machine – so beware of these details.

Within the economic impact is also the automation. It can be automated, semi-automated, or no automation. For the most part, in DR cases I rather prefer semi-automated on a two-man rule fashion. What this means is even when everything indicates there is a massive outage that requires DR, it will take more than one person to say ‘go’ on the failover, and more than one person to actually activate the processes involved. The reason being: once the DR process is started, going back before completion can render a nightmare.

Pilot light

Although it is strange to see pilot light within economic impact, there is a reason for it; pilot light allows a DR site with the minimum of infrastructure. Although a data replica must exist, the DR site needs only one or two VMs, and those VMs, when needed, will take care of spawning the necessary resources. As an engineer, I sometimes steer towards pilot light with an orchestration tool, such as Terraform.

Having a virtual machine online that contains all the IaaC (infrastructure as code) files necessary to spin up an entire infrastructure is convenient, and usually it is a matter of a few minutes until the last version of the infrastructure is back and running, connected to all the necessary block devices. Remember, nowadays, it is possible to even handle load balancers with IaaC, so there are no boundaries to this.

The DR plan

This is a critical part, not only because it describes the processes that will become active when the failover is a reality, but also because all the stakeholders have a part in the plan and all of them must know what to do when it is time to execute it. The plan must be of course not only written and forgotten, but tested, not only once, but in a continuous improvement manner.

Anything and everything necessary to measure the efficacy and efficiency of it. It is adequate to test the plan with and without the stakeholders aware, in order to see how they will behave in a real situation – and it is also advisable to repeat every six months, since infrastructure and processes can change.

Leaving the degraded state

Sometimes, the plan does not cover going back to the primary site, and this is important, since the infrastructure is at the moment in a degraded state, it is necessary to bring the systems to the normal state so as to have DR again. Since going back to the normal state of things takes time as well, and all the data needs to be replicated back, this is something that needs to be done under a maintenance window, and surely all the customers will understand the need to do so; but just in case, when setting up SLO and SLA, bear in mind that this maintenance window may be necessary. It is possible to add them as a ‘fine print’, of which I am not a fan, or consider them within the calculations.

Conclusion

There are some considerations with regard to DR in different regions, specifically but not only for Europe, and these come in the form of data, security, compliance and GDPR. The new GDPR requires companies to have any personal data available in the event of any technical or physical incident, so DR is no more a wish list item – it is required. What this basically means is that under GDPR legislation, data held of a person must be available for deletion or freed up for transfer upon request. For those legally inclined, more information can be found in article 32 of GDPR. In case DR is found to be daunting, there are nowadays multiple vendors that offer DraaS as well.

How to leverage cloud architectures for high availability

Cloud architectures are used broadly nowadays; the cloud is a plethora of amazing alternatives in terms of services and solutions. However, with great power comes great responsibility, and the cloud presents itself as a place where failure can and eventually will occur. Thus, consequently, it will spread upon the entire architecture fast, possibly causing massive outages that can leave a business on its knees.

Okay, that’s not an overly optimistic scenario – more likely the opposite – but not to fear. This is the nature of almost any architecture – and why should cloud be any different?

Cloud architects face two different problems at scale at any given time in order to prepare for the worst; firstly, if something unexpected and undesired happens, how to continue business operations as if nothing happened, and secondly, if something unexpected and undesired happens and I am unable to continue operations as usual, how can I bring the architecture up someplace else and within a reasonable window of time, and then, resume operations as usual?

In these terms we can discuss:
– Continue business as usual in the face of an outage
– Resume business as usual in the shortest term possible in the face of an irrecuperable outage

The first is covered by high availability, and the second is covered by disaster recovery. Here, we will look at high availability.

The alternatives currently on the table

The cloud yields more than what is expected to face both scenarios. Most clouds are distributed in a geographic and technical way as to avoid massive outage scenarios by themselves; at a small scale, clouds have what is known as Availability Zones (AZs) or Availability Domains (ADs). These are usually different buildings, or different clusters of buildings, in the same geographic area, interconnected but highly redundant, especially in what refers to power, refrigeration and storage.

At a large scale, clouds are divided by regions; global regions, that is, with 10 or 15 regions if we look at giants such as Google Cloud and Amazon Web Services. These regions are spread geographically across the globe and serve two purposes; isolation in case of disaster, and performance. Customers in different countries and continents will be served by the nearest point of service, not rerouted to the main one. That is what makes the latency smaller and the response higher.

Putting all this into consideration, it is the task of the architect to design the service with availability zones and regions in mind, in order to serve customers properly and take advantage of the technologies at hand. Architectures are not replicated by cloud providers in different regions – that is something architects and engineering teams need to consider and tackle, and the same goes for availability domains, unless the discussion is about storage; compute instances and virtual networks to mention core services, are not replicated throughout Ads or AZs for the most part.

The alternatives for high availability involve avoiding single points of failure, testing the resilience of the architecture before deploying to product, and either constructing master/master, master/slave or active/passive solutions in order to be always available, or have an automation that is able to reduce the unavailability time to the minimum.

What are considered best practices?

The following is a list of best practices in terms of providing HA in the cloud. It is not completely comprehensive, but it may also apply, to a lesser degree, to data centre architectures as well.

  • Distributing load balancers across Ads, beware of single point of failure (SPOF) in the architecture: two is one and one is none
  • If the cloud provider is not providing redundancy across Ads and at least three copies of the same data automatically, it may be a good idea to re-evaluate the provider decision, or contemplate a service that does so
  • Easy to get in, easy to get out: it is necessary to have the certainty that in case it becomes primordial to move or redirect services, it is possible to do so with minimum effort
  • Implementing extra monitoring and metrics systems if possible, not to mention good integration: also if possible, off-the-shelf, through third parties that can provide for timely alerts and rich diagnostic information. Platforms such as New Relic, or incident tools such as PagerDuty, can be extremely valuable
  • Keeping the architecture versioned, and in IaaC (infrastructure as code) form: if an entire region goes away, it will be possible to spawn the entire service in a different region, or even a different cloud, provided data has been replicated and DNS services are elastic
  • Keeping DNS services elastic: this goes without saying, especially after the previous step; flexibility is key in terms of pointing records in one direction or another
  • Some clouds do not charge to have instances in a stopped state, especially with VMs e.g. Oracle only charges for stopped instances if those are Dense or HighIO, otherwise it does not. It is easy to leverage this and keep a duplicated architecture in two regions; with IaaC, this is not unreal and it is also easy to maintain
  • Synchronising necessary and critical data across ADs constantly in the ways of block storage ready to use and often unattached, avoid NVMe usage if that implies being billed for unused compute resources to which those NVMe are connected to
  • Leveraging object storage in order to have replicated data in +2 regions
  • Leveraging cold storage (archive, such as Glacier) to retain critical data in several sparse regions; sometimes the price to pay to break the minimum retention policy and request a restore Is worth in order to bring a production environment up
  • Using the APIs and SDKs for automation, by creating HA and failover tools, automation can transform systems into autonomous systems that take care of failovers by themselves, mixing this with anomaly detection can be a game changer. Do not rely so heavily on dashboards – most things can/are, and some must, be done behind the curtain
  • Nobody says it is necessary to stick to one cloud: with the power of orchestration and cloud providers, it is simple enough to have infrastructure in more than one cloud at the same time, running comparisons and if necessary switching providers
  • Using tools to test the resilience of the infrastructure and the readiness of the engineering team – faking important failures in the architecture can yield massive learning

Conclusion

Although best practices are only best practices if applied, not all of them can be applied in the same architecture or at the same time, so the judgement of an experienced architecture and engineering team is always necessary.

That said, most of the points can be applied without significant effort. It only takes some hard work and disposition, but the results yielded will make it worth its weight in copper.

Happy architecting and keep up the good work.

Exploring cloud APIs – the unnoticed side of cloud computing

Nowadays, it is increasingly easier to lose one’s self in dashboards, visualisation tools, nice graphics, and all sorts of button-like approaches to cloud computing. Since the last decade, UX work has improved tenfold; nevertheless, there is a side that usually goes unnoticed unless extremely necessary. That side belongs to application user interfaces (APIs).

An API constitutes a set of objects, methods and functions that allows the user to build scripts, or even big applications, to make certain things in the cloud happen that usually cannot be made to happen through a dashboard. This can be for various reasons; whether it is not yet being available to the end user; being very specific to the business, such as a competitive advantage; or being an automated portion that is not widely needed and may also be specific to the business.

Of course, there are other reasons. The cloud may not have the maturity to enable the user to do certain things. In all those cases, APIs are most welcome and, contrary to popular belief, the learning curve to build something with it is usually not steep.

Widespread use throughout the industry

Virtually every cloud has an API. Sometimes, it is not offered as an API, but as an SDK. Although they are not the same thing, they are sometimes seen as equals, or at least close siblings. SDKs come in different flavours; it may be a Python SDK, a Java SDK, a Go SDK, or one of many others. The ones mentioned seem to be the ones broadest in use, but there are also many others, such as JavaScript/Node.js, that we see frequently in AWS, Oracle Cloud, Google Cloud Platform, and even Azure.

From luxurious to essential

Although there is little luxury in writing tools that make use of an API, I would like to think that some things can be made more elegantly when built from scratch, or from first principles, instead of simply using what is on the market. Nonetheless, although I am an advocate of not re-inventing the wheel, if off-the-shelf tools and technologies do not fit the bill, it is necessary to come up with something from scratch, be it proprietary or open source.

APIs can be used to write code that will create complementary features, such as high availability solutions that are not provided ad hoc, or certain automations that are intrinsic to the business case at hand. Sometimes these may not be indispensable, and in those cases they may be considered luxurious.

On the other hand, there are things that you will inevitably need, such as building a tool that creates havoc in your cloud architecture in order to test your incident response. Chaos Monkey, for AWS, is a key example of this. I believe these are towards the essential tools model, and should not be taken lightly.

What is needed to use an API or SDK

This depends on the cloud used, of course, but it generally goes along the lines of the following:

  • An authentication key: This will be used to authenticate the tool against the endpoint that the cloud has proposed. Some cloud platforms allow for instance principals, in which case a key is not even needed
  • An SDK package: As an example, in Oracle Cloud Infrastructure, in order to use the Python SDK, it is necessary to ‘pip install’ the OCI package – something as simple as typing a one-liner
  • An initial read within the SDK documentation: Sometimes it takes less than an hour to come up with a prototype and scale from there. It is not necessary to be an expert in the inner workings of the SDK, although it helps
  • Some familiarity with the language of the SDK: This is a given, but nevertheless it would be unfair not to add to the list

The recommended approach

Usually SDKs come in the form of what is known as CRUD (Create, Read, Update, Delete). This is a simple approach, known since the dawn of time and as common-sensical as humanly possible, so it should not scare any seasoned engineer in the least.

The top-down approach: Starting in a very detailed manner can be daunting, so in this case it is my belief that a top-down approach is well suited, starting from a list of resources until familiarity with common structures is gained. Not to mention that in some cases, Paretto’s Law is in plain sight, meaning that in the first 20% of an SDK lies 80% of its power.

I have found this when I started to build tools for Oracle Cloud; with some simple methods in Python, leveraging resource listing and drilling down on some objects, I was able to gather all the data necessary to start devising a data science model of resource usage. A major gain for our team.

From documentation to source code: Although documentation is helpful, it will never be as helpful as exploring the source code. It is clear to one’s self that an hour going through the source code is equivalent to around five hours going through the documentation. I understand in some cases watching a YouTube video may be easier than reading a book on the same subject, but the depth acquired and the mental training is much greater when reading a book. With an SDK or any program it works in the same way. Reading the documentation is nice, is good and helps, but nothing will shed the same light as going through the source code.

The MVP: Since The Lean Startup hit the bookstores, the concept of MVP has gained popularity significantly. I am referring to the minimum viable product, a product that contains only the basic features and that is not meant to cover all cases but the standard base ones. This concept is well stated in The Lean Startup and though is attached often with Agile, it is an older concept.

Although it would be ideal to have a product with a perfect set of functionalities since version 1.0.0, it is often if not always utopic. There is no perfect set of functionalities and waiting to launch can lead to catastrophic failure. In IT, it is often significantly better to have a minimal product with a small client base that desperately wants the product, than have a big base that is relatively indifferent – if they have it they will use it, if not they can live without it.

Conclusion

At the end of the day this can be simplified into six points:

  • APIs are a significant advantage for automation and solutions such as HA or DR that are sometimes not provided by the cloud platform
  • Paretto applies everywhere – APIs are not focused only for big applications. A simple script can save several hours of weekly toil work
  • Start with broad strokes and slowly go into the details
  • Do not be afraid to examine the source code – it is often a gold mine of knowledge
  • Start with a small minimal version, instead of several features that may or may not be used
  • Look for hidden treasures in the cloud platform that are not easily spotted through dashboards and UX

Happy cloud automation – and until next time.

Editor’s note: You can read more of Nazareno’s articles here.

Designing new cloud architectures: Exploring CI/CD – from data centre to cloud

Today, most companies are using continuous integration and delivery (CI/CD) in one form or another – and this is of significance due to various reasons:

  • It increases the quality of the code base and the testing of that code base
  • It greatly increases team collaboration
  • It reduces the time in which new features reach the production environment
  • It reduces the number of bugs that in turn reach the production environment

Granted, these reasons apply if – and only if – CI/CD is applied with more than 70% correctness. Although there is no single perfect way of doing CI/CD, there are best practices to follow, as well as caveats to avoid in order to prevent unwanted scenarios.

Some of the problems that might arise as a consequence include: the build being broken frequently; the velocity in which new features are pushed creating havoc in the testing teams or even in the client acceptance team; features being pushed to production without proper or sufficient testing; the difficulty in tracking and even the separation of big releases; old school engineers struggling to adapt to the style.

IaaC

A few years ago, the thinking model indicated that CI/CD was only useful for the product itself; that it will only affect the development team and that operation teams were only there to support the development lifecycle. This development-centric approach suddenly came to an end when different technologies appeared, spellbinding the IT market completely. These technologies I am making reference to are those that allow to create infrastructure as code.

CI/CD is no longer exclusive to development teams. Its umbrella has expanded throughout the entirety of engineering teams, software engineers, infrastructure, network, systems engineers, and so forth.

DevOps

Nobody knows what DevOps really is, but if you are not doing, using, breathing, dreaming – being? – DevOps, you’re doing it wrong. All teasing aside, with the advent of DevOps, the gap that existed between development teams and operation teams has become closer, to the extent of some companies mixing the teams. Even so, some of those took a different approach and have multidisciplinary teams where engineers work on the product throughout the lifecycle, coding, testing and deploying – including on occasion security teams as well, now called DevOpsSec.

As the DevOps movement becomes more popular, CI/CD does as well, since it is a major component. Not doing CI/CD means not doing DevOps.

From data centre to cloud

After reducing some terms and concepts, it is clear why CI/CD is so important. Since architectures and abstraction levels change when migrating a product from data centre into the cloud, it has become necessary to evaluate what is needed in the new ecosystem for two reasons:

  • To take advantage of what the cloud has to offer, in terms of the new paradigm and the plethora of options
  • To avoid making the mistake of treating the cloud as a data centre and building everything from scratch

Necessary considerations

The CI/CD implementation to use in the cloud must fulfil the majority of the following:

  • Provided as a service: The cloud is XaaS-centric, and avoiding building things from scratch is a must. In the case of building from scratch, if it is a non in-house component, nor a value-added product feature, I would suggest a review of the architecture in addition to a logical business justification
  • Easy to get in, easy to get out: A non-complicated process of in-out means that the inner workings of the implementation are likely to be non-complicated as well. Also, in case it does not work as expected, an easy way out is always a necessity
  • Portable configuration: This is a nice to have, in order to avoid reinventing the wheel and learning a given implementation details in-depth, it is easier to move from one system to another. Typical configurations are compatible with YAML or JSON formats – however many providers allow the use of familiar language such as Python, Java or JavaScript in order to fit the customer
  • Integration with VCS as a service: This is practically a given. As an example, Bitbucket provides pipelines within a repository. AWS does it differently with CodeCommit, which provides Git repositories as a service within. Different cloud providers will employ different ways and some will integrate with external repositories as well
  • Artifact store: It depends on the type of application, but having an artefact store to store the output of the build is often a good idea. Once the delivery part is done, deploying to production is significantly easier if everything is packaged neatly
  • Statistics and metric visualisation: This is in terms of what is occurring throughout the entire pipeline, which tests are failing, which features are ready, which pipeline is having problems, analogously for the code base, and not to mention the staging/testing/UAT or similar systems prior to production
  • No hidden fees: Although the technological part is important, the financial and economic part will be so too. In cloud, the majority of things turn to OpEx, and things that are running and unused can impact greatly. In terms of pipelines, it is important to focus on the cost of build minutes per month, the cost storage of GB for VCS and artefact store, the cost per parallel pipeline, the cost of the testing infrastructure used for the given purpose, among other things. Being fully aware of minutiae and reading the fine print pays off
  • Alerts and notifications: Mainly in case of failure, but also setting minimum and maximum thresholds for number of commits, for example, can yield substantial information; no-one committing frequently to the code base may mean breaking the DevOps chain
  • Test environments easy to create/destroy: The less manual integration, the better. This needs to be automated and integrated
  • Easy ‘delivery to deployment’ integration: The signoff after the delivery stage will be a manual step, but only to afterwards trigger a set of automated steps. Long gone are the days in which an operator ran a code upgrade manually
  • Fast, error-free rollback: When problems arise after a deployment, the rollback must be easy, fast and, above all, automatic or at least semi-automatic. Human intervention at this stage is a recipe for disaster
  • Branched testing: Having a single pipeline and only performing CI/CD on the master branch is an unpopular idea – not to mention that if that is the case, breaking the build would mean affecting everyone else’s job
  • Extensive testing suite: This may not be necessarily cloud-only, but it is of significance. At minimum, four of the following must exist: unit testing, integration testing, acceptance, smoke, capacity, performance, UI/UX
  • Build environment as a service: Some cloud providers allow for virtualised environments; Bitbucket pipelines allow for integration with Docker and Docker Hub for the build environment

Monitoring, metrics, and continuous tracking of the production environment

The show is not over once deployment happens. It is at that moment, and after, when it is critical to keep track of what is occurring. Any glitch or problem can potentially snowball into an outage; thus it is important to extract as many metrics as possible and monitor as many sensors as possible without loosing track of the important things. By this, I mean establishing priorities to avoid generating chaos between engineers on-call and at desk.

Most cloud providers will provide an XaaS for monitoring, metrics, logs and alerts, plus integration with other external systems. For instance, AWS provides CloudWatch that, in turn, provides everything as a service and integrated. Google Cloud provides Stackdriver, a similar service; Microsoft has a slightly more basic service in Azure Monitor. Another giant, Alibaba, provides Cloud Monitor at a similar level as the competition. Needless to say, every major cloud provides this as a service in one level or another.

This is an essential component and must not go unnoticed – I cannot emphasise this enough. Even if the cloud does not provide a service, it must provide integration with other monitoring services from other cloud-oriented service providers, such as Dynatrace, which integrates with the most popular enterprise cloud providers.

Conclusion

CI/CD is a major component of the technology process. It can make or break your product in the cloud, and in the data centre; however, evaluating the list above when designing a new cloud architecture can save time, money and effort on a significant level.

When designing a cloud architecture, it is fundamentally important to avoid copying the current architecture, and focus the design as if the application is a cloud native application, thinking that it was born to perform in the cloud together with the entire lifecycle. As I have mentioned previously, once a first architecture is proposed and initially peer reviewed, then a list of important caveats must be brought to attention before moving onto a more solid version of the architecture.

As a final comment, doing CI/CD halfway is better than not doing it at all. Some engineers and authors may argue that it is a binary decision – either there is CI/CD or there is not. I rather think that every small improvement gained by adopting CI/CD, CI, or CD only, even in stages, is a win. In racing, whether it is by a mile or a metre, a win is a win.

Happy architecting and let us explore the cloud in depth.

Orchestration in the cloud: What is it all about?

Can orchestration be considered a better alternative to provisioning and configuration management, especially in the case of cloud-native applications? We can look at this from a variety of angles; comparing against data centre-oriented solutions; differentiating orchestration of infrastructure (in the cloud and out of the cloud) versus containers (focusing mostly on cloud), as well as looking at best practices under different scenarios.

It’s worth noting here that this topic can span not only a plethora of articles but a plethora of books – but as the great Richard Feynman used to say, it is not only about reading or working through problems, but also to discuss ideas, talk about them, and communicate to others.

I would like to start with my favourite definition of orchestration, found in the Webster dictionary. Orchestration is ‘harmonious organisation’.

Infrastructure or containers?

When discussing orchestration, inevitably, the first question we ask ourselves is: infrastructure orchestration or container orchestration?

These are two separate Goliaths to engage, but undoubtedly we will face them both in the current IT arena. It all depends on the level of abstraction we wish to attain, and also, on how we organise the stack and which layers we want to take care of, or the opposite.

If we have decided to manage at the infrastructure level, we will work with virtual machines and/or bare metal servers – in other words, either a multi-tenant or a single-tenant server. Say we hire our cloud in an IaaS fashion, then we are handed resources such as the aforementioned plus networking resources, storage, load balancer, databases, DNS, and so on. From there, we build our infrastructure as we prefer.

If we have decided to manage at the CaaS (sometimes seen as PaaS) level, we will be managing the lifecycle of containers or, as they are frequently referred to in the literature, workloads. For those unfamiliar with containers, it is a not-so-new way of looking at workloads. Some of the most popular are Docker, Rkt, and LXC. Containers are extremely good to define an immutable architecture, also for microservice definition – not to mention they are lightweight, easily portable, and can be packed to use another day.

There are pros and cons to each of these – but for now, let us proceed in discussing the orchestration aspect on these two endpoints.

Infrastructure

There are several choices to orchestrate infrastructure: here are the two that seem to be among the most popular in companies today.

Provisioning and configuration management: One way of doing this is with the solid old school way of the combo PXe/Kickstart files, although it is slowly being replaced by more automated solutions, and some companies still stick to it, or alternatives such as Cobbler. On the other side, we use tools such as Foreman. Foreman has support for BIOS and UEFI across different operating systems, and it integrates with configuration management tools such as Puppet and Chef. Foreman shines in data centre provisioning and leaves us with an easy to manage infrastructure ready to be used or config managed even more.

Once the provision aspect is complete, we move onto configuration management, which will allow for the management throughout the lifecycle. There are many flavours: Ansible, Chef, Puppet, Salt, even the old and reliable CFengine. The last two are my favourites; even Ansible, a Swiss army knife that helped me many times, given the simplicity and master-less way of work.

Orchestration and optional configuration management: Now, orchestration implies conceptually something different – as mentioned before, harmonious organisation – and the tool that is frequently used nowadays is Terraform. On the upside, it allows to orchestrate in a data centre or in the cloud, integrating with different clouds such as AWS, Oracle Cloud, Azure, and even AliCloud. Terraform has many providers and sometimes the flexibility of the resource management lies in the underlying layer. Besides the cloud providers, it is also possible to integrate Terraform with third parties such as PagerDuty and handle all types of resources. From first hand experience, that sort of integration was smooth and simple, although granted, sometimes not mature enough.

Not all providers will yield the same flexibility. When I started to work with Terraform in Oracle Cloud, OCI did not have the maturity to do auto-scaling; hence, the provider was not allowing Terraform to create autoscaling groups, sometimes so vital that I took it for granted due to working with Terraform and AWS in the past. So another tip is to take a look at the capabilities of the provider, whether cloud or anything else. Sometimes our tools simply do not integrate well with each other, and to design a proper architecture, that is an aspect which cannot be taken lightly.

Another plus of Terraform is that it allows to orchestrate any piece of infrastructure, not only compute machines; it goes from virtual machines, bare metal and such, into networking resources and storage resources. Again, it will depend on the cloud and the Terraform provider and plugins used.

What makes Terraform new generation tools is not only the orchestration, but the infrastructure as a code (IaaC) aspect. The industry steered towards IaaC everywhere, and Terraform is no exception. We are allowed to store our resource definitions in files in any VCS system, Git, SVN, or any other, and that is massive: it allows us to have a versioned infrastructure, teams can interact and everybody is up to speed, and it is possible to manage branches and define different releases, separating versions of infrastructure and environment such as production, staging, UAT, and so on. This is now considered a must: it is not wishful thinking, but the best practice way of doing it.

Once the initial steps with Terraform are done, the provisioning can be completed with something such as Cloud-Init, although any bootstrapping will do. A popular alternative here seems to be Ansible: I have used it and as stated previously, it is a Swiss army knife for small, simple initial tasks. If we are starting to work on cloud, Cloud-Init will fit the bill. After that, other configuration management tools can take over.

That being said, I am adept to immutable infrastructure, so I limit configuration management to the minimum. My thoughts are that in the future, configuration management tools will not be needed. If and when something fails, it should be destroyed and re-instantiated. System administrators must not know the name of resources and only SSH into them as a last resort – if ever.

Container orchestration

Containers are not a new thing anymore; they have been around for a few years (or decades depending on how we look at it), they are stable enough and useful enough that we may choose them for our platform.

Although containers in a data centre is fun, containers in the cloud is amazing, especially because most clouds nowadays provide us with container orchestration, plus a plethora of solutions exist in case we cannot get enough. Some examples include ECS, Amazon Container Service; ACS, Azure Container Services; CoreOS Fleet; Docker Swarm; GCE, Google Container Engine; Kubernetes, and others.

Although I have left Kubernetes last, it has taken the spotlight. There are three reasons this tool has a future:

  • It was designed by Google and that has merit on its own, due to the humongous environment in which it was used and was able to thrive
  • It is the selected one from the Cloud Native Computing Foundation (CNCF), and that means it has bigger chances to stay afloat. The CNCF is very important for cloud-native applications and it it supported by many companies (such as Oracle)
  • The architecture is simple and easy to learn, can be deployed rapidly, and scaled easily

Kubernetes is a very promising tool that is already delivering results. If you are thinking about container orchestration at scale, starting to delve into something such as Minikube and slowly progressing to easy-to-use tools such as Rancher will significantly help to pave the road ahead.

Conclusion

There are many solutions, as has been shown, depending on what sort of infrastructure is being managed; also where the infrastructure is located, the scale, and how it is currently being distributed.

Technologies can also be used jointly. Before Oracle Cloud had OKE (Oracle Kubernetes Engine), the way we implemented Kubernetes in the cloud was through a Terraform plugin that instantiated the necessary infrastructure, and then deployed the Kubernetes cluster on top of it for us to continue configuring, managing, and installing applications such as ElasticSearch on top of it.

The industry is moving towards cloud, and that new paradigm means everything to be delivered as XaaS (everything as a service). This in turn means that building distributed architectures, reliable, performant, scalable and at a lower cost will be, and for some companies already is, a huge competitive advantage.

Nonetheless, there are many technologies to choose from. Often, aligning with the industry standard is a smart decision. It means it is proven, used by companies, in current development, and will be maintained for years ahead.

How to botch a cloud migration in three easy steps – and how to remedy it

Today, the cloud is where everyone wants to be; there is no place like it, and nothing makes your clients happier. Okay, that may be a bit of a stretch – but it’s also fair to say that not everyone can dive into the cloud with their current architecture. Many times, it takes a complete re-engineering process in order to carry a company – or even a simple product – into any cloud.

Here are three areas to think about to ensure you do not carry out a disastrous cloud migration.

Rush it and think ‘agile!’

Go with the flow and just migrate: how hard can that be? We are all familiar with the phrase ‘failure to plan is planning to fail’ – and most cloud migrations break in terrible ways due to failure to plan. All that is needed is a few engineers to move components from one place to the other, then everything is plugged together and – voila – it magically works, because it is the cloud, a magical place where everything seems to fall seamlessly into place. Right?

Perhaps not. There is another way. The old school way, the real engineering way, where you make use not only of system administrators, but instead run different teams with a central project manager who keeps things on track and acts like the glue that keeps the team communicating and working with each other like a well-oiled machine.

Communication is a critical step. We all dislike meetings, but if we do not communicate, we will run into massive issues. It is important that at least one person from each team has some facetime with a person from another team; also, a once a week global meeting to see how things are moving.

Thinking from first principles – an age-old concept recently popularised by Elon Musk – is good, but learning what others did and how they worked is extremely useful. Using agile principles is nice; holding stand up meetings in 10 minutes is nice; having a scrum master is good; but the old methodology of a PM with a Gantt chart keeping track of things and documenting everything is known to work wonders.

Above all, go slowly and design a solid foundation: an architecture in which the engineers can give their feedback and build on good ground. Trust the engineers. They will stand to suspicion on technical difficulties. If not, they will be the ones paying for it – trust them.

Lift and shift

Lift and shift is very popular nowadays. Why? Some people think moving to the cloud means moving from one data centre to another, cheaper and with more resources. It’s distributed, with nicer dashboards, but it’s just another data centre.

Needless to say, this is not the case. This is only re-hosting. This is how the process usually goes:

  • Create an inventory of resources
  • Instantiate the same resources in cloud X
  • Create the failover, high availability and/or disaster recovery solutions
  • Upload all the data and watch everything fall apart

A bigger problem is that it sometimes ‘works’, but there is no improvement. Moving to cloud is adopting a new paradigm: it means implementing cloud orchestration, automation, a different form of resilience, and of course everything as a service (XaaS); using third party components instead of implementing your own.

Once in the cloud, you do not need to install products such as Icinga or Nagios; monitoring is already there as a service. It’s the same for most LAMP stack components – it’s everything as a service! As the old quote goes – simplicity is the ultimate sophistication.

Move to an immature cloud

Don’t forget to run a checklist on things that might be needed. As an example:

  • How many X as a service do I have? What is my uptime?
  • Do I have redundancy? Regions?
  • How many availability zones and domains do I have? Where? Are they real?
  • What about security and restrictions?
  • Compliance and data regulations – is my data safe?
  • Popularity – is this cloud here to stay? Long term?
  • Third-party vendors – is there an app, support, solution, consultants’ market?

Conclusion

There are many things which need to be considered before moving to cloud. This gives a glimpse of what is needed, but do not be discouraged: the benefits are tenfold. Moving to the cloud is no longer being ahead but riding the wave; and not being in the cloud soon will certainly mean falling behind.