Mis à jour : avr. 8
Thanks to the Sokube team for resuming all those sessions !
Check out other days:
The passionated opening keynote by Priyanka Sharmam, General Manager of the Cloud Native Computing Foundation, highlighted how the CNCF is positioning itself as the champion of end-user driven open source solutions, uniting big giants, small teams and companies in a community of excellence.
An interesting information shared by Cheryl Hung, VP Ecosystem of the Linux Foundation in her Keynote on the CNCF Enf User Community: CNCF will regularly publish its quartely End User Technology Radars which truly draws a clear picture of what end users are considering or have already adopted for a particular practice, like Continuous Deployment below. We’ll be looking for the September edition !
CNCF Projects that were mentioned in the CNCF Projects Update session by Splunk’s Principal Software Engineer Constance Caramanolis: Argo is definitely trending as a key player in the CD area. We’re looking forward to the future releases and the unified (FluxCD / ArgoCD) GitOps Engine.
Other projects that were highlighted: SPIFFE, Contour, TiKV (a new distributed Key Value store), and Jaeger.
Cisco’s President and CTO of Cloud, Vijoy Pandey, highlighted in a sponsored keynote how their SD-WAN multi cluster technology helps simplifying the multi cluster implementation.
Another sponsored Keynote by RedHar’s Sally Ann O'Malley and Urvashi Mohnani was celebrating the 5 years of the Open Container Initiative. While looking at what happened during the past 5 years and how the various ecosystems have evolved on top of the initial scope (unifying container runtimes), we see how beneficial the initiative has been. OCI allows us to chose our container runtime, the registries we want to use to store our container images and the tools we want to automate their creation and distribution. Long live the OCI !
The keynote from Liz Rice, Vice President of Open Source Engineering at Aqua Security was presenting the CNCF Technical Oversight Committee and how the CNCF graduation process is assessing the submitted projects, and how their measure of quality is focusing on the user experience feedback on the project technical qualities and governance.
This short talk was about what Rook is and how this storage framework elegantly encapsulates the multiple storage technologies (like Ceph, GlusterFS) which are sometimes quite difficult to deploy and operate. Its key features? The presence of dedicated Custom Resource Definitions (like CephCluster, CephBlockPool) and Kubernetes Operators are truly simpliying the underlying lower level storage technology by making its implementation a first-class Kubernetes citizen with a definable desired state and operators that will perform the necessary operations.
The demo was unfortunately too short to highlight all the potential of the framework, especially in the case of bare metal clusters where it shows the highest benefits.
By Ying Chun Guo Software Engineer, IBM
We've been disappointed by the bad quality of audio and video of the talk.
The demonstration was about using Tekton trigger to build a canary release on knative.
The speaker showed us how to get X % of the traffic on one route, and Y % on the other route using tekton triggers.
Code has been pushed in GitHub, to get a “Hello World 2021” on a new release for instance (vs “Hello World 2020” on release-1).
Thanks to the tekton trigger, a tekton pipeline is launched, and a new route is added in knative thanks to the deploy stage of the pipeline. This new route will initially not handle all the requests, but only 20%. Later on, an operator will “upgrade” the knative route and the new release will become production release.
Product Managers are not technical people, and it is sometimes difficult to involve them for technical consideration. This talk helps giving us the key features of Kubernetes that will be interesting for product managers. It answers the question: “How can Kubernetes help to get a more efficient delivery?”
Kubernetes helps at each stage of the product lifecycle
Build: Need to focus on development (not metrics, configuration, alerts => Kubernetes will do that for you, using well-known products like Prometheus, AlertManager, …)
Deploy: release cycles are difficult to maintain, deploy a new “application” was difficult in the past (need to build infrastructure around the app).
If your service is encapsulated into a Pod, then you “just” have to deploy this pod in your Kubernetes cluster.
Runtime: every service (/application) can be scalable using hpa for instance, and so adapt to the current user’s volume.
Kubernetes ecosystem grows every day, offering the tooling around observability which is the bigger pain point for product owners (what happened that made my app unavailable? why it is slower than before?…)
Is my application portable?
The key is infrastructure-as-code. Everything is yaml with Kubernetes, and this is a base start to go to infrastructure-as-code.
Don’t add anything you don’t need into your development repository, keep application configurations at the kubernetes level, and have a Docker image independent of the rest of the world. If you respect these guidelines, with Kubernetes, your application/service should be absolutely portable.
Continue delivering value and find ways to increase profits
Reducing infrastructure cost is a good way to do that
Adopt platform as a product principle, then you can have:
a scale-up/scale-down platform among the request
a cheaper infrastructure overall
Find the right balance between:
managing reliability and responding to failure - DevOps
functionality and improvement - Development
This talk was clear, and it is useful to remember how you can communicate about the features of Kubernetes, to persuade your interlocutor that it meets the requirements (especially if you don’t talk to a technical person). Don’t forget that observability and scaling are very important features of Kubernetes, and the product owner should want them. Moreover you can manage technical configurations at the Kubernetes level, and update them easily. This will be approved by product owners too, because they don’t want the development team to loose time on customizing technical configuration.
By Shane Lawrence, Senior Infrastructure Security Engineer, Shopify
A terrific session (with a later on additional Keynote) on Shopify's journey to implement intrusion detection on their clusters (which sometimes peak at >170,000 req/s). The presenter truly made a case for the practice and for the Falco open source technology behind, highlighting how it complements nicely the more traditional security practices (like network policies, rbac, etc…).
Whereas these aspects belong to prevention, Falco is focusing on the detection side. The technology, which is based on the extended Berkeley Packet Filter for the low-level information extraction, uses as well insights from the containers and kubernetes cluster data to extract, detect and classify syscalls that matches suspect activities from known vulnerabilities defined in rules and traces this information as an intrusion.
We appreciated the tremendous level of details, insights and honesty on the difficulties and challenges faced when you implement such a technology. In particular, its success depends highly on the active maintenance (by typically a staffed infrastructure security team) of the various warnings, rules and classification of false positives to eliminate noise and make the whole system truly relevant and effective.
The demo was using known vulnerabilities and real life use-cases, which made the whole session absolutely exceptional.
By Laurent Bernaille, Staff Engineer, Datadog
Another incredible session on an often overlooked network aspect of our Kubernetes clusters: DNS resolutions. Obviously, for Datadog, with 200,000 DNS queries/s received on their clusters, each call counts and DNS becomes a crucial element to carefuly consider.
Through a very detailed journey, full of fun and less fun stories, findings, the speaker exposes this part of the kubernetes network stack, what it really does and all the possible issues you can face when you run clusters at this size: from Alpine images ignoring IPv6 disabling, to Linux kernel limitations and patches, to AWS limits, the speaker has transformed a usually “boring” topic to something absolutely interesting from start to finish, demonstrating a true mastering of network at Datadog.
We highly recommend watching it when video recordings are released later if you’re curious about what’s under the DNS hood.
deploy a new version in parallel
Once it is considered as a reliable version (based on a live production trafic volume), do the rollout
Keep in mind:
Limit the blast radius
You don’t need to really control the “avoid downtime” part, because kubernetes does it for you with Deployment and rollout strategy.
Progressive deployment strategies:
Blue-Green deployment: Difficult, because rollout your X deployment can be hard.
Canary deployment: Choose who has access to the new version vs stable? Increase the percentage progressively. We don’t control who sees what?
Feature flags: Deliver new feature that are off by default and then enable it one by one. If something is wrong, just re-disable the feature
Everybody sees every enabled feature.
Monitoring is the new testing
See issues encountered by user in prod, and react to this automatically.
Using Istio and Flagger:
Flagger will check that your new deployment is ok.
Flagger’s Prometheus will show information about old version and new version.
Usage of the Canary object provided by Istio (to check your canary deployment’s status, use kubectl get canaries).
Flagger will create a Deploy-primary in parallel of your deployment the first time (it is your production version)
When there’s a new version of the deployment, the Deployment “Canary” is created. If it fails, it will never replace your “Deployment”.
Watching kubectl get canaries, you will see that all the traffic is progressively pushed to the new release (if the release is ok, i.e. if check using curl passes). When all traffic is ok and 100% of the traffic is on the “canary” tagged deployment, then the new status of the Canary is “promoting”, and your production deployment will be updated.
Very interesting talk, and we bet these types of deployment with prechecks are the future of deploying applications in a Kubernetes environment. The setup doesn’t seem very hard, and will give confidence in the deployment, and that every build deployed in production is working. It gives a very valid deployment alternative for organizations not comfortable enough with continuous delivery.
The authors presented Cloud Native Buildpacks and how they help developers to avoid the complexities of crafting correct and high quality Docker images. The custom tools in the solution are definitely worth a look but it seems heavily focused so far on Spring Boot apps.
The technology of build packs brings in tons of advantages for cloud app developers, from faster builds to Bill Of Materials (detailing what’s exactly in the various image layers) and image rebasing capabilities (changing the base “OS” layer image) that minimized the changes and layers binaries transfers.
The speakers detailed the journey of Intuit into building a company wide distributed tracing platform for their various applications. They made a case for Observability and the significant impact it has on their MTTR / MTTD:
They explained in details the implementation at Intuit with various tools and technologies like Jaeger, Zipkin, and how they finally aggregate the massive amount of spans data into a set of aggregated transaction metrics for their various services.
If you’re considering implementing distributed tracing / analysis, this session was definitely worth a look, and the Q&A sessions rich of implementation details.
By Priya Wadhwa Software Engineer, Google (Minikube maintainer)
The session is about performance improvements that were included in minikube, and how performance issues have been detected.
Need to get reliable measurements:
single process vs entire system
priyawadhwa/track-cpu - equivalent to docker ps
tstromberg/cstat => more precise than iostat
Used performance tools:
Using the USE method, the speaker found that:
Step 1: Blocking I/O
Using eBPF, biosnoop, and flame graph, Priya identified ectd writes as an overhead contributor (many block device I/O with big latency).
There’s a way to tune how often etcd writes to disk, so Priya tried it.
Tuning values of snapshot count, which defines the number of commited transactions to trigger a snapshot to disk (default is 10k)
–-snapshot-count => only 2% improvements
So maybe etcd is not the main overhead contributor.
Step 2 - some spikes on CPUs:
pidstat 1 60 -l => shows that kubectl process is the main CPU consumer kubectl apply -f /etc/kubernetes/addons identified => too many polls on the directory?
What is kube-addon-manager?
Minikube uses kube-addon-manager to enable/disable addons in the cluster. The addon manager runs kubectl apply every 5 seconds to ensure that desired state matches current state.
So Pryia decided to completely remove the addon-manager, since increasing its poll-time shown already some benefits
removing addon manager => 32% improvements
Step 3: What part of k8s is contributing to overhead
kube-apiserver seems to consume a lot of cpu, even when we don’t use minikube…
Using pprof we see that the function used 50% of the time is syscall.Syscall
and using the associated pprof dashboard we can see which function call which function.
By retrieving the logs of the kube-api-server and retrieve “GET” requests on it, we can see there’s a leader election frequently, coming from kube controller and kube scheduler.
But we don’t need leader election in a minikube instance, since it’s a one node cluster… so Priya turned off the leader-elect by setting --leader-elect=false on kube-controller-manager and kube-scheduler
by turning on dns replicas to 1 and remove election of a leader => 18% improvements
using pprof => 40% is syscall.SysCall
hprof graph shows:
Looking at the Go’s http library => found httproxy packages in etcd code, with a refresh-interval set by default to 30. Increasing this value implies that any endpoints will take longer to be proxied properly.
--proxy-refresh-interval: Time (in milliseconds) of the endpoints refresh interval (default: 30000)
--proxy-refresh-interval=70000 => only 4% improvements
This presentation was amazing, not only for the solutions given (although it was pretty interesting to understand how minikube has been improved since the beginning of the year), but it demonstrates how can to assess a cluster performance, how to investigate it, which tools can be used.
It also highlights that removing unnecessary components is a good idea (and kubernetes distributions like k3s have taken this road already). However, never forget the user experience, in particular if you decide to increase poll-time.
Infrastructure as Code is useful but it has a some limits and raises questions :
Is your Infrastructure as Code well organized to make changes easily ?
Is it possible to share your Infrastructure as Code across multiple teams ?
Do you spend too much time on Infrastructure as Code implementation compared to application business logic ?
→ IaC must solve consistency and reusability for infrastructure Manifest and its delivery Pipeline.
Code everything for Infrastructure
Define everything as code is a core practice for making changes quickly and reliably. It fosters reusability, consistency and transparency.
Chaos of code : Complicated and fragile code
Many specific projects are hard to share,
Too many templating languages : envsubst, Jinja…,
Tangled code : dependencies between different tools like Terraform, Kubernetes…
Rethink Infrastructure as Code
Rethink the approach to IaC with :
Software approach to share code : abstract, module, group…,
Unified interface for every infrastructure provider,
Take pipeline/operation into account from a design
To answer all these questions, Design Pattern as Code appears in the landscape. Design Pattern as Code is derived form IaC and powered by Cuelang and Tekton. It provides a reusable and composable Cloud Native Architecture pattern written in a powerful language for a software engineer.
What is Design Pattern as Code ?
Design Pattern as Code adresses :
How to deliver a Cloud Native application,
How to integrate Cloud Native solutions .
Design pattern :
Declares all of the infrastructure provider Manifest,
Puts Manifest and Pipeline together,
Is composable with other Design Patterns
Design Pattern written in a Cuelang consists of :
Cuelang for unified interface :
Powerful typed language focusing on declaring data,
Designed for scale, generating configs from multiple patterns (commutative and idempotent)
Tekton for workflow engine
Tekton - Declarative pipeline to run the task
Each task is isolated from other tasks,
Can compose a new task to pipeline easily
Design Pattern generates Tekton Pipeline from task definiton
Example of Design Pattern as Code
Break Manifest to reusable portions and improve agility :
Compose patterns :
Compose Pipeline tasks :
Put Manifest and Pipeline in the API Pattern :
API Pattern Implementation example :
The domainName used for the resource configuration is consistent with the task to check if the API is active.
Design Pattern as Code helps us keep consistency between configuration and pipeline.
Finally Design Pattern can be used for:
Security and Compliance (for example add a security checkpoint like OSS static analysis of container vulnerabilities to a pipeline)
Observability and Analysis (install and maintain metrics agents to establish well designed feedback)
Multi-Platforms ( It can leverage any public cloud specific features by just swapping Design Pattern)
Improving value proposition ( Design Pattern as Code is a new interface for maximizing product value Stream
But a Design Pattern doesn’t address state management yet, another layer is required.
This is a very interesting subject, Design Pattern could become a standardized pattern in the future with these features.
By Anton Weiss (Principal Consultant, Otomato Software)
This topic is about how to start your journey toward Basic CI/CD or Cloud Native CI/CD. The “jungle” is mentioned because of the proliferation of tools, methodologies, languages, frameworks, databses, cloud and service providers for CI/CD.
It makes it difficult to take decisions, to integrate a new or an existing solution without pain. On the one hand : Quality, Supportability and clarity ; and the other hand : Velocity, Autonomy and Creativity ? We shouldn’t have to choose between all of these.
Finally Cloud Native CI/CD is a better way escape from the jungle :
Integrated with the Platform (Kubernetes)
Dynamic Agents and Environment
Decouple Delivery from Release