Kubernetes Today
Unlocking Performance: Exploring Java 21 Virtual Threads [Video]
Kubernetes in the Enterprise
Kubernetes: it’s everywhere. To fully capture or articulate the prevalence and far-reaching impacts of this monumental platform is no small task — from its initial aims to manage and orchestrate containers to the more nuanced techniques to scale deployments, leverage data and AI/ML capabilities, and manage observability and performance — it’s no wonder we, DZone, research and cover the Kubernetes ecosystem at great lengths each year.In our 2023 Kubernetes in the Enterprise Trend Report, we further dive into Kubernetes over the last year, its core usages as well as emerging trends (and challenges), and what these all mean for our developer and tech community. Featured in this report are actionable observations from our original research, expert content written by members of the DZone Community, and other helpful resources to help you go forth in your organizations, projects, and repos with deeper knowledge of and skills for using Kubernetes.
Identity and Access Management
Core PostgreSQL
I spent my previous two posts on the difference between efficient versus effective software engineering, and then how it ties in with accidental versus essential complexity. I am curious how AI will change the programming profession in the coming decades, but I am critical of any hype-infused predictions for the short term. AI won’t dream up software that remains valuable over decades. That remains the truly hard problem. It can help us out fine at being more efficient but does a poor job at effectiveness. Better rephrase that as an unreliable job. Effectiveness is about building the right thing. A thing that is aligned with our human interests and doesn’t harm us. Self-driving cars, designed not to crash into other cars or pedestrians, are unreliable at best. It’s easier to specify safeguards, but fiendishly hard to implement. And it gets even harder. Once we have millions of them on the road, every day some of these will make life-or-death decisions between the lesser of two evils. The machine needs to judge what’s best for other humans, in a split second and with Vulcan detachment. The needs of the many outweigh the needs of the one, it will argue. When it comes to such existential decisions, we should remain firmly in the driver’s seat to shape the kind of machine future we want. Current AI is much better equipped to handle efficiency improvements. It can swap out alternatives, weigh their relative merits, and suggest the combination that leads to the most efficient solution. But the smarter it gets, the less we should trust it with controversial topics that require judgment. Because things might take a scary turn. Nick Bostrom’s famous paperclip maximizer is an amusing thought experiment with an important warning: AI will optimize for whatever you instruct it. If that happens to be making paperclips and provided it is infinitely powerful and infinitely selfless, it will strip entire galaxies of their metal to make more useless stationery. Even if AI were to become self-conscious, with or without a dark agenda, it would still be alien, and by definition so (it’s in the word artificial). Isaac Asimov predicted that a human creation with individual agency should probably have some hardcoded safeguards in place. His three laws of Robotics predated the ENIAC by only three years. But he couldn’t have predicted the evil genius who added some private exceptions to the “do no harm” principle through a sneaky firmware upgrade, like in the first Robocop movie. Enough gloomy gazing in the palantír. What I do predict (having no stock in any of the major stakeholders) is that the art of programming will transform into the art of expressing what you need clearly and unambiguously. Developers will become AI-savvy business analysts, accustomed to speaking to an AI, using the ultimate high-level programming language, i.e., English. It will always build working software, and if we’re lucky it will even be useful. Working Software Is Not Good Enough Isn’t it strange that the Agile Manifesto called for working software? As if broken software were ever an acceptable alternative! Is it too much to ask that prompt-generated code is also useful and valuable? Yes, it probably is too much to ask. The gap between working and valuable software is huge because the value is intangible and unpredictable. Perfectly fine software can lose its relevance through no fault of your own and in ways that no upgrade can fix. Here are a few examples. It's not the first time I mentioned the long-forgotten OS project Chandler. Its rocky path to version 1.0 is beautifully told in Scott Rosenberg’s 2007 book Dreaming in Code. It’s an enduring reminder that the best intentions, a team of dedicated top-notch developers, and a generous sponsor (Mitch Kapor, who created Lotus 1-2-3) are not guarantees for success. Chandler set out to be a free alternative to Microsoft Outlook and Exchange. It promised a radically different user experience. It was going to disrupt how we handled messages, agenda items, and to-do lists. And it meant to do so in a desktop app, communicating through a peer-to-peer protocol. Power to the people! But the team had taken too many wrong turns in their architectural roadmap. Like Icarus, they flew too close to the sun. The world caught up with them. More powerful browser features made a Python-based desktop app a poor choice. Cheap and easy hosting of your own server removed the need for a peer-to-peer protocol, a design choice that unleashed a torrent of accidental complexity. Now, all those could have been remedied if the community had wanted to. But it didn’t. The essential problem lay in the user experience. The ideas were too radical. They were not what the average office worker needed. I haven’t seen any of them implemented in other products (but I’ll gladly stand corrected). People are still using mail and agendas the way they did in 1995, only now on their phones and without beveled corners. The Unplanned Obsolescence of GWT Sometimes a great tool can become obsolete because its original unique selling point doesn’t sell anymore. Google Web Toolkit (GWT) had a compelling proposition in 2006. Desktop computers had enough horsepower to support the browser as an application platform. You could do your taxes without having to install anything. But browser incompatibilities were rife, especially for advanced stuff like drag and drop or double-clicking. GWT let you write backend and frontend code in the same project, with shared objects for data transfer and validation, and deploy them in a single web archive. GWT compiled Java to JavaScript, and you could even debug your client-side Java code with a local development server. I loved it and made serious money off it for a while. But compilation was notoriously expensive. Browser vendors reconciled their quirks. Front-end platforms like Angular and React quickly matured. Building frontends became a serious career, and these developers didn’t seem to shun JavaScript as a programming platform. GWT had lost its relevance, and there’s no way AI could have foreseen or fixed it. The problem was not about code, but a mismatch with the world around it. Keep Coding for Coding’s Sake Let none of this discourage you from writing code, by the way. There is no need for serious software to be effective in a commercial sense, or to have any practical benefit at all. I’m talking about amateur Open Source. I have written software that I’m proud of, but that had no business plan, no roadmap, and no other motivation than my own education and enjoyment. It was effective to the extent that it taught me new concepts, but I had zero appetite for my own dog food. There are many such projects on GitHub. I mean no disrespect. I speak from personal experience. There’s nothing wrong with coding for coding’s sake, but that’s like playing in a band that never performs for an audience: hard to keep up.
Java was the first language I used professionally and is the scale by which I measure other languages I learned afterward. It's an OOP statically-typed language. Hence, Python feels a bit weird because of its dynamic typing approach. For example, Object offers methods equals(), hashCode(), and toString(). Because all other classes inherit from Object, directly or indirectly, all objects have these methods by definition. Conversely, Python was not initially built on OOP principles and is dynamically typed. Yet, any language needs cross-cutting features on unrelated objects. In Python, these are specially-named methods: methods that the runtime interprets in a certain way but that you need to know about. You can call them magic methods. The documentation is pretty exhaustive, but it needs examples for beginners. The goal of this post is to list most of these methods and provide these examples so that I can remember them. I've divided it into two parts to make it more digestible. Lifecycle Methods Methods in this section are related to the lifecycle of new objects. object.__new__(cls[, ...]) The __new()__ method is static, though it doesn't need to be explicitly marked as such. The method must return a new object instance of type cls; then, the runtime will call the __init__() (see below) method on the new instance. __new__() is meant to customize instance creation of subclasses of immutable classes. Python class FooStr(str): #1 def __new__(cls, value): return super().__new__(cls, f'{value}Foo') #2 print(FooStr('Hello')) #3 Inherit from str. Create a new str instance, whose value is the value passed to the constructor, suffixed with Foo. Print HelloFoo. object.__init__(self[, ...]) __init__() is the regular initialization method, which you probably know if you've read any basic Python tutorial. The most significant difference with Java is that the superclass __init__() method has no implicit calling. One can only wonder how many bugs were introduced because somebody forgot to call the superclass method. __init__() differs from a constructor in that the object is already created. Python class Foo: def __init__(self, a, b, c): #1 self.a = a #2 self.b = b #2 self.c = c #2 foo = Foo('one', 'two', 'three') print(f'a={foo.a}, b={foo.b}, c={foo.c}') #3 The first parameter is the instance itself. Initialize the instance. Print a=one, b=two, c=three. object.__del__(self) If __init()__ is akin to an initializer, then __del__() is it's finalizer. As in Java, finalizers are unreliable, e.g., there's no guarantee that the interpreter finalizes instances when it shuts down. Representation Methods Python offers two main ways to represent objects: one "official" for debugging purposes, and the other "informal." You can use the former to reconstruct the object. The official representation is expressed via the object.__repr__(self). The documentation states that the representation must be "information-rich and unambiguous." Python class Foo: def __init__(self, a, b, c): self.a = a self.b = b self.c = c def __repr__(self): return f'Foo(a={foo.a}, b={foo.b}, c={foo.c})' foo = Foo('one', 'two', 'three') print(foo) #1 Print Foo(a=one, b=two, c=three). My implementation returns a string, though it's not required. Yet, you can reconstruct the object with the information displayed. The object.__str__(self) handles the unofficial representation. As its name implies, it must return a string. The default calls __repr__(). Aside from the two methods above, the object.__format__(self, format_spec) method returns a string representation of the object. The second argument follows the rules of the Format Specification Mini-Language. Note that the method must return a string. It's a bit involved so I won't implement it. Finally, the object.__bytes__(self) returns a byte representation of the object. Python from pickle import dumps #1 class Foo: def __init__(self, a, b, c): self.a = a self.b = b self.c = c def __repr__(self): return f'Foo(a={foo.a}, b={foo.b}, c={foo.c})' def __bytes__(self): return dumps(self) #2 foo = Foo('one', 'two', 'three') print(bytes(foo)) #3 Use the pickle serialization library. Delegate to the dumps() method. Print the byte representation of foo. Comparison Methods Let's start with similarities with Java: Python has two methods, object.__eq__(self, other) and object.__hash__(self), that work in the same way. If you define __eq__() for a class, you must define __hash__() as well. Contrary to Java, if you don't define the former, you must not define the latter. Python class Foo: def __init__(self, a, b): self.a = a self.b = b def __eq__(self, other): if not isinstance(other, Foo): #1 return false return self.a == other.a and self.b == other.b #2 def __hash__(self): return hash(self.a + self.b) #3 foo1 = Foo('one', 'two') foo2 = Foo('one', 'two') foo3 = Foo('un', 'deux') print(hash(foo1)) print(hash(foo2)) print(hash(foo3)) print(foo1 == foo2) #4 print(foo2 == foo3) #5 Objects that are not of the same type are not equal by definition. Compare the equality of attributes. The hash consists of the addition of the two attributes. Print True. Print False. As in Java, __eq__()__ and __hash__() have plenty of gotchas. Some of them are the same, others not. I won't paraphrase the documentation; have a look at it. Other comparison methods are pretty self-explanatory: Method Operator object.__lt__(self, other) < object.__le__(self, other) `` object.__ge__(self, other) >= object.__ne__(self, other) != Python class Foo: def __init__(self, a): self.a = a def __ge__(self, other): return self.a >= other.a #1 def __le__(self, other): return self.a <= other.a #1 foo1 = Foo(1) foo1 = Foo(1) foo2 = Foo(2) print(foo1 >= foo1) #2 print(foo1 >= foo2) #3 print(foo1 <= foo1) #4 print(foo2 <= foo2) #5 Compare the single attribute. Print True. Print False. Print True. Print True. Note that comparison methods may return something other than a boolean. In this case, Python will transform the value in a boolean using the bool() function. I advise you not to use this implicit conversion. Attribute Access Methods As seen above, Python allows accessing an object's attributes via the dot notation. If the attribute doesn't exist, Python complains: 'Foo' object has no attribute 'a'. However, it's possible to define synthetic accessors on a class, via the object.__getattr__(self, name) and object.__setattr__(self, name, value) methods. The rule is that they are fallbacks: if the attribute doesn't exist, Python calls the method. Python class Foo: def __init__(self, a): self.a = a def __getattr__(self, attr): if attr == 'a': return 'getattr a' #1 if attr == 'b': return 'getattr b' #2 foo = Foo('a') print(foo.a) #3 print(foo.b) #4 print(foo.c) #5 Return the string if the requested attribute is a. Return the string if the requested attribute is b. Print a. Print getattr b. Print None. For added fun, Python also offers the object.__getattribute__(self, name). The difference is that it's called whether the attribute exists or not, effectively shadowing it. Python class Foo: def __init__(self, a): self.a = a def __getattribute__(self, attr): if attr == 'a': return 'getattr a' #1 if attr == 'b': return 'getattr b' #2 foo = Foo('a') print(foo.a) #3 print(foo.b) #4 print(foo.c) #5 Return the string if the requested attribute is a. Return the string if the requested attribute is b. Print getattr a. Print getattr b. Print None. The dir() function allows returning an object's list of attributes and methods. You can set the list using the object.__dir__(self)__ method. By default, the list is empty: you need to set it explicitly. Note that it's the developer's responsibility to ensure the list contains actual class members. Python class Foo: def __init__(self, a): self.a = 'a' def __dir__(self): #1 return ['a', 'foo'] foo = Foo('one') print(dir(foo)) #2 Implement the method. Display ['a', 'foo']; Python sorts the list. Note that there's no foo member, though. Descriptors Python descriptors are accessors delegates, akin to Kotlin's delegated properties. The idea is to factor a behavior somewhere so other classes can reuse it. In this way, they are the direct consequence of favoring composition over inheritance. They are available for getters, setters, and finalizers, respectively: object.__get__(self, instance, owner=None) object.__set__(self, instance, value) object.__delete__(self, instance) Let's implement a lazy descriptor that caches the result of a compute-intensive operation. Python class Lazy: #1 def __init__(self): self.cache = {} #2 def __get__(self, obj, objtype=None): if obj not in self.cache: self.cache[obj] = obj._intensiveComputation() #3 return self.cache[obj] class Foo: lazy = Lazy() #4 def __init__(self, name): self.name = name self.count = 0 #5 def _intensiveComputation(self): self.count = self.count + 1 #6 print(self.count) #7 return self.name foo1 = Foo('foo1') foo2 = Foo('foo2') print(foo1.lazy) #8 print(foo1.lazy) #8 print(foo2.lazy) #9 print(foo2.lazy) #9 Define the descriptor. Initialize the cache. Call the intensive computation. Conclusion This concludes the first part of Python magic methods. The second part will focus on class, container, and number-related methods.
This is an article from DZone's 2023 Kubernetes in the Enterprise Trend Report.For more: Read the Report Cloud-native architecture is a transformative approach to designing and managing applications. This type of architecture embraces the concepts of modularity, scalability, and rapid deployment, making it highly suitable for modern software development. Though the cloud-native ecosystem is vast, Kubernetes stands out as its beating heart. It serves as a container orchestration platform that helps with automatic deployments and the scaling and management of microservices. Some of these features are crucial for building true cloud-native applications. In this article, we explore the world of containers and microservices in Kubernetes-based systems and how these technologies come together to enable developers in building, deploying, and managing cloud-native applications at scale. The Role of Containers and Microservices in Cloud-Native Environments Containers and microservices play pivotal roles in making the principles of cloud-native architecture a reality. Figure 1: A typical relationship between containers and microservices Here are a few ways in which containers and microservices turn cloud-native architectures into a reality: Containers encapsulate applications and their dependencies. This encourages the principle of modularity and results in rapid development, testing, and deployment of application components. Containers also share the host OS, resulting in reduced overhead and a more efficient use of resources. Since containers provide isolation for applications, they are ideal for deploying microservices. Microservices help in breaking down large monolithic applications into smaller, manageable services. With microservices and containers, we can scale individual components separately. This improves the overall fault tolerance and resilience of the application as a whole. Despite their usefulness, containers and microservices also come with their own set of challenges: Managing many containers and microservices can become overly complex and create a strain on operational resources. Monitoring and debugging numerous microservices can be daunting in the absence of a proper monitoring solution. Networking and communication between multiple services running on containers is challenging. It is imperative to ensure a secure and reliable network between the various containers. How Does Kubernetes Make Cloud Native Possible? As per a survey by CNCF, more and more customers are leveraging Kubernetes as the core technology for building cloud-native solutions. Kubernetes provides several key features that utilize the core principles of cloud-native architecture: automatic scaling, self-healing, service discovery, and security. Figure 2: Kubernetes managing multiple containers within the cluster Automatic Scaling A standout feature of Kubernetes is its ability to automatically scale applications based on demand. This feature fits very well with the cloud-native goals of elasticity and scalability. As a user, we can define scaling policies for our applications in Kubernetes. Then, Kubernetes adjusts the number of containers and Pods to match any workload fluctuations that may arise over time, thereby ensuring effective resource utilization and cost savings. Self-Healing Resilience and fault tolerance are key properties of a cloud-native setup. Kubernetes excels in this area by continuously monitoring the health of containers and Pods. In case of any Pod failures, Kubernetes takes remedial actions to ensure the desired state is maintained. It means that Kubernetes can automatically restart containers, reschedule them to healthy nodes, and even replace failed nodes when needed. Service Discovery Service discovery is an essential feature of a microservices-based cloud-native environment. Kubernetes offers a built-in service discovery mechanism. Using this mechanism, we can create services and assign labels to them, making it easier for other components to locate and communicate with them. This simplifies the complex task of managing communication between microservices running on containers. Security Security is paramount in cloud-native systems and Kubernetes provides robust mechanisms to ensure the same. Kubernetes allows for fine-grained access control through role-based access control (RBAC). This certifies that only authorized users can access the cluster. In fact, Kubernetes also supports the integration of security scanning and monitoring tools to detect vulnerabilities at an early stage. Advantages of Cloud-Native Architecture Cloud-native architecture is extremely important for modern organizations due to the evolving demands of software development. In this era of digital transformation, cloud-native architecture acts as a critical enabler by addressing the key requirements of modern software development. The first major advantage is high availability. Today's world operates 24/7, and it is essential for cloud-native systems to be highly available by distributing components across multiple servers or regions in order to minimize downtime and ensure uninterrupted service delivery. The second advantage is scalability to support fluctuating workloads based on user demand. Cloud-native applications deployed on Kubernetes are inherently elastic, thereby allowing organizations to scale resources up or down dynamically. Lastly, low latency is a must-have feature for delivering responsive user experiences. Otherwise, there can be a tremendous loss of revenue. Cloud-native design principles using microservices and containers deployed on Kubernetes enable the efficient use of resources to reduce latency. Architecture Trends in Cloud Native and Kubernetes Cloud-native architecture with Kubernetes is an ever-evolving area, and several key trends are shaping the way we build and deploy software. Let's review a few important trends to watch out for. The use of Kubernetes operators is gaining prominence for stateful applications. Operators extend the capabilities of Kubernetes by automating complex application-specific tasks, effectively turning Kubernetes into an application platform. These operators are great for codifying operational knowledge, creating the path to automated deployment, scaling, and management of stateful applications such as databases. In other words, Kubernetes operators simplify the process of running applications on Kubernetes to a great extent. Another significant trend is the rise of serverless computing on Kubernetes due to the growth of platforms like Knative. Over the years, Knative has become one of the most popular ways to build serverless applications on Kubernetes. With this approach, organizations can run event-driven and serverless workloads alongside containerized applications. This is great for optimizing resource utilization and cost efficiency. Knative's auto-scaling capabilities make it a powerful addition to Kubernetes. Lastly, GitOps and Infrastructure as Code (IaC) have emerged as foundational practices for provisioning and managing cloud-native systems on Kubernetes. GitOps leverages version control and declarative configurations to automate infrastructure deployment and updates. IaC extends this approach by treating infrastructure as code. Best Practices for Building Kubernetes Cloud-Native Architecture When building a Kubernetes-based cloud-native system, it's great to follow some best practices: Observability is a key practice that must be followed. Implementing comprehensive monitoring, logging, and tracing solutions gives us real-time visibility into our cluster's performance and the applications running on it. This data is essential for troubleshooting, optimizing resource utilization, and ensuring high availability. Resource management is another critical practice that should be treated with importance. Setting resource limits for containers helps prevent resource contention and ensures a stable performance for all the applications deployed on a Kubernetes cluster. Failure to manage the resource properly can lead to downtime and cascading issues. Configuring proper security policies is equally vital as a best practice. Kubernetes offers robust security features like role-based access control (RBAC) and Pod Security Admission that should be tailored to your organization's needs. Implementing these policies helps protect against unauthorized access and potential vulnerabilities. Integrating a CI/CD pipeline into your Kubernetes cluster streamlines the development and deployment process. This promotes automation and consistency in deployments along with the ability to support rapid application updates. Conclusion This article has highlighted the significant role of Kubernetes in shaping modern cloud-native architecture. We've explored key elements such as observability, resource management, security policies, and CI/CD integration as essential building blocks for success in building a cloud-native system. With its vast array of features, Kubernetes acts as the catalyst, providing the orchestration and automation needed to meet the demands of dynamic, scalable, and resilient cloud-native applications. As readers, it's crucial to recognize Kubernetes as the linchpin in achieving these objectives. Furthermore, the takeaway is to remain curious about exploring emerging trends within this space. The cloud-native landscape continues to evolve rapidly, and staying informed and adaptable will be key to harnessing the full potential of Kubernetes. Additional Reading: CNCF Annual Survey 2021 CNCF Blog "Why Google Donated Knative to the CNCF" by Scott Carey Getting Started With Kubernetes Refcard by Alan Hohn "The Beginner's Guide to the CNCF Landscape" by Ayrat Khayretdinov This is an article from DZone's 2023 Kubernetes in the Enterprise Trend Report.For more: Read the Report
This is an article from DZone's 2023 Kubernetes in the Enterprise Trend Report.For more: Read the Report Kubernetes celebrates its ninth year since the initial release this year, a significant milestone for a project that has revolutionized the container orchestration space. During the time span, Kubernetes has become the de facto standard for managing containers at scale. Its influence can be found far and wide, evident from various architectural and infrastructure design patterns for many cloud-native applications. As one of the most popular and successful open-source projects in the infrastructure space, Kubernetes offers a ton of choices for users to provision, deploy, and manage Kubernetes clusters and applications that run on them. Today, users can quickly spin up Kubernetes clusters from managed providers or go with an open-source solution to self-manage them. The sheer number of these options can be daunting for engineering teams deciding what makes the most sense for them. In this Trend Report article, we will take a look at the current state of the managed Kubernetes offerings as well as options for self-managed clusters. With each option, we will discuss the pros and cons as well as recommendations for your team. Overview of Managed Kubernetes Platforms Managed Kubernetes offerings from the hyperscalers (e.g., Google Kubernetes Engine, Amazon Elastic Kubernetes Service, Azure Kubernetes Service) remain one of the most popular options for administering Kubernetes. The 2019 survey of the Kubernetes landscape from the Cloud Native Computing Foundation (CNCF) showed that these services from each of the cloud providers make up three of the top five options that enterprises use to manage containers. More recent findings from CloudZero illustrating increased cloud and Kubernetes adoption further solidifies the popularity of managed Kubernetes services. All of the managed Kubernetes platforms take care of the control plane components such as kube-apiserver, etcd, kubescheduler, and kube-controller-manager. However, the degree to which other aspects of operating and maintaining a Kubernetes cluster are managed differs for each cloud vendor. For example, Google offers a more fully-managed service with GKE Autopilot, where Google manages the cluster's underlying compute, creating a serverless-like experience for the end user. They also provide the standard mode where Google takes care of patching and upgrading of the nodes along with bundling autoscaler, load balancer controller, and observability components, but the user has more control over the infrastructure components. On the other end, Amazon's offering is more of a hands-off, opt-in approach where most of the operational burden is offloaded to the end user. Some critical components like CSI driver, CoreDNS, VPC CNI, and kube-proxy are offered as managed add-ons but not installed by default. Figure 1: Managed Kubernetes platform comparison By offloading much of the maintenance and operational tasks to the cloud provider, managed Kubernetes platforms can offer users a lower total cost of ownership (especially when using something like a per-Pod billing model with GKE Autopilot) and increased development velocity. Also, by leaning into cloud providers' expertise, teams can reduce the risk of incorrectly setting Kubernetes security settings or fault-tolerance that could lead to costly outages. Since Kubernetes is so complex and notorious for a steep learning curve, using a managed platform to start out can be a great option to fast-track Kubernetes adoption. On the other hand, if your team has specific requirements due to security, compliance, or even operating environment (e.g., bare metal, edge computing, military/medical applications), a managed Kubernetes platform may not fit your needs. Note that even though Google and Amazon have on-prem products (GKE on-prem and EKS anywhere), the former requires VMware's server virtualization software, and the latter is an open-source, self-managed option. Finally, while Kubernetes lends itself to application portability, there is still some degree of vendor lock-in by going with a managed option that you should be aware of. Overview of Self-Managed Kubernetes Options Kubernetes also has a robust ecosystem of self-managing Kubernetes clusters. First, there's the manual route of installing "Kubernetes the Hard Way," which walks through all the steps needed for bootstrapping a cluster step by step. In practice, most teams use a tool that abstracts some of the setup such as kops, kubeadm, kubespray, or kubicorn. While each tool behaves slightly differently, they all automate the infrastructure provisioning, support maintenance functions like upgrades or scaling, as well as integrate with cloud providers and/or bare metal. The biggest advantage of going the self-managed route is that you have complete control over how you want your Kubernetes cluster to work. You can opt to run a small cluster without a highly available control plane for less critical workloads and save on cost. You can customize the CNI, storage, node types, and even mix and match across multiple cloud providers if need be. Finally, self-managed options are more prevalent in non-cloud environments, namely edge or on-prem. On the other hand, operating a self-managed cluster can be a huge burden for the infrastructure team. Even though open-source tools have come a long way to lower the burden, it still requires a non-negligible amount of time and expertise to justify the cost against going with a managed option. PROS AND CONS OF MANAGED vs. SELF-MANAGED KUBERNETES Options Pros Cons Managed Lower TCO Increased development velocity Lean on security best practices Inherit cloud provider's expertise Less maintenance burden Fully customizable to satisfy compliance requirements Can use latest features Flexible deployment schemes Self-managed May not be available on-prem or on the edge Not open to modification Requires support from service provider in case of outage Requires significant Kubernetes knowledge and expertise Maintenance burden can be high Table 1 Considerations for Managed vs. Self-Managed Kubernetes For most organizations running predominantly on a single cloud, going with the managed offering makes the most sense. While there is a cost associated with opting for the managed service, it is a nominal fee ($0.10 per hour per cluster) compared to the engineer hours that may be required for maintaining those clusters. The rest of the cost is billed the same way as using VMs, so cost is usually a non-factor. Also, note that there will still be a non-negligible amount of work to do if you go with a vendor who provides a less-managed offering. There are few use cases where going with a self-managed Kubernetes option makes sense: If you need to run on-prem or on the edge, you may decide that the on-prem offerings from the cloud providers may not fit your needs. If you are running on-prem, likely this means that either cost was a huge factor or there is a tangible need to be on-prem (i.e., applications must run closer to where it's deployed). In these scenarios, you likely already have an infrastructure team with significant Kubernetes experience or the luxury of growing that team in-house. Even if you are not running on-prem, you may consider going with a self-managed option if you are running on multiple clouds or a SaaS provider that must offer a flexible Kubernetes-as-a-Service type of product. While you can run different variants of Kubernetes across clouds, it may be desirable to use a solution like Cluster API to manage multiple Kubernetes clusters in a consistent manner. Likewise, if you are offering Kubernetes as a Service, then you may need to support more than the managed Kubernetes offerings. Also, as mentioned before, compliance may play a big role in the decision. You may need to support an application in regions where major US hyperscalers do not operate in (e.g., China) or where a more locked-down version is required (e.g., military, banking, medical). Finally, you may work in industries where there is a need for either cutting-edge support or massive modifications to fit the application's needs. For example, for some financial institutions, there may be a need for confidential computing. While the major cloud providers have some level of support for them at the time of writing, it is still limited. Conclusion Managing and operating Kubernetes at scale is no easy task. Over the years, the community has continually innovated and produced numerous solutions to make that process easier. On one hand, we have massive support from major hyperscalers for production-ready, managed Kubernetes services. Also, we have more open-source tools to self-manage Kubernetes if need be. In this article, we went through the pros and cons of each approach, breaking down the state of each option along the way. While most users will benefit from going with a managed Kubernetes offering, opting for a self-managed option is not only valid but sometimes necessary. Make sure your team either has the expertise or the resources required to build it in-house before going with the self-managed option. Additional Reading: CNCF Survey 2019: Deployments Are Getting Larger as Cloud Native Adoption Becomes Mainstream "101+ Cloud Computing Statistics That Will Blow Your Mind (Updated 2023)" by Cody Slingerland, Cloud Zero This is an article from DZone's 2023 Kubernetes in the Enterprise Trend Report.For more: Read the Report
Backpressure is a critical concept in software development, particularly when working with data streams. It refers to the control mechanism that maintains the balance between data production and consumption rates. This article will explore the notion of backpressure, its importance, real-world examples, and how to implement it using Java code. Understanding Backpressure Backpressure is a technique employed in systems involving data streaming where the data production rate may surpass the consumption rate. This imbalance can lead to data loss or system crashes due to resource exhaustion. Backpressure allows the consumer to signal the producer when it's ready for more data, preventing the consumer from being overwhelmed. The Importance of Backpressure In systems without backpressure management, consumers may struggle to handle the influx of data, leading to slow processing, memory issues, and even crashes. By implementing backpressure, developers can ensure that their applications remain stable, responsive, and efficient under heavy loads. Real-World Examples Video Streaming Services Platforms like Netflix, YouTube, and Hulu utilize backpressure to deliver high-quality video content while ensuring the user's device and network can handle the incoming data stream. Adaptive Bitrate Streaming (ABS) dynamically adjusts the video stream quality based on the user's network conditions and device capabilities, mitigating potential issues due to overwhelming data. Traffic Management Backpressure is analogous to traffic management on a highway. If too many cars enter the highway at once, congestion occurs, leading to slower speeds and increased travel times. Traffic signals or ramp meters can be used to control the flow of vehicles onto the highway, reducing congestion and maintaining optimal speeds. Implementing Backpressure in Java Java provides a built-in mechanism for handling backpressure through the Flow API, introduced in Java 9. The Flow API supports the Reactive Streams specification, allowing developers to create systems that can handle backpressure effectively. Here's an example of a simple producer-consumer system using Java's Flow API: Java import java.util.concurrent.*; import java.util.concurrent.Flow.*; public class BackpressureExample { public static void main(String[] args) throws InterruptedException { // Create a custom publisher CustomPublisher<Integer> publisher = new CustomPublisher<>(); // Create a subscriber and register it with the publisher Subscriber<Integer> subscriber = new Subscriber<>() { private Subscription subscription; private ExecutorService executorService = Executors.newFixedThreadPool(4); @Override public void onSubscribe(Subscription subscription) { this.subscription = subscription; subscription.request(1); } @Override public void onNext(Integer item) { System.out.println("Received: " + item); executorService.submit(() -> { try { Thread.sleep(1000); // Simulate slow processing System.out.println("Processed: " + item); } catch (InterruptedException e) { e.printStackTrace(); } subscription.request(1); }); } @Override public void onError(Throwable throwable) { System.err.println("Error: " + throwable.getMessage()); executorService.shutdown(); } @Override public void onComplete() { System.out.println("Completed"); executorService.shutdown(); } }; publisher.subscribe(subscriber); // Publish items for (int i = 1; i <= 10; i++) { publisher.publish(i); } // Wait for subscriber to finish processing and close the publisher Thread.sleep(15000); publisher.close(); } } Java class CustomPublisher<T> implements Publisher<T> { private final SubmissionPublisher<T> submissionPublisher; public CustomPublisher() { this.submissionPublisher = new SubmissionPublisher<>(); } @Override public void subscribe(Subscriber<? super T> subscriber) { submissionPublisher.subscribe(subscriber); } public void publish(T item) { submissionPublisher.submit(item); } public void close() { submissionPublisher.close(); } } In this example, we create a CustomPublisher class that wraps the built-in SubmissionPublisher. The CustomPublisher can be further customized to generate data based on specific business logic or external sources. The Subscriber implementation has been modified to process the received items in parallel using an ExecutorService. This allows the subscriber to handle higher volumes of data more efficiently. Note that the onComplete() method now shuts down the executorService to ensure proper cleanup. Error handling is also improved in the onError() method. In this case, if an error occurs, the executorService is shut down to release resources. Conclusion Backpressure is a vital concept for managing data streaming systems, ensuring that consumers can handle incoming data without being overwhelmed. By understanding and implementing backpressure techniques, developers can create more stable, efficient, and reliable applications. Java's Flow API provides an excellent foundation for building backpressure-aware systems, allowing developers to harness the full potential of reactive programming.
When doing unit tests, you have probably found yourself in the situation of having to create objects over and over again. To do this, you must call the class constructor with the corresponding parameters. So far, nothing unusual, but most probably, there have been times when the values of some of these fields were irrelevant for testing or when you had to create nested "dummy" objects simply because they were mandatory in the constructor. All this has probably generated some frustration at some point and made you question whether you were doing it right or not; if that is really the way to do unit tests, then it would not be worth the effort. That is to say, typically, a test must have a clear objective. Therefore, it is expected that within the SUT (system under test) there are fields that really are the object of the test and, on the other hand, others are irrelevant. Let's take an example. Let's suppose that we have the class "Person" with the fields Name, Email, and Age. On the other hand, we want to do the unit tests of a service that, receiving a Person object, tells us if this one can travel for free by bus or not. We know that this calculation only depends on the age. Children under 14 years old travel for free. Therefore, in this case, the Name and Email fields are irrelevant. In this example, creating Person objects would not involve too much effort, but let's suppose that the fields of the Person class grow or nested objects start appearing: Address, Relatives (List of People), Phone List, etc. Now, there are several issues to consider: It is more laborious to create the objects. What happens when the constructor or the fields of the class change? When there are lists of objects, how many objects should I create? What values should I assign to the fields that do not influence the test? Is it good if the values are always the same, without any variability? Two well-known design patterns are usually used to solve this situation: Object Mother and Builder. In both cases, the idea is to have "helpers" that facilitate the creation of objects with the characteristics we need. Both approaches are widespread, are adequate, and favor the maintainability of the tests. However, they still do not resolve some issues: When changing the constructors, the code will stop compiling even if they are fields that do not affect the tests. When new fields appear, we must update the code that generates the objects for testing. Generating nested objects is still laborious. Mandatory and unused fields are hard coded and assigned by default, so the tests have no variability. One of the Java libraries that can solve these problems is "EasyRandom." Next, we will see details of its operation. What is EasyRandom? EasyRandom is a Java library that facilitates the generation of random data for unit and integration testing. The idea behind EasyRandom is to provide a simple way to create objects with random values that can be used in tests. Instead of manually defining values for each class attribute in each test, EasyRandom automates this process, automatically generating random data for each attribute. This library handles primitive data types, custom classes, collections, and other types of objects. It can also be configured to respect specific rules and data generation restrictions, making it quite flexible. Here is a basic example of how EasyRandom can be used to generate a random object: Java public class EasyRandomExample { public static void main(String[] args) { EasyRandom easyRandom = new EasyRandom(); Person randomPerson = easyRandom.nextObject(Person.class); System.out.println(randomPerson); } } In this example, Person is a dummy class, and easyRandom.nextObject(Person.class) generates an instance of Person with random values for its attributes. As can be seen, the generation of these objects does not depend on the class constructor, so the test code will continue to compile, even if there are changes in the SUT. This would solve one of the biggest problems in maintaining an automatic test suite. Why Is It Interesting? Using the EasyRandom library for testing your applications has several advantages: Simplified random data generation: It automates generating random data for your objects, saving you from writing repetitive code for each test. Facilitates unit and integration testing: By automatically generating test objects, you can focus on testing the code's behavior instead of worrying about manually creating test data. Data customization: Although it generates random data by default, EasyRandom also allows you to customize certain fields or attributes if necessary, allowing you to adjust the generation according to your needs. Reduced human error: Manual generation of test data can lead to errors, especially when dealing with many fields and combinations. EasyRandom helps minimize human errors by generating consistent random data. Simplified maintenance: If your class requirements change (new fields, types, etc.), you do not need to manually update your test data, as EasyRandom will generate them automatically. Improved readability: Using EasyRandom makes your tests cleaner and more readable since you do not need to define test values explicitly in each case. Faster test development: By reducing the time spent creating test objects, you can develop tests faster and more effectively. Ease of use: Adding this library to our Java projects is practically immediate, and it is extremely easy to use. Where Can You Apply It? This library will allow us to simplify the creation of objects for our unit tests, but it can also be of great help when we need to generate a set of test data. This can be achieved by using the DTOs of our application and generating random objects to later dump them into a database or file. Where it is not recommended: this library may not be worthwhile in projects where object generation is not complex or where we need precise control over all the fields of the objects involved in the test. How To Use EasyRandom Let's see EasyRandom in action with a real example, environment used, and prerequisites. Prerequisites Java 8+ Maven or Gradle Initial Setup Inside our project, we must add a new dependency. The pom.xml file would look like this: XML <dependency> <groupId>org.jeasy</groupId> <artifactId>easy-random-core</artifactId> <version>5.0.0</version> </dependency> Basic Use Case The most basic use case has already been seen before. In this example, values are assigned to the fields of the person class in a completely random way. Obviously, when testing, we will need to have control over some specific fields. Let's see this as an example. Recall that EasyRandom can also be used with primitive types. Therefore, our example could look like this. Java public class PersonServiceTest { private final EasyRandom easyRandom = new EasyRandom(); private final PersonService personService = new PersonService(); @Test public void testIsAdult() { Person adultPerson = easyRandom.nextObject(Person.class); adultPerson.setAge(18 + easyRandom.nextInt(80)); assertTrue(personService.isAdult(adultPerson)); } @Test public void testIsNotAdult() { Person minorPerson = easyRandom.nextObject(Person.class); minorPerson.setAge(easyRandom.nextInt(17)); assertFalse(personService.isAdult(minorPerson)); } } As we can see, this way of generating test objects protects us from changes in the "Person" class and allows us to focus only on the field we are interested in. We can also use this library to generate lists of random objects. Java @Test void generateObjectsList() { EasyRandom generator = new EasyRandom(); //Generamos una lista de 5 Personas List<Person> persons = generator.objects(Person.class, 5) .collect(Collectors.toList()); assertEquals(5, persons.size()); } This test, in itself, is not very useful. It is simply to demonstrate the ability to generate lists, which could be used to dump data into a database. Generation of Parameterized Data Let's see now how to use this library to have more precise control in generating the object itself. This can be done by parameterization. Set the value of a field. Let's imagine the case that for our tests, we want to keep certain values constant (an ID, a name, an address, etc.) To achieve this, we would have to configure the initialization of objects using "EasyRandomParameters" and locate the parameters by their name. Let's see how: Java EasyRandomParameters params = new EasyRandomParameters(); // Asignar un valor al campo por medio de una función lamba params.randomize(named("age"),()-> 5); EasyRandom easyRandom = new EasyRandom(params); // El objeto tendrá siempre una edad de 5 Person person = easyRandom.nextObject(Person.class); Of course, the same could be done with collections or complex objects. Let's suppose that our class Person, contains an Address class inside and that, in addition, we want to generate a list of two persons. Let's see a more complete example: Java EasyRandomParameters parameters = new EasyRandomParameters() .randomize(Address.class, () -> new Address("Random St.", "Random City")) EasyRandom easyRandom = new EasyRandom(parameters); return Arrays.asList( easyRandom.nextObject(Person.class), easyRandom.nextObject(Person.class) ); Suppose now that a person can have several addresses. This would mean the "Address" field will be a list inside the "Person" class. With this library, we can also make our collections have a variable size. This is something that we can also do using parameters. Java EasyRandomParameters parameters = new EasyRandomParameters() .randomize(Address.class, () -> new Address("Random St.", "Random City")) .collectionSizeRange(2, 10); EasyRandom easyRandom = new EasyRandom(parameters); // El objeto tendrá una lista de entre 2 y 10 direcciones Person person = easyRandom.nextObject(Person.class); Setting Pseudo-Random Fields As we have seen, setting values is quite simple and straightforward. But what if we want to control the randomness of the data? We want to generate random names of people, but still names and not just strings of unconnected characters. This same need is perhaps clearer when we are interested in having randomness in fields such as email, phone number, ID number, card number, city name, etc. For this purpose, it is useful to use other data generation libraries. One of the best-known is Faker. Combining both libraries, we could get a code like this: Java EasyRandomParameters params = new EasyRandomParameters(); //Generar número entre 0 y 17 params.randomize(named("age"), () -> Faker.instance().number().numberBetween(0, 17)); // Generar nombre "reales" aleatorios params.randomize(named("name"), () -> Faker.instance().name().fullName()); EasyRandom easyRandom = new EasyRandom(params); Person person = easyRandom.nextObject(Person.class); There are a multitude of parameters that allow us to control the generation of objects. Closing EasyRandom is a library that should be part of your backpack if you develop unit tests, as it helps maintain unit tests. In addition, and although it may seem strange, establishing some controlled randomness in tests may not be a bad thing. In a way, it is a way to generate new test cases automatically and will increase the probability of finding bugs in code.
Building complex container-based architectures is not very different from programming in terms of applying design best practices and principles. The goal of this article is to present three popular extensibility architectural patterns from a developer's perspective using well-known programming principles. Let's start with the Single Responsibility Principle. According to R. Martin, "A class should have only one reason to change." But classes are abstractions used to simplify real-world problems and represent software components. Hence, a component should have only one reason to change over time. Software services and microservices in particular are also components (runtime components) and should have only one reason to change. Microservices are supposed to be a single deployable unit, meaning they are deployed independently of other components and can have as many instances as needed. But is that always true? Are microservices always deployed as a single unit? In Kubernetes, the embodiment of a microservice is a Pod. A Pod is defined as a group of containers that share resources like file systems, kernel namespaces, and an IP address. The Pod is the atomic unit of scheduling in a Kubernetes cluster and each Pod is meant to run a single instance of a given application. According to the documentation, "Pods are designed to support multiple cooperating processes (as containers) that form a cohesive unit of service. The containers in a Pod are automatically co-located and co-scheduled on the same physical or virtual machine in the cluster. Scaling an application horizontally means replicating Pods. According to the Kubernetes documentation, Pods can be configured using two strategies: Pods that run a single container: The "one-container-per-Pod" model is the most common Kubernetes use case; the Pod is a wrapper around a single container and Kubernetes manages Pods rather than managing the containers directly. Pods that run multiple containers working together: A Pod can encapsulate an application composed of multiple co-located containers that are tightly coupled and need to share resources. These co-located containers form a single cohesive unit of service—for example, one container serving data stored in a shared volume to the public, while a separate sidecar container refreshes or updates those files. The Pod wraps these containers, storage resources, and an ephemeral network identity together as a single unit." The answer is: NO! Microservices are NOT always deployed as a single unit! Next to some popular architectural patterns for the cloud like scalability patterns, deployment and reliability patterns are extensibility architectural patterns. We will have a closer look at the three most popular extensibility patterns for cloud architectures: Sidecar pattern Ambassador pattern Adapter pattern Sidecar Pattern Problem Each deployable service/application has its own "reason to change," or responsibility. However, in addition to its core functionality it needs to do other things called in the software developer terminology "cross-cutting concerns." One example is the collection of performance metrics that need to be sent to a monitoring service. Another one is logging events and sending them to a distributed logging service. I called them cross-cutting concerns, as they do not directly relate to business logic and are needed by multiple services, they basically represent reusable functionality that needs to be part of each deployed unit. Solution The solution to that problem is called the sidecar pattern and imposes the creation of an additional container called a sidecar container. Sidecar containers are an extension of the main container following the Open-Closed design principle (opened for extension, closed for modification). They are tightly coupled with the "main" container in terms of deployment as they are deployed as part of the same Pod but are still easy to replace and do not break the single responsibility of the extended container. Furthermore, the achieved modularity allows for isolated testing of business-related functionality and additional helper services like event logging or monitoring. The communication of the two containers is fast and reliable and they share access to the same resources enabling the helper component to provide reusable infrastructure-related services. In addition, it is applicable to many types of services solving the issue with heterogeneity in terms of different technologies used. The upgrade of the sidecar components is also straightforward as it usually means the upgrade of a Docker container version and redeploying using for example the no-down-time Kubernetes strategies. Ambassador Containers Problem Deployed services do not function in isolation. They usually communicate over the network to other services even outside the application or software platform controlled by a single organization. Integrations between components in general imply integration with external APIs and also dealing with failures or unavailability of external systems. A common practice for external systems integration is to define the so-called API Facade, an internal API that hides the complexity of external system APIs. The role of the API Facades is to isolate the external dependencies providing an implementation of the internal API definition and taking care of security and routing if needed. In addition, failures and unavailability of external systems are usually handled using some common patterns like the Retry Pattern, Circuit Breaker Pattern, and sometimes backed by Local Caching. All these technicalities would complicate the main service and appear to be candidates for a helper container. Solution The solution to that problem is called Ambassador Pattern and implies the creation of an additional container called an Ambassador container. Ambassador containers proxy a local connection to the world, they are basically a type of Sidecar container. This composition of containers is powerful, not just because of the separation of concerns and the fact that different teams can easily own the components but it also allows for an easy mocking of external services for local development environments. Adapter Containers Problem There are still many monolith systems planned for migration to more lightweight architectures. Migrations, though, can not happen in one pass, and it is also risky to wait for the rewriting of a whole system for years while also supporting the addition of new features in both versions of the system. Migrations should happen in small pieces publishing separate services and integrating them one by one. That process repeats until the legacy monolith system is gone. So we have a new part of the system supporting new APIs and an old part that still supports old APIs. For example, we might have newly implemented REST services and still have some old SOAP-based services. We need something that takes care of exposing the old functionality as if all the services were migrated and can be integrated by the clients' systems. Solution The solution to that problem is called Adapter or Anti-Corruption pattern. The Adapter container takes care of translating from one communication protocol to another and from one data model to another while hiding the actual service from the external world. Furthermore, the Adapter container can provide two-way communication. If the legacy system needs to communicate with the new services it could also be the adapting component for that communication serving as a kind of an Ambassador container until the migration is finalized. In this article, we saw how container composition provides an extensibility mechanism without an actual change of the main application container providing stability and reusability by allowing the composite pod to be treated as any other simple pod exposing a single and simple service in a microservice architecture. One would ask why not use a library and share it across many containers. Well, that is also a solution but then we are facing the shared responsibility problem of introducing coupling between all the services using it. In addition, heterogeneous services would require rewriting the libraries using all the supported languages. That also breaks the Single Responsibility Principle, which we would in any case like to keep.
Meltdown has definitely taken the internet by storm. The attack seems quite simple and elegant, yet the whitepaper leaves out critical details on the specific vulnerability. It relies mostly on a combination of cache timing side-channels and speculative execution that accesses globally mapped kernel pages. This deep dive assumes some familiarity with CPU architecture and OS kernel behavior. Read the background section first for a primer on paging and memory protection. Simplified Version of the Attack Speculative memory reads to kernel mapped (supervisor) pages and then performing a calculation on value. Conditionally issuing a load to some other non-cached location from memory based on the result of the calculation. While the 2nd load will be nuked from the pipeline when the faulting exception retires, it already issued a load request out to L2$ and beyond ensuring the outstanding memory request still brings the line into the cache hierarchy like a prefetch. Finally, a separate process can issue loads to those same memory locations and measure the time for those loads. A cache hit will be much quicker than a cache miss which can be used to represent binary 1s (i.e., hits) and binary 0s (i.e., misses). Parts 1 and 2 have to do with speculative execution of instructions, while parts 3 and 4 enable the microarchitecture state (i.e., in cache or not) to be committed to an architectural state. Is the Attack Believable? What is not specified in the Meltdown whitepaper is what specific x86 instruction sequence or CPU state can enable the memory access to be speculatively executed AND allow the vector or integer unit to consume that value. or L2$. In modern Intel CPUs, when a fault happens such as a page fault, the pipeline is not squashed/nuked until the retirement of the offending instruction. However, memory permission checks for page protection, segmentation limits, and canonical checks are done in what is called the address generation (AGU) stage and TLB lookup stage before the load even look up in the L1D$ or go out to memory. More on this below. Performing Memory Permission Checks Intel CPUs implement physically tagged L1D$ and L1I$ which requires translating the linear (virtual) address to a physical address before the L1D$ can determine if it hits or misses in the cache via a tag match. This means the CPU will attempt to find the translation in the (Translation Lookaside Buffer) TLB cache. The TLB caches these translations along with the page table or page directory permissions (privileges required to access a page are also stored along with the physical address translation in the page tables). A TLB entry may contain the following: Valid Physical Address minus page offset. Read/Write User/Supervisor Accessed Dirty Memtype Thus, even a speculative load already knows the permissions required to access the page is compared against the Current Privilege Level (CPL) and required op privilege and thus can be blocked from any arithmetic unit from ever consuming the speculative load. Such permission checks include: Segment limit checks Write faults User/Supervisor faults Page not present faults This is what many x86 CPUs in fact are designed to do. The load would be rejected until the fault is later handled by software/uCode when the op is at retirement. The load would be zeroed out on the way to the integer/vector units. In other words, a fault on User/Supervisor protection fault would be similar to a page not present fault or other page translation issue meaning the line read out of the L1D$ should be thrown away immediately and the uOp simply put in a waiting state. Preventing integer/floating units from consuming faulting loads is beneficial not just to prevent such leaks, but can actually boost performance. I.e., loads that fault won’t train the prefetchers with bad data, allocate buffers to track memory ordering, or allocate a fill buffer to fetch data from L2$ if it missed the L1D$. These are limited resources in modern CPUs and shouldn’t be consumed by loads that are not good anyway. In fact, if the load missed the TLB’s and had to perform a page walk, some Intel CPUs will even kill the page walk in the PMH (Page Miss Handler) if a fault happens during the walk. Page walks perform a lot of pointer chasing and have to consume precious load cycles, so it makes sense to cancel if it’ll be thrown away later anyway. In addition, the PMH Finite State Machine can usually handle only a few page walks simultaneously. In other words, aborting the L1D Load uOp can actually a good thing from a performance stand point. The press articles saying Intel slipped because they were trying to extract as much performance as possible with the tradeoff of being less secure isn’t true unless they want to claim the basic concepts of speculation and caching are considered tradeoffs. The Fix This doesn’t mean the Meltdown vulnerability doesn’t exist. There is more to the story than what the whitepaper and most news posts discuss. Most posts claim that only the mere act of having speculative memory accesses and cache timing attacks can create the attack, and now Intel has to completely redesign its CPUs or eliminate speculative execution. Meltdown is more of a logic bug that slipped Intel CPU validation rather than a “fundamental breakdown in modern CPU architecture” like the press is currently saying. It can probably be fixed with a few gate changes in hardware. In fact, the bug fix would probably be a few gate changes to add the correct rejection logic in the L1D$ pipelines to mask the load hit. Intel CPUs certainly have the information already as the address generation and TLB lookup stages have to be complete before a L1D$ cache hit can be determined anyway. It is unknown what are all the scenarios that cause the vulnerability. Is it certain CPU designs that missed validation on this architectural behavior? Is it a special x86 instruction sequence that bypasses these checks, or some additional steps to set up the state of the CPU to ensure the load is actually executed? Project Zero believes the attack can only occur if the faulting load hits in L1D$. Maybe Intel had the logic on the miss path but had a logic bug for the hit path? I wouldn’t be surprised if certain Intel OoO designs are immune to Meltdown as it’s a specific CPU design and validation problem, rather than a general CPU architecture problem. Unfortunately, x86 has many different flows through the Memory Execution Unit. For example, certain instructions like MOVNTDQA have different memory ordering and flows in the L1D$ than a standard cacheable load. Haswell Transactional Synchronization Extensions and locks add even additional complexity to validate correctness. Instruction fetches go through a different path than D-side loads. The validation state space is very large. Throw in all the bypass networks and then you can see how many different places where fault checks need to be validated. One thing is for certain, caching and speculation are not going away anytime soon. If it is a logic bug, it may be a simple fix for future Intel CPUs. Why Is This Attack Easier Today? Let’s say there is an instruction that enable loads to hit and consumed even in presence of faults, or I am wrong on the above catch, why is it happening now rather than discovered decades ago. Today’s CPUs have much deeper pipelines cough Prescott cough which provides a wider window between the speculative memory access and the actual nuke/squash of those faulting accesses. Faulting instructions are not handled until the instruction is up for retirement/committal to processor architectural state. Only at retirement is the pipeline nuked. Long pipelines allow for a large window between the execution of the faulting instruction and the retirement of it allowing other speculative instructions to race ahead. Larger cache hierarchy and slower memory fabric speed relative to fast CPU only ops such as cache hit/integer ops which provides a much larger time difference in cycles between a cache hit and a cache miss/memory to enable more robust cache timing attacks. Today’s large multi-core server CPUs with elaborate mesh fabrics to connect tens or hundreds of cores exaggerates this. Addition of performance enhancing features for fine granularity cache control such as the x86 CLFLUSH and PREFETx give more control for cache timing attacks. Wider issue processors that enable parallel integer, floating, and memory ops simultaneously. One could place long floating point operations such as divide or sqrt right before the faulting instruction to keep the core busy, but still keeping the integer and memory pipelines free for the attack. Since the faulting instruction will not nuke the pipeline until retirement, it has to wait until any earlier instructions in the instruction sequence to be committed including long running floating point ops. Virtualization and PaaS. Many web scale companies are now running workloads cloud providers like AWS and Azure. Before cloud, Fortune 500 companies would run their own trusted applications on their own hardware. Thus, applications from different companies were physically separated, unlike today. While it is unknown if Meltdown can allow a guest OS to break into the hypervisor or host OS, what is known is that many virtualization techniques are more lightweight than full blown VT-x. For example, multiple apps in Heroku, AWS Beanstalk, or Azure Web apps along with Docker containers are running within the same VM. Companies no longer spin up a separate VM for each application. This could allow a rogue application to read kernel memory of the specific VM. Shares resources were not a thing in the 90’s when OoO execution became mainstream with the Pentium Pro/Pentium III. The use of the Global and User/Supervisor bits in x86 paging entries which enables the kernel memory space to be mapped into every user process (but protected from Ring3 code execution) to reduce pressure on TLBs and slow context switching to a separate kernel process. This performance optimization has been done since the 1990s. Is This x86 Specific? First of all, cache timing attacks and speculative execution is not specific to Intel or x86 CPUs. Most modern CPUs implement multi-level caches and heavy speculation outside of a few embedded microprocessors for your watch or microwave. This isn’t an Intel specific problem or a x86 problem but rather a fundamental problem in general CPU architecture. There are now claims that specific OoO ARM CPUs such as those in iPhones and smartphones exhibit this flaw also. Out of order execution has been done since being introduced by Tomasulo algorithm. At the same time, cache timing attacks have been known for decades as it’s been known stuff may be loaded into caches when it shouldn’t have. However, cache timing attacks have traditionally been used to find the location of kernel memory rather than the ability to actually read it. It’s more of a race condition and window that is enabled depending on the microarchitecture. Some CPUs have shallower pipelines than others causing the nuke to happen sooner. Modern desktop/server CPUs like x86 have more elaborate features from CLFLUSH to PREFETCHTx that can be additional tools to make the attach more robust. Background on Memory Paging Since the introduction of paging to the 386 and Windows 3.0, operating systems have used this feature to isolate the memory space of one process from another. A process will be mapped to its own independent virtual address space which is independent from another running process’s address space. These virtual address spaces are backed by physical memory (pages can be swapped out to disk, but that’s beyond the scope of this post). For example, let’s say Process 1 needs 4KB of memory thus the OS allocates a virtual memory space of 4KB which has a byte-addressable range from 0x0 to 0xFFF. This range is backed by physical memory starting at the location 0x1000. This means processes 1’s [0x0-0xFFF] is “mounted” at the physical location [0x1000-0x1FFF]. If there is another process running, it also needs 4KB, thus the OS will map a second virtual address space for this Process 2 with the range 0x0 to 0xFFF. This virtual memory space also needs to be backed by physical memory. Since process 1 is already using 0x1000-0x1FFF, the OS will decide to allocate the next block of physical memory [0x2000-0x2FFF] for Process 2. Given this setup, if Process 1 issues a load from memory to linear address 0x0, the OS will translate this to physical location 0x1000. Whereas if Process 2 issues a load from memory to linear address 0x0, the OS will translate this to physical location 0x2000. Notice how there needs to be a translation. That is the job of the page tables. An analogy in the web world would be how two different running Docker containers running on a single host can mount the same /data dir inside the container to two different physical locations on the host machine /data/node0 and /data/node1. A range of mapped memory is referred to as a page. CPU architectures have a defined page size such as 4KB. Paging allows the memory to be fragmented across the physical memory space. In our above example, we assumed a page size of 4KB, thus each process only mapped one page. Now, let’s say Process 1 performs a malloc() and forces the kernel to map a second 4KB region to be used. Since the next page of physical memory [0x2000-0x2FFF] is already utilized by Process 2, the OS needs to allocate a free block of physical memory [0x3000-0x3FFF] to Process 1 (Note: Modern OS’s use deferred/lazy memory allocation which means virtual memory may be created before being backed by any physical memory until the page is actually accessed, but that’s beyond the scope of this post. See x86 Page Accessed/Dirty Bits for more). The address space appears contiguous to the process but in reality is fragmented across the physical memory space: Process 1 Virtual Memory Physical Memory [0x0-0xFFF] [0x1000-0x1FFF] [0x1000-0x1FFF] [0x3000-0x3FFF] Process 2 Virtual Memory Physical Memory [0x0-0xFFF] [0x2000-0x2FFF] There is an additional translation step before this to translate logical address to linear address using x86 Segmentation. However, most Operating Systems today do not use segmentation in the classical sense so we’ll ignore it for now. Memory Protection Besides creating virtual address spaces, paging is also used as a form of protection. The above translations are stored in a structure called a page table. Each 4KB page can have specific attributes and access rights stored along with the the translation data itself. For example, pages can be defined as read-only. If a memory store is executed against a read-only page of memory, a fault is triggered by the CPU. Straight from the x86 Reference Manual, the following non-exhaustive list of attribute bits (which behave like boolean true/false) are stored with each page table entry: Bit Name Description P Present must be 1 to map a 4-MByte page R/W Read/write if 0, writes may not be allowed to the page referenced by this entry U/S User/supervisor if 0, user-mode accesses are not allowed to the page referenced by this entry A Accessed indicates whether software has accessed the page referenced by this entry D Dirty indicates whether software has written to the page referenced by this entry G Global if CR4.PGE = 1, determines whether the translation is global, ignored otherwise XD Execution Disable If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 4-KByte page controlled by this entry; see Section 4.6); otherwise, reserved (must be 0) Minimizing Context Switching Cost We showed how each process has its own virtual address mapping. The kernel process is a process just like any other and also has a virtual memory mapping. When the CPU switches context from one process to another process, there is high switching cost as much of the architectural state needs to be saved to memory so that the old process that was suspended can resume executing with the saved state when it starts executing again. However, many system calls need to be performed by the kernel such as I/O, interrupts, etc. This means a CPU would constantly be switching between a user process and the kernel process to handle those system calls. To minimize this cost, kernel engineers and computer architects map the kernel pages right in the user virtual memory space to avoid the context switching. This is done via the User/Supervisor access rights bits. The OS maps the kernel space but designates it as supervisor (a.k.a. Ring0) access only so that any user code cannot access those pages. Thus, those pages appear invisible to any code running at user privilege level (a.k.a. Ring3). while running in user-mode, if a CPU sees a instruction access a page that requires supervisor rights, a page fault is triggered. In x86, page access rights are one of the paging related reasons that can trigger a #PF (Page Fault). The Global Bit We showed how each process has its own virtual address mapping. The kernel process is a process just like any other and also has a virtual memory mapping. Most translations are private to the process. This ensures Process 1 cannot access Process 2’s data since there won’t be any mapping to [0x2000-0x2FFF] physical memory from Process 1. However, many system calls are shared across many processes to handle a process making I/O calls, interrupts, etc. Normally, this means each process would replicate the kernel mapping putting pressure on caching these translations and higher cost of context switching between processes. The Global bit enables these certain translations (i.e., the kernel memory space) to be visible across all processes. Closing Thoughts It’s always interesting to dig into security issues. Systems are now expected to be secure unlike in the 90s, AND it's only becoming much more critical with the growth of crypto, biometric verification, mobile payments, and digital health. A large breach is much more scary for consumers and businesses today that in the 90s. At the same time, we also need to keep discussion going on new reports. The steps taken to trigger the Meltdown vulnerability have been proven by various parties. However, it probably isn’t the mere act of having speculation and cache timing attacks that caused Meltdown nor is a fundamental breakdown in CPU architecture; rather, it seems like a logical bug that slipped validation. Meaning, speculation and caches are not going away anytime soon, nor will Intel require an entirely new architecture to fix Meltdown. Instead, the only change needed in future x86 CPUs is a few small gate changes to the combinational logic that affects determining if a hit in L1D$ (or any temporary buffers) is good.
Delivering new features and updates to users without causing disruptions or downtime is a crucial challenge in the quick-paced world of software development. This is where the blue-green deployment strategy is useful. Organizations can roll out new versions of their software in a secure and effective way by using the release management strategy known as “blue-green deployment.” Organizations strive for quick and dependable deployment of new features and updates in the fast-paced world of software development. Rolling out changes, however, can be a difficult task because there is a chance that it will introduce bugs or result in downtime. An answer to this problem can be found in the DevOps movement’s popular blue-green deployment strategy. Blue-green deployment enables uninterrupted software delivery with little interruption by utilizing parallel environments and careful traffic routing. In this article, we will explore the principles, benefits, and best practices of blue-green deployment, shedding light on how it can empower organizations to release software with confidence. In this article, we will explore the concept of blue-green deployment, its benefits, and how it can revolutionize the software development process. Understanding Blue-Green Deployment In order to reduce risks and downtime when releasing new versions or updates of an application, blue-green deployment is a software deployment strategy. It entails running two parallel instances of the same production environment, with the “blue” environment serving as a representation of the current stable version and the “green” environment. With this configuration, switching between the two environments can be done without upsetting end users. without disrupting end-users. The fundamental idea behind blue-green deployment is to automatically route user traffic to the blue environment to protect the production system's stability and dependability. Developers and QA teams can validate the new version while the green environment is being set up and thoroughly tested before it is made available to end users. The deployment process typically involves the following steps: Initial Deployment: The blue environment is the initial production environment running the stable version of the application. Users access the application through this environment, and it serves as the baseline for comparison with the updated version. Update Deployment: The updated version of the application is deployed to the green environment, which mirrors the blue environment in terms of infrastructure, configuration, and data. The green environment remains isolated from user traffic initially. Testing and Validation: The green environment is thoroughly tested to ensure that the updated version functions correctly and meets the desired quality standards. This includes running automated tests, performing integration tests, and potentially conducting user acceptance testing or canary releases. Traffic Switching: Once the green environment passes all the necessary tests and validations, the traffic routing mechanism is adjusted to start directing user traffic from the blue environment to the green environment. This switch can be accomplished using various techniques such as DNS changes, load balancer configuration updates, or reverse proxy settings. Monitoring and Verification: Throughout the deployment process, both the blue and green environments are monitored to detect any issues or anomalies. Monitoring tools and observability practices help identify performance problems, errors, or inconsistencies in real-time. This ensures the health and stability of the application in a green environment. Rollback and Cleanup: In the event of unexpected issues or unsatisfactory results, a rollback strategy can be employed to switch the traffic back to the blue environment, reverting to the stable version. Additionally, any resources or changes made in the green environment during the deployment process may need to be cleaned up or reverted. The advantages of blue-green deployment are numerous. By maintaining parallel environments, organizations can significantly reduce downtime during deployments. They can also mitigate risks by thoroughly testing the updated version before exposing it to users, allowing for quick rollbacks if issues arise. Blue-green deployment also supports scalability testing, continuous delivery practices, and experimentation with new features. Overall, blue-green deployment is a valuable approach for organizations seeking to achieve seamless software updates, minimize user disruption, and ensure a reliable and efficient deployment process. Benefits of Blue-Green Deployment Blue-green deployment offers several significant benefits for organizations looking to deploy software updates with confidence and minimize the impact on users. Here are the key benefits of implementing blue-green deployment: Minimized Downtime: Blue-green deployment significantly reduces downtime during the deployment process. By maintaining parallel environments, organizations can prepare and test the updated version (green environment) alongside the existing stable version (blue environment). Once the green environment is deemed stable and ready, the switch from blue to green can be accomplished seamlessly, resulting in minimal or no downtime for end-users. Rollback Capability: Blue-green deployment provides the ability to roll back quickly to the previous version (blue environment) if issues arise after the deployment. In the event of unforeseen problems or performance degradation in the green environment, organizations can redirect traffic back to the blue environment, ensuring a swift return to a stable state without impacting users. Risk Mitigation: With blue-green deployment, organizations can mitigate the risk of introducing bugs, errors, or performance issues to end-users. By maintaining two identical environments, the green environment can undergo thorough testing, validation, and user acceptance testing before directing live traffic to it. This mitigates the risk of impacting users with faulty or unstable software and increases overall confidence in the deployment process. Scalability and Load Testing: Blue-green deployment facilitates load testing and scalability validation in the green environment without affecting production users. Organizations can simulate real-world traffic and user loads in the green environment to evaluate the performance, scalability, and capacity of the updated version. This helps identify potential bottlenecks or scalability issues before exposing them to the entire user base, ensuring a smoother user experience. Continuous Delivery and Continuous Integration: Blue-green deployment aligns well with continuous delivery and continuous integration (CI/CD) practices. By automating the deployment pipeline and integrating it with version control and automated testing, organizations can achieve a seamless and streamlined delivery process. CI/CD practices enable faster and more frequent releases, reducing time-to-market for new features and updates. Flexibility for Testing and Experimentation: Blue-green deployment provides a controlled environment for testing and experimentation. Organizations can use the green environment to test new features, conduct A/B testing, or gather user feedback before fully rolling out changes. This allows for data-driven decision-making and the ability to iterate and improve software based on user input. Improved Reliability and Fault Tolerance: By maintaining two separate environments, blue-green deployment enhances reliability and fault tolerance. In the event of infrastructure or environment failures in one of the environments, the other environment can continue to handle user traffic seamlessly. This redundancy ensures that the overall system remains available and minimizes the impact of failures on users. Implementing Blue-Green Deployment To successfully implement blue-green deployment, organizations need to follow a series of steps and considerations. The process involves setting up parallel environments, managing infrastructure, automating deployment pipelines, and establishing efficient traffic routing mechanisms. Here is a step-by-step guide on how to implement blue-green deployment effectively: Duplicate Infrastructure: Duplicate the infrastructure required to support the application in both the blue and green environments. This includes servers, databases, storage, and any other components necessary for the application’s functionality. Ensure that the environments are identical to minimize compatibility issues. Automate Deployment: Implement automated deployment pipelines to ensure consistent and repeatable deployments. Automation tools such as Jenkins, Travis CI, or GitLab CI/CD can help automate the deployment process. Create a pipeline that includes steps for building, testing, and deploying the application to both the blue and green environments. Version Control and Tagging: Adopt proper version control practices to manage different releases effectively. Use a version control system like Git to track changes and create clear tags or branches for each environment. This helps in identifying and managing the blue and green versions of the software. Automated Testing: Implement comprehensive automated testing to validate the functionality and stability of the green environment before routing traffic to it. Include unit tests, integration tests, and end-to-end tests in your testing suite. Automated tests help catch issues early in the deployment process and ensure a higher level of confidence in the green environment. Traffic Routing Mechanisms: Choose appropriate traffic routing mechanisms to direct user traffic between the blue and green environments. Popular options include DNS switching, reverse proxies, or load balancers. Configure the routing mechanism to gradually shift traffic from the blue environment to the green environment, allowing for a controlled transition. Monitoring and Observability: Implement robust monitoring and observability practices to gain visibility into the performance and health of both environments. Monitor key metrics, logs, and user feedback to detect any anomalies or issues. Utilize monitoring tools like Prometheus, Grafana, or ELK Stack to ensure real-time visibility into the system. Incremental Rollout: Adopt an incremental rollout approach to minimize risks and ensure a smoother transition. Gradually increase the percentage of traffic routed to the green environment while monitoring the impact and collecting feedback. This allows for early detection of issues and quick response before affecting the entire user base. Rollback Strategy: Have a well-defined rollback strategy in place to revert back to the stable blue environment if issues arise in the green environment. This includes updating the traffic routing mechanism to redirect traffic back to the blue environment. Ensure that the rollback process is well-documented and can be executed quickly to minimize downtime. Continuous Improvement: Regularly review and improve your blue-green deployment process. Collect feedback from the deployment team, users, and stakeholders to identify areas for enhancement. Analyze metrics and data to optimize the deployment pipeline, automate more processes, and enhance the overall efficiency and reliability of the blue-green deployment strategy. By following these implementation steps and considering key aspects such as infrastructure duplication, automation, version control, testing, traffic routing, monitoring, and continuous improvement, organizations can successfully implement blue-green deployment. This approach allows for seamless software updates, minimized downtime, and the ability to roll back if necessary, providing a robust and efficient deployment strategy. Best Practices for Blue-Green Deployment Blue-green deployment is a powerful strategy for seamless software delivery and minimizing risks during the deployment process. To make the most of this approach, consider the following best practices: Version Control and Tagging: Implement proper version control practices to manage different releases effectively. Clearly label and tag the blue and green environments to ensure easy identification and tracking of each version. This helps in maintaining a clear distinction between the stable and updated versions of the software. Automated Deployment and Testing: Leverage automation for deployment pipelines to ensure consistent and repeatable deployments. Automation helps streamline the process and reduces the chances of human error. Implement automated testing at different levels, including unit tests, integration tests, and end-to-end tests. Automated testing helps verify the functionality and stability of the green environment before routing traffic to it. Infrastructure Duplication: Duplicate the infrastructure and set up identical environments for blue and green. This includes replicating servers, databases, and any other dependencies required for the application. Keeping the environments as similar as possible ensures a smooth transition without compatibility issues. Traffic Routing Mechanisms: Choose appropriate traffic routing mechanisms to direct user traffic from the blue environment to the green environment seamlessly. Popular techniques include DNS switching, reverse proxies, or load balancers. Carefully configure and test these mechanisms to ensure they handle traffic routing accurately and efficiently. Incremental Rollout: Consider adopting an incremental rollout approach rather than switching all traffic from blue to green at once. Gradually increase the percentage of traffic routed to the green environment while closely monitoring the impact. This allows for real-time feedback and rapid response to any issues that may arise, minimizing the impact on users. Canary Releases: Implement canary releases by deploying the new version to a subset of users or a specific geographic region before rolling it out to the entire user base. Canary releases allow you to collect valuable feedback and perform additional validation in a controlled environment. This approach helps mitigate risks and ensures a smoother transition to the updated version. Rollback Strategy: Always have a well-defined rollback strategy in place. Despite thorough testing and validation, issues may still occur after the deployment. Having a rollback plan ready allows you to quickly revert to the stable blue environment if necessary. This ensures minimal disruption to users and maintains the continuity of service. Monitoring and Observability: Implement comprehensive monitoring and observability practices to gain visibility into the performance and health of both the blue and green environments. Monitor key metrics, logs, and user feedback to identify any anomalies or issues. This allows for proactive detection and resolution of problems, enhancing the overall reliability of the deployment process. By following these best practices, organizations can effectively leverage blue-green deployment to achieve rapid and reliable software delivery. The careful implementation of version control, automation, traffic routing, and monitoring ensures a seamless transition between different versions while minimizing the impact on users and mitigating risks. Conclusion Deploying software in a blue-green fashion is a potent method for ensuring smooth and dependable releases. Organizations can minimize risks, cut down on downtime, and boost confidence in their new releases by maintaining two parallel environments and converting user traffic gradually. This method enables thorough testing, validation, and scalability evaluation and perfectly complies with the continuous delivery principles. Adopting blue-green deployment as the software development landscape changes can be a game-changer for businesses looking to offer their users top-notch experiences while maintaining a high level of reliability. Organizations can use the effective blue-green deployment strategy to deliver software updates with confidence. This method allows teams to seamlessly release new features and updates by reducing downtime, providing rollback capabilities, and reducing risks. Organizations can use blue-green deployment to achieve quicker and more reliable software delivery if the appropriate infrastructure is set up, deployment pipelines are automated, and traffic routing mechanisms are effective. Organizations can fully utilize blue-green deployment by implementing the recommended best practices discussed in this article. This will guarantee a positive user experience while lowering the risk of deployment-related disruptions. In conclusion, blue-green deployment has a lot of advantages, such as decreased downtime, rollback capability, risk reduction, scalability testing, alignment with CI/CD practices, flexibility for testing and experimentation, and increased reliability. Organizations can accomplish seamless software delivery, boost deployment confidence, and improve user experience throughout the deployment process by utilizing parallel environments and careful traffic routing.
The WIZ Research team recently discovered that an overprovisioned SAS token had been lying exposed on GitHub for nearly three years. This token granted access to a massive 38-terabyte trove of private data. This Azure storage contained additional secrets, such as private SSH keys, hidden within the disk backups of two Microsoft employees. This revelation underscores the importance of robust data security measures. What Happened? WIZ Research recently disclosed a data exposure incident found on Microsoft’s AI GitHub repository on June 23, 2023. The researchers managing the GitHub used an Azure Storage sharing feature through an SAS token to give access to a bucket of open-source AI training data. This token was misconfigured, giving access to the account's entire cloud storage rather than the intended bucket. This storage comprised 38TB of data, including a disk backup of two employees’ workstations with secrets, private keys, passwords, and more than 30,000 internal Microsoft Teams messages. SAS (Shared Access Signatures) are signed URLs for sharing Azure Storage resources. They are configured with fine-grained controls over how a client can access the data: what resources are exposed (full account, container, or selection of files), with what permissions, and for how long. See Azure Storage documentation. After disclosing the incident to Microsoft, the SAS token was invalidated. From its first commit to GitHub (July 20, 2020) to its revoking, nearly three years elapsed. See the timeline presented by the Wiz Research team: Yet, as emphasized by the WIZ Research team, there was a misconfiguration with the Shared Access Signature (SAS). Data Exposure The token was allowing anyone to access an additional 38TB of data, including sensitive data such as secret keys, personal passwords, and over 30,000 internal Microsoft Teams messages from hundreds of Microsoft employees. Here is an excerpt from some of the most sensitive data recovered by the Wiz team: As highlighted by the researchers, this could have allowed an attacker to inject malicious code into the storage blob that could then automatically execute with every download by a user (presumably an AI researcher) trusting in Microsoft's reputation, which could have led to a supply chain attack. Security Risks According to the researchers, Account SAS tokens such as the one presented in their research present a high-security risk. This is because these tokens are highly permissive, long-lived tokens that escape the monitoring perimeter of administrators. When a user generates a new token, it is signed by the browser and doesn't trigger any Azure event. To revoke a token, an administrator needs to rotate the signing account key, therefore revoking all the other tokens at once. Ironically, the security risk of a Microsoft product feature (Azure SAS tokens) caused an incident for a Microsoft research team, a risk recently referenced by the second version of the Microsoft threat matrix for storage services: Secrets Sprawl This example perfectly underscores the pervasive issue of secrets sprawl within organizations, even those with advanced security measures. Intriguingly, it highlights how an AI research team, or any data team, can independently create tokens that could potentially jeopardize the organization. These tokens can cleverly sidestep the security safeguards designed to shield the environment. Mitigation Strategies For Azure Storage Users: 1 - Avoid Account Sas Tokens The lack of monitoring makes this feature a security hole in your perimeter. A better way to share data externally is using a Service SAS with a Stored Access Policy. This feature binds a SAS token to a policy, providing the ability to centrally manage token policies. Better though, if you don't need to use this Azure Storage sharing feature, is to simply disable SAS access for each account you own. 2 - Enable Azure Storage Analytics Active SAS token usage can be monitored through the Storage Analytics logs for each of your storage accounts. Azure Metrics allows the monitoring of SAS-authenticated requests and identifies storage accounts that have been accessed through SAS tokens, for up to 93 days. For All: 1 - Audit Your Github Perimeter for Sensitive Credentials With around 90 million developer accounts, 300 million hosted repositories, and 4 million active organizations, including 90% of Fortune 100 companies, GitHub holds a much larger attack surface than meets the eye. Last year, GitGuardian uncovered 10 million leaked secrets on public repositories, up 67% from the previous year. GitHub must be actively monitored as part of any organization's security perimeter. Incidents involving leaked credentials on the platform continue to cause massive breaches for large companies, and this security hole in Microsoft's protective shell wasn't without reminding us of the Toyota data breach from a year ago. On October 7, 2022 Toyota, the Japanese-based automotive manufacturer, revealed they had accidentally exposed a credential allowing access to customer data in a public GitHub repo for nearly 5 years. The code was made public from December 2017 through September 2022. If your company has development teams, likely, some of your company's secrets (API keys, tokens, passwords) end up on public GitHub. Therefore it is highly recommended to audit your GitHub attack surface as part of your attack surface management program. Final Words Every organization, regardless of size, needs to be prepared to tackle a wide range of emerging risks. These risks often stem from insufficient monitoring of extensive software operations within today's modern enterprises. In this case, an AI research team inadvertently created and exposed a misconfigured cloud storage sharing link, bypassing security guardrails. But how many other departments - support, sales, operations, or marketing - could find themselves in a similar situation? The increasing dependence on software, data, and digital services amplifies cyber risks on a global scale. Combatting the spread of confidential information and its associated risks necessitates reevaluating security teams' oversight and governance capabilities.
That Can Not Be Tested!: Spring Cache and Retry
October 24, 2023 by
Popular Enterprise Architecture Frameworks
October 24, 2023 by CORE
A/B Testing: Optimizing Success Through Experimentation
October 24, 2023 by
The ABCs of Unity's Coroutines: From Basics to Implementation
October 24, 2023 by
Explainable AI: Making the Black Box Transparent
May 16, 2023 by
A/B Testing: Optimizing Success Through Experimentation
October 24, 2023 by
Shielding the Software Supply Chain Through CI/CD Pipeline Protection
October 24, 2023 by
Low Code vs. Traditional Development: A Comprehensive Comparison
May 16, 2023 by
Optimizing Kubernetes Costs With FinOps Best Practices
October 24, 2023 by CORE
How to Create Your Own 'Dynamic' Bean Definitions in Spring
October 24, 2023 by
Shielding the Software Supply Chain Through CI/CD Pipeline Protection
October 24, 2023 by
Optimizing Kubernetes Costs With FinOps Best Practices
October 24, 2023 by CORE
Low Code vs. Traditional Development: A Comprehensive Comparison
May 16, 2023 by
Unlocking the Power of OpenAI in React: Revolutionizing User Experiences
October 24, 2023 by
CrowdStrike Outlines Its Vision for AI-Driven Security at Fal.Con 2023
October 24, 2023 by CORE
Five IntelliJ Idea Plugins That Will Change the Way You Code
May 15, 2023 by