Trouble in IT Paradise with Darkening Clouds:
If you ask an enterprise CIO over a couple of drinks, what is his/her biggest hurdle today that is preventing to deliver the business right resources at the right time at a right price, his/her answer would be that “the IT is too darn complex.” Over a long period of time, the infrastructure vendors have hijacked Information Technologies with their complex silos and expediency has given way to myriad tools and point solutions that overlay a management web. In addition, the Venture Capitalists looking for quickie “insertion points” with no overarching architectural framework have proliferated tools and appliances that have contributed to the current complexity and tool fatigue.
After a couple of more drinks, if you press the CIO why his/her mission critical applications are not migrating to the cloud which claims lesser complexity, the CIO laments that there is no cloud provider willing to sign a warranty that assures the service levels for their mission critical applications that guarantee application availability, performance and security levels. “Every cloud provider talks about infrastructure service levels but not willing to step up to assure application availability, performance and security. There are myriad off-the main street providers that claim to offer orchestration to provide the service levels, but no one yet is signing on the dotted line.” The situation is more complicated when the resources span across multiple infrastructure providers.
The decoupling of the strong binding between the management of applications and the infrastructure management is a key for the CIO as more applications are developed with shorter time to market. CIO’s top five priorities are transnational applications demanding distributed resources, security, cost, compliance and uptime. A Gartner report claims that the CIOs spend 74% of IT budget on keeping the application “lights on” and another 18% on “changing the bulbs” and other maintenance activities. (It is interesting to recall that before Strowger’s switch eliminated many operators sitting in long rows plugging countless jacks into countless plugs, the cost of adding and managing new subscribers was rising in a geometric proportion. According to the Bell System chronicles, one large city general manager of a telephone company at that time wrote that he could see the day coming soon when he would go broke merely by adding a few more subscribers because the cost of adding and managing a subscriber is far greater than the corresponding revenue generated. The only difference between today’s IT datacenter and central office before Strowger’s switch is that “very expensive consultants, countless hardware appliances, and countless software systems that manage them” replace “many operators, countless plugs and countless jacks”.)
In order to utilize commodity infrastructure while maintaining high security, mobility for performance and availability, the CIOs are looking to solutions that let them focus on application quality of service (QoS) and are willing to outsource the infrastructure management to providers who can assure application mobility, availability and security albeit with end to end service visibility and control at their disposal.
While the public clouds seem to offer a way out to leverage the commodity infrastructure with on demand Virtual Machine provisioning, there are four hurdles that are preventing the CIO’s to embrace the clouds for mission critical applications:
- Current mission critical and even non-mission critical applications and services (groups of applications) are used to highly secure and low latency infrastructures that have been hardened and managed and the CIO’s are loath to spend more money to bring same level of SLA’s in public clouds.
- The dependence on particular service provider infrastructure API’s, Virtual Machine Image Management (nested or not) infrastructure dependencies and added self-healing, auto-scaling, live-migration service cost and complexity create service provider lock-in on their infrastructure and their management services. This defeats the purpose of leveraging the commodity infrastructure offered by different service providers.
- The increasing scope creep from infrastructure provides “up-the-stack” to provide application awareness and insert their API in application development in the name of satisfying non-functional requirements (availability, security, performance optimization) at run-time has started to increase the complexity and cost of application and service development. The resulting proliferation of tools and point solutions without a global architectural framework to use resources from multiple service providers have increased the integration and troubleshooting cost.
- Global communications, collaboration and commerce at the speed of light has increased the scale of computing and the distributed computing resource management has fallen short in meeting the scale and the fluctuations both caused by demand and also fluctuations in resources availability, performance and security.
The Inadequacy of Ad-hoc Programming to Solve Distributed Computing Complexity:
Unfortunately, the complexity is more a structural issue than an operational or infrastructure technology issue that cannot be resolved with ad-hoc programming techniques to manage the resources. Cockshott et al. conclude their book “Computation and its limits” with the paragraph “The key property of general-purpose computer is that they are general purpose. We can use them to deterministically model any physical system, of which they are not themselves a part, to an arbitrary degree of accuracy. Their logical limits arise when we try to get them to model a part of the world that includes themselves.” While the success of IT in modeling and executing business processes has evolved to current distributed datacenters and cloud computing infrastructures that provide on-demand computing resources to model and execute business processes, the structure and fluctuations that dictate the evolution of computation have introduced complexity in dealing with real-time changes in the interaction of the infrastructure and the computations they perform. The complexity manifests in the following ways:
- In a distributed computing environment, maintaining the right computing resources (cpu, memory, network bandwidth, latency, storage capacity, throughput and IOPs) are available to right software component contributing to the service transaction requires orchestration and management of myriad computing infrastructures often owned by different providers with different profit motives and incentives. The resulting complexity in resource management to assure availability, performance and security of service transactions adds to the cost of computing. For example, it is estimated that up to 70% of current IT budget is consumed in assuring service availability, performance and security. The complexity is compounded in distributed computing environments that are supported by heterogeneous infrastructures with disparate management systems.
- In a large-scale dynamic distributed computation supported by myriad infrastructure components, the increased component failure probabilities introduce a non-determinism (for example the Google is observing emergent behavior in their scheduling of distributed computing resources when dealing with large number of resources) that must be addressed by a service control architecture that decouples functional and non-functional aspects of computing.
- Fluctuations in the computing resource requirements dictated by changing business priorities, workload variations that depend on service consumption profiles and real-time latency constraints dictated by the affinity of service components, all demand a run-time response to dynamically adjust the computing resources. Current dependence on myriad orchestrators and management systems cannot scale in a distributed infrastructure without either a vendor lock-in on infrastructure access methods or a universal standard that often stifles innovation and competition to meet fast changing business needs.
Thus the function, structure and fluctuations involved in dynamic processes delivering service transaction are driving a need to search new computation, management and programming models that address the unification of the computer and the computed and decouple the service management from the infrastructure management at run-time.
It is the Architecture Stupid:
A business process is defined both by functional requirements that dictate the business domain functions and logic as well as non-functional requirements that define operational constraints related to service availability, reliability, performance, security and cost dictated by business priorities, workload fluctuations and resource latency constraints. A non-functional requirement specifies criteria that can be used to judge the operation of a system, rather than specific behaviors. The plan for implementing functional requirements is detailed in the system design. The plan for implementing non-functional requirements is detailed in the system architecture. While much progress has been made in the system design and development, the architecture of distributed systems falls short to address the non-functional requirements for two reasons:
- Current distributed systems architecture from its server-centric and low-bandwidth origins has created layers of resource management-centric ad-hoc software to address various uncertainties that arise in a distributed environment. Lack of support for concurrency, synchronization, parallelism and mobility of applications dictated by the current serial von-Neumann stored program control has given rise to ad-hoc software layers that monitor and manage distributed resources. While this approach may have been adequate when distributed resources are owned by a single provider and controlled by a framework that provides architectural support for implementing non-functional requirements, the proliferation of commodity distributed resource clouds offered by different service providers with different management infrastructures adds scaling and complexity issues. Current OpenStack and AWS API discussions are a clear example that forces a choice of one or the other or increased complexity to use both.
- The resource-centric view of IT currently demotes application and service management to a second-class citizenship where the QoS of application/service is monitored and managed by myriad resource management systems overlaid with multiple correlation and analysis layers used to manipulate the distributed resources to adjust the Cpu, memory, bandwidth, latency, storage IOPs, throughput and capacity which are all what are required to keep the application/service to meet its quality of service. Obviously, this approach cannot scale unless single set of standards evolve or a single vendor lock-in occurs.
Unless an architectural framework evolves to decouple application/service management from myriad infrastructure management systems owned and operated by different service providers with different profit motives, the complexity and cost of management will only increase.
A Not So Cool Metaphor to Deliver Very Cool Services Anywhere, Anytime and On-demand:
A lesson on an architectural framework that addresses nonfunctional requirements while connecting billions of users anywhere anytime on demand is found in the Plain Old Telephone System (POTS). From the beginnings of AT&T to today’s remaking of at&t, much has changed but two things that remain constant are the universal service (on a global scale) and the telecom grade “trust” that are taken for granted. Very recently, Mark Zuckerberg proclaimed at the largest mobile technology conference in Barcelona that his very cool service Facebook wants to be the dial tone for the Internet. Originally, the dial tone was introduced to assure the telephone user that the exchange is functioning when the telephone is taken off-hook by breaking the silence (before an operator responded) with an audible tone. Later on, the automated exchanges provided a benchmark for telecom grade trust that assures managed resources on-demand with high availability, performance and security. Today, as soon as the user goes on hook, the network recognizes the profile based on the dialing telephone number. As soon as the dialed party number is dialed, the network recognizes the destination profile and provisions all the network resources required to make the desired connection, commence billing, monitor and assure the connection till one of the parties initiates a disconnect. During the call, if the connection experiences any changes that impact the non-functional requirements, the network intelligence takes appropriate action based on policies. The resulting resiliency (availability, performance, and security), efficiency and scaling ability to connect billions of users on demand have come to be known as “Telecom grade trust”. An architectural flaw in the original service design (exploited by Steve Jobs by building a blue-box) was fixed by introducing an architectural change to separate the data path and the control path. The resulting 800 service call model provided a new class of services such as call forwarding, call waiting and conference call.
The Internet on the other hand evolved to connect billions of computers together anywhere, anytime from the prophetic statement made by J. C. R. Licklider “A network of such (computers), connected to one another by wide-band communication lines [which provided] the functions of present-day libraries together with anticipated advances in information storage and retrieval and [other] symbiotic functions.” The convergence of voice over IP, data and video networks has given rise to a new generation of services enabling communication, collaboration and commerce at the speed of light. The result is that the datacenter has replaced the central office to become the hub from which myriad voice, video and data services are created, and delivered on a global scale. However the management of these services which determines their resiliency, efficiency and scaling is another matter. In order to provide on demand services, anywhere, any-time with prescribed quality of service in an environment of wildly fluctuating workloads, changing business priorities and latency constraints dictated by the proximity of service consumers and suppliers, resources have to be managed in real-time across distributed pools to match the service QoS to resource SLAs. The telephone network is designed to share resources on a global scale and to connect them as required in real-time to meet the non-functional service requirements while current datacenters (whether privately owned or publicly provides as cloud services) are not. There are three structural deficiencies in the current distributed datacenter architecture to match the telecom grade resiliency, efficiency and scaling:
- The data path and service control path are not decoupled giving rise to same problems that Steve Jobs exploited causing a re-architecting of the network.
- The service management is strongly coupled with the resource management systems and does not scale as the resources become distributed and multiple service providers provide those resources with different profit motives and incentives. Since the resources are becoming commodity, every service provider wants to go up the stack to provide lock-in.
- Current trend to infuse resource management API in service logic to provide resource management at run-time and application aware architectures that want to establish intimacy with applications only increase complexity and make service composition with reusable service components all the more difficult because of their increased lock-in with resource management systems.
Resource management based datacenter operations miss an important feature of services/applications management which is that all services are not created equal. They have different latency and throughput requirements. They have different business priorities and different workload characteristics and fluctuations. What works for the goose does not work for the gander. In addition to the current complexity and cost of resource management to assure service availability, reliability, performance and security, there is an even more fundamental issue that plagues the current distributed systems architecture. A distributed transaction that spans multiple servers, networks and storage devices in multiple geographies uses resources that span across multiple datacenters. The fault, configuration, accounting, performance and security (FCAPS) of a distributed transaction behavior requires the end-to-end connection management more like telecommunication service spanning distributed resources. Therefore, focusing on only resource management in a datacenter without the visibility and control of all resources participating in the transaction will not provide assurance of service availability, reliability, performance and security at run-time.
New Dial Tones for Application/Service Development, Deployment and Operation:
Current Web-scale applications are distributed transactions that span across multiple resources widely scattered across multiple locations owned and managed by different providers. In addition, the transactions are transient making connections with various components to fulfill an intent and closing them only to reconnect when they need them again. This is very much in contrast to always-on distributed computing paradigm of yesterday.
In creating, deploying and operating these services, there are three key stake holders and associated processes:
- Resources providers deliver the vital resources required to create, deploy and operate these resources on demand anywhere anytime (resource dial tone). The vital resources are just the CPU, memory, network latency, bandwidth and storage capacity, throughput and IOPs required to execute the application or service that has been compiled to “1″s and “0″s (the Turing Machine). The resource consumers care less about how you provide these as long as you maintain the service levels the resource providers agree to when the application or service requests the resources at provisioning time (matching the QoS request with SLA and maintaining it during the application/service life-time). The resource dial tone that assures the QoS with resource SLA is offered to two different types of consumers of this resource. First, the application developer who uses these resources to develop the service components and composes them to create more complex services with their own QoS requirements. Second the service operators who use the SLAs to provide management of QoS at run-time to deliver the services to end users.
- The application developers like to use their tools and best practices without any constraints from resource providers and the run-time vital signs required to execute their services should be transparent to where or who is providing the vital resources. The resources must support the QoS specified by developer or service composer depending on the context, communication, control and constraint needs. They do not care how they get the CPU, memory, bandwidth, storage capacity, throughput or IOPs or how the latency constraints are met. This model is a major departure from current SDN route focusing on giving control of resources to applications which is not a scalable solution that allows decoupling of resource management from service management.
- The service operators provide run-time QoS assurance by brokering the QoS demands to match the best available resource pool that meets the cost and quality constraints (the management dial tone that assures non-functional requirements). The brokering function is a network service ala services switching to match the applications/services to the right resources.
The brokering service must then provide the non-functional requirements management at run-time just as in POTS.
The New Service Operations Center (SOC) with End-to-end Service Visibility and Control Independent of Distributed Infrastructure Management Centers Owned by Different Infrastructure Providers:
The new Telco model that the broker facilitates allows the enterprises and other infrastructure users to focus on services architecture and management and use infrastructure as a commodity from different infrastructure providers just as Telcos provide shared resources with network services.
Figure 1: The Telco Grade Services Architecture that
decouples end to end service transaction management from infrastructure
management systems at run-time
The service broker matches the QoS of service and service components with service levels offered by different infrastructure providers based on the service blueprint which defines the context, constraints, communications and control abstraction of the service at hand. The service components are provided with desired Cpu, memory, bandwidth, latency, storage IOPs, throughput and capacity desired. The decoupling of service management from distributed infrastructure management systems puts the safety and survival of services first and allows sectionalization, isolation, diagnosis anfd fixing infrastructure at leisure as is the case today with POTS.
It is important to note that the service dial tone Zuckerberg is talking about is not related to the resources dial tone or management dial tone required for providing service connections and management at run-time. He is talking about application end user receiving the content. Facebook application developers do not care how the computing resources are provided as long as their service QoS is maintained to meet the business priorities, workloads and latency constraints to deliver their service on a global scale. Facebook CIO would rather spend time maintaining the service QoS by getting the resources wherever they are available to meet the service needs at reasonable cost. In fact most CIOs would get rid of the infrastructure management burden if they have QoS assurance and end-to-end service visibility and service control (they could not care less about access to resources or their management systems) to manage the non-functional requirements at run-time. After all, Facebook’s open compute project is a side effect trying to fill a gap left by infrastructure providers - not their main line of business. The crash that resulted after Zuckerberg’s announcement of WhatsApp acquisition was not the “cool” application’s fault. They probably could have used a service broker/switch providing the old fashioned resource dial tone so that they could provide the service dial tone to their users.
This is similar to a telephone company assuring appropriate resources to connect different users based on their profiles or the Internet connecting devices based on their QoS needs at run-time. The broker acts as service switch that connects various service components at run-time and matches the QoS demands with appropriate resources.
With the right technology, the service broker/switch may yet provide the required service level warranties to the enterprise CEOs from well-established carriers with money and muscle.
Will at&t and other Telcos have the last laugh by incorporating this brokering service switch in the network and make current distributed datacenters (cloud or otherwise with physical or virtual infrastructure) a true commodity?