Many-core Servers, Solid State Drives, and High-Bandwidth Networks: The Hardware Upheaval, The Software Shortfall, The Future of IT and all that Jazz
August 2, 2014

The “gap between the hardware and the software of a concrete computer and even greater gap between pure functioning of the computer and its utilization by a user, demands description of many other operations that lie beyond the scope of a computer program, but might be represented by a technology of computer functioning and utilization”

Marc S. Burgin, “Super Recursive Algorithms”, Springer, New York, 2005

Introduction

According to Holbrook (Holbrook 2003), “Specifically, creativity in all areas seems to follow a sort of dialectic in which some structure (a thesis or configuration) gives way to a departure (an antithesis or deviation) that is followed, in turn, by a reconciliation (a synthesis or integration that becomes the basis for further development of the dialectic). In the case of jazz, the structure would include the melodic contour of a piece, its harmonic pattern, or its meter…. The departure would consist of melodic variations, harmonic substitutions, or rhythmic liberties…. The reconciliation depends on the way that the musical departures or violations of expectations are integrated into an emergent structure that resolves deviation into a new regularity, chaos into a new order, surprise into a new pattern as the performance progresses.”

Current IT in this Jazz metaphor, evolved from a thesis and currently is experiencing an anti-thesis and is ripe for a synthesis that would blend the old and the new with a harmonious melody to create a new generation of highly scalable, distributed, secure services with desired availability, cost and performance characteristics to meet the changing business priorities, highly fluctuating workloads and latency constraints.

The Hardware Upheaval and the Software Shortfall

There are three major factors driving the datacenter traffic and their patterns:
1. A multi-tier architecture which determines the availability, reliability, performance, security and cost of initiating a user transaction to an end-point and delivering that service transaction to the user. The composition and management of the service transaction involves both the north-south traffic from end-user to the end-point (most often over the Internet) and the east-west traffic that flows through various service components such as DMZ servers, web servers, application servers and databases. Most often these components exist within the datacenter or connected through a WAN to other datacenters. Figure 1 shows a typical configuration.

Service Transaction Delivery Network

Service Transaction Delivery Network

The transformation from the client-server architectures to “composed service” model along with virtualization of servers allowing the mobility of Virtual Machines at run-time are introducing new patterns of traffic that increase in the east west direction inside the datacenter by orders of magnitude compared to the north-south traffic going from end-user to the service end-point or vice-versa. Traditional applications that evolved from client-server architectures use TCP/IP for all the traffic that goes across servers. While some optimizations attempt to improve performance for applications that go across servers using high-speed network technologies such as InfiniBand, Ethernet etc., TCP/IP and socket communications still dominate even among virtual servers within the same physical server.

2. The advent of many-core severs with tens and even hundreds of computing cores with high bandwidth communication among them drastically alters the traffic patterns.  When two applications are using two cores within a processor, the communication among them is not very efficient if it uses socket communication and TCP/IP protocol instead of shared memory. When the two applications are running in two processors within the same server, it is more efficient to use PCIExpress or other high-speed bus protocols instead of socket communication using TCP/IP. If the two applications are running in two servers within the same datacenter it is more efficient to use Ethernet or InfiniBand. With the advent of mobility of applications using containers or even Virtual Machines, it is more efficient to switch the communication mechanism based on the context of where they are running. This context sensitive switching is a better alternative to replicating current VLAN and socket communications inside the many-core server. It is important to recognize that the many-core servers and processors constitute a network where each node itself is a sub-network with different bandwidths and protocols (socket-based low-bandwidth communication between servers, InfiniBand, or PCI Express bus based communication across processors in the same server and shared memory based low latency communication across the cores inside the processor). Figure 2 shows the network of networks using many-core processors.

A Network of Networks with Multiple Protocols

A Network of Networks with Multiple Protocols

3. The many-core servers with new class of flash memory and high-bandwidth networks offer a new architecture for services creation, delivery and assurance going far beyond the current infrastructure-centric service management systems that have evolved from single-CPU and low-bandwidth origins. Figure 3 shows a potential architecture where many-core servers are connected with high-bandwidth networks that obviate the need for current complex web of infrastructure technologies and their management systems. The many-core servers each with huge solid-state Drives, SAS attached inexpensive disks, optical switching interfaces connected to WAN Routers offer a new class of services architecture if only the current software shortfall is plugged to match the hardware advances in server, network and storage devices.

If Server is the Cloud, What is the Service Delivery Network?

If Server is the Cloud, What is the Service Delivery Network?

This would eliminate the current complexity mainly involved in dealing with TCP/IP across east-west traffic and infrastructure based service delivery and management systems to assure availability, reliability, performance, cost and security. For example, current security mechanisms that have evolved from TCP/IP communications do not make sense across east/west traffic and emerging container based architectures with layer 7 switching and routing independent of server and network security offer new efficiencies and security compliance.

Current evolution of commodity clouds and distributed virtual datacenters while providing on-demand resource provisioning, auto-failover, auto-scaling and live-migration of Virtual machines, they are still tied to the IP address and associated complexity of dealing with infrastructure management in distributed environments to assure the end-to-end service transaction quality of service (QoS).

The QoS Gap

The QoS Gap

This introduces either vendor lock-in that precludes the advantages of commodity hardware or introduces complexity in dealing with multitude of distributed infrastructures and their management to tune the service transaction QoS. Figure 4 shows the current state of the art. One can quibble whether it includes every product available or whether they are depicted correctly to represent their functionality but the general picture describes the complexity and or vendor lock-in dilemma. The important point to recognize is that the service transaction QoS depends on tuning the SLAs of distributed resources at run-time across multiple infrastructure owners with disparate management systems and incentives. The QoS tuning of service transactions is not scalable without increasing cost and complexity if it depends on tuning the distributed infrastructure with a multitude of point solutions and myriad infrastructure management systems..

What the Enterprise IT Wants:

There are three business drivers that are at the heart of the Enterprise Quest for an IT framework:

  • Compression of Time-to-Market: Proliferation of mobile applications, social networking, and web-based communication, collaboration and commerce are increasing the pressure on enterprise IT to support a rapid service development, deployment and management processes. Consumer facing services are demanding quick response to rapidly changing workloads and the large-scale computing, network and storage infrastructure supporting service delivery requires rapid reconfiguration to meet the fluctuations in workloads and infrastructure reliability, availability, performance and security.
  • Compression of Time-to-Fix: With consumers demanding “always-on” services supporting choice, mobility and collaboration, the availability, performance and security of end to end service transaction is at a premium and IT is under great pressure to respond by compressing the time to fix the “service” regardless of which infrastructure is at fault. In essence, the business is demanding the deployment of reliable services on not so reliable distributed infrastructure.
  • Cost Reduction of IT operation and management which is running at about 60% to 70% of its budget going to keep the “service lights” on: Current service administration and management paradigm that originated with server-centric and low-bandwidth network architecture is resource-centric and assumes that the resources (CPU, memory, network bandwidth, latency, storage capacity, throughput and IOPs) allocated to an application at install time can be changed to meet rapidly changing workloads and business priorities in real-time. Current state-of-the art uses virtual servers, network and storage that can be dynamically provisioned using software API. Thus the application and service (a group of applications providing a service transaction) QoS (quality of service defining the availability, performance, security and cost) can be tuned by dynamically reconfiguring the infrastructure. There are three major issues with this approach:

With a heterogeneous, distributed and multi-vendor infrastructure, tuning the infrastructure requires myriad point solutions, tools and integration packages to monitor current utilization of the resources by the service components, correlate and reason to define the actions required and coordinate many distributed infrastructure management systems to reconfigure the resources.

In order to provide high availability and disaster recovery (HA/DR), recent attempts to move Virtual Machines (VM) introduces additional issues with IP mobility, Firewall reconfiguration, VM sprawl and associated run-away VM images, bandwidth and storage management.

Introduction of public clouds and the availability of software as a service, while they have worked well for new application development or non-mission critical applications or applications that can be re-architected to optimize for the Cloud API which leverage application/service components available, they are also adding additional cost for IT to migrate many existing mission critical applications that demand high security, performance and low-latency. The suggested Hybrid solutions require adopting new cloud architecture in the datacenters or use myriad orchestration packages that add additional complexity and tool fatigue.

In order to address the need to compress time to market and time to fix and to reduce the complexity, enterprises small and big are desperately looking for solutions.

The lines of business owners want:

  • End-to-end visibility and control of service QoS independent of who provides the infrastructure
  • Availability, performance and security governance based on policies
  • Accounting of resource utilization and dynamic resolution of contention for resources
  • Application architecture decoupled from infrastructure while still enabling continuous availability (or decouple functional requirements execution from non-functional requirement compliance)

IT wants to provide the application developers:

  • Application architecture decoupled from infrastructure by separating functional and non-functional requirements so that the application developers focus on business functions while deployment and operations are adjusted at run-time based on business priorities, latency constraints and workload fluctuations
  • Provide cloud-like services (on-demand provisioning of applications, self-repair, auto-scaling, live-migration and end-to-end security) at service level instead of at infrastructure level so that they can leverage own datacenter resources, or commodity resources abundant in public clouds without depending on cloud architectures, vendor API and cloud management systems.
  • Provide a suite of applications as a service (databases, queues, web servers etc.)
  • Service composition schemes that allow developers to reuse components and
  • Instrumentation to monitor service component QoS parameters (independent from infrastructure QoS parameters) to implement policy compliance
  • When problems occur provide component run-time QoS history to developers

IT wants to have the:

  • Ability to use local infrastructure or on demand cloud or managed out-sourced infrastructure
  • Ability to use secure cloud resources without cloud management API or cloud architecture dependence
  • Ability to provide end to end service level security independent of server and network security deployed to manage distributed resources
  • Ability to provide end-to-end service QoS visibility and control (on-demand service provisioning, auto-failover, auto-scaling, live migration and end-to-end security) across distributed physical or virtual servers in private or public infrastructure
  • Ability to reduce complexity and eliminate point solutions and myriad tools to manage distributed private and public infrastructure

Application Developers want:

  • To focus on developing service components, test them in their own environments and publish them in a service catalogue for reuse
  • Ability to compose services, test and deploy in their own environments and publish then in the service catalogue ready to deploy anywhere
  • Ability to specify the intent, context, constraints, communication, and control aspects of the service at run-time for managing non-functional requirements
  • An infrastructure that uses the specification to manage the run-time QoS with on-demand service provisioning on appropriate infrastructure (a physical or virtual server with appropriate service level assurance, SLA), manage run-time policies for fail-over, auto-scaling, live-migration and end-to-end security to meet run-time changes in business priorities, workloads and latency constraints.
  • Separation of run-time safety and survival of the service from sectionalizing, isolating, diagnosing and fixing at leisure
  • Get run-time history of service component behavior and ability to conduct correlated analysis to identify problems when they occur.

We need to discover a path to bridge the current IT to the new IT without changing applications, or the OSs or the current infrastructure while providing a way to migrate to a new IT where service transaction QoS management is truly decoupled from myriad distributed infrastructure management systems. This is not going to happen with current ad-hoc programming approaches. We need a new or at least an improved theory of computing.

As Cockshott et al (2012) point out current computing, management and programming models fall short when you try to include computers and the computed in same model.

“the key property of general-purpose computer is that they are general purpose. We can use them to deterministically model any physical system, of which they are not themselves a part, to an arbitrary degree of accuracy. Their logical limits arise when we try to get them to model a part of the world that includes themselves.”

There are emerging technologies that might just provide the synthesis (reconciliation depends on the way that the architecture departures or violations of expectations are integrated into an emergent structure that resolves deviation into a new regularity, chaos into a new order, surprise into a new pattern as the transformation progresses) required to build the harmony by infusing cognition into computing. Only future will tell if this architecture is expressive enough and efficient as Mark Burgin claims in his elegant book on “Super Recursive Algorithms” quoted above.

Is the Information Technology poised for a renaissance (with a synthesis) since the great masters (Turing, von Neumann, Shannon  etc.) developed the original thesis and take us beyond the current distributed-cloud-management anti-thesis.

The IEEE WETICE2015 International conference track on “the Convergence of Distributed Clouds, GRIDs and their Management” to be held in Cyprus next June (15 – 18) will address some of these emerging trends and attempt to bridge the old and the new.

24th IEEE International Conference on Enabling Technologies: Infrastructures for Collaborative Enterprises (WETICE2015)

24th IEEE International Conference on Enabling Technologies: Infrastructures for Collaborative Enterprises (WETICE2015) in Larnaca, Cyprus

 

For a state-of-the-emerging-science, please go to Infusing Self-Awareness into Turing Machine – A Path to Cognitive Distributed Computing presented in WETICE2014 in Parma Italy

 

 References:

Holbrook, Morris B. 2003. ” Adventures in Complexity: An Essay on Dynamic Open Complex Adaptive Systems, Butterfly Effects, Self-Organizing Order, Coevolution, the Ecological Perspective, Fitness Landscapes, Market Spaces, Emergent Beauty at the Edge of Chaos, and All That Jazz.” Academy of Marketing Science Review [Online] 2003 (6) Available: http://www.amsreview.org/articles/holbrook06-2003.pdf

Cockshott P., MacKenzie L. M., and  Michaelson, G, (2012) Computation and its Limits, Oxford University Press, Oxford.

Is Distributed IT Infrastructure Brokering (With or Without Virtual Machines) a Network Service that will be offered by a Network Carrier in the Future?
March 22, 2014

Trouble in IT Paradise with Darkening Clouds:

If you ask an enterprise CIO over a couple of drinks, what is his/her biggest hurdle today that is preventing  to deliver the business right resources at the right time at a right price, his/her answer would be that “the IT is too darn complex.” Over a long period of time, the infrastructure vendors have hijacked Information Technologies with their complex silos and expediency has given way to myriad tools and point solutions that overlay a management web. In addition, the Venture Capitalists looking for quickie “insertion points” with no overarching architectural framework have proliferated tools and appliances that have contributed to the current complexity and tool fatigue.

After a couple of more drinks, if you press the CIO why his/her mission critical applications are not migrating to the cloud which claims lesser complexity, the CIO laments that there is no cloud provider willing to sign a warranty that assures the service levels for their mission critical applications that guarantee application availability, performance and security levels. “Every cloud provider talks about infrastructure service levels but not willing to step up to assure application availability, performance and security.  There are myriad off-the main street providers that claim to offer orchestration to provide the service levels, but no one yet is signing on the dotted line.” The situation is more complicated when the resources span across multiple infrastructure providers.

The decoupling of the strong binding between the management of applications and the infrastructure management is a key for the CIO as more applications are developed with shorter time to market. CIO’s top five priorities are transnational applications demanding distributed resources, security, cost, compliance and uptime. A Gartner report claims that the CIOs spend 74% of IT budget on keeping the application “lights on” and another 18% on “changing the bulbs” and other maintenance activities. (It is interesting to recall that before Strowger’s switch eliminated many operators sitting in long rows plugging countless jacks  into countless plugs, the cost of adding and managing new subscribers was rising in a geometric proportion. According to the Bell System chronicles, one large city general manager of a telephone company at that time wrote that he could see the day coming soon when he would go broke merely by adding a few more subscribers because the cost of adding and managing a subscriber is far greater than the corresponding revenue generated. The only difference between today’s IT datacenter and central office before Strowger’s switch is that “very expensive consultants, countless hardware appliances, and countless software systems that manage them” replace “many operators, countless plugs and countless jacks”.)

In order to utilize commodity infrastructure while maintaining  high security, mobility for performance and availability, the CIOs are looking to solutions that let them focus on application quality of service (QoS) and are willing to outsource the infrastructure management to providers who can assure application mobility, availability and security albeit with end to end service visibility and control at their disposal.

While the public clouds seem to offer a way out to leverage the commodity infrastructure with on demand Virtual Machine provisioning, there are four hurdles that are preventing the CIO’s to embrace the clouds for mission critical applications:

  1. Current mission critical and even non-mission critical applications and services (groups of applications) are used to highly secure and low latency infrastructures that have been hardened and managed and the CIO’s are loath to spend more money to bring same level of SLA’s in public clouds.
  2. The dependence on particular service provider infrastructure API’s, Virtual Machine Image Management (nested or not) infrastructure dependencies and added self-healing, auto-scaling, live-migration service cost and complexity create service provider lock-in on their infrastructure and their management services. This defeats the intent to leverage the commodity infrastructure offered by different service providers.
  3. The increasing scope creep from infrastructure providers “up-the-stack” to provide application awareness and insert their API in application development in the name of satisfying non-functional requirements (availability, security, performance optimization) at run-time has started to increase the complexity and cost of application and service development. The resulting proliferation of tools and point solutions without a global architectural framework to use resources from multiple service providers have increased the integration and troubleshooting cost.
  4. Global communications, collaboration and commerce at the speed of light has increased the scale of computing and the distributed computing resource management has fallen short in meeting the scale and the fluctuations both caused by demand and also fluctuations in resources availability, performance and security.

The Inadequacy of Ad-hoc Programming to Solve Distributed Computing Complexity:

Unfortunately, the complexity is more a structural issue than an operational or infrastructure technology issue that cannot be resolved with ad-hoc programming techniques to manage the resources. Cockshott et al. conclude their book “Computation and its limits” with the paragraph “The key property of general-purpose computer is that they are general purpose. We can use them to deterministically model any physical system, of which they are not themselves a part, to an arbitrary degree of accuracy. Their logical limits arise when we try to get them to model a part of the world that includes themselves.”  While the success of IT in modeling and executing business processes has evolved to current distributed datacenters and cloud computing infrastructures that provide on-demand computing resources to model and execute business processes, the structure and fluctuations that dictate the evolution of computation have introduced complexity in dealing with real-time changes in the interaction of the infrastructure and the computations they perform. The complexity manifests in the following ways:

  1. In a distributed computing environment, maintaining the right computing resources (cpu, memory, network bandwidth, latency, storage capacity, throughput and IOPs) are available to right software component contributing to the service transaction requires orchestration and management of myriad computing infrastructures often owned by different providers with different profit motives and incentives. The resulting complexity in resource management to assure availability, performance and security of service transactions adds to the cost of computing. For example, it is estimated that up to 70% of current IT budget is consumed in assuring service availability, performance and security. The complexity is compounded in distributed computing environments that are supported by heterogeneous infrastructures with disparate management systems.
  2. In a large-scale dynamic distributed computation supported by myriad infrastructure components, the increased component failure probabilities introduce a non-determinism (for example the Google is observing emergent behavior in their scheduling of distributed computing resources when dealing with large number of resources) that must be addressed by a service control architecture that decouples functional and non-functional aspects of computing.
  3. Fluctuations in the computing resource requirements dictated by changing business priorities, workload variations that depend on service consumption profiles and real-time latency constraints dictated by the affinity of service components, all demand a run-time response to dynamically adjust the computing resources. Current dependence on myriad orchestrators and management systems cannot scale in a distributed infrastructure without either a vendor lock-in on infrastructure access methods or a universal standard that often stifles innovation and competition to meet fast changing business needs.

Thus the function, structure and fluctuations involved in dynamic processes delivering service transaction are driving a need to search new computation, management and programming models that address the unification of the computer and the computed and decouple the service management from the infrastructure management at run-time.

It is the Architecture Stupid:

A business process is defined both by functional requirements that dictate the business domain functions and logic as well as non-functional requirements that define operational constraints related to service availability, reliability, performance, security and cost dictated by business priorities, workload fluctuations and resource latency constraints. A non-functional requirement specifies criteria that can be used to judge the operation of a system, rather than specific behaviors. The plan for implementing functional requirements is detailed in the system design. The plan for implementing non-functional requirements is detailed in the system architecture. While much progress has been made in the system design and development, the architecture of distributed systems falls short to address the non-functional requirements for two reasons:

  1. Current distributed systems architecture from its server-centric and low-bandwidth origins has created layers of resource management-centric ad-hoc software to address various uncertainties that arise in a distributed environment. Lack of support for concurrency, synchronization, parallelism and mobility of applications dictated by the current serial von-Neumann stored program control has given rise to ad-hoc software layers that monitor and manage distributed resources. While this approach may have been adequate when distributed resources are owned by a single provider and controlled by a framework that provides architectural support for implementing non-functional requirements, the proliferation of commodity distributed resource clouds offered by different service providers with different management infrastructures adds scaling and complexity issues. Current OpenStack and AWS API discussions are a clear example that forces a choice of one or the other or increased complexity to use both.
  2. The resource-centric view of IT currently demotes application and service management to a second-class citizenship where the QoS of application/service is monitored and managed by myriad resource management systems overlaid with multiple correlation and analysis layers used to manipulate the distributed resources to adjust the Cpu, memory, bandwidth, latency, storage IOPs, throughput and capacity which are all what are required to keep the application/service to meet its quality of service. Obviously, this approach cannot scale unless single set of standards evolve or a single vendor lock-in occurs.

Unless an architectural framework evolves to decouple application/service management from myriad infrastructure management systems owned and operated by different service providers with different profit motives, the complexity and cost of management will only increase.

A Not So Cool Metaphor to Deliver Very Cool Services Anywhere, Anytime and On-demand:

A lesson on an architectural framework that addresses nonfunctional requirements while connecting billions of users anywhere anytime on demand is found in the Plain Old Telephone System (POTS). From the beginnings of AT&T to today’s remaking of at&t, much has changed but two things that remain constant are the universal service (on a global scale) and the telecom grade “trust” that are taken for granted. Very recently, Mark Zuckerberg proclaimed at the largest mobile technology conference in Barcelona that his very cool service Facebook wants to be the dial tone for the Internet. Originally, the dial tone was introduced to assure the telephone user that the exchange is functioning when the telephone is taken off-hook by breaking the silence (before an operator responded) with an audible tone.  Later on, the automated exchanges provided a benchmark for telecom grade trust that assures managed resources on-demand with high availability, performance and security.  Today, as soon as the user goes on hook, the network recognizes the profile based on the dialing telephone number.  As soon as the dialed party number is dialed, the network recognizes the destination profile and provisions all the network resources required to make the desired connection, commence billing, monitor and assure the connection till one of the parties initiates a disconnect. During the call, if the connection experiences any changes that impact the non-functional requirements, the network intelligence takes appropriate action based on policies. The resulting resiliency (availability, performance, and security), efficiency and scaling ability to connect billions of users on demand have come to be known as “Telecom grade trust”. An architectural flaw in the original service design (exploited by Steve Jobs by building a blue-box) was fixed by introducing an  architectural change to separate the data path and the control path. The resulting 800 service call model provided a new class of services such as call forwarding, call waiting and conference call.

The Internet on the other hand evolved to connect billions of computers together anywhere, anytime from the prophetic statement made by J. C. R. Licklider “A network of such (computers), connected to one another by wide-band communication lines [which provided] the functions of present-day libraries together with anticipated advances in information storage and retrieval and [other] symbiotic functions.”       The convergence of voice over IP, data and video networks has given rise to a new generation of services enabling communication, collaboration and commerce at the speed of light. The result is that the datacenter has replaced the central office to become the hub from which myriad voice, video and data services are created, and delivered on a global scale. However the management of these services which determines their resiliency, efficiency and scaling is another matter. In order to provide on demand services, anywhere, any-time with prescribed quality of service in an environment of wildly fluctuating workloads, changing business priorities and latency constraints dictated by the proximity of service consumers and suppliers, resources have to be managed in real-time across distributed pools to match the service QoS to resource SLAs. The telephone network is designed to share resources on a global scale and to connect them as required in real-time to meet the non-functional service requirements while current datacenters (whether privately owned or publicly provides as cloud services) are not. There are three structural deficiencies in the current distributed datacenter architecture to match the telecom grade resiliency, efficiency and scaling:

  1. The data path and service control path are not decoupled giving rise to same problems that Steve Jobs exploited causing a re-architecting of the network.
  2. The service management is strongly coupled with the resource management systems and does not scale as the resources become distributed and multiple service providers provide those resources with different profit motives and incentives. Since the resources are becoming commodity, every service provider wants to go up the stack to provide lock-in.
  3. Current trend to infuse resource management API in service logic to provide resource management at run-time and application aware architectures that want to establish intimacy with applications only increase complexity and make service composition with reusable service components all the more difficult because of their increased lock-in with resource management systems.

Resource management based datacenter operations miss an important feature of services/applications management which is that all services are not created equal. They have different latency and throughput requirements. They have different business priorities and different workload characteristics and fluctuations. What works for the goose does not work for the gander.  In addition to the current complexity and cost of resource management to assure service availability, reliability, performance and security, there is an even more fundamental issue that plagues the current distributed systems architecture. A distributed transaction that spans multiple servers, networks and storage devices in multiple geographies uses resources that span across multiple datacenters. The fault, configuration, accounting, performance and security (FCAPS) of a distributed transaction behavior requires the end-to-end connection management more like telecommunication service spanning distributed resources. Therefore, focusing on only resource management in a datacenter without the visibility and control of all resources participating in the transaction will not provide assurance of service availability, reliability, performance and security at run-time.

New Dial Tones for Application/Service Development, Deployment and Operation:

Current Web-scale applications are distributed transactions that span across multiple resources widely scattered across multiple locations owned and managed by different providers. In addition, the transactions are transient making connections with various components to fulfill an intent and closing them only to reconnect when they need them again. This is very much in contrast to always-on distributed computing paradigm of yesterday.

In creating, deploying and operating these services, there are three key stake holders and associated processes:

  1. Resources providers deliver the vital resources required to create, deploy and operate these resources on demand anywhere anytime (resource dial tone). The vital resources are just the CPU, memory, network latency, bandwidth and storage capacity, throughput and IOPs required to execute the application or service that has been compiled to “1”s and “0”s (the Turing Machine). The resource consumers care less about how you provide these as long as you maintain the service levels the resource providers agree to when the application or service requests the resources at provisioning time (matching the QoS request with SLA and maintaining it during the application/service life-time). The resource dial tone that assures the QoS with resource SLA is offered to two different types of consumers of this resource. First, the application developer who uses these resources to develop the service components and composes them to create more complex services with their own QoS requirements. Second the service operators who use the SLAs to provide management of QoS at run-time to deliver the services to end users.
  2. The application developers like to use their tools and best practices without any constraints from resource providers and the run-time vital signs required to execute their services should be transparent to where or who is providing the vital resources. The resources must support the QoS specified by developer or service composer depending on the context, communication, control and constraint needs. They do not care how they get the CPU, memory, bandwidth, storage capacity, throughput or IOPs or how the latency constraints are met. This model is a major departure from current SDN route focusing on giving control of resources to applications which is not a scalable solution that allows decoupling of resource management from service management.
  3. The service operators provide run-time QoS assurance by brokering the QoS demands to match the best available resource pool that meets the cost and quality constraints (the management dial tone that assures non-functional requirements). The brokering function is a network service ala services switching to match the applications/services to the right resources.

The brokering service must then provide the non-functional requirements management at run-time just as in POTS.

The New Service Operations Center (SOC) with End-to-end Service Visibility and Control Independent of Distributed Infrastructure Management Centers Owned by Different Infrastructure Providers:

The new Telco model that the broker facilitates allows the enterprises and other infrastructure users to focus on services architecture and management and use infrastructure as a commodity from different infrastructure providers just as Telcos provide shared resources with network services.

Telcograde

Figure 1: The Telco Grade Services Architecture that
decouples end to end service transaction management from infrastructure
management systems at run-time

The service broker matches the QoS of service and service components with service levels offered by different infrastructure providers based on the service blueprint which defines the context, constraints, communications and control abstraction of the service at hand. The service components are provided with desired Cpu, memory, bandwidth, latency, storage IOPs, throughput and capacity desired. The decoupling of service management from distributed infrastructure management systems puts the safety and survival of services first and allows sectionalization, isolation, diagnosis anfd fixing infrastructure at leisure as is the case today with POTS.

It is important to note that the service dial tone Zuckerberg is talking about is not related to the resources dial tone or management dial tone required for providing service connections and management at run-time. He is talking about application end user receiving the content. Facebook application developers do not care how the computing resources are provided as long as their service QoS is maintained to meet the business priorities, workloads and latency constraints to deliver their service on a global scale. Facebook CIO would rather spend time maintaining the service QoS by getting the resources wherever they are available to meet the service needs at reasonable cost. In fact most CIOs would get rid of the infrastructure management burden if they have QoS assurance and end-to-end service visibility and service control (they could not care less about access to resources or their management systems) to manage the non-functional requirements at run-time. After all, Facebook’s open compute project is a side effect trying to fill a gap left by infrastructure providers – not their main line of business. The crash that resulted after Zuckerberg’s announcement of WhatsApp acquisition was not the “cool” application’s fault. They probably could have used a service broker/switch providing the old fashioned resource dial tone so that they could provide the service dial tone to their users.

This is similar to a telephone company assuring appropriate resources to connect different users based on their profiles or the Internet connecting devices based on their QoS needs at run-time. The broker acts as service switch that connects various service components at run-time and matches the QoS demands with appropriate resources.

With the right technology, the service broker/switch may yet provide the required service level warranties to the enterprise CEOs from well-established carriers with money and muscle.

Will at&t and other Telcos have the last laugh by incorporating this brokering service switch in the network and make current distributed datacenters (cloud or otherwise with physical or virtual infrastructure) a true commodity?

A Path Toward Intelligent Services using Dumb Infrastructure on Stupid, Fat Networks?
November 3, 2013

“The key property of general-purpose computer is that they are general purpose. We can use them to deterministically model any physical system, of which they are not themselves a part, to an arbitrary degree of accuracy. Their logical limits arise when we try to get them to model a part of the world that includes themselves.”

—  P. Cockshott, L. M. MacKenzie and  G. Michaelson, Computation and its Limits, Oxford University Press, Oxford 2012, p 215

“The test of a framework, however, is not what one can say about it in principle but what one can do with it”

—  Andrew Wells, Rethinking Cognitive Computation: Turing and the Science of the Mind, Palgrave, Macmillan, 2006.

Summary

The “Convergence of Clouds, Grids and their Management” conference track is devoted to discussing current and emerging trends in virtualization, cloud computing, high-performance computing, Grid computing and cognitive computing. The tradition that started in WETICE2009 “to analyze current trends in Cloud Computing and identify long-term research themes and facilitate collaboration in future research in the field that will ultimately enable global advancements in the field that are not dictated or driven by the prototypical short term profit driven motives of a particular corporate entity” has resulted in a new computing model that was included in the Turing Centenary Conference proceedings in 2012. More recently, a product based on these ideas was discussed in the 2013 Open Server Summit (www.serverdesignsummit.com), where many new ideas and technologies were presented to exploit the new generation of many-core servers, high-bandwidth networks and high-performance storage. We present here some thoughts on current trends which we hope will stimulate further research to be discussed in the WETICE 2014 conference track in Parma, Italy (http://wetice.org).

Introduction

Current IT datacenters have evolved from their server-centric, low-bandwidth origins to distributed and high-bandwidth environments where resources can be dynamically allocated to applications using computing, network and storage resource virtualization. While Virtual machines improve resiliency and provide live migration to reduce the recovery time objectives in case of service failures, the increased complexity of hypervisors, their orchestration, Virtual Machine images and their movement and management adds an additional burden in the datacenter.

Further automation trends continue to move toward static applications (locked-in-a-virtual machine, often as one application in one virtual machine) in a dynamic infrastructure (virtual servers, virtual networks, virtual storage, Virtual Image managers etc.). The safety and survival of applications and end to end service transactions delivered by a group of applications are managed by dynamically monitoring and controlling the resources at run-time in real-time. As services migrate to distributed environments where applications contributing to a service transaction are deployed in different datacenters and public or private clouds often owned by different providers, resource management across distributed resources is provided using myriad point solutions and tools that monitor, orchestrate and control these resources. A new call for application-centric infrastructure proposes that the infrastructure provide (http://blogs.cisco.com/news/application-centric-infrastructure-a-new-era-in-the-data-center/ ):

  • Application Velocity (Any workload, anywhere): Reducing application deployment time through a fully automated and programmatic infrastructure for provisioning and placement. Customers will be able to define the infrastructure requirements of the application, and then have those requirements applied automatically throughout the infrastructure.
  • A common platform for managing physical, virtual and cloud infrastructure: The complete integration across physical and virtual, normalizing endpoint access while delivering the flexibility of software and the performance, scale and visibility of hardware across multi-vendor, virtualized, bare metal, distributed scale out and cloud applications
  • Systems Architecture: A holistic approach with the integration of infrastructure, services and security along with the ability to deliver simplification of the infrastructure, integration of existing and future services with real time telemetry system wide.
  • Common Policy, Management and Operations for Network, Security, Applications: A common policy management framework and operational model driving automation across Network, Security and Application IT teams that is extensible to compute and storage in the future.
  • Open APIs, Open Source and Multivendor: A broad ecosystem of partners who will be empowered by a comprehensive published set of APIs and innovations contributed to open source.
  • The best of Custom and Merchant Silicon: To provide highly scalable, programmatic performance, low-power platforms and optics innovations that protect investments in existing cabling plants, and optimize capital and operational expenditures.

Perhaps this approach will work in a utopian IT landscape where either the infrastructure is provided by a single vendor or universal standards force all infrastructures to support common API. Unfortunately the real world evolves in a diverse, heterogeneous and competitive environment and what we are left with is a strategy that cannot scale and lacks end-to-end service visibility and control. End-to-end security becomes difficult to assure because of the myriad security management systems that control distributed resources. The result is open source systems that attempt to fill this niche. Unfortunately, in a highly networked world where multiple infrastructure providers provide a plethora of diverse technologies that evolve at a rapid rate to absorb high-paced innovations, orchestrating the infrastructure to meet the changing workload requirements that applications must deliver is a losing battle. The complexity and tool fatigue resulting from layers of virtualization and orchestration of orchestrators is crippling the operation and management of datacenters (virtualized or not) requiring 70% of current IT budgets going toward keeping the lights on. An explosion of tools, special purpose appliances (for Disaster Recovery, IP security, Performance optimization etc.) and administrative controls have escalated operation and management costs. Gartner Report estimates that for every 1$ spent on development of an application, another $1.31 is spent on assuring safety & survival. While all vendors agree upon Open Source, Open API, and multi-vendor support, reality is far from it. An example is the recent debate about whether OpenStack should include Amazon AWS API support while the leading cloud provider conveniently ignores the competing API.

The Strategy of Dynamic Virtual Infrastructure

The following picture presented in the Open Server Summit Presents a vision of future datacenter with a virtual switch network overlay over physical network.

Virtual NetworkFigure 1: Network Virtualization: What It Is and Why It Matters – Presented in Open Server Summit in 2013

Bruce Davie, Principal Engineer, VMware

In addition to the Physical network connecting physical servers, an overlay of virtual network inside the physical server to connect the virtual machines inside a physical server. In addition, a plethora of virtual machines are being introduced to replace the physical routers and switches that control the physical network. The quest to dynamically reconfigure the network at run-time to meet the changing application workloads, business priorities and latency constraints has introduced layers of additional network infrastructure albeit software-defined. While applications are locked in a virtual server, the infrastructure is evolving to dynamically reconfigure itself to meet changing application needs. Unfortunately this strategy can not scale in a distributed environment where different infrastructure providers deploy myriad heterogeneous technologies and management strategies and results in orchestrators of orchestrators contributing to complexity and tool fatigue in both datacenters and clod environments (private or public).

Figure 2 shows a new storage management architecture also presented in the Open Server Summit.

Virtual Storage

Figure 2: PCI Express Supersedes SAS and SATA in Storage – Presented in Open Server Summit 2013, Akber Kazmi, Senior Marketing Director, PLX Technology

The PCIe switch allows a converged physical storage fabric at half the cost and half the power of current infrastructure. In order to leverage these benefits, the management infrastructure has to accommodate it which adds to the complexity.

In addition, it is estimated that the data traffic inside the datacenter is about 1000 times that of the data that is sent to and received from the users outside. This completely changes the role of TCP/IP traffic inside the datacenter and consequently the communication architecture between applications inside the datacenter. It does not anymore make sense for Virtual machines running inside a Many-core server to use TCP/IP as long as they are within the datacenter. In fact, it makes more sense for them to communicate via shared memory when they are executed on different cores within a processor, communicate via high speed bus when they are executed on different processors in the same server and a high speed network when they are executed in different servers in the same datacenter. TCP/IP is only needed when communicating with users outside the datacenter who can only be accessed via the Internet.

Figure 3 shows the server evolution.

Server

Figure 3: Servers for the New Style of IT – Presented in Open Server summit 2013, Dwight Barron, HP Fellow and Chief Technologies Hyper-scale Server Business Segment, HP Servers Global Business Unit, Hewlett-Packard

As the following picture presents, current evolution of the datacenter is designed to provide dynamic control of resources for addressing the work-load fluctuations at run-time, changing business priorities and real-time latency constraints. The applications are static in a Virtual or Physical Server and the software defined infrastructure dynamically adjusts to changing application needs.

Complexity1

Figure 4: Macro Trends, Complexity, and SDN – Presentation in the Open Server Summit 2013, David Meyer, CTO/Chief Architect, Brocade

Cognitive Containers & Self-Managing Intelligent Services on Static Infrastructure

With the advent of many-core servers, high bandwidth technologies connecting these servers, and new class of high performance storage devices that can be optimized to meet the workload needs (IOPs intensive, throughput sensitive or capacity hungry), is it time to look at a static infrastructure with dynamic application/service management to reduce IT complexity in both datacenters and clouds (public or private)? This is possible if we can virtualize the applications inside a server (physical or virtual) and decouple the safety and survival of the applications and groups of applications that contribute to a distributed transaction from myriad resource management systems that provision and control a plethora of distributed resources supporting these applications.

The Cognitive Container discussed in the Open Server Summit (http://lnkd.in/b7-rfuK) presents the decoupling required between application and service management and underlying distributed resource management systems. Cognitive Container is specially designed to decouple the management of an application and service transactions that a group of distributed applications execute from the infrastructure management systems, at run-time, controlling their resources that are often owned or operated by different providers. The safety and survival of the application at run-time is put ahead by infusing the knowledge about the application (such as the intent, non-functional attributes, run-time constraints, connections and communication behaviors) into the container and using this information to monitor and manage the application at run-time. The Cognitive Container is instantiated and managed by a Distributed Cognitive Transaction Platform (DCTP) that sits between the applications and the OS facilitating the run-time management of Cognitive Containers. The DCTP does not require any changes to the application, OS or the infrastructure and uses the local OS in a physical or virtual server. A network of Cognitive Containers infused with similar knowledge about the service transaction they execute also is managed at run-time to assure the safety and survival based on policies dictated by business priorities, run-time workload fluctuations and real-time latency constraints. The Cognitive Container network using replication, repair, recombination and reconfiguration properties provide dynamic service management independent of infrastructure management systems at run-time. The Cognitive Containers are designed to use the local operating system to monitor the application vital signs (CPU, memory, bandwidth, latency, storage capacity, IOPs and throughput) and run-time behavior to manage the application to conform to the policies.

The cognitive container can be deployed in a physical or virtual server and does not require any changes to the applications, OSs or the infrastructure. Only the knowledge about the functional and n0n-functional requirements has to be infused into the Cognitive Container. The following figure shows a Cognitive Network deployed in a distributed infrastructure. The Cognitive Container and the service management are designed to provide auto-scaling, self-repair, live-migration and end-to-end service transaction security independent of infrastructure management system.

service visibility

Figure 5: End-to-End Service Visibility and Control in a Distributed Datacenter (Virtualized or Not) – Presented in the Open Server Summit

Rao Mikkilineni, Chief Scientist, C3 DNA

Using the Cognitive Container network it is possible to create a federated service creation, delivery and assurance platforms that transcend the physical and virtual server boundaries and geographical locations as shown in figure below.

 Platform

Figure 6: Federated Services Fabric with Service Creation, delivery and assurance processes decoupled from Resource provisioning, management and control.

This architecture provides an opportunity to simplify the infrastructure where a tiered server, storage and network infrastructure that is static and hardwired to provide various servers (physical or virtual) with specified service levels (CPU, memory, network bandwidth, latency, storage capacity and throughput) the cognitive containers are looking for based on their QoS requirements. It does not matter what technology is used to provision these servers with required service levels. The Cognitive Containers monitor these vital signs using the local OS and if they are not adequate, they will migrate to other servers where they are adequate based on policies determined by business priorities, run-time workload fluctuations and real-time latency constraints.

The infrastructure provisioning then becomes a simple matter of matching the Cognitive Container to the server based on QoS requirements. Thus the Cognitive Container services network provides a mechanism to deploy intelligent (self-aware, self-reasoning and self-controlling) services using dumb infrastructure with limited intelligence about services and applications (matching application profile to the server profile) on stupid pipes that are designed to provide appropriate performance based on different technologies as discussed in the Open Server Summit.

The managing and safekeeping of application required to cope with a non-deterministic impact on workloads from changing demands, business priorities, latency constraints, limited resources and security threats is very similar to how cellular organisms manage life in a changing environment. The managing and safekeeping of life efficiently at the lowest level of biological architecture that provides the resiliency was in his mind when von Neumann was presenting his Hixon lecture (Von Neumann, J. (1987) Papers of John von Neumann on Computing and Computing Theory, Hixon Symposium, September 20, 1948, Pasadena, CA, The MIT Press, Massachusetts, p474). ‘‘The basic principle of dealing with malfunctions in nature is to make their effect as unimportant as possible and to apply correctives, if they are necessary at all, at leisure. In our dealings with artificial automata, on the other hand, we require an immediate diagnosis. Therefore, we are trying to arrange the automata in such a manner that errors will become as conspicuous as possible, and intervention and correction follow immediately.’’ Comparing the computing machines and living organisms, he points out that the computing machines are not as fault tolerant as the living organisms. He goes on to say ‘‘It’s very likely that on the basis of philosophy that every error has to be caught, explained, and corrected, a system of the complexity of the living organism would not run for a millisecond.’’ Perhaps the Cognitive Container bridges this gap by infusing self-management into computing machines that manage the external world while also managing themselves with self-awareness, reasoning, and control based on policies and best practices.

Cognitive Containers or not, the question is how do we address the problem of ever increasing complexity and cost in current datacenter and cloud offerings? This will be a major theme in the 4th conference track on the Convergence of Distributed Clouds, Grids and their management at WETICE2014 in Parma, Italy.

WETICE 2014 and The Conference Track on the Convergence of Clouds, Grids and Their Management
October 30, 2013

WETICE is an annual IEEE International conference on state-of-the-art research in enabling technologies for collaboration, consisting of a number of cognate conference tracks. The “Convergence of Clouds, Grids and their Management” conference track is devoted to discussing current and emerging trends in virtualization, cloud computing, high performance computing, Grid computing and Cognitive Computing. The tradition that started in WETICE2009 “to analyze current trends in Cloud Computing and identify long-term research themes and facilitate collaboration in future research in the field that will ultimately enable global advancements in the field that are not dictated or driven by the prototypical short term profit driven motives of a particular corporate entity” has resulted in a new computing model that was included in the Turing Centenary Conference proceedings in 2012. The 2013 conference track discussed Virtualization, Cloud Computing and the Emerging Datacenter Complexity Cliff in addition to conventional cloud and grid computing solutions.

The WETICE 2014 conference to be held in Parma, Italy during June, 23rd-25th, 2014, will continue the tradition by continuing the discussions on the convergence of clouds, grids and their management. In addition, it will also solicit papers on new computing models, cognitive computing platforms and strong AI resulting from recent efforts to inject cognition into computing (Turing Machines).

All papers are refereed by the Scientific Review of Committee of each conference track. All accepted papers will be published in the electronic proceedings by the IEEE Computer Society, and submitted to the IEEE digital library. The proceedings will be submitted for indexing through INSPEC, Compendex, and Thomson Reuters, DBLP, Google Scholar and EI Index.

http://wetice.org

Changing Landscape of Backup and Disaster Recovery
September 16, 2012

“Consumers need to drive vendors to deliver what they really need, and not what the vendors want to sell them.”

——  Jon Toigo (http://www.datastorageconnection.com/doc.mvc/Jon-Toigo-Exposes-More-About-Data-Storage-Ven-0001 )

Starting from the mainframe datacenters where applications are accessed using narrow bandwidth networks and dumb terminals and evolving to client-server and peer-to-peer distributed computing architectures which exploit higher bandwidth connections, business process automation has contributed significantly to reduce the TCO. With the Internet, global e-commerce was enabled and the resulting growth in commerce led to an explosion of storage.  Storage networking and resulting NAS (network attached storage) and SAN (storage area network) technologies have further changed the dynamics of the enterprise IT infrastructure in a significant way to meet business process automation needs.  The storage backup and recovery technologies have further improved the resiliency of services delivery processes by improving the time it takes to respond in case of service failure.  Figure 1 shows the evolution of the data recovery time objective, (the recovery point objective (RPO) is the point in time to which you must recover data as dictated by business needs.  Recovery time objective (RTO) is the period of time after an outage in which the application and its data must be restored to a predetermined state defined by RPO.), which dropped from days to minutes and seconds.  While the productivity, flexibility and global connectivity made possible with this evolution have radically transformed the business economics of information systems, the complexity of heterogeneous and multi-vendor solutions have created high dependence on specialized training and service expertise to assure availability, reliability, performance and security of various business applications.

Figure 1: The evolution of Recovery Time Objective. Virtualization of server technology provides an order of magnitude improvement in the way applications are backed-up, recovered and protected against disasters.

Successful implementation must integrate various server, network and storage centric products with their local optimization best-practices with end-to-end optimization strategies.  While each vendor attempts to assure their success with more software and services, the small and medium enterprises often cannot afford the escalating software and service expenses associated with optimization strategies and become vulnerable.  The exponential growth in services demand for voice, data and video in the consumer market also has introduced severe strains on current IT infrastructures.  There are three main issues that are currently driving distributed computing solutions to seek new approaches:

  1. Current IT datacenters have evolved to meet the business services needs in an evolutionary fashion from server-centric application design to client-server networking to storage area networking without an end-to-end optimized architectural transformation along the way.  The server, network and storage vendors optimized management in their own local domains often duplicating functions from other domains to compete in the market place.  For example, cache memory is used to improve the performance of service transactions by improving response time. However, redundancy of cache management in server, storage and even network switches make tuning of the response time a complex task requiring multiple management systems. Application developers have also started to introduce server, storage and network management within their applications.  For example, Oracle is not just a database application.  It also is a storage manager, and a network manager as well as being an application manager.  It tries to optimize all its resources for performance tuning.  No wonder it takes an army of experts to keep it going.  The result is an over-provisioned datacenter with multiple functions duplicated many times by the server, storage and networking vendors.  Large enterprises with big profit margins throw human bodies, tons of hardware and a host of custom software and shelf-ware packages to address their needs.  Some data centre managers do not even know what assets they have — of course, yet another opportunity for vendors to sell an asset management system to discover what is available, and services to provide asset management using such an asset manager.  Another system is de-duplication software that finds out multiple copies of the same files and removes duplication.  This shows how expensive it is to clean up after the fact.
  2. Heterogeneous technologies from multiple vendors that are supposed to reduce IT costs actually increase the complexity and management costs.  Today, many CFOs consider IT as a black hole that sucks in, expensive human consultants and continually demands capital and operational expenses to add hardware and software which often end up as shelf-ware because of their complexity.  Even for mission-critical business services, enterprises CFOs are starting to question the productivity and effectiveness of current IT infrastructures.  It becomes even more difficult to justify the costs and complexity to support the massive scalability and wild fluctuations in workloads demanded by consumer services.  The price point is set low for the mass market but the demand is high for massive scalability (a relatively simple, but massive, service like Facebook is estimated to use about 40,000 servers and Google is estimated to run a million servers to support its business).
  3. More importantly, Internet-based consumer services such as social networking, e-mail and video streaming applications have introduced new elements: wild fluctuations in demand, massive scale of delivery to a divergent set of customers.  The result is an increased sensitivity to the economics of service creation, delivery and assurance. Unless the cost structure of IT management infrastructure is addressed, the mass-market needs cannot be met profitably.  Large service providers such as Amazon, Google, Facebook etc., have understandably implemented alternatives to meet wildly fluctuating workloads, massive scaling of customers and latency. constraints to meet demanding response time requirements.

Cloud computing technology has evolved to meet the needs of massive scaling, wild fluctuations in consumer demand and response time control of distributed transactions spanning multiple systems, players and geographies.  More importantly, cloud computing changes the backup and Disaster Recovery (DR) strategies in a drastic manner reducing the RTO to minutes and seconds doing much better than SAN/NAS based server-less backup and recovery strategies. Live migration is accomplished as follows:

  1. The entire state of a virtual machine is encapsulated by a set of files stored on shared storage such as Fibre Channel or iSCSI Storage Area Network (SAN) or Network Attached Storage (NAS).
  2. The active memory and precise execution state of the virtual machine is rapidly transferred over a high-speed network, allowing the virtual machine to instantaneously switch from running on the source host to the destination host. This entire process could take less than few seconds on a Gigabit Ethernet network.
  3. The networks being used by the virtual machine are virtualized by the underlying host. This ensures that even after the migration, the virtual machine network identity and network connections are preserved.

While Virtual machines improve resiliency and live migration to reduce the RTO, the increased complexity of hypervisors, their orchestration, Virtual Machine images and their management adds an additional burden in the datacenter. Figure 2 shows the evolution of current datacenters from the mainframe days to the cloud computing transformation.  The cost of creating and delivering a service has continuously decreased with increased performance of hardware and software technologies. What used to take months and years to develop and deliver new services now only takes weeks and hours. On the other hand, as service demand increased with ubiquitous access using the Internet and broadband networks, the need for resiliency (availability, reliability, performance and security management), efficiency and scaling also put new demands on service assurance and hence on the need for continuous reduction of RTO and RPO. The introduction of SAN server-less backup and virtual machine migration in turn have increased complexity and hence the cost of managing the service transactions during delivery while reducing the RTO and RPO.

Figure 2: Cost of Service Creation, Delivery and Assurance with the Evolution of Datacenter Technologies. The management cost has exploded because of a myriad point-solution appliances, software and shelf-ware are cobbled together from multiple vendors. Any future solution that addresses the datacenter management conundrum must provide end-to-end service visibility and control transcending multiple service provider resource management systems. Future datacenter focus will be on a transformation from Resources Management to Services Switching to provide telecom-grade “trust”.

The increased complexity of management of services implemented using the von Neumann serial computing model executing a Turing machine turns out to be more a fundamental architectural issue related to Godel’s prohibition of self-reflection in Turing machines than a software design issue. Cockshott et al. conclude their book “Computation and its limits” with the paragraph “The key property of general-purpose computer is that they are general purpose. We can use them to deterministically model any physical system, of which they are not themselves a part, to an arbitrary degree of accuracy. Their logical limits arise when we try to get them to model a part of the world that includes themselves.” While the last statement is not strictly correct (for example current operating systems facilitate incorporating computing resources and their management interspersed with the computations that attempt to model any physical system to be executed in a Turing machine), it still points to a fundamental limitation of current Turing machine implementations of computations using the serial von Neumann stored program control computing model. The universal Turing machine allows a sequence of connected Turing machines synchronously model a physical system as a description specified by a third-party (the modeler). The context, constraints, communication abstractions and control of various aspects during the execution of the model (which specifies the relationship between the computer acting as the observer and the computed acting as the observed) cannot be also included in the same description of the model because of Gödel’s theorems of incompleteness and decidability. Figure 3 shows the evolution of computing from mainframe/client-server computing where the management was labor-intensive to the cloud computing paradigm where the management services (which include the computers themselves in the model controlling the physical world) are automated.

 Figure 3: Evolution of Computing with respect to Resiliency, Efficiency and Scaling.

The first phase (of conventional computing) depended on manual operations and served well when the service transaction times and service management times could be very far apart and did not affect the service response times. As the service demands increased, service management automation helped reduce the gap between the two transaction times at the expense of increased complexity and resulting cost of management. It is estimated that 70% of today’s IT budget goes to self-maintenance and only 30% goes to new service development. Figure 4 shows current layers of systems contributing to cloud management.

Figure 4: Services and their management complexity

The origin of complexity is easy to understand. Current ad-hoc distributed service management practices originated from server-centric operating systems and narrow bandwidth connections. The need to address end-to-end service transaction management and the resource allocation and contention resolution required to address changing circumstances which, depend on business priorities, latency and workload fluctuations, were accommodated as an after-thought. In addition, open competitive market place has driven server-centric, network-centric and storage-centric oriented devices and appliances to multiply. The resulting duplication of many of the management functions in multiple devices without an end-to-end architectural view has largely contributed the cost and complexity of management. For example the storage volume management is duplicated in server, network and storage devices leading to a complex web of performance optimization strategies. Special purpose appliance solutions have sprouted to provide application, network, storage, and server security often duplicating many of the functions. Lack of an end-to-end architectural framework has led to point solutions that have dominated service management landscape often negating the efficiency improvements of service development and delivery made possible by the hardware performance improvements (Moore’s law) and software technologies and development frameworks.

The escape from this conundrum is to re-examine the computation models and circumvent the computational limit to go beyond Turing machines and serial von-Neumann computing model. Recently proposed computing model implemented in the DIME network architecture (Designing a New Class of Distributed Systems, Springer 2011) attempts to provide a new approach based on the old Turing O-machine proposed by Turing in his thesis. The phase 3 in figure 3 shows the new computing model implementing non-von Neumann managed Turing machine to implement hierarchical self-management of temporal computing processes. The implementation exploits the parallel threads and high bandwidth available with many-core processors and provides auto-scaling, live-migration, performance optimization and end to end transaction security by providing FCAPS (fault, configuration, accounting, performance and security) management of each Linux process and a network of such Linux processes provide a distributed service transaction. This eliminates the need for Hypervisors and Virtual machines and their management while reducing complexity. Since a Linux process is virtualized instead of a Virtual machine, the backup and DR are at a process level and also include a network of processes providing the service. Hence it is much more light-weight than VM based backup and DR.

In its simplest form the DIME computing model modifies the Turing machine SPC implementation by exploiting the parallelism and high bandwidth available in today’s infrastructure.

Figure 5: The DIME Computing Model – A Managed Turing Machine with Signaling incorporates the spirit of Turing Oracle machine proposed in his thesis.

Figure 5 shows the transition from the TM to a managed TM by incorporating three attributes:

  1. Before any read or write, the computing element checks the fault, configuration, accounting, performance and security (FCAPS) policies assigned to it,
  2. Self-management of the computing element is endowed by introducing parallel FCAPS management that sets the FCAPS policies that the computing element obeys, and
  3. An overlay of signaling network provides an FCAPS monitoring and control channel which allows the composition of managed network of TMs implementing managed workflows.

Figure 6 shows the services architecture with DIME network management providing end-to-end service FCAPS management.

Figure 6: Service Management with DIME Networks

The resulting decoupling of services management from infrastructure management provides a new approach to service management including backup and DR. While, the DIME computing model is in its infancy, two prototypes have already demonstrated its usefulness one with a LAMP stack and another with a new native-OS designed for many-core servers. Unlike Virtual Machine based backup and DR, the DIME network architecture supports auto-provisioning, auto-scaling, self-repair, live-migration, secure service isolation, and end-to-end distributed transaction security across multiple devices at the process level in an operating system. Therefore, this approach not only avoids the complexity of Hypervisors and Virtual machines (although, it still works with Virtual servers) but also allows adopting live-migration to existing applications without requiring changes to their code. In addition, it offers a new approach where the hardware infrastructure is simpler without the burden of anticipating service level requirements and let intelligence of services management reside in the services infrastructure leading to the deployment of intelligent self-managing services using a dumb infrastructure on stupid networks.

In conclusion, we emphasize that the DIME network architecture works with or without Hypervisors and associated Virtual Machine, IaaS and PaaS complexity and allows uniform service assurance across hybrid clouds independent of the service provider management systems. Only the Virtual server provisioning commands are required to configure just enough OS, DIMEX libraries and execute service components using DNA.

The power of DIME network architecture is easy to understand. By introducing parallel management to the Turing machine, we are converting a computing element to a managed computing element. In current operating systems, it is at the process level. In the new native operating system (parallax-OS) we have demonstrated, it is the Core in a many-core processor. A managed element provides plug-in dynamism to service architecture.

Figure 7 shows a service deployment in a Hybrid cloud with integrated service assurance across the private and public clouds without using service provider management infrastructure. Only the local operating system is utilized in DIME service network management.

Figure 7: A DNA based services deployment and assurance in a Hybrid Cloud. The decoupling of dynamic service provisioning and management from infrastructure resource provisioning and management (server, network and storage administration) enabled by DNA makes static provisioning of resource pools possible and dynamic service migration of services allows them to seek right resources at the right time based on workloads, business priorities and latency constraints.

As mentioned earlier, the DIME network architecture is still in its infancy and researchers are developing both the theory and practice to validate its usefulness in mission critical environments. Hopefully in this year of Turing centenary celebration, some new approaches will address the computation and its limits pointed out by Cockshott et al., in their book. Paraphrasing Turing (Turing was unimpressed by Wilkes’s EDSAC design, commenting that it was “much more in the American tradition of solving one’s difficulties by means of much equipment rather than by thought.”) a lot of appliances or code may not be often, a sustainable substitute for thoughtful architecture.

Is the Software Defined Network (SDN) Another Detour to a Datacenter Dead-end?
August 6, 2012

Introduction

Frustrated by the inability to fiddle with Internet routing in the real world, Stanford computer scientist Nick McKeown and colleagues developed a standard called OpenFlow that essentially opens up the Internet to researchers, allowing them to define data flows using software–a sort of “software-defined networking.” Installing a small piece of OpenFlow firmware (software embedded in hardware) gives engineers access to flow tables, rules that tell switches and routers how to direct network traffic. Yet it protects the proprietary routing instructions that differentiate one company’s hardware from another. SDN is nothing more than the separation of network data traffic processing from the logic and rules controlling the flow, inspection, and modification of that data. Traditional network hardware, i.e. switches and routers, implement these functions in proprietary firmware partitioned respectively into what is known as the data and control planes. While this is a fine research project, as the major vendors start to take this seriously and are attempting to introduce it in the real-world datacenters, one must ask if this will add or reduce complexity in the already complex datacenter where a host of piece meal solutions are offered by mega corporations seeking to continually increase their revenues without an incentive to reduce complexity by eliminating the number of hardware and software components deployed which would cut into their product sales.

Systems theory tells us that as the number of components increase in a system, the cost of complexity could outweigh the benefits unless architectural reorganization provides a way out.  We argue that the management complexity in current IT infrastructure design, based on the serial von Neumann stored program control implementation of the universal Turing machine, is a more fundamental architecture issue related to the lack of resiliency of the computing model than a software design issue. Cockshott et al. (2012) conclude their book “Computation and its limits” with the paragraph “The key property of general-purpose computer is that they are general purpose. We can use them to deterministically model any physical system, of which they are not themselves a part, to an arbitrary degree of accuracy. Their logical limits arise when we try to get them to model a part of the world that includes themselves.” Current generation distributed systems are implemented using a network of Turing machines in which the service and its management are intermixed as shown in figure 1. The resources utilized by the nodes in a network are often controlled by a plethora of management systems which are outside the purview of the service workflow that is utilizing the resources.  Thus the end to end service transaction response is controlled by these management systems which introduce a layer of complexity in coordination and contention resolution making the service much simpler than its management.

Figure 1: Serial von Neumann implementation of Turing Machines

The limitations of the SPC computing architecture were clearly on his mind when von Neumann gave his lecture at the Hixon symposium in 1948 in Pasadena, California (von Neumann, 1987, p. 408). “The basic principle of dealing with malfunctions in nature is to make their effect as unimportant as possible and to apply correctives, if they are necessary at all, at leisure. In our dealings with artificial automata, on the other hand, we require an immediate diagnosis. Therefore, we are trying to arrange the automata in such a manner that errors will become as conspicuous as possible, and intervention and correction follow immediately.” Comparing the computing machines and living organisms, he points out that the computing machines are not as fault tolerant as the living organisms.  He goes on to say “It’s very likely that on the basis of philosophy that every error has to be caught, explained, and corrected, a system of the complexity of the living organism would not run for a millisecond” (von Neumann, 1987,p. 408). It is clear that von Neumann recognized a problem in the way we design computing systems.

“Normally, a literary description of what an automaton is supposed to do is simpler than the complete diagram of the automaton. It is not true a priori that this always will be so. There is a good deal in formal logic which indicates that when an automaton is not very complicated the description of the function of the automaton is simpler than the description of the automaton itself, as long as the automaton is not very complicated, but when you get to high complications, the actual object is much simpler than the literary description.” (von Neumann, 1987,pp. 454-457). He remarked, “It is a theorem of Gödel that the description of an object is one class type higher than the object and is therefore asymptotically infinitely longer to describe.” (von Neumann, 1987,pp. 454-457). The conjecture of von Neumann leads to the fact that “one cannot construct an automaton which will predict the behavior of any arbitrary automaton” (von Neumann, 1987,p. 456). This is so with the Turing machine implemented by the SPC model.

In simpler terms the management complexity is related to the classical Russel Paradox that can be paraphrased as follows: “Who manages the managers?” Gödel’s prohibition of self-reflection in a Turing Machine mandates a hierarchy of Turing machines acting as managers managing other Turing machines implementing the computations described as a sequence of instructions that are compiled into a sequence of 1’s and 0’s. The universal Turing machine (or the general purpose computer) implements these TMs in a synchronous workflow thus prohibiting changes to computations at run-time in any Turing machine while the computation is in progress in that machine (i.e., you cannot change the behavior of that computation (compiled code) till its execution is interrupted).

Current generation server, networking, and storage equipment and their management systems have evolved from server-centric and bandwidth limited network architectures to today’s Cloud computing architecture with virtual servers and broadband networks. During last six decades, many layers of computing abstractions have been introduced to map the execution of complex computational workflows to a sequence of 1s and 0s that eventually get stored in the memory and operated upon by the CPU to achieve the desired result.  These include process definition languages, programming languages, file systems, databases, operating systems etc. While this has helped in automating many business processes, the exponential growth in services in the consumer market also has introduced severe strains on current IT infrastructure. In order to meet the need to rapidly respond to manage the distributed computing resources demanded by changing workloads, business priorities and latency constraints, new layers of resource management are added with the introduction of Hypervisors, virtual machines (VM) and their management. While these layers have made the application or service management more agile, they have introduced a new layer of issues related to their own management. For example, new layers of Virtual machine-level clustering, intrusion detection and performance management, are being introduced in addition to already existing clusters, intrusion detection and performance management systems at the infrastructure, operating systems and distributed resource management layers.

However, this approach is completely unsuited to exploit the new generation many-core servers and high-bandwidth networks now available. The advent of many-core severs with tens and even hundreds of computing cores with high bandwidth communication among them makes the current generation server, networking and storage equipment and their management systems which have evolved from server-centric and bandwidth limited architectures completely unsuited to use in the next generation computing infrastructure efficiently.  It is hard to imagine replicating current TCP/IP-based socket communication, “isolate and fix” diagnostic procedures, and the multiple operating systems (which do not have end-to-end visibility or control of business transactions that span across multiple cores, multiple chips, multiple servers and multiple geographies) inside the next generation many-core servers without addressing their shortcomings.  The many-core servers and processors constitute a network where each node itself is a sub-network with different bandwidths and protocols (socket-based low-bandwidth communication between servers, InfiniBand, or PCI Express bus based communication across processors in the same server and shared memory based low latency communication across the cores inside the processor).

Figure 2 shows the many-core server network supporting multiple bandwidths.

In order to cope with the scaling issues and utilize the hierarchical many-core network of networks effectively, next generation service architecture has to emulate the architectural resiliency of cellular organisms that tolerate faults and implement command and control structures which enable execution of self-configuring, self-monitoring, self-protecting, self-healing and self-optimizing (in short self-*) business processes. This requires new computing models that break the Turing machine barrier to computation by allowing the computer and the computed to be treated in the same model.

Papers Solicited to Address Next Generation Datacenter Infrastructure and Technologies:

The conference on “Convergence of Distributed Clouds, Grids and their Management” sponsored under the Aegis of WETICE 2013 is devoted to addressing next generation computing models which support real-time resource reconfiguration of distributed business workflow execution based on latency constraints, changing workloads and business priorities. It is devoted to addressing the assurance of reliability, availability, performance, account management and security of distributed business process execution with appropriate visibility and control.

The objective of the Conference was first stated in WETICE 2009; “to analyze current trends in Cloud Computing and identify long-term research themes and facilitate collaboration in future research in the field that will ultimately enable global advancements in the field that are not dictated or driven by the prototypical short-term profit driven motives of a particular corporate entity.” We are glad to report that the discussions started in 2009 have directly resulted in an alternative approach to self-managing distributed computing systems totally different from current industry trend showing a way to eliminate the complexity of virtual machines and Hypervisors. If this approach is proven to be theoretically sound (as a paper in WETICE2012 investigated) and extend its usefulness (demonstrated through their feasibility in the form of two proofs of concepts in the last conference) to mission critical environments, the DIME network architecture may yet prove to be an important contribution to computer science.

Following the tradition, the target of the WETICE2013 is to transform current complex, redundant, costly and knowledge intensive IT management into self-configuring, self-monitoring, self-healing and self-optimizing distributed workflow implementations with service management only limited by the speed of light. We identify another emerging area of software defined networks (SDN) as a potential candidate for further investigation without the bias that often surrounds commercial profit motives to see whether the overall complexity of the datacenter will be reduced or the SDNs are yet another layer of complexity.

Papers are solicited to advance the next generation distributed computing and its management infrastructure that leverages the new hardware innovations.  The goals of the conference include (but are not limited to):
  1. Discovering new application scenarios, proposing new operating systems, programming abstractions and tools
  2. Identifying the challenging problem that still need to be solved such as parallel programming, scaling and management of distributed computing elements, and
  3. Reporting results and experiences gained by researchers in building dynamic Grid-based middleware, computing clouds (distributed or otherwise) and workflow management systems.
Submission of papers March 10, 2013
Notification to authors April 1, 2013
Final papers to IEEE-CS April 25, 2013
Paper author’s registration deadline May 10, 2013
 WETICE-2013 Conference June 17-20, 2013

References:

P. Cockshott, L. M. MacKenzie and  G. Michaelson, “Computation and its Limits”, Oxford University Press, Oxford 2012.

J. v.Neumann, Probabilistic logic and the synthesis of reliable organisms from unreliable components, “Automatic studies,” edited by C. E. Shannon, and J. McCarthy, Princeton University Press, 1956, pp. 43-98.

W. Aspray, and A. Burks, “Papers of John von Neumann on Computing and Computer Theory.” Cambridge, MA: MIT Press. 1989.

Path to Self-managing Services: A Case for Deploying Managed Intelligent Services Using Dumb Infrastructure in a Stupid Network
February 2, 2012

“WETICE 2012 Convergence of Distributed Clouds, Grids and their Management Conference Track is devoted to transform current labor intensive, software/shelf-ware-heavy, and knowledge-professional-services dependent IT management into self-configuring, self-monitoring, self-protecting, self-healing and self-optimizing distributed workflow implementations with end-to-end service management by facilitating the development of a Unified Theory of Computing.”

“In recent history, the basis of telephone company value has been the sharing of scarce resources — wires, switches, etc. – to create premium-priced services. Over the last few years, glass fibers have gotten clearer, lasers are faster and cheaper, and processors have become many orders of magnitude more capable and available. In other words, the scarcity assumption has disappeared, which poses a challenge to the telcos’ “Intelligent Network” model. A new type of open, flexible communications infrastructure, the “Stupid Network,” is poised to deliver increased user control, more innovation, and greater value.”

                     —–Isenberg, D. S., (1998). “The dawn of the stupid network”. ACM netWorker 2, 1, 24-31.

Much has changed since the late 90’s that drove the Telco’s to essentially abandon their drive for supremacy in intelligent services creation, delivery and assurance business and take the back seat in the information services market to manage the ‘stupid network’ that merely carries the information services.  You have to only look at the demise of major R&D companies such as AT&T Bell Labs, Lucent, Nortel, Alcatel and the rise of a new generation of services platforms from Apple, Amazon, Google, Facebook, Twitter, Oracle and Microsoft to notice the sea change that has occurred in a short span of time. The data center has replaced the central office to become the hub from which myriad voice, video and data services are created, and delivered on a global scale. However the management of these services which determines their resiliency, efficiency and scaling is another matter.

While, the data center value has been the sharing of expensive resources – processor speed, memory, network bandwidth, storage capacity, throughput and IOPs – to create premium-priced services, over the last couple of decades, the complexity of the infrastructure and its management has exploded. It is estimated that up to 70% of the total IT budget now goes to the management of infrastructure rather than to develop new services (www.serverdesignsummit.com). It is important to define what TCO (total cost of ownership) we are talking about here because it is often, used to justify different solutions as the following picture showing three different TCO representations of a data center. Figure 1 shows three different TCO views presented by three different speakers in the Server Design Summit in November 2011.  Each graph, while it is accurate, represents a different view. For example, the first view represents the server infrastructure and its management cost. The second one represents the power infrastructure and its management. The third view shows both the server infrastructure and power management. As you can see the total power and its management, while steadily increasing, is only a small fraction of the total infrastructure management cost.  In addition, these views do not even show the network and storage infrastructure and their management. It is also interesting to see the explosion of management cost shown in figure 3 over the last two decades. Automation has certainly improved the number of servers that can be managed by a single person by orders of magnitude. This is borne by the labor cost in the left picture by Intel which shows it is about 13% of the TCO from server view-point. But this does not tell the whole story.

Figure 1: Three different views of Data center TCO presented in the Server Design Summit conference in November 2011 (http://www.serverdesignsummit.com/English/Conference/Proceedings_Chrono.html). These views do not touch the storage, network and application/service management costs both in terms of software systems and labor.

A more revealing picture can be obtained by using the TCO calculator by one of the Virtualization infrastructure vendors. Figure 2 shows percentage Total Cost of Ownership (TCO) (for a 1500 server data center) over five years by each component with and without virtualization.

Figure 2: Five Year TCO of Virtualization According to a Vendor ROI Calculator. While virtualization reduces the TCO from 35% to 25%, it is almost offset by the software, services and training costs.

While virtualization introduces many benefits such as consolidation, multi-tenancy in a physical server, real-time business continuity and elastic scaling of resources to meet wildly fluctuating workloads, it adds another layer of management systems in addition to current computing, network, storage and application management systems. Figure 3 shows a reduction by 50% of the five-year TCO with virtualization. The Virtual Machine density of about 13 allows a great saving in hardware costs which is somewhat off-set by the new software, training and services costs of virtualization.

Figure 3: TCO over 5 Years with virtualization of 1500 servers using 13 VMs per Server. While the infrastructure and administration costs drop, it is almost offset by the software, services and training costs.

In addition, there is the cost of new complexity in optimizing the 13 or so VMs within each server in order to match the resources (network bandwidth, storage capacity, IOPs and throughput) to application workload characteristics, business priorities and latency constraints. According to a storage consultant, Jon Toigo “Consumers need to drive vendors to deliver what they really need, and not what the vendors want to sell them. They need to break with the old ways of architecting storage infrastructure and of purchasing the wrong gear to store their bits: Deploying a “SAN” populated with lots of stovepipe arrays and fabric switches that deliver less than 15% of optimal efficiency per port is a waste of money that bodes ill for companies in the areas of compliance, continuity, and green IT.”

Resource management based data center operations miss an important feature of services/applications management which is that all services are not created equal. They have different latency and throughput requirements. They have different business priorities and different workload characteristics and fluctuations. What works for the goose does not work for the gander. Figure 4 shows a classification of different services based on their throughput and latency requirements presented by Dell in the server design summit. The applications are characterized by their need for throughput, latency and storage capacity. In order to take advantage of the differing priorities and characteristics of the applications, additional layers of services management are introduced which focus on service specific resource management. Various appliance or software based solutions are added to the already complex resource management suites that address server, network and storage to provide service specific optimization. While this approach is well suited for making recurring revenues for vendors, it is not ideally suited for customers to lower the final TCO when all piece-wise TCO’s are added up. Over a period of time, most of these appliances and software end up as shelf-ware while the venodors tout more new TCO reducing solutions. For example, a well known solution vendor makes more annual revenue from maintenance and upgrades than new products or services that help their cutomers really reduce the TCO.

 Figure 4: Various services/Applications characterized by their throughput and latency requirements. Current resource management based data center does not optimally exploit the resources based on application/service priority, workload variations and latency constraints. It is easy to see the inefficiency in deploying a “one size fits all” infrastructure. It will be more eff icient to tailor “dumb” infrastructure and “Stupid Network” pools specialized to cater to different latency and throughput characteristics and let intelligent services provision themselves with the right resources based on their own business priorities, workload characteristics and latency constraints. This requires the visibility and control of service specification, management and execution available at run time which necessitates a search for new computing models.

In addition to the current complexity and cost of resource management to assure service availability, reliability, performance and security, there is even more fundamental issue that plagues the current distributed systems architecture. A distributed transaction that spans multiple servers, networks and storage devices in multiple geographies uses resources that span across multiple data centers. The fault, configuration, accounting, performance and security (FCAPS) of a distributed transaction behavior requires the end-to-end connection management more like telecommunication service spanning distributed resources. Therefore, focusing on only resource management in a data center without the visibility and control of all resources participating in the transaction will not provide assurance of service availability, reliability, performance and security.

Distributed transactions transcend the current stored program control implementation of the Turing machine which is at the heart of the atomic computing element in current computing infrastructure.  The communication and control are not an integral part of this atomic computing unit in the stored program control implementation of the Turing machine. The distributed transactions require interaction which integrates computing, control and communication to provide the ability to specify and execute highly temporal and hierarchical event flows. According to Goldin and Wegner, Interactive computation is inherently concurrent, where the computation of interacting agents or processes proceeds in parallel. Hoare, Milner and other founders of concurrency theory have long realized that Turing Machines (TM) do not model all of computation (Wegner and Goldin, 2003). However, when their theory of concurrent systems was first developed in the late ’70s, it was premature to openly challenge TMs as a complete model of computation. Their theory positions interaction as orthogonal to computation, rather than a part of it. By separating interaction from computation, the question whether the models for CCS and the Pi-calculus went beyond Turing Machines and algorithms was avoided. The resulting divide between the theory of computation and concurrency theory runs very deep. The theory of computation views computation as a closed-box transformation of inputs to outputs, completely captured by Turing Machines. By contrast, concurrency theory focuses on the communication aspect of computing systems, which is not captured by Turing Machines – referring both to the communication between computing components in a system, and the communication between the computing system and its environment. As a result of this division of labor, there has been little in common between these fields and their communities of researchers. According to Papadimitriou (Papadimitriou, 1995), such a disconnect within the theory community is a sign of a crisis and a need for a Kuhnian paradigm shift in our discipline.”

Kuhnian paradigm shift or not, a new computing model called DIME computing model (discussed in WETICE2010) provides a convergence of these two disciplines by addressing the computing and the communications in a single computing entity that is a managed Turing machine. The DIME network architecture provides a fractal (recursive) composition scheme to create an FCAPS managed network of DIMEs implementing business workflows as DAGs supporting both hierarchical and temporal event flows. The DIME computing model supports only those computations that can be specified as managed DAGs where a management signaling network overlay allows execution of managed computing tasks (executed by a computing unit called MICE) in each Turing machine node that is endowed with self-management using parallel computing threads. The MICE (see the video referenced in this blog for a description of DIME and its use in distributed computing and its management) constitutes the atomic Turing machine that is controlled by the FCAPS manager in a DIME which allows configuring, executing and managing the MICE to load and execute well specified computing workflow and its FCAPS management. The MICE under parallel real-time control of the DIME FCAPS manager aided by a signaling network overlay provides control over start, stop, read and write abstractions of the Turing machine. Two implementations have proven the existence proof for the DIME network architecture.

Figure 5 shows a DIME network implementing Linux, Apache, MySQL and PHP/Perl/Python web services delivery and assurance infrastructure.

Figure 5: The GUI showing the configuration of a LAMP Cloud (Mikkilineni, Morana, Zito, Di Sano, 2012). Each Apache and DNS are DIME aware running in a DIME aware Linux Operating System which, transforms a process into a managed element in the DIME network. A video describes the implementation of auto-failover, auto-scaling and performance management of the DIME aware LAMP cloud

Look Ma! No Hypervisor or VM in My Cloud (See Video)

The prototype implementations demonstrates a side effect of the DIME network architecture, which combines the computing and communication abstractions at an atomic level, – it decouples the services management from the underlying hardware infrastructure management. This makes it possible to implement highly resilient distributed transactions with auto-scaling, self-repair, state-aware migration, and self-protection – in-short, end-to-end transaction FCAPS management – based on business priorities, workload fluctuations and latency constraints.  No Hypervisors or VMs are required. The intelligent management of services workflow with resilient distributed transactions offers a new architecture for the data center infrastructure. For the first time it will be possible to remove embedding service management in the infrastructure management intelligence using myriad expensive appliances and software systems. It will be possible to design new tiers of dumb infrastructure pools (of servers, storage and network devices) with different latency and throughput characteristics and the services will be able to manage themselves based on policies by requesting appropriate resources based on their specifications. They will be able to self-migrate when quality of service levels are not met. The case for dumb infrastructure on a stupid network with intelligent services management puts forth the following advantages:

  1. Separation of concerns: The network, storage and server hardware provides hardware infrastructure management with signaling enabled FCAPS management. They do not encapsulate service management as the current generation equipment does.
  2. Specialization: The hardware is designed to meet specific latency and throughput characteristics to simplify its design through specialization. Different hardware with FCAPS management and signaling will provide plug and play components at run time.
  3. End-to-end service connection FCAPS management using the signaling network overlay allows dynamic service FCAPS management facilitating self-repair, auto-scaling, self-protection, state-aware migration and end to end transaction security assurance.

Figure 4 shows an example design of a possible storage device using simple storage architecture enabled with FCAPS management over a signaling overlay. It can be easily built with commercially off the shelf (COTS) hardware. This design allows separation of the services management from storage device management and eliminates a host of storage software management systems thus simplifying the data center infrastructure.

Figure 5: A gedanken design of autonomic storage and autonomic storage service deployment using the new DIME network architecture. The signaling overlay and FCAPS management are used to provide dynamic service management. Each service can request, using standard Linux OS services during run time, services from the storage device based on business priorities, workload fluctuations and latency constraints.

It is easy to see that the service connection model eliminates the need for clustering and provides new ways to provide transaction resilience with features such as service call forwarding, service call waiting, data broadcast, 800 service call model etc. It is also equally easy to see that with many-core servers, how the DIME Network architecture eliminates the inefficiencies of communication between Linux images within the same container (e.g., TCP/IP) and also how simple SAS storage and Flash storage can replace current generation appliance based storage strategies and their myraid management systems. Looking at the trends, it is easy to see that a paradigm shift soon will be in play to transform the data centers from their current role of being just managed server, networking, and storage hosting centers (whether physical or virtual), to true service switching centers with telecom grade trust. The emphasis will shift from resource switching and resource connection management to services switching and service connection management thus replacing the current efforts to replicate the complexity inside the data center today, also inside the many-core servers. With the resulting decoupling of services management from the infrastructure management, the next generation data centers will perhaps be more like central offices of the old Telcos, switching service connections.

Obviously the new computing model is in its infancy and requires participation from academicians who can validate or reject its theoretical foundation, VCs who can see beyond current approaches and are not satisfied by how many servers can be managed by a single administrator to measure the data center efficiency (as one Silicon Valley VC claimed it as progress in the Server Design Summit) and architects who exploit new paradigms to disrupt the status-quo. The DIME computing model by allowing Linux processes to be converted into a DIME network transcending physical boundaries allows easy migration from current infrastructure to the new one without abandoning legacy applications as the prototype of LAMP cloud demonstrates.

In closing, I like to point out that there have been many calls for a new computing model that combines computing and communication at an atomic computing element level which the Turing machine falls short as discussed above. However, without high bandwidth communication and exploitation of the parallelism that is abundant in the new generation hardware, it is not practically very useful to seriously utilize such new computing models. However, it seems that the hardware advances have outpaced the software advances and perhaps it is about time for computer scientists to seriously take a second look at addressing the software short-fall in dealing with distributed transactions. As the following fable illustrates, it may be futile to look for parallel break-through solutions in a serial boat.

“When Master Foo and his student Nubi journeyed among the sacred sites, it was the Master’s custom in the evenings to offer public instruction to UNIX neophytes of the towns and villages in which they stopped for the night.  On one such occasion, a methodologist was among those who gathered to listen.  “If you do not repeatedly profile your code for hot spots while tuning, you will be like a fisherman who casts his net in an empty lake,” said Master Foo.
“Is it not, then, also true,” said the methodology consultant, “that if you do not continually measure your productivity while managing resources, you will be like a fisherman who casts his net in an empty lake?”
“I once came upon a fisherman who just at that moment let his net fall in the lake on which his boat was floating,” said Master Foo. “He scrabbled around in the bottom of his boat for quite a while looking for it.”  “But,” said the methodologist, “if he had dropped his net in the lake, why was he looking in the boat?”  “Because he could not swim,” replied Master Foo.
Upon hearing this, the methodologist was enlightened”        — Master Foo and the Methodologist
                                                                   (http://www.catb.org/esr/writings/unix-koans/methodology-consultant.html)

If you have transformational research results, or want to make a real difference in computer science research, see Call for Papers at:

www.workshop.kawaobjects.com and http://WETICE.org

Protected: Can Today’s Systems Administration Paradigm, Albeit Automation, Move Us To Telecom Grade “Trust” In The Clouds?
January 12, 2010

This content is password protected. To view it please enter your password below: