From Being to Becoming: Function, Structure and Fluctuations – Incremental versus Leap-frog Innovation in Datacenters
February 16, 2015

“We grow in direct proportion to the amount of chaos we can sustain and dissipate” ― Ilya Prigogine, Order out of Chaos: Man’s New Dialogue with Nature

Abstract

According to Gartner “Alpha organizations aggressively focus on disruptive innovation to achieve competitive advantage. Characterized by unknowns, disruptive innovation requires business and IT leaders to go beyond traditional management techniques and implement new ground rules to enable success.”

While there is a lot of buzz about “game changing” technologies, and “disruptive innovation”, real “game changers” and “disruptive innovators” are few and far between. Leap-frog innovation is more like a “phase transition” in physics. A system is composed of individual elements with a well-defined function which interact with each other and the external world with a well-defined structure. The system usually exhibits normal equilibrium behavior that is predictable and when there are small fluctuations, incremental innovation allows to adjust itself and maintain the equilibrium with predictability. Only when the external forces inflict large or wild unexpected fluctuations in the system, the equilibrium is threatened and the system exhibits an emergent behavior where unstable equilibrium introduces unpredictability in the evolution dynamics of the system. A phase transition occurs with a reconfiguration of the structure of the system going through an architecture transformation resulting in order from chaos.

The difference between “Kaizen” (incremental improvement) and “disruptive innovation” is in dealing with stable equilibrium with small fluctuations versus dealing with meta-stable equilibrium with large-scale and big fluctuations. Current datacenter is in a similar transition from “being” to “becoming” driven by both the hyper-scale structure and fluctuations (which, the hardware and software systems delivering business processes are experiencing) caused by rapidly changing business priorities on a global scale, workload fluctuations and latency constraints. Is the current von Neumann stored program control implementation of the Turing machine reaching its limit? Is the datacenter poised for a phase transition from current ad-hoc distributed computing practices to a new theory-driven self-* architecture? In this blog we discuss a non-von Neumann managed Turing oracle machine network with a control architecture as an alternative.

From Being to Becoming” – What Does It Mean?

The representation of the dynamics of a physical systems as linear, reversible (hence deterministic), temporal order of states requires that, in a deep sense, physical systems never change their identities through time; hence they can never become anything radically new (e.g., they must at most merely rearrange their parts, parts whose being is fixed). However, as elements interact with each other and their environment, the system dynamics can dramatically change when large fluctuations in the interactions induce a structural transformation leading to chaos and the eventual emergence of a new order out of chaos. This is denoted as “becoming”. In short, the dynamics of near equilibrium states with small-scale fluctuations in a system represent the “being” and large deviations from the equilibrium, emergence of an unstable equilibrium and the final restoration of order in a new equilibrium state represent the “becoming”. According to Plato “being” is absolute, independent, and transcendent. It never changes and yet causes the essential nature of things we perceive in the world of “becoming”. The world of becoming is the physical world we perceive through our senses. This world is always in movement, always changing. The two aspects – the static structures and their dynamics of evolution are two sides of a coin. Dynamics (becoming) represents time and static configurations at any particular instance represent the “being”. Prigogine applied this concept to understand the chemistry of matter, phase transitions and the like. Individual elements represent function and the groups (constituting a system) represent structure with dynamics. Fluctuations caused by the interaction within the system and between the system and its environment, cause the dynamics of the system to induce transitions from being to becoming. Thus, function, structure and fluctuations determine the system and its dynamics defining the complexity, chaos and order.

Why is it Relevant to Datacenters?

Datacenters are dynamic systems where software working with hardware delivers information processing services that allow modeling, interaction, reasoning, analysis and control of the environment external to them. Figure 1 shows the hardware, software and their interaction among themselves and the external world. There are two distinct systems interacting with each other to deliver the intent of the datacenter which is to execute specific computational workflows that model, monitor and control the external world processes using the computing resources:

  1. Service workflows modeling the process dynamics of the system depicting the external world and its interactions. Usually this consists of functional requirements of the system that is under consideration such as business logic, sensors and actuator monitoring and control (the computed) etc. The model consists of various functions captured in a structure (e.g., a directed acyclic graph, DAG, and it’s evolution in time. This model does not include the computing resources required to execute the process dynamics. It is assumed tat the resources will be available for the computation (cpu, memory, time etc.)
  2. The non-functional requirements that address the required resources to execute the functions as a function of time and fluctuations both in the interactions in the external world and also in the computing resources available to accomplish the intent defined in the functional requirements. The computation as implemented in the von Neumann stored program control model of the Turing machine requires time (impacted by the cpu speed, network latency, bandwidth, storage IOPs, throughput, capacity) and memory. The computing model assumes unbounded resources including time for completing the computation. Today, these resources are provided by a cluster of servers and other devices containing multi-core cpu’s and memory networked with different types of storage. The computations are executed in the server or device by allocating the resources using an operating system which itself is a software that mediates the resources to various computations.

On the right hand side of Figure 1, we depict the computing resources required to execute the functions in a given structure whether it is distributed or not. In the middle, we represent the application workflows composed of various components constituting an application area network (AAN) that is executed in a distributed computing cluster (DCC) made up of the hardware resources with specified service levels (cpu, memory, network bandwidth, cluster latency, storage capacity, IOPs , throughput and capacity). The left hand side shows a desired end-to-end process configuration and evolution monitoring and control mechanism. When all is said and done, the process workflows need to execute various functions using the computing resources made available in the form of a distributed cluster providing required CPU, memory, network bandwidth, latency, storage IOPs, throughput and capacity. The structure is determined by the non-functional requirements such as resource availability, performance, security and cost. Fluctuations evolve the process dynamics and require adjusting the resources to meet the needs of applications to cope with the fluctuations.

Figure 1: Decoupling service orchestration and infrastructure orchestration to deliver function, structure and dynamic process flow to address the fluctuations both in resource availability and service demand

Figure 1: Decoupling service orchestration and infrastructure orchestration to deliver function, structure and dynamic process flow to address the fluctuations both in resource availability and service demand

There are two ways to match the resources available to the computing nodes connected by links that execute the business process dynamics. First approach is the current state of the art and the second one is an alternative approach based on extensions to the current von Neumann stored program implementation of  the Turing machine.

Current State of the Art

The infrastructure is infused with intelligence about various applications and their evolving needs and adjust the resources (time of computation affected by cpu, network bandwidth, latency, storage capacity, throughput and IOPs and the memory required for the computation). Current IT has evolved from a model where the resources are provisioned anticipating the peak workloads and the structure of the application network is optimized for coping with deviations from equilibrium. Conventional computing models using physical servers (often referred to as bare-metal) cannot cope with wild fluctuations if the new server provisioning times are much larger than the time it takes for the onset of fluctuations and the predictability of their magnitude to pre-plan the provisioning of additional resources. Virtualization of the servers and on-demand provisioning of Virtual machines reduces the provisioning times substantially to institute auto-scaling, auto-failover and live migration across distributed resources using Virtual Machine image mobility. However, it comes with a price:

    1. The Virtual Image is still tied to the infrastructure (network, storage and computing resources supporting the VM and moving a VM involves manipulating a multitude of distributed resources often owned or operated by different owners and touch many infrastructure management systems thus increasing complexity and cost of management.
    2. If the distributed infrastructure is homogeneous and supports VM mobility, it is simpler but the solution forces vendor lock-in and does not allow to take advantage of commodity infrastructure offered by multiple suppliers.
    3. If the distributed infrastructure is heterogeneous, VM mobility now must depend on myriad management systems and most often, these management systems themselves need other management systems to manage their resources.
    4. The VM mobility and management also increase bandwidth and storage requirements and proliferation of point solutions and tools to move across heterogeneous distributed infrastructure that increase operational complexity and additional cost.

Current state of the art based on the mobility of VMs and infrastructure orchestration  is summarized in figure 2.

Anti-Thesis: urrent State of the Art

Figure 2: The infrastructure orchestration based on second guessing the application quality of service requirements and its dynamic behavior

 It clearly shows the futility of orchestrating service availability, performance, compliance, cost and security in a very distributed and heterogeneous environment where scale and fluctuations dominate. The cost and complexity of navigating multiple infrastructure service offerings often outweigh the benefits of commodity computing. It is one reason why enterprises complain that 70% of their budget often is spent on keeping the service lights on.

Alternative Approach: A Clean Separation of Business Logic Implementation and the Operational Realization of Non-functional Requirements

Another approach is to decouple application and business process workflow management from the distributed infrastructure mobility by placing the applications in the right infrastructure that has the right resources, monitor the evolution of the applications and proactively manage the infrastructure to add or delete resources with predictability based on history. Based on the RPO and RTO, adjust the application structure to create active/passive or active/active nodes to manage application QoS and workflow/business process QoS. This approach requires top down method of business process implementation with the specification of the business process intent followed by a hierarchical and temporal specification of process dynamics with context, constraints, communication, control of the group and its constituents and the initial conditions for the equilibrium quality of service (QoS). The details include:

  1. Non-functional requirements that specify availability, performance, security, compliance and cost constraints and the policies specified with hierarchical and temporal process flows. The intent at higher level are translated to the down-stream intent of the computing nodes contributing to the workflow.
  2. A distributed or otherwise structure of network of networks providing the computing nodes with specified SLAs for resources (cpu, memory, network bandwidth, latency, storage IOPs, throughput and capacity)
  3. A method to implement autonomic behavior with visibility and control of application components so that they can be managed with policies defined. When scale and fluctuations demand a change in the structure to transition to a new equilibrium state, the policy implementation processes proactively add or subtract computing nodes or find existing nodes to replicate, repair, recombine or reconfigure the application components. The structural change implements the transition from being to becoming.

A New Architecture to Accommodate Scale and Fluctuations: Toward the Oneness of the Computer and the Computed

There is a fundamental reason why current Turing, von Neumann stored program computing model cannot address large-scale distributed computing with fluctuations both in resources and in computation workloads without increasing complexity and cost (Mikkilineni et. al. 2012). As von Neumann put it “It is a theorem of Gödel that the description of an object is one class type higher than the object.” An important implication of Gödel’s incompleteness theorem is that it is not possible to have a finite description with the description itself as the proper part. In other words, it is not possible to read yourself or process yourself as a process. In short, Gödel’s theorems prohibit “self-reflection” in Turing machines. According to Alan Turing, Gödel’s theorems show that every system of logic is in a certain sense incomplete, but at the same time it indicates means whereby from a system L of logic a more complete system L_ may be obtained. By repeating the process we get a sequence L, L1 = L_, L2 = L_1 … each more complete than the preceding. A logic Lω may then be constructed in which the provable theorems are the totality of theorems provable with the help of the logics L, L1, L2, … Proceeding in this way we can associate a system of logic with any constructive ordinal. It may be asked whether such a sequence of logics of this kind is complete in the sense that to any problem A, there corresponds an ordinal α such that A is solvable by means of the logic Lα.”

This observation along with his introduction of the oracle-machine influenced many theoretical advances including the development of generalized recursion theory that extended the concept of an algorithm. “An o-machine is like a Turing machine (TM) except that the machine is endowed with an additional basic operation of a type that no Turing machine can simulate.” Turing called the new operation the ‘oracle’ and said that it works by ‘some unspecified means’. When the Turing machine is in a certain internal state, it can query the oracle for an answer to a specific question and act accordingly depending on the answer. The o-machine provides a generalization of the Turing machines to explore means to address the impact of Gödel’s incompleteness theorems and problems that are not explicitly computable but are limit computable using relative reducibility and relative computability.

According to Mark Burgin, an Information processing system (IPS) “has two structures—static and dynamic. The static structure reflects the mechanisms and devices that realize information processing, while the dynamic structure shows how this processing goes on and how these mechanisms and devices function and interact.”

The software contains the algorithms (à la the Turing machine) that specify information processing tasks while the hardware provides the required resources to execute the algorithms. The static structure is defined by the association of software and hardware devices and the dynamic structure is defined by the execution of the algorithms. The meta-knowledge of the intent of the algorithm, the association of specific algorithm execution to a specific device, and the temporal evolution of information processing and exception handling when the computation deviates from the intent (be it because of software behavior or the hardware behavior or their interaction with the environment) is outside the software and hardware design and is expressed in non-functional requirements. Mark Burgin calls this Infware which contains the description and specification of the meta-knowledge that can be also be implemented using the hardware and software to enforce the intent with appropriate actions.

The implementation of Infware using Turing machines introduces the same dichotomy mentioned by Turing with respect to the manager of manager conundrum. This is consistent with the observation of Cockshott et al. (2012) ““The key property of general-purpose computer is that they are general purpose. We can use them to deterministically model any physical system, of which they are not themselves a part, to an arbitrary degree of accuracy. Their logical limits arise when we try to get them to model a part of the world that includes themselves.”

The goals of the distributed system determine the resource requirements and computational process definition of individual service components based on their priorities, workload characteristics and latency constraints. The overall system resiliency, efficiency and scalability depend upon the individual service component workload and latency characteristics of their interconnections that in turn depend on the placement of these components (configuration) and available resources. The resiliency (fault, configuration, accounting, performance and security often denoted by FCAPS) is measured with respect to a service’s tolerance to faults, fluctuations in contention for resources, performance fluctuations, security threats and changing system-wide priorities.  Efficiency depicts the optimal resource utilization.  Scaling addresses end-to-end resource provisioning and management with respect to increasing the number of computing elements required to meet service needs.

A possible solution  to address resiliency with respect to scale and fluctuations is an application network architecture, based on increasing the intelligence of computing nodes which, is presented in the Turing centenary conference (2012) for improving the resiliency, efficiency and scaling of information processing systems. In its essence, the distributed intelligent managed element (DIME) network architecture extends the conventional computational model of information processing networks, allowing improvement of the efficiency and resiliency of computational processes. This approach is based on organizing the process dynamics under the supervision of intelligent agents. The DIME network architecture utilizes the DIME computing model with non-von Neumann parallel implementation of a managed Turing machine with a signaling network overlay and adds cognitive elements to evolve super recursive information processing. The DIME network architecture introduces three key functional constructs to enable process design, execution, and management to improve both resiliency and efficiency of application area networks delivering distributed service transactions using both software and hardware (Burgin and Mikkilineni):

  1. Machines with an Oracle: Executing an algorithm, the DIME basic processor P performs the {read -> compute -> write} instruction cycle or its modified version the {interact with a network agent -> read -> compute -> interact with a network agent -> write} instruction cycle. This allows the different network agents to influence the further evolution of computation, while the computation is still in progress. We consider three types of network agents: (a) A DIME agent. (b) A human agent. (c) An external computing agent. It is assumed that a DIME agent knows the goal and intent of the algorithm (along with the context, constraints, communications and control of the algorithm) the DIME basic processor is executing and has the visibility of available resources and the needs of the basic processor as it executes its tasks. In addition, the DIME agent also has the knowledge about alternate courses of action available to facilitate the evolution of the computation to achieve its goal and realize its intent. Thus, every algorithm is associated with a blueprint (analogous to a genetic specification in biology), which provides the knowledge required by the DIME agent to manage the process evolution. An external computing agent is any computing node in the network with which the DIME unit interacts.
  2. Blue-print or policy managed fault, configuration, accounting, performance and security monitoring and control (FCAPS): The DIME agent, which uses the blueprint to configure, instantiate, and manage the DIME basic processor executing the algorithm uses concurrent DIME basic processors with their own blueprints specifying their evolution to monitor the vital signs of the DIME basic processor and implements various policies to assure non-functional requirements such as availability, performance, security and cost management while the managed DIME basic processor is executing its intent. This approach integrates the evolution of the execution of an algorithm with concurrent management of available resources to assure the progress of the computation.
  3. DIME network management control overlay over the managed Turing oracle machines: In addition to read/write communication of the DIME basic processor (the data channel), other DIME basic processors communicate with each other using a parallel signaling channel. This allows the external DIME agents to influence the computation of any managed DIME basic processor in progress based on the context and constraints. The external DIME agents are DIMEs themselves. As a result, changes in one computing element could influence the evolution of another computing element at run time without halting its Turing machine executing the algorithm. The signaling channel and the network of DIME agents can be programmed to execute a process, the intent of which can be specified in a blueprint. Each DIME basic processor can have its own oracle managing its intent, and groups of managed DIME basic processors can have their own domain managers implementing the domain’s intent to execute a process. The management DIME agents specify, configure, and manage the sub-network of DIME units by monitoring and executing policies to optimize the resources while delivering the intent.

The result is a new computing model, a management model and a programming model which infuse self-awareness using an intelligent Infware into a group of software components deployed on a distributed cluster of hardware devices while enabling the monitoring and control of the dynamics of computation to conform to the intent of the computational process. The DNA based control architecture configures appropriately the software and hardware components to execute the intent. As the computation evolves, the control agents monitor the evolution and makes appropriate adjustments to maintain an equilibrium conforming to the intent. When the fluctuations create conditions for unstable equilibrium, the control agents reconfigure the structure in order to create a new equilibrium state that conforms to the intent based on policies.

Figure 3 shows the Infware, hardware and software executing a web service using DNA.

Infware

Figure 3: Hardware and software networks with a process control Infware orchestrating the life-cycle evolution of a web service deployed on a Distributed Computing Cluster

The hardware components are managed dynamically to configure an elastic distributed computing cluster (DCC) to provide the required resources to execute the computations. The software components are organized as managed Turing oracle machines with a control architecture to create AANs that can be monitored and controlled to execute the intent using the network management abstractions of replication, repair, recombination and reconfiguration. With DNA, the datacenters are able to evolve from being to becoming.

It is important to note that DNA is implemented (Mikkilineni, et. al. 2012, 2014) to demonstrate a couple of functions that cannot be accomplished today with current state of the art:

  1. Migrating a workflow being executed in a physical server (a web service transaction including a web server, application server and a database) to another physical server without a reboot or losing transactions to maintain recovery time and recovery point objectives. No virtual machines are required although they can be used just as if they were bare-metal servers.
  2. Provide workflow auto-scaling, auto-failover and live migration with retention of application state using distributed computing clusters with heterogeneous infrastructure (bare metal servers, private and public clouds etc.) without infrastructure orchestration to accomplish them (e.g., without moving virtual machine images or LXC container based images).

The approach using DNA allows the implementation of the above functions without requiring changes to existing applications, OSs or current infrastructure because the architecture non-intrusively extends the current Turing computing model to a managed Turing oracle machine network with control network overlay. It is not a coincidence that similar abstractions are present in how cellular organisms, human organizations and telecommunication networks self-govern and deliver the intent of the system (Mikkilineni 2012).

Only time will tell if the DNA implementation of Infware is an incremental or leap-frog innovation.

Acknowledgements

This work originated from discussions started in  IEEE WETICE 2009 to address the complexity, security and compliance issues in Cloud Computing. The work of Dr. Giovanni Morana, the C3DNA Team and the theoretical insights from professor Eugene Eberbach, Professor Mark Burgin and Pankaj Goyal are behind the current implementation of DNA.

Many-core Servers, Solid State Drives, and High-Bandwidth Networks: The Hardware Upheaval, The Software Shortfall, The Future of IT and all that Jazz
August 2, 2014

The “gap between the hardware and the software of a concrete computer and even greater gap between pure functioning of the computer and its utilization by a user, demands description of many other operations that lie beyond the scope of a computer program, but might be represented by a technology of computer functioning and utilization”

Marc S. Burgin, “Super Recursive Algorithms”, Springer, New York, 2005

Introduction

According to Holbrook (Holbrook 2003), “Specifically, creativity in all areas seems to follow a sort of dialectic in which some structure (a thesis or configuration) gives way to a departure (an antithesis or deviation) that is followed, in turn, by a reconciliation (a synthesis or integration that becomes the basis for further development of the dialectic). In the case of jazz, the structure would include the melodic contour of a piece, its harmonic pattern, or its meter…. The departure would consist of melodic variations, harmonic substitutions, or rhythmic liberties…. The reconciliation depends on the way that the musical departures or violations of expectations are integrated into an emergent structure that resolves deviation into a new regularity, chaos into a new order, surprise into a new pattern as the performance progresses.”

Current IT in this Jazz metaphor, evolved from a thesis and currently is experiencing an anti-thesis and is ripe for a synthesis that would blend the old and the new with a harmonious melody to create a new generation of highly scalable, distributed, secure services with desired availability, cost and performance characteristics to meet the changing business priorities, highly fluctuating workloads and latency constraints.

The Hardware Upheaval and the Software Shortfall

There are three major factors driving the datacenter traffic and their patterns:
1. A multi-tier architecture which determines the availability, reliability, performance, security and cost of initiating a user transaction to an end-point and delivering that service transaction to the user. The composition and management of the service transaction involves both the north-south traffic from end-user to the end-point (most often over the Internet) and the east-west traffic that flows through various service components such as DMZ servers, web servers, application servers and databases. Most often these components exist within the datacenter or connected through a WAN to other datacenters. Figure 1 shows a typical configuration.

Service Transaction Delivery Network

Service Transaction Delivery Network

The transformation from the client-server architectures to “composed service” model along with virtualization of servers allowing the mobility of Virtual Machines at run-time are introducing new patterns of traffic that increase in the east west direction inside the datacenter by orders of magnitude compared to the north-south traffic going from end-user to the service end-point or vice-versa. Traditional applications that evolved from client-server architectures use TCP/IP for all the traffic that goes across servers. While some optimizations attempt to improve performance for applications that go across servers using high-speed network technologies such as InfiniBand, Ethernet etc., TCP/IP and socket communications still dominate even among virtual servers within the same physical server.

2. The advent of many-core severs with tens and even hundreds of computing cores with high bandwidth communication among them drastically alters the traffic patterns.  When two applications are using two cores within a processor, the communication among them is not very efficient if it uses socket communication and TCP/IP protocol instead of shared memory. When the two applications are running in two processors within the same server, it is more efficient to use PCIExpress or other high-speed bus protocols instead of socket communication using TCP/IP. If the two applications are running in two servers within the same datacenter it is more efficient to use Ethernet or InfiniBand. With the advent of mobility of applications using containers or even Virtual Machines, it is more efficient to switch the communication mechanism based on the context of where they are running. This context sensitive switching is a better alternative to replicating current VLAN and socket communications inside the many-core server. It is important to recognize that the many-core servers and processors constitute a network where each node itself is a sub-network with different bandwidths and protocols (socket-based low-bandwidth communication between servers, InfiniBand, or PCI Express bus based communication across processors in the same server and shared memory based low latency communication across the cores inside the processor). Figure 2 shows the network of networks using many-core processors.

A Network of Networks with Multiple Protocols

A Network of Networks with Multiple Protocols

3. The many-core servers with new class of flash memory and high-bandwidth networks offer a new architecture for services creation, delivery and assurance going far beyond the current infrastructure-centric service management systems that have evolved from single-CPU and low-bandwidth origins. Figure 3 shows a potential architecture where many-core servers are connected with high-bandwidth networks that obviate the need for current complex web of infrastructure technologies and their management systems. The many-core servers each with huge solid-state Drives, SAS attached inexpensive disks, optical switching interfaces connected to WAN Routers offer a new class of services architecture if only the current software shortfall is plugged to match the hardware advances in server, network and storage devices.

If Server is the Cloud, What is the Service Delivery Network?

If Server is the Cloud, What is the Service Delivery Network?

This would eliminate the current complexity mainly involved in dealing with TCP/IP across east-west traffic and infrastructure based service delivery and management systems to assure availability, reliability, performance, cost and security. For example, current security mechanisms that have evolved from TCP/IP communications do not make sense across east/west traffic and emerging container based architectures with layer 7 switching and routing independent of server and network security offer new efficiencies and security compliance.

Current evolution of commodity clouds and distributed virtual datacenters while providing on-demand resource provisioning, auto-failover, auto-scaling and live-migration of Virtual machines, they are still tied to the IP address and associated complexity of dealing with infrastructure management in distributed environments to assure the end-to-end service transaction quality of service (QoS).

The QoS Gap

The QoS Gap

This introduces either vendor lock-in that precludes the advantages of commodity hardware or introduces complexity in dealing with multitude of distributed infrastructures and their management to tune the service transaction QoS. Figure 4 shows the current state of the art. One can quibble whether it includes every product available or whether they are depicted correctly to represent their functionality but the general picture describes the complexity and or vendor lock-in dilemma. The important point to recognize is that the service transaction QoS depends on tuning the SLAs of distributed resources at run-time across multiple infrastructure owners with disparate management systems and incentives. The QoS tuning of service transactions is not scalable without increasing cost and complexity if it depends on tuning the distributed infrastructure with a multitude of point solutions and myriad infrastructure management systems..

What the Enterprise IT Wants:

There are three business drivers that are at the heart of the Enterprise Quest for an IT framework:

  • Compression of Time-to-Market: Proliferation of mobile applications, social networking, and web-based communication, collaboration and commerce are increasing the pressure on enterprise IT to support a rapid service development, deployment and management processes. Consumer facing services are demanding quick response to rapidly changing workloads and the large-scale computing, network and storage infrastructure supporting service delivery requires rapid reconfiguration to meet the fluctuations in workloads and infrastructure reliability, availability, performance and security.
  • Compression of Time-to-Fix: With consumers demanding “always-on” services supporting choice, mobility and collaboration, the availability, performance and security of end to end service transaction is at a premium and IT is under great pressure to respond by compressing the time to fix the “service” regardless of which infrastructure is at fault. In essence, the business is demanding the deployment of reliable services on not so reliable distributed infrastructure.
  • Cost Reduction of IT operation and management which is running at about 60% to 70% of its budget going to keep the “service lights” on: Current service administration and management paradigm that originated with server-centric and low-bandwidth network architecture is resource-centric and assumes that the resources (CPU, memory, network bandwidth, latency, storage capacity, throughput and IOPs) allocated to an application at install time can be changed to meet rapidly changing workloads and business priorities in real-time. Current state-of-the art uses virtual servers, network and storage that can be dynamically provisioned using software API. Thus the application and service (a group of applications providing a service transaction) QoS (quality of service defining the availability, performance, security and cost) can be tuned by dynamically reconfiguring the infrastructure. There are three major issues with this approach:

With a heterogeneous, distributed and multi-vendor infrastructure, tuning the infrastructure requires myriad point solutions, tools and integration packages to monitor current utilization of the resources by the service components, correlate and reason to define the actions required and coordinate many distributed infrastructure management systems to reconfigure the resources.

In order to provide high availability and disaster recovery (HA/DR), recent attempts to move Virtual Machines (VM) introduces additional issues with IP mobility, Firewall reconfiguration, VM sprawl and associated run-away VM images, bandwidth and storage management.

Introduction of public clouds and the availability of software as a service, while they have worked well for new application development or non-mission critical applications or applications that can be re-architected to optimize for the Cloud API which leverage application/service components available, they are also adding additional cost for IT to migrate many existing mission critical applications that demand high security, performance and low-latency. The suggested Hybrid solutions require adopting new cloud architecture in the datacenters or use myriad orchestration packages that add additional complexity and tool fatigue.

In order to address the need to compress time to market and time to fix and to reduce the complexity, enterprises small and big are desperately looking for solutions.

The lines of business owners want:

  • End-to-end visibility and control of service QoS independent of who provides the infrastructure
  • Availability, performance and security governance based on policies
  • Accounting of resource utilization and dynamic resolution of contention for resources
  • Application architecture decoupled from infrastructure while still enabling continuous availability (or decouple functional requirements execution from non-functional requirement compliance)

IT wants to provide the application developers:

  • Application architecture decoupled from infrastructure by separating functional and non-functional requirements so that the application developers focus on business functions while deployment and operations are adjusted at run-time based on business priorities, latency constraints and workload fluctuations
  • Provide cloud-like services (on-demand provisioning of applications, self-repair, auto-scaling, live-migration and end-to-end security) at service level instead of at infrastructure level so that they can leverage own datacenter resources, or commodity resources abundant in public clouds without depending on cloud architectures, vendor API and cloud management systems.
  • Provide a suite of applications as a service (databases, queues, web servers etc.)
  • Service composition schemes that allow developers to reuse components and
  • Instrumentation to monitor service component QoS parameters (independent from infrastructure QoS parameters) to implement policy compliance
  • When problems occur provide component run-time QoS history to developers

IT wants to have the:

  • Ability to use local infrastructure or on demand cloud or managed out-sourced infrastructure
  • Ability to use secure cloud resources without cloud management API or cloud architecture dependence
  • Ability to provide end to end service level security independent of server and network security deployed to manage distributed resources
  • Ability to provide end-to-end service QoS visibility and control (on-demand service provisioning, auto-failover, auto-scaling, live migration and end-to-end security) across distributed physical or virtual servers in private or public infrastructure
  • Ability to reduce complexity and eliminate point solutions and myriad tools to manage distributed private and public infrastructure

Application Developers want:

  • To focus on developing service components, test them in their own environments and publish them in a service catalogue for reuse
  • Ability to compose services, test and deploy in their own environments and publish then in the service catalogue ready to deploy anywhere
  • Ability to specify the intent, context, constraints, communication, and control aspects of the service at run-time for managing non-functional requirements
  • An infrastructure that uses the specification to manage the run-time QoS with on-demand service provisioning on appropriate infrastructure (a physical or virtual server with appropriate service level assurance, SLA), manage run-time policies for fail-over, auto-scaling, live-migration and end-to-end security to meet run-time changes in business priorities, workloads and latency constraints.
  • Separation of run-time safety and survival of the service from sectionalizing, isolating, diagnosing and fixing at leisure
  • Get run-time history of service component behavior and ability to conduct correlated analysis to identify problems when they occur.

We need to discover a path to bridge the current IT to the new IT without changing applications, or the OSs or the current infrastructure while providing a way to migrate to a new IT where service transaction QoS management is truly decoupled from myriad distributed infrastructure management systems. This is not going to happen with current ad-hoc programming approaches. We need a new or at least an improved theory of computing.

As Cockshott et al (2012) point out current computing, management and programming models fall short when you try to include computers and the computed in same model.

“the key property of general-purpose computer is that they are general purpose. We can use them to deterministically model any physical system, of which they are not themselves a part, to an arbitrary degree of accuracy. Their logical limits arise when we try to get them to model a part of the world that includes themselves.”

There are emerging technologies that might just provide the synthesis (reconciliation depends on the way that the architecture departures or violations of expectations are integrated into an emergent structure that resolves deviation into a new regularity, chaos into a new order, surprise into a new pattern as the transformation progresses) required to build the harmony by infusing cognition into computing. Only future will tell if this architecture is expressive enough and efficient as Mark Burgin claims in his elegant book on “Super Recursive Algorithms” quoted above.

Is the Information Technology poised for a renaissance (with a synthesis) since the great masters (Turing, von Neumann, Shannon  etc.) developed the original thesis and take us beyond the current distributed-cloud-management anti-thesis.

The IEEE WETICE2015 International conference track on “the Convergence of Distributed Clouds, GRIDs and their Management” to be held in Cyprus next June (15 – 18) will address some of these emerging trends and attempt to bridge the old and the new.

24th IEEE International Conference on Enabling Technologies: Infrastructures for Collaborative Enterprises (WETICE2015)

24th IEEE International Conference on Enabling Technologies: Infrastructures for Collaborative Enterprises (WETICE2015) in Larnaca, Cyprus

 

For a state-of-the-emerging-science, please go to Infusing Self-Awareness into Turing Machine – A Path to Cognitive Distributed Computing presented in WETICE2014 in Parma Italy

 

 References:

Holbrook, Morris B. 2003. ” Adventures in Complexity: An Essay on Dynamic Open Complex Adaptive Systems, Butterfly Effects, Self-Organizing Order, Coevolution, the Ecological Perspective, Fitness Landscapes, Market Spaces, Emergent Beauty at the Edge of Chaos, and All That Jazz.” Academy of Marketing Science Review [Online] 2003 (6) Available: http://www.amsreview.org/articles/holbrook06-2003.pdf

Cockshott P., MacKenzie L. M., and  Michaelson, G, (2012) Computation and its Limits, Oxford University Press, Oxford.

A Path Toward Intelligent Services using Dumb Infrastructure on Stupid, Fat Networks?
November 3, 2013

“The key property of general-purpose computer is that they are general purpose. We can use them to deterministically model any physical system, of which they are not themselves a part, to an arbitrary degree of accuracy. Their logical limits arise when we try to get them to model a part of the world that includes themselves.”

—  P. Cockshott, L. M. MacKenzie and  G. Michaelson, Computation and its Limits, Oxford University Press, Oxford 2012, p 215

“The test of a framework, however, is not what one can say about it in principle but what one can do with it”

—  Andrew Wells, Rethinking Cognitive Computation: Turing and the Science of the Mind, Palgrave, Macmillan, 2006.

Summary

The “Convergence of Clouds, Grids and their Management” conference track is devoted to discussing current and emerging trends in virtualization, cloud computing, high-performance computing, Grid computing and cognitive computing. The tradition that started in WETICE2009 “to analyze current trends in Cloud Computing and identify long-term research themes and facilitate collaboration in future research in the field that will ultimately enable global advancements in the field that are not dictated or driven by the prototypical short term profit driven motives of a particular corporate entity” has resulted in a new computing model that was included in the Turing Centenary Conference proceedings in 2012. More recently, a product based on these ideas was discussed in the 2013 Open Server Summit (www.serverdesignsummit.com), where many new ideas and technologies were presented to exploit the new generation of many-core servers, high-bandwidth networks and high-performance storage. We present here some thoughts on current trends which we hope will stimulate further research to be discussed in the WETICE 2014 conference track in Parma, Italy (http://wetice.org).

Introduction

Current IT datacenters have evolved from their server-centric, low-bandwidth origins to distributed and high-bandwidth environments where resources can be dynamically allocated to applications using computing, network and storage resource virtualization. While Virtual machines improve resiliency and provide live migration to reduce the recovery time objectives in case of service failures, the increased complexity of hypervisors, their orchestration, Virtual Machine images and their movement and management adds an additional burden in the datacenter.

Further automation trends continue to move toward static applications (locked-in-a-virtual machine, often as one application in one virtual machine) in a dynamic infrastructure (virtual servers, virtual networks, virtual storage, Virtual Image managers etc.). The safety and survival of applications and end to end service transactions delivered by a group of applications are managed by dynamically monitoring and controlling the resources at run-time in real-time. As services migrate to distributed environments where applications contributing to a service transaction are deployed in different datacenters and public or private clouds often owned by different providers, resource management across distributed resources is provided using myriad point solutions and tools that monitor, orchestrate and control these resources. A new call for application-centric infrastructure proposes that the infrastructure provide (http://blogs.cisco.com/news/application-centric-infrastructure-a-new-era-in-the-data-center/ ):

  • Application Velocity (Any workload, anywhere): Reducing application deployment time through a fully automated and programmatic infrastructure for provisioning and placement. Customers will be able to define the infrastructure requirements of the application, and then have those requirements applied automatically throughout the infrastructure.
  • A common platform for managing physical, virtual and cloud infrastructure: The complete integration across physical and virtual, normalizing endpoint access while delivering the flexibility of software and the performance, scale and visibility of hardware across multi-vendor, virtualized, bare metal, distributed scale out and cloud applications
  • Systems Architecture: A holistic approach with the integration of infrastructure, services and security along with the ability to deliver simplification of the infrastructure, integration of existing and future services with real time telemetry system wide.
  • Common Policy, Management and Operations for Network, Security, Applications: A common policy management framework and operational model driving automation across Network, Security and Application IT teams that is extensible to compute and storage in the future.
  • Open APIs, Open Source and Multivendor: A broad ecosystem of partners who will be empowered by a comprehensive published set of APIs and innovations contributed to open source.
  • The best of Custom and Merchant Silicon: To provide highly scalable, programmatic performance, low-power platforms and optics innovations that protect investments in existing cabling plants, and optimize capital and operational expenditures.

Perhaps this approach will work in a utopian IT landscape where either the infrastructure is provided by a single vendor or universal standards force all infrastructures to support common API. Unfortunately the real world evolves in a diverse, heterogeneous and competitive environment and what we are left with is a strategy that cannot scale and lacks end-to-end service visibility and control. End-to-end security becomes difficult to assure because of the myriad security management systems that control distributed resources. The result is open source systems that attempt to fill this niche. Unfortunately, in a highly networked world where multiple infrastructure providers provide a plethora of diverse technologies that evolve at a rapid rate to absorb high-paced innovations, orchestrating the infrastructure to meet the changing workload requirements that applications must deliver is a losing battle. The complexity and tool fatigue resulting from layers of virtualization and orchestration of orchestrators is crippling the operation and management of datacenters (virtualized or not) requiring 70% of current IT budgets going toward keeping the lights on. An explosion of tools, special purpose appliances (for Disaster Recovery, IP security, Performance optimization etc.) and administrative controls have escalated operation and management costs. Gartner Report estimates that for every 1$ spent on development of an application, another $1.31 is spent on assuring safety & survival. While all vendors agree upon Open Source, Open API, and multi-vendor support, reality is far from it. An example is the recent debate about whether OpenStack should include Amazon AWS API support while the leading cloud provider conveniently ignores the competing API.

The Strategy of Dynamic Virtual Infrastructure

The following picture presented in the Open Server Summit Presents a vision of future datacenter with a virtual switch network overlay over physical network.

Virtual NetworkFigure 1: Network Virtualization: What It Is and Why It Matters – Presented in Open Server Summit in 2013

Bruce Davie, Principal Engineer, VMware

In addition to the Physical network connecting physical servers, an overlay of virtual network inside the physical server to connect the virtual machines inside a physical server. In addition, a plethora of virtual machines are being introduced to replace the physical routers and switches that control the physical network. The quest to dynamically reconfigure the network at run-time to meet the changing application workloads, business priorities and latency constraints has introduced layers of additional network infrastructure albeit software-defined. While applications are locked in a virtual server, the infrastructure is evolving to dynamically reconfigure itself to meet changing application needs. Unfortunately this strategy can not scale in a distributed environment where different infrastructure providers deploy myriad heterogeneous technologies and management strategies and results in orchestrators of orchestrators contributing to complexity and tool fatigue in both datacenters and clod environments (private or public).

Figure 2 shows a new storage management architecture also presented in the Open Server Summit.

Virtual Storage

Figure 2: PCI Express Supersedes SAS and SATA in Storage – Presented in Open Server Summit 2013, Akber Kazmi, Senior Marketing Director, PLX Technology

The PCIe switch allows a converged physical storage fabric at half the cost and half the power of current infrastructure. In order to leverage these benefits, the management infrastructure has to accommodate it which adds to the complexity.

In addition, it is estimated that the data traffic inside the datacenter is about 1000 times that of the data that is sent to and received from the users outside. This completely changes the role of TCP/IP traffic inside the datacenter and consequently the communication architecture between applications inside the datacenter. It does not anymore make sense for Virtual machines running inside a Many-core server to use TCP/IP as long as they are within the datacenter. In fact, it makes more sense for them to communicate via shared memory when they are executed on different cores within a processor, communicate via high speed bus when they are executed on different processors in the same server and a high speed network when they are executed in different servers in the same datacenter. TCP/IP is only needed when communicating with users outside the datacenter who can only be accessed via the Internet.

Figure 3 shows the server evolution.

Server

Figure 3: Servers for the New Style of IT – Presented in Open Server summit 2013, Dwight Barron, HP Fellow and Chief Technologies Hyper-scale Server Business Segment, HP Servers Global Business Unit, Hewlett-Packard

As the following picture presents, current evolution of the datacenter is designed to provide dynamic control of resources for addressing the work-load fluctuations at run-time, changing business priorities and real-time latency constraints. The applications are static in a Virtual or Physical Server and the software defined infrastructure dynamically adjusts to changing application needs.

Complexity1

Figure 4: Macro Trends, Complexity, and SDN – Presentation in the Open Server Summit 2013, David Meyer, CTO/Chief Architect, Brocade

Cognitive Containers & Self-Managing Intelligent Services on Static Infrastructure

With the advent of many-core servers, high bandwidth technologies connecting these servers, and new class of high performance storage devices that can be optimized to meet the workload needs (IOPs intensive, throughput sensitive or capacity hungry), is it time to look at a static infrastructure with dynamic application/service management to reduce IT complexity in both datacenters and clouds (public or private)? This is possible if we can virtualize the applications inside a server (physical or virtual) and decouple the safety and survival of the applications and groups of applications that contribute to a distributed transaction from myriad resource management systems that provision and control a plethora of distributed resources supporting these applications.

The Cognitive Container discussed in the Open Server Summit (http://lnkd.in/b7-rfuK) presents the decoupling required between application and service management and underlying distributed resource management systems. Cognitive Container is specially designed to decouple the management of an application and service transactions that a group of distributed applications execute from the infrastructure management systems, at run-time, controlling their resources that are often owned or operated by different providers. The safety and survival of the application at run-time is put ahead by infusing the knowledge about the application (such as the intent, non-functional attributes, run-time constraints, connections and communication behaviors) into the container and using this information to monitor and manage the application at run-time. The Cognitive Container is instantiated and managed by a Distributed Cognitive Transaction Platform (DCTP) that sits between the applications and the OS facilitating the run-time management of Cognitive Containers. The DCTP does not require any changes to the application, OS or the infrastructure and uses the local OS in a physical or virtual server. A network of Cognitive Containers infused with similar knowledge about the service transaction they execute also is managed at run-time to assure the safety and survival based on policies dictated by business priorities, run-time workload fluctuations and real-time latency constraints. The Cognitive Container network using replication, repair, recombination and reconfiguration properties provide dynamic service management independent of infrastructure management systems at run-time. The Cognitive Containers are designed to use the local operating system to monitor the application vital signs (CPU, memory, bandwidth, latency, storage capacity, IOPs and throughput) and run-time behavior to manage the application to conform to the policies.

The cognitive container can be deployed in a physical or virtual server and does not require any changes to the applications, OSs or the infrastructure. Only the knowledge about the functional and n0n-functional requirements has to be infused into the Cognitive Container. The following figure shows a Cognitive Network deployed in a distributed infrastructure. The Cognitive Container and the service management are designed to provide auto-scaling, self-repair, live-migration and end-to-end service transaction security independent of infrastructure management system.

service visibility

Figure 5: End-to-End Service Visibility and Control in a Distributed Datacenter (Virtualized or Not) – Presented in the Open Server Summit

Rao Mikkilineni, Chief Scientist, C3 DNA

Using the Cognitive Container network it is possible to create a federated service creation, delivery and assurance platforms that transcend the physical and virtual server boundaries and geographical locations as shown in figure below.

 Platform

Figure 6: Federated Services Fabric with Service Creation, delivery and assurance processes decoupled from Resource provisioning, management and control.

This architecture provides an opportunity to simplify the infrastructure where a tiered server, storage and network infrastructure that is static and hardwired to provide various servers (physical or virtual) with specified service levels (CPU, memory, network bandwidth, latency, storage capacity and throughput) the cognitive containers are looking for based on their QoS requirements. It does not matter what technology is used to provision these servers with required service levels. The Cognitive Containers monitor these vital signs using the local OS and if they are not adequate, they will migrate to other servers where they are adequate based on policies determined by business priorities, run-time workload fluctuations and real-time latency constraints.

The infrastructure provisioning then becomes a simple matter of matching the Cognitive Container to the server based on QoS requirements. Thus the Cognitive Container services network provides a mechanism to deploy intelligent (self-aware, self-reasoning and self-controlling) services using dumb infrastructure with limited intelligence about services and applications (matching application profile to the server profile) on stupid pipes that are designed to provide appropriate performance based on different technologies as discussed in the Open Server Summit.

The managing and safekeeping of application required to cope with a non-deterministic impact on workloads from changing demands, business priorities, latency constraints, limited resources and security threats is very similar to how cellular organisms manage life in a changing environment. The managing and safekeeping of life efficiently at the lowest level of biological architecture that provides the resiliency was in his mind when von Neumann was presenting his Hixon lecture (Von Neumann, J. (1987) Papers of John von Neumann on Computing and Computing Theory, Hixon Symposium, September 20, 1948, Pasadena, CA, The MIT Press, Massachusetts, p474). ‘‘The basic principle of dealing with malfunctions in nature is to make their effect as unimportant as possible and to apply correctives, if they are necessary at all, at leisure. In our dealings with artificial automata, on the other hand, we require an immediate diagnosis. Therefore, we are trying to arrange the automata in such a manner that errors will become as conspicuous as possible, and intervention and correction follow immediately.’’ Comparing the computing machines and living organisms, he points out that the computing machines are not as fault tolerant as the living organisms. He goes on to say ‘‘It’s very likely that on the basis of philosophy that every error has to be caught, explained, and corrected, a system of the complexity of the living organism would not run for a millisecond.’’ Perhaps the Cognitive Container bridges this gap by infusing self-management into computing machines that manage the external world while also managing themselves with self-awareness, reasoning, and control based on policies and best practices.

Cognitive Containers or not, the question is how do we address the problem of ever increasing complexity and cost in current datacenter and cloud offerings? This will be a major theme in the 4th conference track on the Convergence of Distributed Clouds, Grids and their management at WETICE2014 in Parma, Italy.

WETICE 2014 and The Conference Track on the Convergence of Clouds, Grids and Their Management
October 30, 2013

WETICE is an annual IEEE International conference on state-of-the-art research in enabling technologies for collaboration, consisting of a number of cognate conference tracks. The “Convergence of Clouds, Grids and their Management” conference track is devoted to discussing current and emerging trends in virtualization, cloud computing, high performance computing, Grid computing and Cognitive Computing. The tradition that started in WETICE2009 “to analyze current trends in Cloud Computing and identify long-term research themes and facilitate collaboration in future research in the field that will ultimately enable global advancements in the field that are not dictated or driven by the prototypical short term profit driven motives of a particular corporate entity” has resulted in a new computing model that was included in the Turing Centenary Conference proceedings in 2012. The 2013 conference track discussed Virtualization, Cloud Computing and the Emerging Datacenter Complexity Cliff in addition to conventional cloud and grid computing solutions.

The WETICE 2014 conference to be held in Parma, Italy during June, 23rd-25th, 2014, will continue the tradition by continuing the discussions on the convergence of clouds, grids and their management. In addition, it will also solicit papers on new computing models, cognitive computing platforms and strong AI resulting from recent efforts to inject cognition into computing (Turing Machines).

All papers are refereed by the Scientific Review of Committee of each conference track. All accepted papers will be published in the electronic proceedings by the IEEE Computer Society, and submitted to the IEEE digital library. The proceedings will be submitted for indexing through INSPEC, Compendex, and Thomson Reuters, DBLP, Google Scholar and EI Index.

http://wetice.org

Cloud Computing, Management Complexity, Self-Organizing Fractal Theory, Non Equilibrium Thermodynamics, DIME networks, and all that Jazz
May 5, 2012

“There are two kinds of creation myths: those where life arises out of the mud, and those where life falls from the sky. In this creation myth, computers arose from the mud and code fell from the sky.”

— George Dyson, “Turing’s Cathedral: The Origins of the Digital Universe”, New York: Random House, 2012.

“The DIME network architecture arose out of the need to manage the ephemeral nature of life in the Digital Universe”

— Rao Mikkilineni (2012)

Abstract:

The explosion of current cloud computing software offerings (both open-sourced and proprietary)  to create public, private and hybrid clouds raises a question. Is it resulting in higher resiliency, efficiency and scaling of service offerings or increasing the complexity by introducing more components in an already crowded datacenter deploying myriad appliances, management frameworks, tools and people, all claiming to help lower total cost of operation? As the reliability, availability, performance, security and efficiency of the total system depends both on the number of components and their configuration, the architecture of a system plays an important role in defining the overall system resiliency, efficiency and scaling. We discuss current cloud computing architecture, the resulting complexity and investigate possible solutions using the self-organizing fractals theory and non-equilibrium thermodynamics. Evolution has taught us that when complexity increases, often, an architectural transformation occurs to lower the overall system entropy. Is a phase transition about to occur in our data centers seeded by the new many-core servers and high bandwidth communications?

Introduction:

According to Holbrook (Holbrook 2003), “Specifically, creativity in all areas seems to follow a sort of dialectic in which some structure (a thesis or configuration) gives way to a departure (an antithesis or deviation) that is followed, in turn, by a reconciliation (a synthesis or integration that becomes the basis for further development of the dialectic). In the case of jazz, the structure would include the melodic contour of a piece, its harmonic pattern, or its meter…. The departure would consist of melodic variations, harmonic substitutions, or rhythmic liberties…. The reconciliation depends on the way that the musical departures or violations of expectations are integrated into an emergent structure that resolves deviation into a new regularity, chaos into a new order, surprise into a new pattern as the performance progresses.” He goes on to explain exquisitely what “all that jazz” means and what it has to do with Dynamic Open Complex Adaptive System or DOCAS.

I borrow the jazz metaphor to understand the current state of affairs in cloud computing. Cloud computing started innocently enough as an attempt to automate systems administration tasks of computing systems to improve the resiliency (availability, reliability, performance and security), efficiency and scaling of services provided by web-hosting data centers. Before the advent of global web e-commerce enabled by broadband networks and ubiquitous access to high-powered computing, the workload fluctuations were not wild-enough to demand very fast response in provisioning to meet them. While enterprise datacenters were not pushed to deal with the wild fluctuations that some web-services companies were, companies such as Amazon, Google, Facebook, Twitter etc., dealing with uncertain (non-deterministic) workload fluctuations took a different approach to improve resiliency and scaling. They took advantage of the increased power in blade servers, high bandwidth networks and virtualization technologies to create virtual machine (VM) based systems administration with multiple VMs in a physical device consolidating workloads that are managed with dynamic resource provisioning. This has become known as cloud computing. Strictly speaking, VM is not essential for automation to improve scaling, auto-failover and live migration of applications and their data; and companies such as Google have chosen their own automation strategies without using VMs. On the other hand, many other enterprises have taken a more conservative approach by not adopting the cloud strategy and avoid the risk of impacting their highly tuned mission critical application availability, performance and security. They are probably correct given the continued occasional outages, security breaches and cost escalation in managing complexity with many public clouds.

Amazon and Google went one step further by offering their flexible infrastructures to developers outside their company to rent the resources with which they could develop, deploy and service their own applications, thus unleashing a new class of developers. Startups could substitute OPEX for CAPEX to obtain the resources required for their new product and services development. Resulting explosion of applications and services has created a new demand for more clouds and more automation of systems administration to extend resiliency and provide a high degree of isolation from multiple tenants sharing resources while resolving the resulting contentions. The result is a complex web of Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS) offerings to meet the needs of developers, service providers and service consumers.  To be sure, these offerings are not independent. On the contrary, each layer influences the other in a complex set of interactions often in non-deterministic way based on workloads, business priorities and latency constraints. Figure 1 shows an example of these relationships.

Figure 1: Complex relationships of information flow between nested layers and information flows between components in each layer. The complexity is only compounded by multi-vendor offerings in each layer (not shown here)

The origin of complexity is easy to understand. While attempting to solve the issue of multi-tenancy and agility, the introduction of the virtual machines gives rise to another complexity of virtual image management and sprawl control. In order to address VM mobility issue, recent efforts to introduce application level mobility using other container constructs such as Gears, Cartridges etc., in the case of Redhat PaaS (or Dynos in the case of Heroku, the salesforce PaaS), introduce yet another layer of management of Gears and Cartridges (or Dynos). Another example is the Eucalyptus Infrastructure as a Service, which goes to great lengths to provide High Availability (HA) of the Infrastructure platform but fails to guarantee HA of applications. It is left to the applications to fend for themselves.  These ad-hoc approaches to automate management have mushroomed the software required, increased the learning curve and made the operation and maintenance even more complex. While all platforms demonstrate drag and drop software with pretty displays that allow developers to easily create new services, there is no guarantee that if something goes wrong, one will be able to debug and find out where the root cause is. Or there is no assurance that when multiple services and applications are deployed on same platform, the feature interactions and shared resource management provided by a plethora of management systems designed independently will cooperate to provide the required reliability, availability, performance and security at the service level. More importantly, when the services cross server, data-center and geographical boundaries, there is no visibility and control of end to end service connections and their FCAPS management. Obviously, the platform vendors are only very eager to provide professional services and additional software to resolve the issues but without end to end service connection visibility and control that spans across multiple modules, systems, geographies and management systems, troubleshooting expenses could outweigh the realized benefits. What we need probably is not more “code” but an intelligent architecture that results in a synthesis of computing services and their management and a decoupling of end to end service connection and service component management from underlying resource (server, network and storage) management.

Self-organizing Fractals and Non-equilibrium Thermodynamics:

Fortunately, the self-organizing fractal theory (SOFT) and non-equilibrium thermodynamics (NET) (Kurakin 2011), provide a way to analyze complex systems and identify solutions. A very good glimpse into the theory can be found in the video (http://www.scivee.tv/node/4994). According to the SOFT-NET theory, the process of self-organization is scale-invariant and proceeds through sequential organizational state transitions, in a manner characteristic of far-from-equilibrium systems, with macrostructure-processes emerging via phase transition and self-organization of microstructure-processes. Once they have emerged as a result of an organizational transition, newborn structure-processes strive to persist and expand, growing in size/number, diversity, complexity, and order, while feeding on pre-existing energy/matter gradients. Economic competition among alternatively organized structure-processes feeding on the same energy/mater gradients leads to the elimination of economically deficient or inferior structure-processes and the improvement, diversification, and specialization of survivors, who are forced to fill and exploit all the available resource niches (the Darwinian phase of self-organization) (Kurakin 2007). Promoted by mutually profitable exchanges of energy/matter, the self-organization of specializing survivors (structure-processes) into larger scale structure-processes transforms (mostly) competing alternatives into (mostly) cooperating complements. As a result, Darwinian competition is transferred onto a larger spatiotemporal scale, where it commences among alternative organizations of self-organized survivors (the organizational phase (Kurakin 2007). Such an economy-driven, scale-invariant process of self-organization leads to the emergence of increasingly long-lived, multi-scale, hierarchical organizations (structures-processes) that expand over increasingly larger scales of space and time, feeding on available energy/matter gradients and eventually destroying them. Yet because energy/matter exists as a non-equilibrium system of interdependent gradients and conjugated fluxes of interconverting energy/matter forms, new gradients and fluxes are created and become dominant as old gradients and fluxes are consumed and destroyed. Such processes are responsible for the continuous birth, death, and transformation of energy/matter forms.

Obviously, cloud computing systems (or for that matter, distributed computing systems in general based on Turing machines) are not living organisms and thus are not susceptible to self-organization. However, if you substitute information to replace energy/matter, there are many similarities between the structure and dynamics of computing systems and living self-organizing systems. The nested computing layers, meta-stable organizational patterns (both macro- and micro- structures) in each layer, and process evolution through inter-layer interaction are the same features that contribute to self-organization. So one can ask what is missing for the cloud computing environments to become self-organizing. The answer lies in two observations:

  1. First one is the Gödel’s prohibition of self-reflection by computing elements that form the fundamental building block in the computing domain, the Turing machine (TM) (Samad and Cofer, 2001).
  2. Second one is the lack of scale invariant macro and micro structure-processes mentioned above for the organization of computing components and their management across various nested layers resulting from current ad-hoc implementation of computing processes using the serial von Neumann implementation of the Turing machine.

I have discussed both these deficiencies elsewhere (Mikkilineni 2011, 2012). The DIME network architecture proposed there attempts to address both these deficiencies.

The DIME Network Architecture:

In its simplest form a DIME is comprised of a policy manager (determining the fault, configuration, accounting, performance, and security aspects often denoted by FCAPS); a computing element called MICE (Managed Intelligent Computing Element); and two communication channels. The FCAPS elements of the DIME provide setup, monitoring, analysis and reconfiguration based on workload variations, system priorities based on policies and latency constraints. They are interconnected and controlled using a signaling channel which overlays a computing channel that provides I/O connections to the MICE (or the computing element) (Mikkilineni 2011). The DIME computing element acts like a Turing oracle machine introduced in his thesis and circumvents Gödel’s halting and un-decidability issues by separating the computing and its management and pushing the management to a higher level. Figure 2 shows the DIME computing model.

Figure 2: The DIME Computing Model. For details on the different implementations of DIME networks (a LAMP stack without VMs and a native Parallax OS) visit http://www.youtube.com/kawaobjects

In addition the introduction of signaling in the DIME network architecture allows a fractal composition scheme of the DIME network to create a recursive distributed computing engine with scale invariant FCAPS management of the computing workflow at node, sub-network and network level. Figure 2 shows the comparison between living organisms with self-organizing fractal attributes and Cloud computing infrastructure organized to exhibit self-management fractal attributes.

Figure 3: Comparison of the nested hierarchical organization of living organisms and DIME network architecture.

While both models exhibit the genetic transactions of replication, repair, recombination and reconfiguration (Stanier and Moore, 2006) (Mikkilineni 2011), there is a fundamental difference between the two. The DIME network architecture is not self-organizing but it is self-managing based on initial policies and constraints defined at the root levels of the hierarchies. These policies can be modified during run time but only with the influence of agents external to the computing element whose behavior is under modification (at the DIME node, sub-network and network level).

At each level, the FCAPS management defines the initial conditions and policy constraints (meta-model if you will, denoting the context and defining the destiny of the ensuing process workflow) that will define the information flows and workflows executed by the DIME network downstream. The resulting metastable configurations are monitored and managed by the managers upstream. This model exhibits the three-step processes that provide self-management in living organisms – establish routine, monitor cues and respond with corrective action based on FCAPS parameters at every level. Figure 4 shows the metastable configuration entropy of the whole system. The FCAPS parameters monitored provide a measure of system entropy shown and the reconfiguration alters the state from higher entropy to lower entropy providing a “measure” of the stable pattern.

Figure 4: System Entropy as a function of time

The SOFT-NET theories provide a path to reexamine the way we design distributed computing systems. Perhaps the living organisms with their self-organizing properties could provide us a way to bring self-management to cloud computing configurations to improve resiliency, efficiency and scaling. The DIME network architecture is a baby-step to implement a recursive distributed computing engine to execute managed workflows that constitute hierarchical and temporal sequences of events executing business workflows.

The DIME network architecture raises some interesting questions about Turing machines and their management. How is it related to the Universal Turing Machine (UTM)? It is important to point out that I do not claim that DIME networks are the answer to Cloud computing vows or that the UTM can or cannot do what a DIME network does. While communicating Turing machines are modeled by a UTM (Penrose 1989), can the managed Turing machine networks also be modeled by the UTM? Is the scale-invariant organizational macro and micro structure-processes discussed in SOFT-NET theory essential for self-organizing systems? What are the differences between living self-organizing systems and self-managing networks? I leave this to the experts. I only point out that the DIME is inspired by the oracle machine discussed by Turing in his thesis and implements the architectural resiliency of cellular organisms in distributed computing infrastructure by introducing parallel management of both the computing elements and networks. While its feasibility has been demonstrated (Mikkilineni, Morana and Seyler, 2012), the DIME network architecture is still in its infancy and presents an opportunity on the eve of Turing’s centenary celebration to investigate its usefulness and theoretical soundness.  Only time will tell if the DIME network architecture is useful in mission critical environments. Figure 5 shows a comparision of Physical server based computing, Virtual Machine based cloud computing and DIME network implementation in Linux server eliminating the Hypervisors and Virtual Machines.

Figure 5: Comparision between conventional, cloud and DIME network computing paradigms. The DIME network Architecture requires no Hypervisors, Virtual Machines, IaaS or PaaS. Linux processes are FCAPS managed and networked using a middleware library without any changes to the Operating System.

The DIME network architecture with its self-management, parallel signaling network overlay and its recursive distributed computing engine model supports all features that current cloud computing provides and more while eliminating the need for Hypervisors, Virtual Machines, IaaS and PaaS. The DNA offers the simplicity by providing FCAPS management of a Linux process through a middle-ware library using standard services of the Linux operating syatem and parallelism available in a multi-core/many-core processor.

Conclusion:

I conclude with one lesson from the past (Mikkilineni and Sarathy, 2009) I take away working in POTS (Plain Old Telephone System), PANS (Pretty Amazing New Services enabled by the Internet), SANs and Clouds. It is that wherever there is networking, switching always trumps other approaches. When services are executed by a network of distributed components, service switching and end-to-end service connection management are the ultimate meta-stable structure-processes and it seems that cellular organisms, telephone networks, and human network eco-systems have figured this out. Signaling and nested FCAPS management structure-processes seem to be the common ingredients. Therefore, I predict that eventually the data centers which are currently computing resource management centers will transform themselves into services switching centers just as in telephony. Perhaps computer scientists should look to telephony, neuroscience and organizational dynamics for answers than engaging in hackathons and coding ad-hoc complex systems to manage distributed computing resources. SOFT-NET theories seem to be pointing to the right direction. The solution may lie in discovering scale invariant micro- and macro structure processes that provide nested FCAPS management and self-managed local and global policy enforcement. Perhaps Holbrook’s “All that Jazz” metaphor is an appropriate metaphor for cloud computing research. Time may be ripe for the reconciliation (the synthesis of the thesis of implementing services and the anti-thesis of services management).

References:

Holbrook, Morris B. 2003. ” Adventures in Complexity: An Essay on Dynamic Open Complex Adaptive Systems, Butterfly Effects, Self-Organizing Order, Coevolution, the Ecological Perspective, Fitness Landscapes, Market Spaces, Emergent Beauty at the Edge of Chaos, and All That Jazz.” Academy of Marketing Science Review [Online] 2003 (6) Available: http://www.amsreview.org/articles/holbrook06-2003.pdf

Kurakin, A., Theoretical Biology and Medical Modelling, 2011, 8:4. http://www.tbiomed.com/content/8/1/4

Kurakin A: The universal principles of self-organization and the unity of Nature and knowledge. 2007 [http://www.alexeikurakin.org/text/thesoft.pdf ].

Mikkilineni, R., Sarathy, V., (2009), “Cloud Computing and the Lessons from the Past,” Enabling Technologies: Infrastructures for Collaborative Enterprises, 2009. WETICE ’09. 18th IEEE International Workshops on , vol., no., pp.57-62, June 29 2009-July 1 2009. doi: 10.1109/WETICE.2009.

Mikkilineni, R., (2011). Designing a New Class of Distributed Systems. New York,NY: Springer. (http://www.springer.com/computer/information+systems+and+applications/book/978-1-4614-1923-5)

Mikkilineni (2012) Turing Machines, Architectural Resilience of Cellular Organisms and DIME Network Architecture (http://www.computingclouds.wordpress.com )

Mikkilineni, R., Morana, G., and Seyler, I., (2012), “Implementing Distributed, Self-managing Computing Services Infrastructure using a Scalable, Parallel and Network-centric Computing Model” Chapter in a Book edited by Villari, M., Brandic, I., & Tusa, F., Achieving Federated and Self-Manageable Cloud Infrastructures: Theory and Practice (pp. 1-374). doi:10.4018/978-1-4666-1631-8

Penrose, R., (1989) “The Emperor’s New Mind: Concerning Computers, Minds, And The Laws of Physics” New York, Oxford University Press pp. 48

Samad, T., Cofer, T., (2001). Autonomy and Automation: Trends, Technologies, In Gani, R., Jørgensen, S. B., (Ed.) Tools in European Symposium on Computer Aided Process Engineering volume 11, Amsterdam, Netherlands: Elsevier Science B. V., p. 10

Stanier, P., Moore, G., (2006) “Embryos, Genes and Birth Defects”, (2nd Edition), Edited by Patrizia Ferretti, Andrew Copp, Cheryll Tickle, and Gudrun Moore, London, John Wiley & Sons

Path to Self-managing Services: A Case for Deploying Managed Intelligent Services Using Dumb Infrastructure in a Stupid Network
February 2, 2012

“WETICE 2012 Convergence of Distributed Clouds, Grids and their Management Conference Track is devoted to transform current labor intensive, software/shelf-ware-heavy, and knowledge-professional-services dependent IT management into self-configuring, self-monitoring, self-protecting, self-healing and self-optimizing distributed workflow implementations with end-to-end service management by facilitating the development of a Unified Theory of Computing.”

“In recent history, the basis of telephone company value has been the sharing of scarce resources — wires, switches, etc. – to create premium-priced services. Over the last few years, glass fibers have gotten clearer, lasers are faster and cheaper, and processors have become many orders of magnitude more capable and available. In other words, the scarcity assumption has disappeared, which poses a challenge to the telcos’ “Intelligent Network” model. A new type of open, flexible communications infrastructure, the “Stupid Network,” is poised to deliver increased user control, more innovation, and greater value.”

                     —–Isenberg, D. S., (1998). “The dawn of the stupid network”. ACM netWorker 2, 1, 24-31.

Much has changed since the late 90’s that drove the Telco’s to essentially abandon their drive for supremacy in intelligent services creation, delivery and assurance business and take the back seat in the information services market to manage the ‘stupid network’ that merely carries the information services.  You have to only look at the demise of major R&D companies such as AT&T Bell Labs, Lucent, Nortel, Alcatel and the rise of a new generation of services platforms from Apple, Amazon, Google, Facebook, Twitter, Oracle and Microsoft to notice the sea change that has occurred in a short span of time. The data center has replaced the central office to become the hub from which myriad voice, video and data services are created, and delivered on a global scale. However the management of these services which determines their resiliency, efficiency and scaling is another matter.

While, the data center value has been the sharing of expensive resources – processor speed, memory, network bandwidth, storage capacity, throughput and IOPs – to create premium-priced services, over the last couple of decades, the complexity of the infrastructure and its management has exploded. It is estimated that up to 70% of the total IT budget now goes to the management of infrastructure rather than to develop new services (www.serverdesignsummit.com). It is important to define what TCO (total cost of ownership) we are talking about here because it is often, used to justify different solutions as the following picture showing three different TCO representations of a data center. Figure 1 shows three different TCO views presented by three different speakers in the Server Design Summit in November 2011.  Each graph, while it is accurate, represents a different view. For example, the first view represents the server infrastructure and its management cost. The second one represents the power infrastructure and its management. The third view shows both the server infrastructure and power management. As you can see the total power and its management, while steadily increasing, is only a small fraction of the total infrastructure management cost.  In addition, these views do not even show the network and storage infrastructure and their management. It is also interesting to see the explosion of management cost shown in figure 3 over the last two decades. Automation has certainly improved the number of servers that can be managed by a single person by orders of magnitude. This is borne by the labor cost in the left picture by Intel which shows it is about 13% of the TCO from server view-point. But this does not tell the whole story.

Figure 1: Three different views of Data center TCO presented in the Server Design Summit conference in November 2011 (http://www.serverdesignsummit.com/English/Conference/Proceedings_Chrono.html). These views do not touch the storage, network and application/service management costs both in terms of software systems and labor.

A more revealing picture can be obtained by using the TCO calculator by one of the Virtualization infrastructure vendors. Figure 2 shows percentage Total Cost of Ownership (TCO) (for a 1500 server data center) over five years by each component with and without virtualization.

Figure 2: Five Year TCO of Virtualization According to a Vendor ROI Calculator. While virtualization reduces the TCO from 35% to 25%, it is almost offset by the software, services and training costs.

While virtualization introduces many benefits such as consolidation, multi-tenancy in a physical server, real-time business continuity and elastic scaling of resources to meet wildly fluctuating workloads, it adds another layer of management systems in addition to current computing, network, storage and application management systems. Figure 3 shows a reduction by 50% of the five-year TCO with virtualization. The Virtual Machine density of about 13 allows a great saving in hardware costs which is somewhat off-set by the new software, training and services costs of virtualization.

Figure 3: TCO over 5 Years with virtualization of 1500 servers using 13 VMs per Server. While the infrastructure and administration costs drop, it is almost offset by the software, services and training costs.

In addition, there is the cost of new complexity in optimizing the 13 or so VMs within each server in order to match the resources (network bandwidth, storage capacity, IOPs and throughput) to application workload characteristics, business priorities and latency constraints. According to a storage consultant, Jon Toigo “Consumers need to drive vendors to deliver what they really need, and not what the vendors want to sell them. They need to break with the old ways of architecting storage infrastructure and of purchasing the wrong gear to store their bits: Deploying a “SAN” populated with lots of stovepipe arrays and fabric switches that deliver less than 15% of optimal efficiency per port is a waste of money that bodes ill for companies in the areas of compliance, continuity, and green IT.”

Resource management based data center operations miss an important feature of services/applications management which is that all services are not created equal. They have different latency and throughput requirements. They have different business priorities and different workload characteristics and fluctuations. What works for the goose does not work for the gander. Figure 4 shows a classification of different services based on their throughput and latency requirements presented by Dell in the server design summit. The applications are characterized by their need for throughput, latency and storage capacity. In order to take advantage of the differing priorities and characteristics of the applications, additional layers of services management are introduced which focus on service specific resource management. Various appliance or software based solutions are added to the already complex resource management suites that address server, network and storage to provide service specific optimization. While this approach is well suited for making recurring revenues for vendors, it is not ideally suited for customers to lower the final TCO when all piece-wise TCO’s are added up. Over a period of time, most of these appliances and software end up as shelf-ware while the venodors tout more new TCO reducing solutions. For example, a well known solution vendor makes more annual revenue from maintenance and upgrades than new products or services that help their cutomers really reduce the TCO.

 Figure 4: Various services/Applications characterized by their throughput and latency requirements. Current resource management based data center does not optimally exploit the resources based on application/service priority, workload variations and latency constraints. It is easy to see the inefficiency in deploying a “one size fits all” infrastructure. It will be more eff icient to tailor “dumb” infrastructure and “Stupid Network” pools specialized to cater to different latency and throughput characteristics and let intelligent services provision themselves with the right resources based on their own business priorities, workload characteristics and latency constraints. This requires the visibility and control of service specification, management and execution available at run time which necessitates a search for new computing models.

In addition to the current complexity and cost of resource management to assure service availability, reliability, performance and security, there is even more fundamental issue that plagues the current distributed systems architecture. A distributed transaction that spans multiple servers, networks and storage devices in multiple geographies uses resources that span across multiple data centers. The fault, configuration, accounting, performance and security (FCAPS) of a distributed transaction behavior requires the end-to-end connection management more like telecommunication service spanning distributed resources. Therefore, focusing on only resource management in a data center without the visibility and control of all resources participating in the transaction will not provide assurance of service availability, reliability, performance and security.

Distributed transactions transcend the current stored program control implementation of the Turing machine which is at the heart of the atomic computing element in current computing infrastructure.  The communication and control are not an integral part of this atomic computing unit in the stored program control implementation of the Turing machine. The distributed transactions require interaction which integrates computing, control and communication to provide the ability to specify and execute highly temporal and hierarchical event flows. According to Goldin and Wegner, Interactive computation is inherently concurrent, where the computation of interacting agents or processes proceeds in parallel. Hoare, Milner and other founders of concurrency theory have long realized that Turing Machines (TM) do not model all of computation (Wegner and Goldin, 2003). However, when their theory of concurrent systems was first developed in the late ’70s, it was premature to openly challenge TMs as a complete model of computation. Their theory positions interaction as orthogonal to computation, rather than a part of it. By separating interaction from computation, the question whether the models for CCS and the Pi-calculus went beyond Turing Machines and algorithms was avoided. The resulting divide between the theory of computation and concurrency theory runs very deep. The theory of computation views computation as a closed-box transformation of inputs to outputs, completely captured by Turing Machines. By contrast, concurrency theory focuses on the communication aspect of computing systems, which is not captured by Turing Machines – referring both to the communication between computing components in a system, and the communication between the computing system and its environment. As a result of this division of labor, there has been little in common between these fields and their communities of researchers. According to Papadimitriou (Papadimitriou, 1995), such a disconnect within the theory community is a sign of a crisis and a need for a Kuhnian paradigm shift in our discipline.”

Kuhnian paradigm shift or not, a new computing model called DIME computing model (discussed in WETICE2010) provides a convergence of these two disciplines by addressing the computing and the communications in a single computing entity that is a managed Turing machine. The DIME network architecture provides a fractal (recursive) composition scheme to create an FCAPS managed network of DIMEs implementing business workflows as DAGs supporting both hierarchical and temporal event flows. The DIME computing model supports only those computations that can be specified as managed DAGs where a management signaling network overlay allows execution of managed computing tasks (executed by a computing unit called MICE) in each Turing machine node that is endowed with self-management using parallel computing threads. The MICE (see the video referenced in this blog for a description of DIME and its use in distributed computing and its management) constitutes the atomic Turing machine that is controlled by the FCAPS manager in a DIME which allows configuring, executing and managing the MICE to load and execute well specified computing workflow and its FCAPS management. The MICE under parallel real-time control of the DIME FCAPS manager aided by a signaling network overlay provides control over start, stop, read and write abstractions of the Turing machine. Two implementations have proven the existence proof for the DIME network architecture.

Figure 5 shows a DIME network implementing Linux, Apache, MySQL and PHP/Perl/Python web services delivery and assurance infrastructure.

Figure 5: The GUI showing the configuration of a LAMP Cloud (Mikkilineni, Morana, Zito, Di Sano, 2012). Each Apache and DNS are DIME aware running in a DIME aware Linux Operating System which, transforms a process into a managed element in the DIME network. A video describes the implementation of auto-failover, auto-scaling and performance management of the DIME aware LAMP cloud

Look Ma! No Hypervisor or VM in My Cloud (See Video)

The prototype implementations demonstrates a side effect of the DIME network architecture, which combines the computing and communication abstractions at an atomic level, – it decouples the services management from the underlying hardware infrastructure management. This makes it possible to implement highly resilient distributed transactions with auto-scaling, self-repair, state-aware migration, and self-protection – in-short, end-to-end transaction FCAPS management – based on business priorities, workload fluctuations and latency constraints.  No Hypervisors or VMs are required. The intelligent management of services workflow with resilient distributed transactions offers a new architecture for the data center infrastructure. For the first time it will be possible to remove embedding service management in the infrastructure management intelligence using myriad expensive appliances and software systems. It will be possible to design new tiers of dumb infrastructure pools (of servers, storage and network devices) with different latency and throughput characteristics and the services will be able to manage themselves based on policies by requesting appropriate resources based on their specifications. They will be able to self-migrate when quality of service levels are not met. The case for dumb infrastructure on a stupid network with intelligent services management puts forth the following advantages:

  1. Separation of concerns: The network, storage and server hardware provides hardware infrastructure management with signaling enabled FCAPS management. They do not encapsulate service management as the current generation equipment does.
  2. Specialization: The hardware is designed to meet specific latency and throughput characteristics to simplify its design through specialization. Different hardware with FCAPS management and signaling will provide plug and play components at run time.
  3. End-to-end service connection FCAPS management using the signaling network overlay allows dynamic service FCAPS management facilitating self-repair, auto-scaling, self-protection, state-aware migration and end to end transaction security assurance.

Figure 4 shows an example design of a possible storage device using simple storage architecture enabled with FCAPS management over a signaling overlay. It can be easily built with commercially off the shelf (COTS) hardware. This design allows separation of the services management from storage device management and eliminates a host of storage software management systems thus simplifying the data center infrastructure.

Figure 5: A gedanken design of autonomic storage and autonomic storage service deployment using the new DIME network architecture. The signaling overlay and FCAPS management are used to provide dynamic service management. Each service can request, using standard Linux OS services during run time, services from the storage device based on business priorities, workload fluctuations and latency constraints.

It is easy to see that the service connection model eliminates the need for clustering and provides new ways to provide transaction resilience with features such as service call forwarding, service call waiting, data broadcast, 800 service call model etc. It is also equally easy to see that with many-core servers, how the DIME Network architecture eliminates the inefficiencies of communication between Linux images within the same container (e.g., TCP/IP) and also how simple SAS storage and Flash storage can replace current generation appliance based storage strategies and their myraid management systems. Looking at the trends, it is easy to see that a paradigm shift soon will be in play to transform the data centers from their current role of being just managed server, networking, and storage hosting centers (whether physical or virtual), to true service switching centers with telecom grade trust. The emphasis will shift from resource switching and resource connection management to services switching and service connection management thus replacing the current efforts to replicate the complexity inside the data center today, also inside the many-core servers. With the resulting decoupling of services management from the infrastructure management, the next generation data centers will perhaps be more like central offices of the old Telcos, switching service connections.

Obviously the new computing model is in its infancy and requires participation from academicians who can validate or reject its theoretical foundation, VCs who can see beyond current approaches and are not satisfied by how many servers can be managed by a single administrator to measure the data center efficiency (as one Silicon Valley VC claimed it as progress in the Server Design Summit) and architects who exploit new paradigms to disrupt the status-quo. The DIME computing model by allowing Linux processes to be converted into a DIME network transcending physical boundaries allows easy migration from current infrastructure to the new one without abandoning legacy applications as the prototype of LAMP cloud demonstrates.

In closing, I like to point out that there have been many calls for a new computing model that combines computing and communication at an atomic computing element level which the Turing machine falls short as discussed above. However, without high bandwidth communication and exploitation of the parallelism that is abundant in the new generation hardware, it is not practically very useful to seriously utilize such new computing models. However, it seems that the hardware advances have outpaced the software advances and perhaps it is about time for computer scientists to seriously take a second look at addressing the software short-fall in dealing with distributed transactions. As the following fable illustrates, it may be futile to look for parallel break-through solutions in a serial boat.

“When Master Foo and his student Nubi journeyed among the sacred sites, it was the Master’s custom in the evenings to offer public instruction to UNIX neophytes of the towns and villages in which they stopped for the night.  On one such occasion, a methodologist was among those who gathered to listen.  “If you do not repeatedly profile your code for hot spots while tuning, you will be like a fisherman who casts his net in an empty lake,” said Master Foo.
“Is it not, then, also true,” said the methodology consultant, “that if you do not continually measure your productivity while managing resources, you will be like a fisherman who casts his net in an empty lake?”
“I once came upon a fisherman who just at that moment let his net fall in the lake on which his boat was floating,” said Master Foo. “He scrabbled around in the bottom of his boat for quite a while looking for it.”  “But,” said the methodologist, “if he had dropped his net in the lake, why was he looking in the boat?”  “Because he could not swim,” replied Master Foo.
Upon hearing this, the methodologist was enlightened”        — Master Foo and the Methodologist
                                                                   (http://www.catb.org/esr/writings/unix-koans/methodology-consultant.html)

If you have transformational research results, or want to make a real difference in computer science research, see Call for Papers at:

www.workshop.kawaobjects.com and http://WETICE.org