Changing Landscape of Backup and Disaster Recovery

September 16, 2012 - Leave a Response

“Consumers need to drive vendors to deliver what they really need, and not what the vendors want to sell them.”

——  Jon Toigo (http://www.datastorageconnection.com/doc.mvc/Jon-Toigo-Exposes-More-About-Data-Storage-Ven-0001 )

Starting from the mainframe datacenters where applications are accessed using narrow bandwidth networks and dumb terminals and evolving to client-server and peer-to-peer distributed computing architectures which exploit higher bandwidth connections, business process automation has contributed significantly to reduce the TCO. With the Internet, global e-commerce was enabled and the resulting growth in commerce led to an explosion of storage.  Storage networking and resulting NAS (network attached storage) and SAN (storage area network) technologies have further changed the dynamics of the enterprise IT infrastructure in a significant way to meet business process automation needs.  The storage backup and recovery technologies have further improved the resiliency of services delivery processes by improving the time it takes to respond in case of service failure.  Figure 1 shows the evolution of the data recovery time objective, (the recovery point objective (RPO) is the point in time to which you must recover data as dictated by business needs.  Recovery time objective (RTO) is the period of time after an outage in which the application and its data must be restored to a predetermined state defined by RPO.), which dropped from days to minutes and seconds.  While the productivity, flexibility and global connectivity made possible with this evolution have radically transformed the business economics of information systems, the complexity of heterogeneous and multi-vendor solutions have created high dependence on specialized training and service expertise to assure availability, reliability, performance and security of various business applications.

Figure 1: The evolution of Recovery Time Objective. Virtualization of server technology provides an order of magnitude improvement in the way applications are backed-up, recovered and protected against disasters.

Successful implementation must integrate various server, network and storage centric products with their local optimization best-practices with end-to-end optimization strategies.  While each vendor attempts to assure their success with more software and services, the small and medium enterprises often cannot afford the escalating software and service expenses associated with optimization strategies and become vulnerable.  The exponential growth in services demand for voice, data and video in the consumer market also has introduced severe strains on current IT infrastructures.  There are three main issues that are currently driving distributed computing solutions to seek new approaches:

  1. Current IT datacenters have evolved to meet the business services needs in an evolutionary fashion from server-centric application design to client-server networking to storage area networking without an end-to-end optimized architectural transformation along the way.  The server, network and storage vendors optimized management in their own local domains often duplicating functions from other domains to compete in the market place.  For example, cache memory is used to improve the performance of service transactions by improving response time. However, redundancy of cache management in server, storage and even network switches make tuning of the response time a complex task requiring multiple management systems. Application developers have also started to introduce server, storage and network management within their applications.  For example, Oracle is not just a database application.  It also is a storage manager, and a network manager as well as being an application manager.  It tries to optimize all its resources for performance tuning.  No wonder it takes an army of experts to keep it going.  The result is an over-provisioned datacenter with multiple functions duplicated many times by the server, storage and networking vendors.  Large enterprises with big profit margins throw human bodies, tons of hardware and a host of custom software and shelf-ware packages to address their needs.  Some data centre managers do not even know what assets they have — of course, yet another opportunity for vendors to sell an asset management system to discover what is available, and services to provide asset management using such an asset manager.  Another system is de-duplication software that finds out multiple copies of the same files and removes duplication.  This shows how expensive it is to clean up after the fact.
  2. Heterogeneous technologies from multiple vendors that are supposed to reduce IT costs actually increase the complexity and management costs.  Today, many CFOs consider IT as a black hole that sucks in, expensive human consultants and continually demands capital and operational expenses to add hardware and software which often end up as shelf-ware because of their complexity.  Even for mission-critical business services, enterprises CFOs are starting to question the productivity and effectiveness of current IT infrastructures.  It becomes even more difficult to justify the costs and complexity to support the massive scalability and wild fluctuations in workloads demanded by consumer services.  The price point is set low for the mass market but the demand is high for massive scalability (a relatively simple, but massive, service like Facebook is estimated to use about 40,000 servers and Google is estimated to run a million servers to support its business).
  3. More importantly, Internet-based consumer services such as social networking, e-mail and video streaming applications have introduced new elements: wild fluctuations in demand, massive scale of delivery to a divergent set of customers.  The result is an increased sensitivity to the economics of service creation, delivery and assurance. Unless the cost structure of IT management infrastructure is addressed, the mass-market needs cannot be met profitably.  Large service providers such as Amazon, Google, Facebook etc., have understandably implemented alternatives to meet wildly fluctuating workloads, massive scaling of customers and latency. constraints to meet demanding response time requirements.

Cloud computing technology has evolved to meet the needs of massive scaling, wild fluctuations in consumer demand and response time control of distributed transactions spanning multiple systems, players and geographies.  More importantly, cloud computing changes the backup and Disaster Recovery (DR) strategies in a drastic manner reducing the RTO to minutes and seconds doing much better than SAN/NAS based server-less backup and recovery strategies. Live migration is accomplished as follows:

  1. The entire state of a virtual machine is encapsulated by a set of files stored on shared storage such as Fibre Channel or iSCSI Storage Area Network (SAN) or Network Attached Storage (NAS).
  2. The active memory and precise execution state of the virtual machine is rapidly transferred over a high-speed network, allowing the virtual machine to instantaneously switch from running on the source host to the destination host. This entire process could take less than few seconds on a Gigabit Ethernet network.
  3. The networks being used by the virtual machine are virtualized by the underlying host. This ensures that even after the migration, the virtual machine network identity and network connections are preserved.

While Virtual machines improve resiliency and live migration to reduce the RTO, the increased complexity of hypervisors, their orchestration, Virtual Machine images and their management adds an additional burden in the datacenter. Figure 2 shows the evolution of current datacenters from the mainframe days to the cloud computing transformation.  The cost of creating and delivering a service has continuously decreased with increased performance of hardware and software technologies. What used to take months and years to develop and deliver new services now only takes weeks and hours. On the other hand, as service demand increased with ubiquitous access using the Internet and broadband networks, the need for resiliency (availability, reliability, performance and security management), efficiency and scaling also put new demands on service assurance and hence on the need for continuous reduction of RTO and RPO. The introduction of SAN server-less backup and virtual machine migration in turn have increased complexity and hence the cost of managing the service transactions during delivery while reducing the RTO and RPO.

Figure 2: Cost of Service Creation, Delivery and Assurance with the Evolution of Datacenter Technologies. The management cost has exploded because of a myriad point-solution appliances, software and shelf-ware are cobbled together from multiple vendors. Any future solution that addresses the datacenter management conundrum must provide end-to-end service visibility and control transcending multiple service provider resource management systems. Future datacenter focus will be on a transformation from Resources Management to Services Switching to provide telecom-grade “trust”.

The increased complexity of management of services implemented using the von Neumann serial computing model executing a Turing machine turns out to be more a fundamental architectural issue related to Godel’s prohibition of self-reflection in Turing machines than a software design issue. Cockshott et al. conclude their book “Computation and its limits” with the paragraph “The key property of general-purpose computer is that they are general purpose. We can use them to deterministically model any physical system, of which they are not themselves a part, to an arbitrary degree of accuracy. Their logical limits arise when we try to get them to model a part of the world that includes themselves.” While the last statement is not strictly correct (for example current operating systems facilitate incorporating computing resources and their management interspersed with the computations that attempt to model any physical system to be executed in a Turing machine), it still points to a fundamental limitation of current Turing machine implementations of computations using the serial von Neumann stored program control computing model. The universal Turing machine allows a sequence of connected Turing machines synchronously model a physical system as a description specified by a third-party (the modeler). The context, constraints, communication abstractions and control of various aspects during the execution of the model (which specifies the relationship between the computer acting as the observer and the computed acting as the observed) cannot be also included in the same description of the model because of Gödel’s theorems of incompleteness and decidability. Figure 3 shows the evolution of computing from mainframe/client-server computing where the management was labor-intensive to the cloud computing paradigm where the management services (which include the computers themselves in the model controlling the physical world) are automated.

 Figure 3: Evolution of Computing with respect to Resiliency, Efficiency and Scaling.

The first phase (of conventional computing) depended on manual operations and served well when the service transaction times and service management times could be very far apart and did not affect the service response times. As the service demands increased, service management automation helped reduce the gap between the two transaction times at the expense of increased complexity and resulting cost of management. It is estimated that 70% of today’s IT budget goes to self-maintenance and only 30% goes to new service development. Figure 4 shows current layers of systems contributing to cloud management.

Figure 4: Services and their management complexity

The origin of complexity is easy to understand. Current ad-hoc distributed service management practices originated from server-centric operating systems and narrow bandwidth connections. The need to address end-to-end service transaction management and the resource allocation and contention resolution required to address changing circumstances which, depend on business priorities, latency and workload fluctuations, were accommodated as an after-thought. In addition, open competitive market place has driven server-centric, network-centric and storage-centric oriented devices and appliances to multiply. The resulting duplication of many of the management functions in multiple devices without an end-to-end architectural view has largely contributed the cost and complexity of management. For example the storage volume management is duplicated in server, network and storage devices leading to a complex web of performance optimization strategies. Special purpose appliance solutions have sprouted to provide application, network, storage, and server security often duplicating many of the functions. Lack of an end-to-end architectural framework has led to point solutions that have dominated service management landscape often negating the efficiency improvements of service development and delivery made possible by the hardware performance improvements (Moore’s law) and software technologies and development frameworks.

The escape from this conundrum is to re-examine the computation models and circumvent the computational limit to go beyond Turing machines and serial von-Neumann computing model. Recently proposed computing model implemented in the DIME network architecture (Designing a New Class of Distributed Systems, Springer 2011) attempts to provide a new approach based on the old Turing O-machine proposed by Turing in his thesis. The phase 3 in figure 3 shows the new computing model implementing non-von Neumann managed Turing machine to implement hierarchical self-management of temporal computing processes. The implementation exploits the parallel threads and high bandwidth available with many-core processors and provides auto-scaling, live-migration, performance optimization and end to end transaction security by providing FCAPS (fault, configuration, accounting, performance and security) management of each Linux process and a network of such Linux processes provide a distributed service transaction. This eliminates the need for Hypervisors and Virtual machines and their management while reducing complexity. Since a Linux process is virtualized instead of a Virtual machine, the backup and DR are at a process level and also include a network of processes providing the service. Hence it is much more light-weight than VM based backup and DR.

In its simplest form the DIME computing model modifies the Turing machine SPC implementation by exploiting the parallelism and high bandwidth available in today’s infrastructure.

Figure 5: The DIME Computing Model – A Managed Turing Machine with Signaling incorporates the spirit of Turing Oracle machine proposed in his thesis.

Figure 5 shows the transition from the TM to a managed TM by incorporating three attributes:

  1. Before any read or write, the computing element checks the fault, configuration, accounting, performance and security (FCAPS) policies assigned to it,
  2. Self-management of the computing element is endowed by introducing parallel FCAPS management that sets the FCAPS policies that the computing element obeys, and
  3. An overlay of signaling network provides an FCAPS monitoring and control channel which allows the composition of managed network of TMs implementing managed workflows.

Figure 6 shows the services architecture with DIME network management providing end-to-end service FCAPS management.

Figure 6: Service Management with DIME Networks

The resulting decoupling of services management from infrastructure management provides a new approach to service management including backup and DR. While, the DIME computing model is in its infancy, two prototypes have already demonstrated its usefulness one with a LAMP stack and another with a new native-OS designed for many-core servers. Unlike Virtual Machine based backup and DR, the DIME network architecture supports auto-provisioning, auto-scaling, self-repair, live-migration, secure service isolation, and end-to-end distributed transaction security across multiple devices at the process level in an operating system. Therefore, this approach not only avoids the complexity of Hypervisors and Virtual machines (although, it still works with Virtual servers) but also allows adopting live-migration to existing applications without requiring changes to their code. In addition, it offers a new approach where the hardware infrastructure is simpler without the burden of anticipating service level requirements and let intelligence of services management reside in the services infrastructure leading to the deployment of intelligent self-managing services using a dumb infrastructure on stupid networks.

In conclusion, we emphasize that the DIME network architecture works with or without Hypervisors and associated Virtual Machine, IaaS and PaaS complexity and allows uniform service assurance across hybrid clouds independent of the service provider management systems. Only the Virtual server provisioning commands are required to configure just enough OS, DIMEX libraries and execute service components using DNA.

The power of DIME network architecture is easy to understand. By introducing parallel management to the Turing machine, we are converting a computing element to a managed computing element. In current operating systems, it is at the process level. In the new native operating system (parallax-OS) we have demonstrated, it is the Core in a many-core processor. A managed element provides plug-in dynamism to service architecture.

Figure 7 shows a service deployment in a Hybrid cloud with integrated service assurance across the private and public clouds without using service provider management infrastructure. Only the local operating system is utilized in DIME service network management.

Figure 7: A DNA based services deployment and assurance in a Hybrid Cloud. The decoupling of dynamic service provisioning and management from infrastructure resource provisioning and management (server, network and storage administration) enabled by DNA makes static provisioning of resource pools possible and dynamic service migration of services allows them to seek right resources at the right time based on workloads, business priorities and latency constraints.

As mentioned earlier, the DIME network architecture is still in its infancy and researchers are developing both the theory and practice to validate its usefulness in mission critical environments. Hopefully in this year of Turing centenary celebration, some new approaches will address the computation and its limits pointed out by Cockshott et al., in their book. Paraphrasing Turing (Turing was unimpressed by Wilkes’s EDSAC design, commenting that it was “much more in the American tradition of solving one’s difficulties by means of much equipment rather than by thought.”) a lot of appliances or code may not be often, a sustainable substitute for thoughtful architecture.

Is the Software Defined Network (SDN) Another Detour to a Datacenter Dead-end?

August 6, 2012 - Leave a Response

Introduction

Frustrated by the inability to fiddle with Internet routing in the real world, Stanford computer scientist Nick McKeown and colleagues developed a standard called OpenFlow that essentially opens up the Internet to researchers, allowing them to define data flows using software–a sort of “software-defined networking.” Installing a small piece of OpenFlow firmware (software embedded in hardware) gives engineers access to flow tables, rules that tell switches and routers how to direct network traffic. Yet it protects the proprietary routing instructions that differentiate one company’s hardware from another. SDN is nothing more than the separation of network data traffic processing from the logic and rules controlling the flow, inspection, and modification of that data. Traditional network hardware, i.e. switches and routers, implement these functions in proprietary firmware partitioned respectively into what is known as the data and control planes. While this is a fine research project, as the major vendors start to take this seriously and are attempting to introduce it in the real-world datacenters, one must ask if this will add or reduce complexity in the already complex datacenter where a host of piece meal solutions are offered by mega corporations seeking to continually increase their revenues without an incentive to reduce complexity by eliminating the number of hardware and software components deployed which would cut into their product sales.

Systems theory tells us that as the number of components increase in a system, the cost of complexity could outweigh the benefits unless architectural reorganization provides a way out.  We argue that the management complexity in current IT infrastructure design, based on the serial von Neumann stored program control implementation of the universal Turing machine, is a more fundamental architecture issue related to the lack of resiliency of the computing model than a software design issue. Cockshott et al. (2012) conclude their book “Computation and its limits” with the paragraph “The key property of general-purpose computer is that they are general purpose. We can use them to deterministically model any physical system, of which they are not themselves a part, to an arbitrary degree of accuracy. Their logical limits arise when we try to get them to model a part of the world that includes themselves.” Current generation distributed systems are implemented using a network of Turing machines in which the service and its management are intermixed as shown in figure 1. The resources utilized by the nodes in a network are often controlled by a plethora of management systems which are outside the purview of the service workflow that is utilizing the resources.  Thus the end to end service transaction response is controlled by these management systems which introduce a layer of complexity in coordination and contention resolution making the service much simpler than its management.

Figure 1: Serial von Neumann implementation of Turing Machines

The limitations of the SPC computing architecture were clearly on his mind when von Neumann gave his lecture at the Hixon symposium in 1948 in Pasadena, California (von Neumann, 1987, p. 408). “The basic principle of dealing with malfunctions in nature is to make their effect as unimportant as possible and to apply correctives, if they are necessary at all, at leisure. In our dealings with artificial automata, on the other hand, we require an immediate diagnosis. Therefore, we are trying to arrange the automata in such a manner that errors will become as conspicuous as possible, and intervention and correction follow immediately.” Comparing the computing machines and living organisms, he points out that the computing machines are not as fault tolerant as the living organisms.  He goes on to say “It’s very likely that on the basis of philosophy that every error has to be caught, explained, and corrected, a system of the complexity of the living organism would not run for a millisecond” (von Neumann, 1987,p. 408). It is clear that von Neumann recognized a problem in the way we design computing systems.

“Normally, a literary description of what an automaton is supposed to do is simpler than the complete diagram of the automaton. It is not true a priori that this always will be so. There is a good deal in formal logic which indicates that when an automaton is not very complicated the description of the function of the automaton is simpler than the description of the automaton itself, as long as the automaton is not very complicated, but when you get to high complications, the actual object is much simpler than the literary description.” (von Neumann, 1987,pp. 454-457). He remarked, “It is a theorem of Gödel that the description of an object is one class type higher than the object and is therefore asymptotically infinitely longer to describe.” (von Neumann, 1987,pp. 454-457). The conjecture of von Neumann leads to the fact that “one cannot construct an automaton which will predict the behavior of any arbitrary automaton” (von Neumann, 1987,p. 456). This is so with the Turing machine implemented by the SPC model.

In simpler terms the management complexity is related to the classical Russel Paradox that can be paraphrased as follows: “Who manages the managers?” Gödel’s prohibition of self-reflection in a Turing Machine mandates a hierarchy of Turing machines acting as managers managing other Turing machines implementing the computations described as a sequence of instructions that are compiled into a sequence of 1’s and 0’s. The universal Turing machine (or the general purpose computer) implements these TMs in a synchronous workflow thus prohibiting changes to computations at run-time in any Turing machine while the computation is in progress in that machine (i.e., you cannot change the behavior of that computation (compiled code) till its execution is interrupted).

Current generation server, networking, and storage equipment and their management systems have evolved from server-centric and bandwidth limited network architectures to today’s Cloud computing architecture with virtual servers and broadband networks. During last six decades, many layers of computing abstractions have been introduced to map the execution of complex computational workflows to a sequence of 1s and 0s that eventually get stored in the memory and operated upon by the CPU to achieve the desired result.  These include process definition languages, programming languages, file systems, databases, operating systems etc. While this has helped in automating many business processes, the exponential growth in services in the consumer market also has introduced severe strains on current IT infrastructure. In order to meet the need to rapidly respond to manage the distributed computing resources demanded by changing workloads, business priorities and latency constraints, new layers of resource management are added with the introduction of Hypervisors, virtual machines (VM) and their management. While these layers have made the application or service management more agile, they have introduced a new layer of issues related to their own management. For example, new layers of Virtual machine-level clustering, intrusion detection and performance management, are being introduced in addition to already existing clusters, intrusion detection and performance management systems at the infrastructure, operating systems and distributed resource management layers.

However, this approach is completely unsuited to exploit the new generation many-core servers and high-bandwidth networks now available. The advent of many-core severs with tens and even hundreds of computing cores with high bandwidth communication among them makes the current generation server, networking and storage equipment and their management systems which have evolved from server-centric and bandwidth limited architectures completely unsuited to use in the next generation computing infrastructure efficiently.  It is hard to imagine replicating current TCP/IP-based socket communication, “isolate and fix” diagnostic procedures, and the multiple operating systems (which do not have end-to-end visibility or control of business transactions that span across multiple cores, multiple chips, multiple servers and multiple geographies) inside the next generation many-core servers without addressing their shortcomings.  The many-core servers and processors constitute a network where each node itself is a sub-network with different bandwidths and protocols (socket-based low-bandwidth communication between servers, InfiniBand, or PCI Express bus based communication across processors in the same server and shared memory based low latency communication across the cores inside the processor).

Figure 2 shows the many-core server network supporting multiple bandwidths.

In order to cope with the scaling issues and utilize the hierarchical many-core network of networks effectively, next generation service architecture has to emulate the architectural resiliency of cellular organisms that tolerate faults and implement command and control structures which enable execution of self-configuring, self-monitoring, self-protecting, self-healing and self-optimizing (in short self-*) business processes. This requires new computing models that break the Turing machine barrier to computation by allowing the computer and the computed to be treated in the same model.

Papers Solicited to Address Next Generation Datacenter Infrastructure and Technologies:

The conference on “Convergence of Distributed Clouds, Grids and their Management” sponsored under the Aegis of WETICE 2013 is devoted to addressing next generation computing models which support real-time resource reconfiguration of distributed business workflow execution based on latency constraints, changing workloads and business priorities. It is devoted to addressing the assurance of reliability, availability, performance, account management and security of distributed business process execution with appropriate visibility and control.

The objective of the Conference was first stated in WETICE 2009; “to analyze current trends in Cloud Computing and identify long-term research themes and facilitate collaboration in future research in the field that will ultimately enable global advancements in the field that are not dictated or driven by the prototypical short-term profit driven motives of a particular corporate entity.” We are glad to report that the discussions started in 2009 have directly resulted in an alternative approach to self-managing distributed computing systems totally different from current industry trend showing a way to eliminate the complexity of virtual machines and Hypervisors. If this approach is proven to be theoretically sound (as a paper in WETICE2012 investigated) and extend its usefulness (demonstrated through their feasibility in the form of two proofs of concepts in the last conference) to mission critical environments, the DIME network architecture may yet prove to be an important contribution to computer science.

Following the tradition, the target of the WETICE2013 is to transform current complex, redundant, costly and knowledge intensive IT management into self-configuring, self-monitoring, self-healing and self-optimizing distributed workflow implementations with service management only limited by the speed of light. We identify another emerging area of software defined networks (SDN) as a potential candidate for further investigation without the bias that often surrounds commercial profit motives to see whether the overall complexity of the datacenter will be reduced or the SDNs are yet another layer of complexity.

Papers are solicited to advance the next generation distributed computing and its management infrastructure that leverages the new hardware innovations.  The goals of the conference include (but are not limited to):
  1. Discovering new application scenarios, proposing new operating systems, programming abstractions and tools
  2. Identifying the challenging problem that still need to be solved such as parallel programming, scaling and management of distributed computing elements, and
  3. Reporting results and experiences gained by researchers in building dynamic Grid-based middleware, computing clouds (distributed or otherwise) and workflow management systems.
Submission of papers March 10, 2013
Notification to authors April 1, 2013
Final papers to IEEE-CS April 25, 2013
Paper author’s registration deadline May 10, 2013
 WETICE-2013 Conference June 17-20, 2013

References:

P. Cockshott, L. M. MacKenzie and  G. Michaelson, “Computation and its Limits”, Oxford University Press, Oxford 2012.

J. v.Neumann, Probabilistic logic and the synthesis of reliable organisms from unreliable components, “Automatic studies,” edited by C. E. Shannon, and J. McCarthy, Princeton University Press, 1956, pp. 43-98.

W. Aspray, and A. Burks, “Papers of John von Neumann on Computing and Computer Theory.” Cambridge, MA: MIT Press. 1989.

Alan Turing’s Legacy and the Emerging Computing Models for Distributed Clouds, Grids and Their Management

July 21, 2012 - Leave a Response

The concepts of a Turing machine and the Turing-Church thesis have been the foundations for Computer Science for more than 70 years. However, much has changed since Turing compared “a man in the process of computing a real number to a machine which is only capable of finite number of conditions.” The Internet, hardware upheaval with the introduction of many-core processors with parallel hardware threads and high bandwidth communications which did not exist in the time of Turing are forcing a new look into the computing paradigms. Dr. Eberbach’s talk presents an overview of computing models for a very important class of distributed systems: autonomic clouds and grids. He presents the DIME network architecture as a representative of this still relatively new class of computing and attempts to capture its potential by formal modelling and examining its emerging properties.

Paper (WETICE2012 IEEE International Conference on Convergence of Distributed Clouds, Grids and Their Management, Touluse, France)

Computing Models for Distributed Autonomic Clouds and Grids in the Context of the DIME Network Architecture (Click Here for the Paper)

Biography of Prof. Eugene Eberbach:

Biography: With more than 150 publications, professor Eberbach has contributed in the areas of process algebras, resource bounded optimization, autonomous agents and mobile robotics. His recent topics of interest are new computing paradigms, languages and architectures, distributed computing, concurrency and interaction, evolutionary computing and neural nets. In late 1980s he worked on new non-von Neumann 5th Generation Computer Architectures at University College London. In 1990-2000s he worked on distributed autonomous underwater vehicles with support of ONR. In Canada and USA he introduced Calculus of Self-modifiable Algorithms and $-Calculus process algebra for automatic problem solving under bounded resources with support of NSERC. His work with Professor Wegner and Goldin on the fundamental limitations of Turing machines as a foundation of computability has contributed to the recent resurgence of an investigation into new computing models. He proposed two super-Turing models of computation: $-Calculus and Evolutionary Turing Machines. He just presented a paper (with Wegner and Burgin) on “Computational Completeness of Interaction Machines and Turing Machines” in the Turing Centenary Conference in Manchester proving that Interaction Machines are more expressive than Turing Machines.

He has held many academic positions; an Associate Professor at Rensselaer; an Associate Professor at Computer and Information Science Department and Intercampus Graduate School of Marine Sciences and Technology, University of Massachusetts Dartmouth, USA; a tenured Professor at School of Computer Science, Acadia University and an Adjunct Professor at Faculty of Graduate Studies, Dalhousie University, Canada; Senior Scientist at Applied Research Lab, The Pennsylvania State University, USA; Visiting Professor at The University of Memphis, USA; Research Scientist at University College London, U.K.; Assistant Professor at Rzeszow University of Technology, Poland, and he also has industrial experience – WSK “PZL-Rzeszow” and Applied Research Lab, Penn State. (http://www.ewp.rpi.edu/~eberbe

Papers Related to New Computing Models (Turing Centenary Conference, Manchester, June 24 – 26, 2012)

Computational Completeness of Interaction Machines and Turing Machines

Turing O-Machine and the DIME Network Architecture: Injecting the Architectural Resiliency into Distributed Computing

Cloud Computing, Management Complexity, Self-Organizing Fractal Theory, Non Equilibrium Thermodynamics, DIME networks, and all that Jazz

May 5, 2012 - Leave a Response

“There are two kinds of creation myths: those where life arises out of the mud, and those where life falls from the sky. In this creation myth, computers arose from the mud and code fell from the sky.”

– George Dyson, “Turing’s Cathedral: The Origins of the Digital Universe”, New York: Random House, 2012.

“The DIME network architecture arose out of the need to manage the ephemeral nature of life in the Digital Universe”

– Rao Mikkilineni (2012)

Abstract:

The explosion of current cloud computing software offerings (both open-sourced and proprietary)  to create public, private and hybrid clouds raises a question. Is it resulting in higher resiliency, efficiency and scaling of service offerings or increasing the complexity by introducing more components in an already crowded datacenter deploying myriad appliances, management frameworks, tools and people, all claiming to help lower total cost of operation? As the reliability, availability, performance, security and efficiency of the total system depends both on the number of components and their configuration, the architecture of a system plays an important role in defining the overall system resiliency, efficiency and scaling. We discuss current cloud computing architecture, the resulting complexity and investigate possible solutions using the self-organizing fractals theory and non-equilibrium thermodynamics. Evolution has taught us that when complexity increases, often, an architectural transformation occurs to lower the overall system entropy. Is a phase transition about to occur in our data centers seeded by the new many-core servers and high bandwidth communications?

Introduction:

According to Holbrook (Holbrook 2003), “Specifically, creativity in all areas seems to follow a sort of dialectic in which some structure (a thesis or configuration) gives way to a departure (an antithesis or deviation) that is followed, in turn, by a reconciliation (a synthesis or integration that becomes the basis for further development of the dialectic). In the case of jazz, the structure would include the melodic contour of a piece, its harmonic pattern, or its meter…. The departure would consist of melodic variations, harmonic substitutions, or rhythmic liberties…. The reconciliation depends on the way that the musical departures or violations of expectations are integrated into an emergent structure that resolves deviation into a new regularity, chaos into a new order, surprise into a new pattern as the performance progresses.” He goes on to explain exquisitely what “all that jazz” means and what it has to do with Dynamic Open Complex Adaptive System or DOCAS.

I borrow the jazz metaphor to understand the current state of affairs in cloud computing. Cloud computing started innocently enough as an attempt to automate systems administration tasks of computing systems to improve the resiliency (availability, reliability, performance and security), efficiency and scaling of services provided by web-hosting data centers. Before the advent of global web e-commerce enabled by broadband networks and ubiquitous access to high-powered computing, the workload fluctuations were not wild-enough to demand very fast response in provisioning to meet them. While enterprise datacenters were not pushed to deal with the wild fluctuations that some web-services companies were, companies such as Amazon, Google, Facebook, Twitter etc., dealing with uncertain (non-deterministic) workload fluctuations took a different approach to improve resiliency and scaling. They took advantage of the increased power in blade servers, high bandwidth networks and virtualization technologies to create virtual machine (VM) based systems administration with multiple VMs in a physical device consolidating workloads that are managed with dynamic resource provisioning. This has become known as cloud computing. Strictly speaking, VM is not essential for automation to improve scaling, auto-failover and live migration of applications and their data; and companies such as Google have chosen their own automation strategies without using VMs. On the other hand, many other enterprises have taken a more conservative approach by not adopting the cloud strategy and avoid the risk of impacting their highly tuned mission critical application availability, performance and security. They are probably correct given the continued occasional outages, security breaches and cost escalation in managing complexity with many public clouds.

Amazon and Google went one step further by offering their flexible infrastructures to developers outside their company to rent the resources with which they could develop, deploy and service their own applications, thus unleashing a new class of developers. Startups could substitute OPEX for CAPEX to obtain the resources required for their new product and services development. Resulting explosion of applications and services has created a new demand for more clouds and more automation of systems administration to extend resiliency and provide a high degree of isolation from multiple tenants sharing resources while resolving the resulting contentions. The result is a complex web of Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS) offerings to meet the needs of developers, service providers and service consumers.  To be sure, these offerings are not independent. On the contrary, each layer influences the other in a complex set of interactions often in non-deterministic way based on workloads, business priorities and latency constraints. Figure 1 shows an example of these relationships.

Figure 1: Complex relationships of information flow between nested layers and information flows between components in each layer. The complexity is only compounded by multi-vendor offerings in each layer (not shown here)

The origin of complexity is easy to understand. While attempting to solve the issue of multi-tenancy and agility, the introduction of the virtual machines gives rise to another complexity of virtual image management and sprawl control. In order to address VM mobility issue, recent efforts to introduce application level mobility using other container constructs such as Gears, Cartridges etc., in the case of Redhat PaaS (or Dynos in the case of Heroku, the salesforce PaaS), introduce yet another layer of management of Gears and Cartridges (or Dynos). Another example is the Eucalyptus Infrastructure as a Service, which goes to great lengths to provide High Availability (HA) of the Infrastructure platform but fails to guarantee HA of applications. It is left to the applications to fend for themselves.  These ad-hoc approaches to automate management have mushroomed the software required, increased the learning curve and made the operation and maintenance even more complex. While all platforms demonstrate drag and drop software with pretty displays that allow developers to easily create new services, there is no guarantee that if something goes wrong, one will be able to debug and find out where the root cause is. Or there is no assurance that when multiple services and applications are deployed on same platform, the feature interactions and shared resource management provided by a plethora of management systems designed independently will cooperate to provide the required reliability, availability, performance and security at the service level. More importantly, when the services cross server, data-center and geographical boundaries, there is no visibility and control of end to end service connections and their FCAPS management. Obviously, the platform vendors are only very eager to provide professional services and additional software to resolve the issues but without end to end service connection visibility and control that spans across multiple modules, systems, geographies and management systems, troubleshooting expenses could outweigh the realized benefits. What we need probably is not more “code” but an intelligent architecture that results in a synthesis of computing services and their management and a decoupling of end to end service connection and service component management from underlying resource (server, network and storage) management.

Self-organizing Fractals and Non-equilibrium Thermodynamics:

Fortunately, the self-organizing fractal theory (SOFT) and non-equilibrium thermodynamics (NET) (Kurakin 2011), provide a way to analyze complex systems and identify solutions. A very good glimpse into the theory can be found in the video (http://www.scivee.tv/node/4994). According to the SOFT-NET theory, the process of self-organization is scale-invariant and proceeds through sequential organizational state transitions, in a manner characteristic of far-from-equilibrium systems, with macrostructure-processes emerging via phase transition and self-organization of microstructure-processes. Once they have emerged as a result of an organizational transition, newborn structure-processes strive to persist and expand, growing in size/number, diversity, complexity, and order, while feeding on pre-existing energy/matter gradients. Economic competition among alternatively organized structure-processes feeding on the same energy/mater gradients leads to the elimination of economically deficient or inferior structure-processes and the improvement, diversification, and specialization of survivors, who are forced to fill and exploit all the available resource niches (the Darwinian phase of self-organization) (Kurakin 2007). Promoted by mutually profitable exchanges of energy/matter, the self-organization of specializing survivors (structure-processes) into larger scale structure-processes transforms (mostly) competing alternatives into (mostly) cooperating complements. As a result, Darwinian competition is transferred onto a larger spatiotemporal scale, where it commences among alternative organizations of self-organized survivors (the organizational phase (Kurakin 2007). Such an economy-driven, scale-invariant process of self-organization leads to the emergence of increasingly long-lived, multi-scale, hierarchical organizations (structures-processes) that expand over increasingly larger scales of space and time, feeding on available energy/matter gradients and eventually destroying them. Yet because energy/matter exists as a non-equilibrium system of interdependent gradients and conjugated fluxes of interconverting energy/matter forms, new gradients and fluxes are created and become dominant as old gradients and fluxes are consumed and destroyed. Such processes are responsible for the continuous birth, death, and transformation of energy/matter forms.

Obviously, cloud computing systems (or for that matter, distributed computing systems in general based on Turing machines) are not living organisms and thus are not susceptible to self-organization. However, if you substitute information to replace energy/matter, there are many similarities between the structure and dynamics of computing systems and living self-organizing systems. The nested computing layers, meta-stable organizational patterns (both macro- and micro- structures) in each layer, and process evolution through inter-layer interaction are the same features that contribute to self-organization. So one can ask what is missing for the cloud computing environments to become self-organizing. The answer lies in two observations:

  1. First one is the Gödel’s prohibition of self-reflection by computing elements that form the fundamental building block in the computing domain, the Turing machine (TM) (Samad and Cofer, 2001).
  2. Second one is the lack of scale invariant macro and micro structure-processes mentioned above for the organization of computing components and their management across various nested layers resulting from current ad-hoc implementation of computing processes using the serial von Neumann implementation of the Turing machine.

I have discussed both these deficiencies elsewhere (Mikkilineni 2011, 2012). The DIME network architecture proposed there attempts to address both these deficiencies.

The DIME Network Architecture:

In its simplest form a DIME is comprised of a policy manager (determining the fault, configuration, accounting, performance, and security aspects often denoted by FCAPS); a computing element called MICE (Managed Intelligent Computing Element); and two communication channels. The FCAPS elements of the DIME provide setup, monitoring, analysis and reconfiguration based on workload variations, system priorities based on policies and latency constraints. They are interconnected and controlled using a signaling channel which overlays a computing channel that provides I/O connections to the MICE (or the computing element) (Mikkilineni 2011). The DIME computing element acts like a Turing oracle machine introduced in his thesis and circumvents Gödel’s halting and un-decidability issues by separating the computing and its management and pushing the management to a higher level. Figure 2 shows the DIME computing model.

Figure 2: The DIME Computing Model. For details on the different implementations of DIME networks (a LAMP stack without VMs and a native Parallax OS) visit http://www.youtube.com/kawaobjects

In addition the introduction of signaling in the DIME network architecture allows a fractal composition scheme of the DIME network to create a recursive distributed computing engine with scale invariant FCAPS management of the computing workflow at node, sub-network and network level. Figure 2 shows the comparison between living organisms with self-organizing fractal attributes and Cloud computing infrastructure organized to exhibit self-management fractal attributes.

Figure 3: Comparison of the nested hierarchical organization of living organisms and DIME network architecture.

While both models exhibit the genetic transactions of replication, repair, recombination and reconfiguration (Stanier and Moore, 2006) (Mikkilineni 2011), there is a fundamental difference between the two. The DIME network architecture is not self-organizing but it is self-managing based on initial policies and constraints defined at the root levels of the hierarchies. These policies can be modified during run time but only with the influence of agents external to the computing element whose behavior is under modification (at the DIME node, sub-network and network level).

At each level, the FCAPS management defines the initial conditions and policy constraints (meta-model if you will, denoting the context and defining the destiny of the ensuing process workflow) that will define the information flows and workflows executed by the DIME network downstream. The resulting metastable configurations are monitored and managed by the managers upstream. This model exhibits the three-step processes that provide self-management in living organisms – establish routine, monitor cues and respond with corrective action based on FCAPS parameters at every level. Figure 4 shows the metastable configuration entropy of the whole system. The FCAPS parameters monitored provide a measure of system entropy shown and the reconfiguration alters the state from higher entropy to lower entropy providing a “measure” of the stable pattern.

Figure 4: System Entropy as a function of time

The SOFT-NET theories provide a path to reexamine the way we design distributed computing systems. Perhaps the living organisms with their self-organizing properties could provide us a way to bring self-management to cloud computing configurations to improve resiliency, efficiency and scaling. The DIME network architecture is a baby-step to implement a recursive distributed computing engine to execute managed workflows that constitute hierarchical and temporal sequences of events executing business workflows.

The DIME network architecture raises some interesting questions about Turing machines and their management. How is it related to the Universal Turing Machine (UTM)? It is important to point out that I do not claim that DIME networks are the answer to Cloud computing vows or that the UTM can or cannot do what a DIME network does. While communicating Turing machines are modeled by a UTM (Penrose 1989), can the managed Turing machine networks also be modeled by the UTM? Is the scale-invariant organizational macro and micro structure-processes discussed in SOFT-NET theory essential for self-organizing systems? What are the differences between living self-organizing systems and self-managing networks? I leave this to the experts. I only point out that the DIME is inspired by the oracle machine discussed by Turing in his thesis and implements the architectural resiliency of cellular organisms in distributed computing infrastructure by introducing parallel management of both the computing elements and networks. While its feasibility has been demonstrated (Mikkilineni, Morana and Seyler, 2012), the DIME network architecture is still in its infancy and presents an opportunity on the eve of Turing’s centenary celebration to investigate its usefulness and theoretical soundness.  Only time will tell if the DIME network architecture is useful in mission critical environments. Figure 5 shows a comparision of Physical server based computing, Virtual Machine based cloud computing and DIME network implementation in Linux server eliminating the Hypervisors and Virtual Machines.

Figure 5: Comparision between conventional, cloud and DIME network computing paradigms. The DIME network Architecture requires no Hypervisors, Virtual Machines, IaaS or PaaS. Linux processes are FCAPS managed and networked using a middleware library without any changes to the Operating System.

The DIME network architecture with its self-management, parallel signaling network overlay and its recursive distributed computing engine model supports all features that current cloud computing provides and more while eliminating the need for Hypervisors, Virtual Machines, IaaS and PaaS. The DNA offers the simplicity by providing FCAPS management of a Linux process through a middle-ware library using standard services of the Linux operating syatem and parallelism available in a multi-core/many-core processor.

Conclusion:

I conclude with one lesson from the past (Mikkilineni and Sarathy, 2009) I take away working in POTS (Plain Old Telephone System), PANS (Pretty Amazing New Services enabled by the Internet), SANs and Clouds. It is that wherever there is networking, switching always trumps other approaches. When services are executed by a network of distributed components, service switching and end-to-end service connection management are the ultimate meta-stable structure-processes and it seems that cellular organisms, telephone networks, and human network eco-systems have figured this out. Signaling and nested FCAPS management structure-processes seem to be the common ingredients. Therefore, I predict that eventually the data centers which are currently computing resource management centers will transform themselves into services switching centers just as in telephony. Perhaps computer scientists should look to telephony, neuroscience and organizational dynamics for answers than engaging in hackathons and coding ad-hoc complex systems to manage distributed computing resources. SOFT-NET theories seem to be pointing to the right direction. The solution may lie in discovering scale invariant micro- and macro structure processes that provide nested FCAPS management and self-managed local and global policy enforcement. Perhaps Holbrook’s “All that Jazz” metaphor is an appropriate metaphor for cloud computing research. Time may be ripe for the reconciliation (the synthesis of the thesis of implementing services and the anti-thesis of services management).

References:

Holbrook, Morris B. 2003. ” Adventures in Complexity: An Essay on Dynamic Open Complex Adaptive Systems, Butterfly Effects, Self-Organizing Order, Coevolution, the Ecological Perspective, Fitness Landscapes, Market Spaces, Emergent Beauty at the Edge of Chaos, and All That Jazz.” Academy of Marketing Science Review [Online] 2003 (6) Available: http://www.amsreview.org/articles/holbrook06-2003.pdf

Kurakin, A., Theoretical Biology and Medical Modelling, 2011, 8:4. http://www.tbiomed.com/content/8/1/4

Kurakin A: The universal principles of self-organization and the unity of Nature and knowledge. 2007 [http://www.alexeikurakin.org/text/thesoft.pdf ].

Mikkilineni, R., Sarathy, V., (2009), “Cloud Computing and the Lessons from the Past,” Enabling Technologies: Infrastructures for Collaborative Enterprises, 2009. WETICE ’09. 18th IEEE International Workshops on , vol., no., pp.57-62, June 29 2009-July 1 2009. doi: 10.1109/WETICE.2009.

Mikkilineni, R., (2011). Designing a New Class of Distributed Systems. New York,NY: Springer. (http://www.springer.com/computer/information+systems+and+applications/book/978-1-4614-1923-5)

Mikkilineni (2012) Turing Machines, Architectural Resilience of Cellular Organisms and DIME Network Architecture (http://www.computingclouds.wordpress.com )

Mikkilineni, R., Morana, G., and Seyler, I., (2012), “Implementing Distributed, Self-managing Computing Services Infrastructure using a Scalable, Parallel and Network-centric Computing Model” Chapter in a Book edited by Villari, M., Brandic, I., & Tusa, F., Achieving Federated and Self-Manageable Cloud Infrastructures: Theory and Practice (pp. 1-374). doi:10.4018/978-1-4666-1631-8

Penrose, R., (1989) “The Emperor’s New Mind: Concerning Computers, Minds, And The Laws of Physics” New York, Oxford University Press pp. 48

Samad, T., Cofer, T., (2001). Autonomy and Automation: Trends, Technologies, In Gani, R., Jørgensen, S. B., (Ed.) Tools in European Symposium on Computer Aided Process Engineering volume 11, Amsterdam, Netherlands: Elsevier Science B. V., p. 10

Stanier, P., Moore, G., (2006) “Embryos, Genes and Birth Defects”, (2nd Edition), Edited by Patrizia Ferretti, Andrew Copp, Cheryll Tickle, and Gudrun Moore, London, John Wiley & Sons

Turing Machines, Architectural Resilience of Cellular Organisms and DIME Network Architecture

March 27, 2012 - One Response

“Cellular biology has evolved to capture dynamic representations of self and its surroundings and a systemic view of monitoring and control of both the self and the surroundings to optimize the organism’s chances of survival. Signaling plays a key role in shaping the structure and behavior of cellular organisms to exhibit a high degree of resiliency by monitoring and controlling its own activity and its interactions with the outside environment with a Zen-like one-ness of the observer and the observed. Evolution has invented the genetic transactions of replication, repair, recombination and reconfiguration to support the survival of living cells by organizing themselves to execute a coordinated set of activities and signaling provides a vehicle for managing the system-wide behavior.

By introducing signaling and self- management in a Turing node and a signaling network as an overlay over the computing network, the current von-Neumann computing model is evolved to bring the architectural resiliency of cellular organisms to computing infrastructure. The new approach introduces the genetic transactions of replication, repair, recombination and reconfiguration to program self-resiliency in distributed computing systems executing a managed workflow. Perhaps, the injection of parallelism, signaling and network based recursive (fractal-like) composition of “Self” identity are the first steps in introducing the elements of homeostasis and self-management required for developing consciousness in the computing infrastructure.

Dr. Rao Mikkilineni and Dr. Giovanni Morana

Some of these ideas will be discussed in WETICE 2012 and CISIS-2012

Introduction:

“Only two requests for reprints came in. Engineers avoided Turing’s paper because it appered entirely theoretical, and theoreticians avoided it because of the references to paper tape and machines”

             – George Dyson, “Turing’s Cathedral: The Origins of the Digital Universe”, Random House, New York, 2012, p249

Alan Turing was born in London on June 1912, and the Alan Turing Centenary Conference will be held in Manchester on June 22-25, 2012, hosted by The University in Manchester, where Turing worked in 1948-1954. According to the conference organizers, the main theme of the conference has the following aims:

  • to celebrate the life and research of Alan Turing;
  • to bring together the most distinguished scientists, to understand and analyze the history and development of Computer Science and Artificial Intelligence;

The conference includes two special public lectures (90 minutes each), 17 lectures (60 minutes each) by invited speakers, including lectures presenting the work of Alan Turing, one dinner lecture, two panel discussions, the presentation of awards to the research competition winners and short presentations from the selected research competition winners.

One can only wonder what Turing would himself say in this conference if he were alive today. The computers Alan Turing knew were radically different in his day (except for the human computer which he was trying to imitate with a machine) and the discipline of computer science itself did not exist. For sure, there are at least three major areas that would be of interest to him:

  1. The many-core processors and associated potential parallel computing paradigms;
  2.  The DNA based biological information models that architect and regulate complex biological processes with precision providing the architectural resilience, and;
  3. The advances in neuroscience and the emerging models of consciousness.

The connection between consciousness and computing models is succinctly summarized by Samad and Cofer (Samad & Cofer, 2001). While there is no accepted precise definition of the term consciousness, “it is generally held that it is a key to human (and possibly other animal) behavior and to the subjective sense of being human. Consequently, any attempt to design automation systems with humanlike autonomous characteristics requires designing in some elements of consciousness. In particular, the property of being aware of one’s multiple tasks and goals within a dynamic environment and of adapting behavior accordingly.” They point to two theoretical limitations of formal systems that may inhibit the implementation of computational consciousness and hence limit our ability to design human-like autonomous systems. “First, we know that all digital computing machines are “Turing-equivalent”-They differ in processing speeds, implementation technology, input/output media, etc., but they are all (given unlimited memory and computing time) capable of exactly the same calculations. More importantly, there are some problems that no digital computer can solve. The best known example is the halting problem; we know that it is impossible to realize a computer program that will take as input another, arbitrary, computer program and determine whether or not the program is guaranteed to always terminate.

Second, by Gödel’s proof, we know that in any mathematical system of at least a minimal power there are truths that cannot be proven. The fact that we humans can demonstrate the incompleteness of a mathematical system has led to the claims that Gödel’s proof does not apply to humans.”

An important implication of Gödel’s incompleteness theorem is that it is not possible to have a finite description with the description itself as the proper part. In other words, it is not possible to read yourself or process yourself as a process. In short, Gödel’s theorems prohibit “self-reflection” in Turing machines. Louis Barrett highlights (Barrett, 2011) the difference between Turing Machines implemented using von Neumann architecture and biological systems. “Although the computer analogy built on von Neumann architecture has been useful in a number of ways, and there is also no doubt that work in classic artificial intelligence (or, as it is often known, Good Old Fashioned AI: GOFAI) has had its successes, these have been somewhat limited, at least from our perspective here as students of cognitive evolution.” She argues that the Turing machines based on algorithmic symbolic manipulation using von Neumann architecture, gravitate toward those aspects of cognition, like natural language, formal reasoning, planning, mathematics and playing chess, in which the processing of abstract symbols in a logical fashion and leaves out other aspects of cognition that deal with producing adoptive behavior in a changeable environment. Unlike the approach where perception, cognition and action are clearly separated, she suggests that the dynamic coupling between various elements of the system, where each change in one element continually influences every other element’s direction of change has to be accounted for in any computational model that includes system’s sensory and motor functions along with analysis. To be fair, such couplings in the observed can be modeled and managed using a Turing machine network and the Turing network itself can be managed and controlled by another serial Turing network. What is not possible is the tight integration of the models of the observer and the observed with a description of the “self” using parallelism and signaling that are the norm and not an exception in biology.

A more interesting controversy that has erupted regarding the need for new computing models (Wegner & Eberbach, 2004, Cockshott & Michaelson, 2007, Goldin & Wegner, 2008) throws some new light on the need for re-examining the Turing machines, Gödel’s prohibition of self-reflection and von Neumann’s conjecture. An even more recent discussion of the need for new computing models was presented in the Ubiquity symposium (ACM Ubiquity, 2011). As we describe later, these authors are attempting to address how to model computational problems that cannot be solved by a single Turing machine but can be solved using a set of Turing machines interacting with each other. In particular, the property of being aware of one’s multiple tasks and goals within a dynamic environment and of adapting behavior accordingly which is related to consciousness mentioned earlier is one such problem that a single Turing machine cannot solve. The insights into biology suggest that in order to model temporal dynamics of the observer and the observed while also assuring the safe-keeping of the observer (with a “self”-identity) requires modifications to the Turing machine to accommodate changes to the behavior while computation is still in progress.

Self Identity, Self-Reflection, and Self-Management – The Dynamic Representation of the Observer and the Observed:

Self-identity, self-reflection, setting expectations, monitoring the deviations and taking corrective action are essential for managing the business of life through homeostasis and evolution has figured out how to encapsulate the right descriptions to execute the life’s processes using the genetic transaction of replication, repair, recombination and reconfiguration by exploiting parallelism and signaling. As Jonah Lehrer (Lehrer, 2010) describes in his book “How We Decide”, “Dopamine neurons automatically detect the subtle patterns that we would otherwise fail to notice; they assimilate all the data that we can’t consciously comprehend. And, then, once they come up with a set of refined predictions about how the world works, they translate these predictions to emotions.” Emotions, it seems are the instinctual localized component level suggestions for corrective actions based on local experience. Conscience  on the other hand, is the adult who correlates the instinctual suggestions with much larger perspective and makes decisions based on global priorities.

It is becoming clear from the recent advances in neuroscience, that self-reflection is a key component in living organisms. Homeostasis is not possible without a dynamic and active representation of the observer and the observed.

A cellular organism is the simplest form of life that maintains an internal environment that supports its essential biochemical reactions, despite changes in the external environment. Therefore, a selectively permeable plasma membrane surrounding a concentrated aqueous solution of chemicals is a feature of all cells. In addition it is capable of self-replication and self-repair which may be unicellular or multicellular. Unicellular organisms perform all the functions of life. Multicellular organisms contain several different cell types that are specialized to perform specific functions. The cell adapts to its environment by recognition and transduction of a broad range of environmental signals, which in turn activate response mechanisms by regulating the expression of proteins that take part in the corresponding processes. The nucleus of the cell houses deoxyribonucleic acid (DNA) the genetic blueprint of the organism which determines the structure and function of the organism as a whole. The DNA serves two functions. First, it contains instructions for assembling the structural and enzymatic proteins of the cell. Cellular enzymes in turn control the formation of other cellular structures and also determine the functional activity of the cell by regulating the rate at which metabolic reactions proceed. Second, by replicating (making copies of itself), DNA perpetuates the genetic blueprint within all new cells formed within the body and is responsible for passing on genetic information from the survivors to successors.

A gene is a stretch of DNA that contains instructions or code for a particular function such as synthesizing a protein or dictating the assembly of amino acids. A unique set of genes are packaged as chromosomes in complex organisms. A gene regulatory network represents relationships between genes that can be established from measuring how the expression level of each one affects the expression level of the others. In any global cellular network, genes do not interact directly with other genes. Instead, gene induction or repression occurs, the action of specific proteins, which are in turn products of certain genes as well. In essence, gene networks are abstract models that display causal relationships between gene activities and are represented by directed graphs. Nearly all of the cells of a multicellular organism contain same DNA. Yet this same genetic information yields a large number of different cell types. The fundamental difference between a neuron and a liver cell, for example, is which genes are expressed. The regulatory gene network forms a cellular control circuitry defining the overall behavior of the various cells. According to Antonio Damasio (Damasio, 2010), the brain architecture is an evolutionary aid to the business of managing life which consists of managing the body and the management gains precision and efficiency with the presence of circuits of neurons assisting the management. In describing the role of neurons, he says that “neurons are about life and managing life in other cells of the body, and that aboutness requires two-way signaling. Neurons act on other body cells, via chemical messages or excitation of muscles, but in order to do their job, they need inspiration from the very body they supposed to prompt, so to speak. In simple brains, the body does its prompts simply by signaling to subcortical nuclei. Nuclei are filled with “dispositional know-how,” the sort of knowledge that does not require detailed mapped representations. But in complex brains, the map-making cerebral cortices describe the body and its doings in so much explicit detail that the owners of those brains become capable, for example, of “imaging: the shape of their limbs and their positions in space, or the fact that their elbows hurt or their stomach does”.

The complex network of neural connections and signaling mechanisms collaborate to create a dynamic, active and temporal representation of both the observer and the observed with myriad patterns, associations and constraints among their components. It seems that the business of managing life is more than mere book-keeping that is possible with a Turing machine. It involves the orchestration of an ensemble with a self-identity both at the group and the component level contributing to the system’s biological value. It is a hierarchy of individual components where each node itself is a sub-network with its own identity and purpose which is consistent with the system-wide purpose. To be sure, each component is capable of book-keeping and algorithmic manipulation of symbols. In addition, identity and representations of the observer and the observed at both the component and group level make system-wide self-reflection possible. As recent advances in neuroscience throw new light on the process of evolution of the cellular computing models, it is becoming clear that communication and collaboration mechanisms of distributed computing elements and end-to-end distributed transaction management played a crucial role in the development of self-resiliency, efficiency and scaling which are exhibited by diverse forms of life from the cellular organisms to highly evolved human beings. According to Antonio Damasio (Damasio 2010), managing and safe keeping life is the fundamental premise of biological value and this biological value has influenced the evolution of brain structures. “Life regulation, a dynamic process known as homeostasis for short, begins in unicellular living creatures, such as bacterial cell or a simple amoeba, which do not have a brain but are capable of adaptive behavior. It progresses in individuals whose behavior is managed by simple brains, as in the case with worms, and it continues its march in individuals whose brains generate both behavior and mind (insects and fish being examples)….” Homeostasis is the property of a system that regulates its internal environment and tends to maintain a stable, constant condition of properties like temperature or chemical parameters that are essential to its survival. System-wide homeostasis goals are accomplished through a representation of current state, desired state, a comparison process and control mechanisms.

He goes on to say that “consciousness came into being because of biological value, as a contributor to more effective value management. But consciousness did not invent biological value or the process of valuation. Eventually, in human minds, consciousness revealed biological value and allowed the development of new ways and means of managing it.” The governance of life’s processes is present even in single-celled organisms that lack a brain and it has evolved to the conscious awareness which is the hallmark of highly evolved human behavior. “Deprived of conscious knowledge, deprived of access to the byzantine devices of deliberation available in our brains, the single cell seems to have an attitude: it wants to live out its prescribed genetic allowance. Strange as it may seem, the want, and all that is necessary to implement it, precedes the explicit knowledge and deliberation regarding life conditions, since the cell clearly has neither. The nucleus and the cytoplasm interact and carry out complex computations aimed at keeping the cell alive. They deal with the moment-to-moment problems posed by the living conditions and adapt the cell to the situation in a survivable manner. Depending on the environmental conditions, they rearrange the position and distribution of molecules in their interior, and they change the shape of sub-components, such as microtubules, in an astounding display of precision. They respond under duress and under nice treatment too. Obviously, the cell components carrying out those adaptive adjustments were put into place and instructed by the cell’s genetic material.” This vivid insight brings to light the cellular computing model that:

  1. Spells out the computational workflow components as a stable sequence of patterns that accomplishes a specific purpose,
  2. Implements a parallel management workflow with another sequence of patterns that assures the successful execution of the system’s purpose (the computing network to assure biological value with management and safekeeping),
  3. Uses a signaling mechanism that controls the execution of the workflow for gene expression (the regulatory network) and
  4. Assures real-time monitoring and control (homeostasis) to execute genetic transactions of replication, repair, recombination and reconfiguration (Stanier & Moore, 2006).

The managing and safekeeping life efficiently are evident at the lowest level of biological architecture that provides the resiliency that von Neumann was discussing in his Hixon lecture (von Neumann, 1987). ‘‘The basic principle of dealing with malfunctions in nature is to make their effect as unimportant as possible and to apply correctives, if they are necessary at all, at leisure. In our dealings with artificial automata, on the other hand, we require an immediate diagnosis. Therefore, we are trying to arrange the automata in such a manner that errors will become as conspicuous as possible, and intervention and correction follow immediately.’’ Comparing the computing machines and living organisms, he points out that the computing machines are not as fault tolerant as the living organisms. He goes on to say ‘‘It’s very likely that on the basis of philosophy that every error has to be caught, explained, and corrected, a system of the complexity of the living organism would not run for a millisecond.’’

In short, the business of managing life is implemented by a system consisting of a network of networks with multiple parallel links that transmit both control information and the mission critical data required to sense and to control the observed by the observer. The data and control networks provide the capabilities to develop an internal representation of both the observer and the observed along with the processes required to implement the business of managing life. The organism is made up of autonomic components making up an ensemble collaborating and coordinating a complex set of life’s processes that are executed to sense and control both the observer and the observed. In this sense, the brain and the body are part of a collaborating system that has a unique identity and a structure that preserves the interrelationships. The system consists of:

  1. Components each with a purpose within a larger system (specialization)
  2. All of a component parts must be present for the system to carry out its purpose optimally,
  3. A system’s parts must be arranged in a specific way for the system to carry out its purpose (separation of concerns),
  4. Systems change in response to feedback (collect information, analyze information and control environment using specialized resources), and
  5. Systems maintain their stability (in accomplishing their purpose) by making adjustments based on feedback (homeostasis).

Figure 1 shows the model of core-conscience, its relationship to the Observed and the extended conscience (Damasio, 1999) proposed by Damasio based on his studies in neuroscience.

Figure 1: The mapping of the observer, the observed and myriad models, associations and processes executed using parallel signaling and data exchange networks.  Each component itself is a sub-network with a purpose defined by its own internal models.

Literature is filled with discussion about Gödel’s prohibition of self-reflection in Turing machines and why consciousness cannot emerge from the brain models that depend on Turing machines. There are many theories on how the human brain is unique and may even involve quantum phenomena or gravity waves (Scott, 1995 and Davis, 1992). However Damasio (Damasio, 2010) takes the evolutionary approach to discuss genomic unconsciousness, the feeling of conscious will, educating the cognitive conscious, the reflective self and its consequences. He goes on to say “in one form or another, the cultural developments manifest the same goal as the form of automated homeostasis.” “They respond to a detection of the imbalance in the life process, and seek to correct it within the constraints of human biology and of the physical and social environment.”

Instead of adding to the already existing controversy (Scott, 1995) on consciousness, let us take a different route using Damasio’s emphasis on homeostasis along with the dynamic representation of the observer and the observed. Let us apply them to extend the Turing machine and its von Neumann Serial computing implementation. Let us ask how we can utilize the abstractions that assist in the business of managing life in cellular organisms, discussed above, to enhance the resiliency of distributed computing systems. In the next section let us analyze the current implementation of Turing machines and suggest adding some of the abstractions that have proven useful in managing life’s processes to develop a computing model that addresses the problem of being aware of one’s multiple tasks and goals within a dynamic environment and of adapting behavior accordingly.

Turing Machines, Super Turing Machines and DIME Networks:

While a single SPC node lacks self-reflection prohibited by Gödel’s theorems, a network of Turing machines have been successfully used to implement business workflows that observe and manage the external world. This is accomplished by modeling the observed (external to the computing infrastructure) and orchestrating the temporal dynamics of the observed. This has helped us develop complex control systems that can be monitored and controlled with the resiliency of cellular organisms.

However, what is missing is the same resiliency in the infrastructure (or the observer) that implements the control of the observed. In order to introduce consciousness, we must also introduce the “self” identity of the observer and its multiple tasks and goals within a dynamic environment and of adapting behavior accordingly.

The evolution of computing seems to follow a similar path to cellular organisms in the sense that it emerged as an individual computing element (von Neumann stored program control (SPC) implementation of the Turing machine) and evolved into today’s networks of managed computing elements executing complex workflows that monitor and control external environment. The Turing machine originally started as a static closed system (Goldin & Wegner, 2008) analogous to a single cell. It was designed for computing algorithms that correspond to mathematical world view. This is the case with Assembler language programming where a CPU is programmed and the Turing machine is implemented using the von Neumann Stored Program Control computing model as shown in figure 2.

Figure 2: A Turing machine with von Neumann Stored Program Control implementation in its simplest form.

The Church-Turing thesis stipulates that “Turing machines can compute any effective (partially recursive) functions over naturals (strings). Goldin and Wegner argue that the Church-Turing thesis applies only to effective computations rather than computation by arbitrary physical machines, dynamical systems or humans.

Another way is to stipulate that “all computations can be represented as workflows specified by a directed acyclic graph (DAG). Algorithms are a sub set of all computations. An algorithm can be viewed as a workflow of instructions executed by a stored program control (SPC) computing unit (constituting an atomic unit of computation). Then, based on the programming paradigm of one’s choice, one can compose other computing units such as procedures, functions, objects etc., to execute the specified workflow.” This can reconcile the operating system conundrum that states that the operating systems do not terminate as required by the Turing machines. As soon as an operating system is introduced, the Turing machine SPC implementation immediately becomes a workflow of computations to implement a process, where each process now behaves as a new Turing machine with SPC implementation. It is as if the operating system is a manager (implementing a management workflow using a group of management Turing machines dedicated for this purpose) controlling a series of computing Turing machines based on policies set in the operating system. The operating system instructions and the computational flow dependent instructions are mixed to serially execute the process and a sequence of processes. This is analogous to the evolution of multi-cellular organisms where individual cells establish a common management protocol to execute their goals with shared resources. The individual processes may or may not have a common goal but they share the same resources. The operating system communicates with the processes to exert its role using shared memory as shown in Figure 3. While the individual processes do not have fault, configuration, accounting, performance and security (FCAPS) management of self, the operating system provides these functions using the signaling abstractions of addressing, alerting, mediation and supervision.

Figure 3: Operating system implements the managed Turing processes with Stored Program Control Computing Model (Serial von-Neumann execution of service and its management)

Since then, multi-threading in a single processor, networked and interactive computing have influenced the computations. In a network, concurrency and influence of one node on another (impact of the environment on the computation) are the new elements that have to be addressed. The Pi calculus and super Turing models (Eberbach, Wegner & Goldin, 2011) are an attempt to address these aspects. While these attempts are embroiled in controversy, (Cockshott, P & Michaelson, 2007), what is not in dispute is that a network of computers represents a network of organized Turing machines where each node is a group of Turing machines managed locally.

In such a network, the local operating systems cannot provide FCAPS management of the system as whole. The disciplines of distributed computing and distributed systems management evolved to address the FCAPS management of the system in an ad-hoc manner without a formal computing model for the system as a whole.  This is even more complicated when the system as a whole now acts in unison with a system-wide purpose where one element can influence other elements as pointed out by Louise Barrett (Barrett, 2011).

In this case, the description of the functions performed and the influence of one computation on another has to be encoded at compile time and each computing element does not have the ability to change the behavior at run time. In addition, operating system function is to allocate the resources appropriately to the consumers (processes running applications) and the applications themselves do not have any influence on the resources during run time. For example, if the workload fluctuates, the application has no way of monitoring and controlling the resources. It has to depend on external agents.

Figure 4: A network of Turing machines implementing a service workflow that manages the external environment (the observed). The management of the observer is also implemented using the same serial Turing machines where in some nodes the management of the observer and the observed are mixed in serial fashion and some other nodes are exclusively devoted to managing the observer.

Taking the cue from cellular biology, we can introduce self-management, into the Turing machine that assures resiliency of the Turing node. This requires a parallel monitoring and control mechanism to observe and control the Turing node and a control or signaling channel to collaborate with other Turing nodes to participate in a system-wide FCAPS management that assures all the Turing nodes participating in a workflow management of the observed are also managing themselves to assure the resiliency of the observer network. We stipulate that “The DIME computing model allows the specification and execution of a recursive composition model where each computing unit at any level specifies and executes the workflow at the lower level. The specification at a higher level eliminates the self-reflection prohibition of Gödel’s theorems on computational units. The parallel implementation of the management workflow and the computational workflow at each level allows the influence of one component in the workflow to influence another component at the lower level.

At any level, the computational unit specifies and assures the execution of the lower level workflow thus it becomes the observer observing and controlling the workflow execution at lower level (which is the observed)

This stipulation eliminates the problem of separation of communication between the computing system components in a system and the communication between the computing system and its environment. In current computing models of systems design, treating them as two separate issues has created the current disconnect in the distributed systems theories (Goldin, Wegner p 22)”

Figure 5 shows the new computing model we call distributed Intelligent Managed Element (DIME) network computing model and the resulting computing infrastructure is designed with DIME network architecture.

Figure 5: A Distributed Intelligent Managed Element (DIME) with local management of the Turing computing node and signaling channel. The FCAPS attributes of the Turing node are continuously monitored and controlled based on local policies. In addition the signaling channel allows coordination with global policies.

The DIME network architecture (Mikkilineni 2011) consists of four components:

  1. A DIME node which encapsulates the von Neumann computing element with self-management of FCAPS.
  2. Signaling capability that allows intra-DIME and Inter-DIME communication and control,
  3. An infrastructure that allows implementing distributed service workflows as a set of tasks, arranged or organized in a DAG and executed by a managed network of DIMEs and
  4. An infrastructure that assures DIME network management using the signaling network overlay over the computing workflow

The self-management and task execution (using the DIME component called MICE, the managed intelligent computing element) are performed in parallel using the stored program control computing devices. The DIME encapsulates the “dispositional know-how.” Each DIME is programmable to control the MICE and provide continuous supervision of the execution of the programs executed by the MICE. The DIME FCAPS management allows to model and represent dynamic behavior of each DIME, the state of the MICE and its evolution as a function of time based on both internal and external stimuli. The parallel management architecture allows the observer (a network or sub-network) that forms a group to monitor and control itself while facilitating the implementation of monitoring and control of the observed in external environment. Parallelism allows dynamic information flow both in the signaling channel and the external I/O channels of the Turing computing nodes.

There are three special features of DNA that contribute to self-resiliency:

  1. Each Turing computing node is controlled by the FCAPS policies set in each DIME. Each read and write are dynamically configurable based on the FCAPS policies.
  2. Each node itself can be a sub-network of DIMES with goals set by the sub-network policies.
  3. The signaling allows dynamic connection management to reconfigure the DIME network thus changing the policies and behavior.

It is easy to show that the DIME network architecture supports the genetic transactions of replication, repair, recombination and rearrangement.

In summary, the dynamic configuration at DIME node level and the ability to implement at each node, a managed directed acyclic graph using a DIME sub-network provides a powerful paradigm for designing and deploying managed services that are decoupled from the hardware infrastructure management. Figure 6 shows a workflow implementation of monitoring and controlling an external environment (temperature monitoring and fan control to maintain the temperature in a range) using a self-managed DIME network with signaling network overlay.

Figure 6: A workflow implementation using a DIME network. There are two FCAPS management workflows, one managing the observer (computing infrastructure) and the other managing the observed (Thermometer and the Fan)

While the DIME network architecture provides food for thought about Turing, machines, new computing models and the role of the representations of observer and the observed in consciousness, it also has practical utility in developing software exploiting the parallelism and performance of many-core servers (Mikkilineni et. al. 2011). Some of the results demonstrating self-repair, auto-scaling to control the response time of a web server are presented at the Server Design Summit (Mikkilineni 2011).

Conclusion:

The limitation of Turing Machines as a complete model of computation has been pointed out by (Wegner and Eberbach, 2004). While it was challenged by (Cockshott & Michaelson, 2007).), it was rebutted by (Goldin, & Wegner, 2008). The main argument for a new computing model was to account for the interactive nature of conventional algorithmic computation and the environment outside the computing element. The Turing model is closed and static and does not address the changes affecting the computation from outside while the computation is in progress. In order to account for networked systems in which each change in one element continually influences every other element’s direction of change, more expressive computing model are required. The von Neumann implementation of the Turing machine with its serial processing and mixing of algorithmic computation and interaction using a network of von Neumann computing nodes have given rise to complex management infrastructure that makes it difficult to implement in our IT infrastructure, the architectural resiliency of cellular organisms.

The DIME computing model has two components – borrowed from the genetic computing model – a Computing function specification and its management specification. The specification of management describes what resources are needed when to start, how to set up needed resources, how to stop, how to reconfigure resources based on monitored behavior of the computing element, and the Compute function specification which uses the resources (CPU, memory, bandwidth etc. to complete the computation.  During computation, it pauses to see policies before each read or write so that it can follow management policies based on external stimuli.

The DIME network architecture, by implementing parallel management infrastructure to monitor and control the Turing machine, allows the read and write functions of the conventional Turing machine to be influenced by external interaction. The hierarchical composition model of DIME network architecture allows the identification of “self” (the observer) at various levels and the representation of the interaction between the observer and the observed very similar to the biological “self” described by Damasio.

The beauty of the DIME computing model is that it does not impact the current implementation of the service workflow using von-Neumann SPC nodes (monitoring and control of the observed external systems). But by introducing parallel control and management of the service workflow, the DIME network architecture provides the required scaling, agility and resilience both at the node level and at the network level (integrating the management and control of self, the observer). The signaling based network level control of a service workflow that spans across multiple nodes allows the end-to-end connection level quality of service management independent of the hardware infrastructure management systems that do not provide any meaningful visibility or control to the end-to-end service transaction implementation at run time. The only requirement for the DIME infrastructure provider is to assure that the node OS provides the required services for the service controller to load the Service Regulator and the Service Execution Packages to create and execute the DIME.

The network management of DIME services allows hierarchical scaling using the network composition of sub-networks. Each DIME with its autonomy on local resources through FCAPS management and its network awareness through signaling can keep its own history to provide negotiated services to other DIMEs thus enabling a collaborative workflow execution.

Each node has a unique identity and supports local behavior and its control using local policies that are programmable using the conventional von Neumann SPC Turing machines. Each sub-network and network allows a group identity (group self) and support group behavior and control. The resulting network of networks enables system-wide resilient business of managing both the self and the services to monitor and control external behavior. The parallel control network allows dynamic connection management of component functions to create dynamic workflows to accommodate changing environment.

The cellular implementation of the business of managing life may also show us the way to the business of managing our computing infrastructure which has already proven valuable in implementing the business of managing our lives and our environment transcending the body and mind of a single individual. As von Neumann remarked (von Neumann, 1966), “A theorem of Gödel that the next logical step, the description of an object, is one class type higher than the object and is therefore asymptotically longer to describe.” He admitted to twisting the theorem a little while describing the evolution of diversifying computational ecology from simple strings of ’0′s and ’1′s (von Neumann,1987). Perhaps the recursive nature of a network containing sub-networks as nodes along with FCAPS management both at the node and network level, offers the definition of “self-identity” at various levels. While self-reflection at any level is prohibited by Gödel, a higher level “self” provides the required management and control to lower levels. A parallel signaling network that allows dynamic replication, repair, recombination and reconfiguration provide a degree of resiliency, efficiency and scaling that are not possible with a network of serial von Neumann implementations of Turing machines. This may well be a prescription for injecting the property of being aware of one’s multiple tasks and goals within a dynamic environment and of adapting behavior accordingly.

References:

ACM Ubiquity Symposium, (2011) http://ubiquity.acm.org/symposia.cfm

Barrett, L., (2011). Beyond the Brain: How Body and Environment Shape Animal and Human Minds. Princeton, New Jersey: Princeton University Press, p 116, 122

Cockshott, P., Michaelson, G., (2007). Are There New Models of Computation? Reply to Wegner and Eberbach, Computer Journal, vol 50, no, 2, 232-247.

Damasio, A., (1999). The Feeling of What Happens: Body and Emotion in the Making of Consciousness. New York, NY: Harcourt & Company.

Damasio, A. (2010). Self Comes to Mind: Constructing the Conscious Brain. New York: Pantheon Books, p. 25 and p. 35.

Dyson, G. B., (1997). Darwin among the Machines: the evolution of global intelligence. Massachusetts: Helix books, p. 189.

Eberbach, E., Wegner, P., Goldin, D., (2011) Our Thesis: Turing Machines Do Not Model All Computations. (Private communication of an unpublished paper)

Goldin, D., Wegner, P., (2008). Refuting the Strong Church-Turing Thesis: the Interactive Nature of Computing, Minds and Machines, 18:1, March, pp.17-38,

Lehrer, J., (2010) How We Decide. Boston, MA: Mariner Books, p. 50

Mikkilineni, R., (2011). Designing a New Class of Distributed Systems. New York,NY: Springer. (http://www.springer.com/computer/information+systems+and+applications/book/978-1-4614-1923-5)

Mikkilineni, R., Morana, G., Zito, D., Di Sano, M., (2011). Service Virtualization using a non-von Neumann Parallel, Distributed & Scalable Computing Model: Fault, Configuration, Accounting, Performance and Security Management of Distributed Transactions, (Preprint)

Mikkilineni, R., (2011). Service Virtualization using a non-von Neumann Computing Model, Server Design Summit (www.serverdesignsummit.com), San Jose, November 29. (A video of the presentation is available at http://www.kawaobjects.com/presentations/ServerDesignSummitVideo.wmv.)

Samad, T., Cofer, T., (2001). Autonomy and Automation: Trends, Technologies, In Gani, R., Jørgensen, S. B., (Ed.) Tools in European Symposium on Computer Aided Process Engineering volume 11, Amsterdam, Netherlands: Elsevier Science B. V., p. 10

Stanier, P., & Moore, G., (2006). The Relationship Between Genotype and Phenotype: Some Basic Concepts. In Ferretti, P., Copp, A., Tickle, C., & Moore, G., (Ed.), Embryos, Genes and Birth Defects, London: John Wiley, p. 5

Scott, A., (1995). The Controversial New Science of Consciousness: Stairway to the Mind. New York, NY: Copernicus, Springer-Verlag. P.184.

“At the hierarchical level of human conscience it is not possible to report a consensus of the scientific community because there is none. Materialists, functionalists, and dualists are-according to a recent issue of the popular science magazine Omni (October 1993)-engaged in

Slinging mud and hitting low like politicians arguing about tax hikes. Although the epithets are more rarified-here it is “obscuritanist” and “crypto-Cartisian” rather than “liberal” and “right wing”-recent exchanges between neuroscientists and philosophers of mind (and in each group among themselves) feature the same sort of relentless defensiveness and stark opinionated name calling we expect from irate congressmen or trash-talking linebackers.

To the extent that this is a true appraisal of the current status of consciousness, it is unfortunate. Like life, the phenomenon of consciousness is intimately related to several levels of the scientific hierarchy, so the appropriate scientists-cytologists, electrophysiologists, neuroscientists, anesthegiologists, sociologists and ethnologists-should be working together. It is difficult to see how this elusive phenomenon might otherwise be understood.

Davis, P., (1992). The Mind of God: The Scientific Basis for a Rational World. New York, NY: Simon and Schuster.

von Neumann, J., (1966). Theory of Self-Reproducing Automata. Burke, A. W. (Ed.) Chicago, Illinois. University of Illinois Press.

von Neumann, J., (1987). Papers of John von Neumann on Computing and Computing Theory, Hixon Symposium, September 20, 1948, Pasadena, CA, The MIT Press, p454, p457

Wegner, P., Eberbach, E., (2004). New Models of Computation. The Computer Journal, vol 47, No. 1, 4-9.

Wegner, P., Goldin, D., (2003). Computation beyond Turing Machines: Seeking appropriate methods to model computing and human thought. Communications of the ACM, Vol. 46, No. 4, pp. 100

Path to Self-managing Services: A Case for Deploying Managed Intelligent Services Using Dumb Infrastructure in a Stupid Network

February 2, 2012 - One Response

“WETICE 2012 Convergence of Distributed Clouds, Grids and their Management Conference Track is devoted to transform current labor intensive, software/shelf-ware-heavy, and knowledge-professional-services dependent IT management into self-configuring, self-monitoring, self-protecting, self-healing and self-optimizing distributed workflow implementations with end-to-end service management by facilitating the development of a Unified Theory of Computing.”

“In recent history, the basis of telephone company value has been the sharing of scarce resources — wires, switches, etc. – to create premium-priced services. Over the last few years, glass fibers have gotten clearer, lasers are faster and cheaper, and processors have become many orders of magnitude more capable and available. In other words, the scarcity assumption has disappeared, which poses a challenge to the telcos’ “Intelligent Network” model. A new type of open, flexible communications infrastructure, the “Stupid Network,” is poised to deliver increased user control, more innovation, and greater value.”

                     —–Isenberg, D. S., (1998). “The dawn of the stupid network”. ACM netWorker 2, 1, 24-31.

Much has changed since the late 90’s that drove the Telco’s to essentially abandon their drive for supremacy in intelligent services creation, delivery and assurance business and take the back seat in the information services market to manage the ‘stupid network’ that merely carries the information services.  You have to only look at the demise of major R&D companies such as AT&T Bell Labs, Lucent, Nortel, Alcatel and the rise of a new generation of services platforms from Apple, Amazon, Google, Facebook, Twitter, Oracle and Microsoft to notice the sea change that has occurred in a short span of time. The data center has replaced the central office to become the hub from which myriad voice, video and data services are created, and delivered on a global scale. However the management of these services which determines their resiliency, efficiency and scaling is another matter.

While, the data center value has been the sharing of expensive resources – processor speed, memory, network bandwidth, storage capacity, throughput and IOPs – to create premium-priced services, over the last couple of decades, the complexity of the infrastructure and its management has exploded. It is estimated that up to 70% of the total IT budget now goes to the management of infrastructure rather than to develop new services (www.serverdesignsummit.com). It is important to define what TCO (total cost of ownership) we are talking about here because it is often, used to justify different solutions as the following picture showing three different TCO representations of a data center. Figure 1 shows three different TCO views presented by three different speakers in the Server Design Summit in November 2011.  Each graph, while it is accurate, represents a different view. For example, the first view represents the server infrastructure and its management cost. The second one represents the power infrastructure and its management. The third view shows both the server infrastructure and power management. As you can see the total power and its management, while steadily increasing, is only a small fraction of the total infrastructure management cost.  In addition, these views do not even show the network and storage infrastructure and their management. It is also interesting to see the explosion of management cost shown in figure 3 over the last two decades. Automation has certainly improved the number of servers that can be managed by a single person by orders of magnitude. This is borne by the labor cost in the left picture by Intel which shows it is about 13% of the TCO from server view-point. But this does not tell the whole story.

Figure 1: Three different views of Data center TCO presented in the Server Design Summit conference in November 2011 (http://www.serverdesignsummit.com/English/Conference/Proceedings_Chrono.html). These views do not touch the storage, network and application/service management costs both in terms of software systems and labor.

A more revealing picture can be obtained by using the TCO calculator by one of the Virtualization infrastructure vendors. Figure 2 shows percentage Total Cost of Ownership (TCO) (for a 1500 server data center) over five years by each component with and without virtualization.

Figure 2: Five Year TCO of Virtualization According to a Vendor ROI Calculator. While virtualization reduces the TCO from 35% to 25%, it is almost offset by the software, services and training costs.

While virtualization introduces many benefits such as consolidation, multi-tenancy in a physical server, real-time business continuity and elastic scaling of resources to meet wildly fluctuating workloads, it adds another layer of management systems in addition to current computing, network, storage and application management systems. Figure 3 shows a reduction by 50% of the five-year TCO with virtualization. The Virtual Machine density of about 13 allows a great saving in hardware costs which is somewhat off-set by the new software, training and services costs of virtualization.

Figure 3: TCO over 5 Years with virtualization of 1500 servers using 13 VMs per Server. While the infrastructure and administration costs drop, it is almost offset by the software, services and training costs.

In addition, there is the cost of new complexity in optimizing the 13 or so VMs within each server in order to match the resources (network bandwidth, storage capacity, IOPs and throughput) to application workload characteristics, business priorities and latency constraints. According to a storage consultant, Jon Toigo “Consumers need to drive vendors to deliver what they really need, and not what the vendors want to sell them. They need to break with the old ways of architecting storage infrastructure and of purchasing the wrong gear to store their bits: Deploying a “SAN” populated with lots of stovepipe arrays and fabric switches that deliver less than 15% of optimal efficiency per port is a waste of money that bodes ill for companies in the areas of compliance, continuity, and green IT.”

Resource management based data center operations miss an important feature of services/applications management which is that all services are not created equal. They have different latency and throughput requirements. They have different business priorities and different workload characteristics and fluctuations. What works for the goose does not work for the gander. Figure 4 shows a classification of different services based on their throughput and latency requirements presented by Dell in the server design summit. The applications are characterized by their need for throughput, latency and storage capacity. In order to take advantage of the differing priorities and characteristics of the applications, additional layers of services management are introduced which focus on service specific resource management. Various appliance or software based solutions are added to the already complex resource management suites that address server, network and storage to provide service specific optimization. While this approach is well suited for making recurring revenues for vendors, it is not ideally suited for customers to lower the final TCO when all piece-wise TCO’s are added up. Over a period of time, most of these appliances and software end up as shelf-ware while the venodors tout more new TCO reducing solutions. For example, a well known solution vendor makes more annual revenue from maintenance and upgrades than new products or services that help their cutomers really reduce the TCO.

 Figure 4: Various services/Applications characterized by their throughput and latency requirements. Current resource management based data center does not optimally exploit the resources based on application/service priority, workload variations and latency constraints. It is easy to see the inefficiency in deploying a “one size fits all” infrastructure. It will be more eff icient to tailor “dumb” infrastructure and “Stupid Network” pools specialized to cater to different latency and throughput characteristics and let intelligent services provision themselves with the right resources based on their own business priorities, workload characteristics and latency constraints. This requires the visibility and control of service specification, management and execution available at run time which necessitates a search for new computing models.

In addition to the current complexity and cost of resource management to assure service availability, reliability, performance and security, there is even more fundamental issue that plagues the current distributed systems architecture. A distributed transaction that spans multiple servers, networks and storage devices in multiple geographies uses resources that span across multiple data centers. The fault, configuration, accounting, performance and security (FCAPS) of a distributed transaction behavior requires the end-to-end connection management more like telecommunication service spanning distributed resources. Therefore, focusing on only resource management in a data center without the visibility and control of all resources participating in the transaction will not provide assurance of service availability, reliability, performance and security.

Distributed transactions transcend the current stored program control implementation of the Turing machine which is at the heart of the atomic computing element in current computing infrastructure.  The communication and control are not an integral part of this atomic computing unit in the stored program control implementation of the Turing machine. The distributed transactions require interaction which integrates computing, control and communication to provide the ability to specify and execute highly temporal and hierarchical event flows. According to Goldin and Wegner, Interactive computation is inherently concurrent, where the computation of interacting agents or processes proceeds in parallel. Hoare, Milner and other founders of concurrency theory have long realized that Turing Machines (TM) do not model all of computation (Wegner and Goldin, 2003). However, when their theory of concurrent systems was first developed in the late ’70s, it was premature to openly challenge TMs as a complete model of computation. Their theory positions interaction as orthogonal to computation, rather than a part of it. By separating interaction from computation, the question whether the models for CCS and the Pi-calculus went beyond Turing Machines and algorithms was avoided. The resulting divide between the theory of computation and concurrency theory runs very deep. The theory of computation views computation as a closed-box transformation of inputs to outputs, completely captured by Turing Machines. By contrast, concurrency theory focuses on the communication aspect of computing systems, which is not captured by Turing Machines – referring both to the communication between computing components in a system, and the communication between the computing system and its environment. As a result of this division of labor, there has been little in common between these fields and their communities of researchers. According to Papadimitriou (Papadimitriou, 1995), such a disconnect within the theory community is a sign of a crisis and a need for a Kuhnian paradigm shift in our discipline.”

Kuhnian paradigm shift or not, a new computing model called DIME computing model (discussed in WETICE2010) provides a convergence of these two disciplines by addressing the computing and the communications in a single computing entity that is a managed Turing machine. The DIME network architecture provides a fractal (recursive) composition scheme to create an FCAPS managed network of DIMEs implementing business workflows as DAGs supporting both hierarchical and temporal event flows. The DIME computing model supports only those computations that can be specified as managed DAGs where a management signaling network overlay allows execution of managed computing tasks (executed by a computing unit called MICE) in each Turing machine node that is endowed with self-management using parallel computing threads. The MICE (see the video referenced in this blog for a description of DIME and its use in distributed computing and its management) constitutes the atomic Turing machine that is controlled by the FCAPS manager in a DIME which allows configuring, executing and managing the MICE to load and execute well specified computing workflow and its FCAPS management. The MICE under parallel real-time control of the DIME FCAPS manager aided by a signaling network overlay provides control over start, stop, read and write abstractions of the Turing machine. Two implementations have proven the existence proof for the DIME network architecture.

Figure 5 shows a DIME network implementing Linux, Apache, MySQL and PHP/Perl/Python web services delivery and assurance infrastructure.

Figure 5: The GUI showing the configuration of a LAMP Cloud (Mikkilineni, Morana, Zito, Di Sano, 2012). Each Apache and DNS are DIME aware running in a DIME aware Linux Operating System which, transforms a process into a managed element in the DIME network. A video describes the implementation of auto-failover, auto-scaling and performance management of the DIME aware LAMP cloud

Look Ma! No Hypervisor or VM in My Cloud (See Video)

The prototype implementations demonstrates a side effect of the DIME network architecture, which combines the computing and communication abstractions at an atomic level, - it decouples the services management from the underlying hardware infrastructure management. This makes it possible to implement highly resilient distributed transactions with auto-scaling, self-repair, state-aware migration, and self-protection – in-short, end-to-end transaction FCAPS management – based on business priorities, workload fluctuations and latency constraints.  No Hypervisors or VMs are required. The intelligent management of services workflow with resilient distributed transactions offers a new architecture for the data center infrastructure. For the first time it will be possible to remove embedding service management in the infrastructure management intelligence using myriad expensive appliances and software systems. It will be possible to design new tiers of dumb infrastructure pools (of servers, storage and network devices) with different latency and throughput characteristics and the services will be able to manage themselves based on policies by requesting appropriate resources based on their specifications. They will be able to self-migrate when quality of service levels are not met. The case for dumb infrastructure on a stupid network with intelligent services management puts forth the following advantages:

  1. Separation of concerns: The network, storage and server hardware provides hardware infrastructure management with signaling enabled FCAPS management. They do not encapsulate service management as the current generation equipment does.
  2. Specialization: The hardware is designed to meet specific latency and throughput characteristics to simplify its design through specialization. Different hardware with FCAPS management and signaling will provide plug and play components at run time.
  3. End-to-end service connection FCAPS management using the signaling network overlay allows dynamic service FCAPS management facilitating self-repair, auto-scaling, self-protection, state-aware migration and end to end transaction security assurance.

Figure 4 shows an example design of a possible storage device using simple storage architecture enabled with FCAPS management over a signaling overlay. It can be easily built with commercially off the shelf (COTS) hardware. This design allows separation of the services management from storage device management and eliminates a host of storage software management systems thus simplifying the data center infrastructure.

Figure 5: A gedanken design of autonomic storage and autonomic storage service deployment using the new DIME network architecture. The signaling overlay and FCAPS management are used to provide dynamic service management. Each service can request, using standard Linux OS services during run time, services from the storage device based on business priorities, workload fluctuations and latency constraints.

It is easy to see that the service connection model eliminates the need for clustering and provides new ways to provide transaction resilience with features such as service call forwarding, service call waiting, data broadcast, 800 service call model etc. It is also equally easy to see that with many-core servers, how the DIME Network architecture eliminates the inefficiencies of communication between Linux images within the same container (e.g., TCP/IP) and also how simple SAS storage and Flash storage can replace current generation appliance based storage strategies and their myraid management systems. Looking at the trends, it is easy to see that a paradigm shift soon will be in play to transform the data centers from their current role of being just managed server, networking, and storage hosting centers (whether physical or virtual), to true service switching centers with telecom grade trust. The emphasis will shift from resource switching and resource connection management to services switching and service connection management thus replacing the current efforts to replicate the complexity inside the data center today, also inside the many-core servers. With the resulting decoupling of services management from the infrastructure management, the next generation data centers will perhaps be more like central offices of the old Telcos, switching service connections.

Obviously the new computing model is in its infancy and requires participation from academicians who can validate or reject its theoretical foundation, VCs who can see beyond current approaches and are not satisfied by how many servers can be managed by a single administrator to measure the data center efficiency (as one Silicon Valley VC claimed it as progress in the Server Design Summit) and architects who exploit new paradigms to disrupt the status-quo. The DIME computing model by allowing Linux processes to be converted into a DIME network transcending physical boundaries allows easy migration from current infrastructure to the new one without abandoning legacy applications as the prototype of LAMP cloud demonstrates.

In closing, I like to point out that there have been many calls for a new computing model that combines computing and communication at an atomic computing element level which the Turing machine falls short as discussed above. However, without high bandwidth communication and exploitation of the parallelism that is abundant in the new generation hardware, it is not practically very useful to seriously utilize such new computing models. However, it seems that the hardware advances have outpaced the software advances and perhaps it is about time for computer scientists to seriously take a second look at addressing the software short-fall in dealing with distributed transactions. As the following fable illustrates, it may be futile to look for parallel break-through solutions in a serial boat.

“When Master Foo and his student Nubi journeyed among the sacred sites, it was the Master’s custom in the evenings to offer public instruction to UNIX neophytes of the towns and villages in which they stopped for the night.  On one such occasion, a methodologist was among those who gathered to listen.  “If you do not repeatedly profile your code for hot spots while tuning, you will be like a fisherman who casts his net in an empty lake,” said Master Foo.
“Is it not, then, also true,” said the methodology consultant, “that if you do not continually measure your productivity while managing resources, you will be like a fisherman who casts his net in an empty lake?”
“I once came upon a fisherman who just at that moment let his net fall in the lake on which his boat was floating,” said Master Foo. “He scrabbled around in the bottom of his boat for quite a while looking for it.”  “But,” said the methodologist, “if he had dropped his net in the lake, why was he looking in the boat?”  “Because he could not swim,” replied Master Foo.
Upon hearing this, the methodologist was enlightened”        — Master Foo and the Methodologist
                                                                   (http://www.catb.org/esr/writings/unix-koans/methodology-consultant.html)

If you have transformational research results, or want to make a real difference in computer science research, see Call for Papers at:

www.workshop.kawaobjects.com and http://WETICE.org

Will Virtual Machine Technology Go the Way of COBOL Programming, Frame Relays and Asynchronous Transfer Mode Switching into Oblivion?

December 21, 2011 - Leave a Response

“WETICE 2012 Convergence of Distributed Clouds, Grids and their Management Conference Track is devoted to transform current labor intensive, software/shelf-ware-heavy, and knowledge-professional-services dependent IT management into self-configuring, self-monitoring, self-protecting, self-healing and self-optimizing distributed workflow implementations with end-to-end service management by facilitating the development of a Unified Theory of Computing.”

“Reliability engineering is a bit like alchemy. The field swirls with competing schools of thought. Profound arguments erupt over obscure issues, and there is little consensus on how to proceed even to the extent that we know how to solve many of the hard problems.’’ (K.P. Birman, Reliable Distributed Systems: Technologies, Web Service and Applications (Springer, NY, 2005), p. xix).

With the calls for no less than a Kuhnian paradigm shift in computer science and ensuing controversy over computing models and Turing machines, the current state of the art and science of distributed systems discipline is under scrutiny. With about 70% of Information Technology budget being spent on managing the complexity of distributed systems in an enterprise, the pressure is mounting to seek alternatives.

Here is some food for thought…

Evolution, it seems, assists the systems to strive constantly to improve productivity and optimize their chance of survival. Most of the times, productivity improvements are incremental.  However, occasionally orders of magnitude improvements occur often preceded by a major disruption in the status-quo. When the dust settles, things are never the same as before. In the last sixty years, the Information Technology has seen such changes routinely. I choose COBOL programming, Frame Relays and Asynchronous Transfer Mode Switching (ATM) as examples that demonstrate the changes in computer programming, networking bandwidth and information services switching technologies that have shaped communication, collaboration and commerce on a global scale. In the late 90’s, with billions of lines of entrenched code, COBOL programming and programmers were considered invincible. Today, very few new lines of code are written in COBOL, while still some companies make money on legacy maintenance. The Frame Relay technology provided the high bandwidth communication for businesses and made the Telco reign supreme for a while. The introduction of SONET, optical broadband networks and high-speed Ethernet drastically altered the landscape initiating the demise of Frame Relay technology. Similarly, the ATM switching technology was going to alter the telecommunication world but along the way, came the Internet and IP technology which wiped out a host of high-flying companies in the 90’s.

Today, Virtualization technologies and the Virtual Machine (VM) infrastructure are attempting to create a new wave that is altering how we develop, deploy and operate software based services. With the increase in performance of computing processors, and broadband ubiquity, it is now possible to deploy multiple Operating Systems (OS) in a physical container thus increasing the granularity of multi-tenancy (with multiple Virtual Servers contained in a physical server) of end users with fault, configuration, accounting, performance and security (FCAPS) management isolation for each tenant. The virtualization of the physical server abstraction allows auto-scaling, self-repair, live migration and cloud based deployment of services bringing the advantages of scale and provisioning flexibility.

On the other hand, this introduces a new layer of complexity by bringing the networking and storage management to the virtual server inside a physical server. The introduction of many-cores in a processor creates a hierarchy of networks with different bandwidths now need to be managed to provide end to end transaction FCAPS management. Figure 1 shows the hierarchy of networks.

Figure 1: The hierarchy of networks with different bandwidths requires new management strategies for assuring end-to-end transaction availability, reliability, performance and security.

The VM based service architecture follows current ad-hoc practices of distributed computing and the current discipline of distributed systems with their server-centric and bandwidth limited origins is not ideally suited to exploit the parallelism, and performance of many-core processors, hierarchies of different bandwidth networks and rapidly changing business priorities, workload fluctuations and latency constraints. The lack of resiliency in distributed systems that affects the availability, reliability, performance and security of end-to-end distributed transactions where changes in one component influences other components, traces back to the fundamental computing model that underlies the foundation of computer science and is the basis for current implementation of IT infrastructure which is the von Neumann Stored Program Control (SPC) implementation of the Turing machine.

The Turing machine is an atomic computing unit that according to Church-Turing thesis, “A function is effective computable by a physical system iff it is Turing machine (TM) compatible.” The action of a Turing machine is determined completely by (1) the current state of the machine (2) the symbol in the cell currently being scanned by the head and (3) a table of transition rules, which serve as the “program” for the machine. The von Neumann SPC implementation forms the basis for the origin of computing infrastructure as we know it. Since the first von Neumann implementation, the computing infrastructure has undergone many changes that may or may not be limited by the Turing model. The Turing machine, according to Wegner et al., and Eberbach et al., is a closed world model; dynamic changes to the world outside the TM which occur during the computation have no bearing on the computation itself. The TM’s computation is completely determined by its input, which has to be predefined in advance – it cannot be altered from the outside once the computation begins. TMs compute functions from these inputs to some output; once the output’s value is determined, the computation stops.  Current implementations of operating system controlled process based computations already breach this simple Turing model.  Wegner et al., argue that current services and computer applications that work with unknown or dynamical inputs, and programs like operating systems or database servers that by design never terminate. This is a topic of controversy (Cockshott et al.) and a fertile ground for research on computation models.

Be that as it may, the operating system allows multiple Turing machines (to be precise, von Neumann Stored Program Control (SPC) implementations executing processes or threads) to share the available resources (memory, CPU etc.) based on policies determined external to the computations carried out by the TMs. Therefore, the physical server with an operating system offers signaling and management services to an application to compose an orchestrated workflow as a directed acyclic graph (DAG). The operating system services manage the application processes/threads using the signaling (alerting, addressing, supervision and mediation) and FCAPS management during run time. They instantiate, load and orchestrates a set of Turing machines to execute a computational workflow defined by the application along with other applications based on external policies. However, both the computational workflow defined by the application and its management workflow offered by the operating system are both implemented by the atomic Turing machine computations which are static and closed as discussed above; the workflows have to be predefined at compile time. In spite of this restriction, the flexibility of implementing a computational workflow as a DAG has allowed implementing business workflows where changes in one part of the workflow can influence other parts of the workflow.

While the operating system provides process management in a physical server, the operating system and the physical server itself have evolved to be managed by other applications to provide FCAPS management and optimize availability, reliability, accounting, performance and security management of the system as whole. It is important to note that layers of management are required, albeit hardcoded at compile time of these services, because Gödel’s theorem prohibits self-reflection and TM does not allow dynamic influence from external agents while computation is in progress. The operating system circumvents the self-reflection prohibition by using managed reflection external to the computing units. As von Neumann mentions (von Neumann J (1966) The theory of self-reproducing automata (edited and completed by AW Burks) University of Illinois Press, Urbana, IL. pp. 47, 51-56) “I am a little twisting a logical theorem, but it’s perfectly good logical theorem. It’s a theorem of Gödel that the next logical step, the description of an object, is one class type higher than the object and is therefore asymptotically longer to describe.” In a reply to a question from Burke regarding this comment, Gödel wrote that the theorem mentioned should be “the fact that a complete epistemological description of a language A cannot be given in the same language A.”

The virtual machine technology introduces another layer of management to allow multiple operating systems share the same resources in a physical container thus allowing more flexibility in sharing the resources.  This flexibility comes with a price of complexity. Figure 2 shows current state of affairs in a data center where multiple layers of management in implementing the computing workflows across distributed computing systems. While this architecture successfully executes distributed managed workflows, the resiliency of the end-to-end transaction management suffers from the ad-hoc distributed system architectures that have evolved over a period of time constrained by the serial and closed nature of the atomic computing unit, the Turing machine discussed above. As von Neumann pointed out in his Lectures given in the Hixon symposium “the basic principle of dealing with malfunctions in nature is to make their effect as unimportant as possible and to apply correctives, if they are necessary at all, at leisure. In our dealings with artificial automata, on the other hand, we require an immediate diagnosis. Therefore, we are trying to arrange the automata in such a manner that errors will become as conspicuous as possible, and intervention and correction follow immediately.

Comparing the computing machines and living organisms, he points out that the computing machines are not as fault tolerant as the living organisms.  He goes on to say “It’s very likely that on the basis of philosophy that every error has to be caught, explained, and corrected, a system of the complexity of the living organism would not run for a millisecond.” The lack of architectural resiliency of cellular organisms in the IT infrastructure reflects itself in the complexity that contributes to 70% of the total IT budget in an organization going into self-maintenance leaving little room for new services development and deployment. While automation of service management has attempted to reduce labor costs, it is often replaced by expensive software, shelf-ware and very expensive knowledge professionals devoted to keep the systems operational by isolating, diagnosing and fixing problems when they occur.

Figure 2: A computing workflow and operating system based management workflow instructions are executed serially by a Turing machine while other management workflows control services that span across multiple physical devices connected by various networks.

Statistics dictate that the frequency of problems becoming visible in a data center grows higher as the number of components grow to meet the ever-increasing demand for communication, collaboration and commerce at the speed of light on a global scale. It is estimated that one company Google alone will require managing 10 million servers in the future to meet its business needs. End to end transaction resiliency (FCAPS) management in such cases becomes extremely critical when service transaction times, service failure times and service repair times are of the same order of magnitude.

Figure 3 shows the world-wide data center spending breakdown presented by IDC in the Server Design Summit (www.ServerDesignSummit.com ). The breakdown of costs show power & cooling costs, management costs and server costs.

The key point to note is that, while server costs are going down with accompanying performance increase, the management costs now dominate the data center expenses. While power and cooling expenses are also increasing (some estimates put it at up to 30% of total data center operational expense), it is now being addressed by redesigning the data centers and migrating to new generation of many-core processors and servers that are designed to drastically reduce power consumption. That leaves the management costs as the next major productivity improvement opportunity. There are two components to this management cost:

  1. Hardware infrastructure resiliency (operation & management) assurance cost and
  2. End-to-end service transaction resiliency spanning across multiple devices (operation & management) assurance cost.

Figure 3: Breakdown of Costs in a datacenter

While the hardware infrastructure (servers, routers and storage devices) reliability and manageability have drastically improved over the past few decades, a new degree of complexity was introduced to address the reliability of distributed transactions that span across multiple hardware devices. Clustering, server-, storage-, and network- resource provisioning, monitoring and optimization appliances and software systems managing application reliability, availability, performance, security, disaster recovery etc., have all contributed the drastic increase in services management cost. The root cause of the resulting complexity is the ad-hoc nature of distributed systems management and end-to-end transaction service resiliency assurance. As mentioned earlier, the operating systems in individual servers (or virtual servers) do not have visibility and control of end to end transaction resilience and their role of resource management to assure service resilience have been usurped by myriad ad-hoc layers of management. There is no operating system that provides end-to-end resource management of distributed resources. In a distributed system, the computation workflow utilizing the resources and the resource management workflow providing the resources interact with each other to allow for implementing end-to-end visibility and control to meet changing business priorities, workload fluctuations and latency constraints. It is as if an end-to-end distributed resource manager (the observer) is observing the dynamic service workflows and controlling the resources to accomplish the overall service workflow (the observed) goals.

There are four key factors contributing to the lack of resiliency (or contributing to the high cost of resiliency) in distributed systems:

  1. When multiple computations compete for resources, management that resolves contention for resources based on global policies external to the individual computations is essential. Operating systems provide this function by executing workflows that manage the local resources and allocate them to appropriate services based on external policies. While the operating systems provide this function implementing both signaling and FCAPS management services in a physical device executing the computations, it cannot control resources across multiple devices.
  2. A lack of rigorous computing models to define a composition scheme that defines a distributed computing unit made up of individual computing units (defining a larger “self” or the observer/manager consisting of a group of distributed resources under its control) to appropriately define management hierarchies to implement managed visibility, reflection, and control based on external policies.
  3. A lack of dynamic representation of both the observer/manager and the observed/managed and their interactions in a distributed system which has led to ad-hoc implementations contributing to the complexity (of orchestrators orchestrating managers managing agents implementing workflows using TMs) and
  4. The propensity of system vendors to sell maintenance intensive products to increase their profit margins through professional services and more software to maintain their products. For example, a particular large enterprise software vendor’s annual maintenance revenue is orders of magnitude higher than their new product revenue.

As the cost of maintenance grows, the search is on, for orders of magnitude improvement with a paradigm shift. The advent of many-core servers with parallelism at the core, performance that scales with the number of cores in a processor, and the hierarchy of high bandwidth networks mentioned earlier provides a new opportunity to look at computing models, distributed systems and their management. One such attempt is the Distributed Intelligent Managed Element (DIME) computing model. Derived from the observations of resilient, efficient and highly scalable distributed systems such as cellular organisms and human network organization, the DIME computing model exploits both the parallelism and high bandwidth to define a new distributed system architecture that integrates the computing workflows and the management workflows by introducing the management of a TM and a parallel signaling network overlay for FCAPS management of a group of managed TMs. The DIME computing model is based on the observation that all computations and workflows can be specified as a directed acyclic graph (DAG). Vertices in a DAG often have a natural ordering – for example, vertices may represent events ordered in time or ordered by hierarchy. The ordering makes the results and algorithms for DAGs relatively simple. DIME network computing executes managed directed acyclic graphs using a set of managed von Neumann Turing machines organized as a network of computing elements with an overlay of signaling enabled control workflow. The DIME computing model allows the specification and execution of a recursive composition model where each computing unit at any level specifies and controls the execution of the DAG at the lower level. The hierarchical composition of “self” representing the manager and the dynamic representation of the observer and the observed allows precise specification of the managed DAG and avoids the thorny problem of halting associated with Turing machines.  Perhaps, DIME networks parallel embriology in biological systems unlike the genetic computing models which embrace mutation. Only theorists can tell.

Theoretically possible or not, or whether it is blessed by the academicians and venture capitalists or not, Figure 4 shows an implementation of DIME computing model that allows the encapsulation of a Linux process as an FCAPS managed DIME and a network of DIMEs providing LAMP (Linux, Apache, MySQL and PHP/perl/Python) based web services.

The prototype was implemented without using Hypervisors, VM infrastructure and a plethora of interfaces to various management systems. Each DIME uses parallel threads to control the computing unit called Managed Intelligent Computing Element (MICE). Each DIME supports a signaling overlay allowing monitoring and control of each DIME’s FCAPS parameters in a distributed DIME network. A novel feature of the DIME network architecture is the decoupling of services management in implementing workflows from the underlying hardware infrastructure management. The services monitor and manage their own response times and other FCAPS parameters using local operating system and if these parameters fall outside the watermarks set by system-wide policies, the workflow is reconfigured. This may involve moving the affected service components appropriately to where the required infrastructure services are available. This allows the infrastructure providers and service providers to negotiate for appropriate service levels a-priori and establish policies on how to manage fluctuations.

Figure 4: The DIME network architecture implementing LAMP services with auto-scaling, self-repair and live-migration without using VM infrastructure.

In addition, the same DIME computing model is also implemented in bare-metal, using a native operating system called Parallax. The Parallax OS converts each core in a many-core processor into a self-managed computing element (DIME) with signaling capability and a signaling network overlay over a computational service delivery network allows an orchestrator to develop, deploy and manage distributed service workflows. A video shows auto-scaling, self-repair, state-ful live-migration and end to end transaction FCAPS management implemented in multi-core servers. (http://youtu.be/K0AxJPaA_RI). This approach seems to suggest that it is possible to use current generation OSs and development environments to create service workflow executable that can be run under the new OS with run time FCAPS management programmed to endow self-resilience.

Whether these are good answers for eliminating the complexity of VM management by eliminating them or not, the DIME network architecture provides an existence proof of an alternative to current complexity. It also points to a direction away from current ad-hoc hacking implementations of distributed systems to where more productivity improvements are possible by searching for computation models that are firmly based on solid theoretical foundations (as Goldin and Wegner suggest) and exploiting the parallelism and high bandwidth now available in hardware. Kuhnian paradigm shift or not, hopefully there are alternatives that will change the direction of IT from current “Buy now and pay-forever for maintenance software and services” philosophy which is contributing to the cost and complexity. Hopefully, this will also eliminate our current strategy to throw human resources and more software to diagnose and fix problems than building self-correcting systems as cellular organisms do. As von Neumann was seeking,  the decoupling of hardware infrastructure management from services management may point to a direction where building reliable services using not so reliable hardware is possible as the cellular organisms do.

If you have transformational research results, or want to make a real difference in computer science research, see Call for Papers at:

www.workshop.kawaobjects.com and http://WETICE.org

Turing Machines, Cognition, Parallel Loosely Coupled Processes, and DIME Networks:

August 22, 2011 - Leave a Response

Louise Barrett [1] making a case for the animal and human dependence on their bodies and environment – not just their brains – to behave intelligently, highlights the difference between Turing Machines implemented using von Neumann architecture and biological systems.  “Although the computer analogy built on von Neumann architecture has been useful in a number of ways, and there is also no doubt that work in classic artificial intelligence (or, as it is often known, Good Old Fashioned AI: GOFAI) has had its successes, these have been somewhat limited, at least from our perspective here as students of cognitive evolution.”  She argues that the Turing machines based on algorithmic symbolic manipulation using von Neumann architecture, gravitate toward those aspects of cognition, like natural language, formal reasoning, planning, mathematics and playing chess, in which the processing of abstract symbols in a logical fashion and leaves out other aspects of cognition that deal with producing adoptive behavior in a changeable environment.  Unlike the approach where perception, cognition and action are clearly separated, she suggests that the dynamic coupling between various elements of the system, where each change in one element continually influences every other element’s direction of change has to be accounted for in any computational model that includes system’s sensory and motor functions along with analysis.

This emphasis on the sensory monitoring of the environment, dynamic coupling, connectivity and system-wide coordination is also confirmed by observations on cell communication.  According to biologist Sean B. Carroll [2], “cells communicate with one another by sending signals in the form of proteins that are exported and travel away from their source.  Those proteins then bind to receptors on other cells, where they trigger a cascade of events, including changes in cell shape, migration, the beginning or cessation of cell multiplication, and the activation or repression of genes.”

Cellular organisms developed very sophisticated computing models well before their brain evolved. The architectural resiliency of cellular organisms stems from their ability to manage highly temporal phenomena. System-wide connectivity and coordination require a sense of time, history and synchronization between various tasks performed by a group of loosely coupled elements which, as Louis Barrett points out, the Turing machine implemented using the stored program control lacks.  Discussing the nature of temporal phenomena, she writes “This means simply that the actual rates and rhythms that characterize a particular process play an important and central role in getting the job done.  This could be the way that the underlying physical processes of the brain work (how long it takes for a neurotransmitter, like nitric oxide or glutamate, to diffuse through the brain, for example, or how long it takes for such neurotransmitters to modulate neuronal activity), which in turn could affect the specific duration or rates of change in other physiological processes.  Similar intrinsic rhythms in the body may also be important, as will other aspects of the body dynamics that relate to, for example, the mechanical properties of the muscle, which dictate where and how fast an animal can move.  These bodily processes may, in turn, need to be synchronized precisely with temporal processes occurring outside of the animal in the environment.”  She also points out that the coordination and synchronization requires system-wide information processing and routing that the brain provides.

Compare this with the quest for real-time information processing currently being driven by global communication, collaboration and commerce at the speed of light.  Whether it is high frequency trading, web-based commerce, social networking or federated enterprise computing, the ability to manage highly temporal phenomena in real-time is becoming critical. System-wide connectivity, high availability, security and performance management require coordination with a sense of time, history and synchronization between various tasks performed by a group of loosely coupled elements.  There are two drivers behind the search for new computing models that go beyond current von-Neumann computing model:

  1. Poor end-to-end distributed transaction reliability, availability, performance and security as recent episodes at Sony, Amazon, Google, and RSA [3, 4, 5 and 6] demonstrate.
  2. The hardware upheaval caused by the new class of many-core processors that allow parallelism which cannot be fully exploited with current state of software innovation.

The DIME network architecture introduced in WETICE 2010 [7] exploits parallelism and addresses the end-to-end distributed transaction management using a signaling network overlay over a network of von Neumann stored program control (SPC) computing nodes to implement dynamic fault, configuration, accounting, performance, and security management of both the nodes and the network based on business priorities, workload variations and latency constraints.  The following video explains the differences between the DIME networks implementing a network of managed Turing machines and the von-Neumann computing architecture.

Figure 1 shows the screenshot of DIME implementation in Linux Operating System (OS)

DIME network architecture in Linux Operating System

Two implementations of DIME networks demonstrate the feasibility of this architecture [8, 9].  In one implementation, a Linux process is encapsulated as a DIME to create a DIME network.  Figure 1 shows a screenshot of a 6-DIME network executing a simple c program in parallel.  Each DIME is programmed to self-manage by monitoring heartbeat, performance etc.  A supervisor DIME implements recovery policies and end-to-end workflow management.  In the second implementation, a native OS called Parallax is implemented from scratch in assembler with c and c++ interface to encapsulate each core as a DIME.  The following video shows the self-repair feature of DIME network architecture implemented using parallax Operating system in a multi-core server network.

As multiple reviewers of the DIME papers noted, this approach is “novel and interesting”.  It attempts to bring the architectural resiliency of cellular organisms by exploiting the parallelism offered by the multi-core architecture and the signaling abstractions that are essential for implementing dynamic temporal systems.  However, as pointed out in one of the papers [8], “The history of the evolution of current OSs is filled with lessons on wasted billions (does anyone remember Multics or OS2?), unmet expectations (who would have thought UNIX, the original System V, would vanish), surprise winners (Windows and Linux), and stealthy survivors (Mach in a Mac).”

Figure 2 shows the DIME network architecture compared with conventional computing and cloud/grid computing architectures in terms of resiliency, efficiency and scaling.  The resiliency is measured with respect to a service’s tolerance to faults, fluctuations in contention for resources, performance fluctuations, security threats and changing business priorities.  Efficiency is measured in terms of total cost of ownership and return on investment.  Scaling addresses end-to-end resource provisioning and management with respect to increasing number of computing elements required to meet service needs. Current operating systems do not easily scale because of their dependence on current lock mechanisms, thread technology implementations and their size and complexity.

Figure 2: Resiliency, Efficiency and Scaling of Distributed Systems

As the picture depicts, the grid and cloud computing paradigms automate many of the administrative tasks in assuring service management while the DIME (Distributed Intelligent Managed Element) network architecture attempts to provide dynamic self-management of service workflows.  The cloud and grid architectures do not scale easily and increase complexity with layers of management systems to compensate for the serial von Neumann implementation of the service nodes that do not address temporal dynamics of distributed systems.

The DIME network architecture is a departure from conventional wisdom currently being pursued by the universities and corporate R&D.  It adds monitoring and control of each Turing computing node and a signaling enabled network to implement the management of temporal behavior of workflows executed as directed acyclic graphs using a network of managed Turing machines.  The concept of a parallel signaling channel is foreign to the current generation of IT professionals, except for those with telecommunications or voice over IP experience. (Unfortunately, the dismantling of research organizations such as AT&T Bell Labs which were the guardians of institutionalized technical knowledge, increasing commercialization of university research and Goldman-Sachs-like investment flipping philosophy to pursue instantaneous profits have put a dent on disruptive innovation and preservation of institutionalized know-how.) Signaling allows establishing equilibrium patterns and monitor and control exceptions system-wide.  It allows contention resolution based on system-wide view and eliminates race conditions and other common issues found in current distributed computing practice. In systems with strong dynamic coupling between various elements of the system, where each change in one element continually influences other element’s direction of change, signaling in the computational model helps implement system-wide coordination and control based on system-wide priorities, workload fluctuations and latency constraints. The DIME network architecture is either profoundly disruptive (disruptive to the vendors of high maintenance software, shelf-ware and recurring services but not to their customers who will benefit from the simplicity of services management with temporal dynamics at both the node and network level)  or it is like many other ideas that will go nowhere for many odd reasons.  The history of IT is filled with many forgotten or mismanaged innovations  [10, 11 and 12] .  It is hard to predict which solution, evolution prefers in its continuous quest for lower entropy.  Some times it favors revolutions and at other times it is satisfied with incremental solutions.

But again, as Mitchell Waldrop points out, revolutions are not revolutions if they are believed in at the start. Are they?

“How did it go in Berkeley? Did they like your ideas?”
“It was the pits,” said Arthur. “Nobody there believes in increasing returns.”
Susan Arthur had seen her husband returning from the academic wars before. “Well,” she said, trying to find something comforting to say, “I guess it wouldn’t be a revolution, would it, if everybody believed in it at the start?”

—        Waldrop, M.M., “complexity: The Emerging Science at the Edge of Order and Chaos”, New York, Simon and Schuster, (1992) p 19.

References:

  1. Barrett, L., “Beyond the Brain: How Body and Environment Shape Animal and Human Minds,” Princeton University Press, Princeton, 2011, p 116, 122
  2. Carroll, S. B., “The New Science of Evo Devo - Endless Forms Most Beautiful”, New York: W. W. Norton & Co. 2005, p74.
  3. Morris, C. (2011). Sony PlayStation Facing Yet Another Security Breach, New York, CNBC.com (http://www.cnbc.com/id/43079509 )
  4. Thibodeau, P. and Vijayan, J. (2011). Amazon EC2 service outage reinforces cloud doubts. Computerworld (http://www.computerworld.com/s/article/356212/Amazon_Service_Outage_Reinforces_Cloud_Doubts)
  5. http://www.searchenginejournal.com/googles-downtime-affected-5-of-the-internet/10463/
  6. Moscaritolo, A. “RSA confirms Lockheed hack linked to SecurID breach,” 2011, SC MAGAZINE, June 07
    (http://www.scmagazineus.com/rsa-confirms-lockheed-hack-linked-to-securid-breach/article/204744/ )
  7. Rao Mikkilineni and Giovanni Morana, “Is the Network-centric Computing Paradigm for  Multi-core, the Next Big Thing?” ( http://computingclouds.wordpress.com )
  8. Morana, G., and Mikkilineni, R., “Scaling and Self-repair of Linux Based Applications Using a Novel Distributed Computing Model Exploiting Parallelism”. IEEE proceedings, WETICE2011, Paris, 2011
  9. Mikkilineni R., and Seyler, I. “Parallax – A New Operating System Prototype Demonstrating Service Scaling and Self-Repair in Multi-core Servers”, IEEE proceedings, WETICE2011, Paris, 2011
  10. Can Cisco Sustain Competitive Differentiation on Operational Excellence Alone? (www.metooeconomist.wordpress.com)
  11. Acquisitions, Innovation and the Economics of the Invisible Hand (www.metooeconomist.wordpress.com)
  12. Are the Short-Term Profit Motives and Wall Street-like Investing under the Influence, Trumping Long Term Innovation in the Silicon Valley?

WETICE2011 – Paris, June 27 – 29, 2011: Convergence of Distributed Clouds, Grids and Their Management – Toward a Unified Theory of Computing with Telecom Grade Trust

May 22, 2011 - Leave a Response

Track Summary And Agenda

Dr. Rao Mikkilineni

IEEE Member

Kawa Objects Inc.,

Los Altos, California, USA

rao@kawaobjects.com

and

Dr. Giovanni Morana

DIEEI

University of Catania

Catania, Italy

giovanni.morana@dieei.unict.it

The stated objective of the first workshop on Collaboration and Cloud Computing” in WETICE 2009 was “to analyze current trends in Cloud Computing and identify long-term research themes and facilitate collaboration in future research in the field that will ultimately enable global advancements in the field that are not dictated or driven by the prototypical short-term profit driven motives of a particular corporate entity.”

This track discusses the progress made within the span of three conferences which helped develop a new approach to the convergence of distributed clouds, grids and their management. This track presents a new computing model and its implementation which resulted directly from the collaboration of the two workshops, Collaboration and Cloud Computing Workshop and the Emerging Technologies for Next-Generation grids sponsored under the Aegis of WETICE. In this track we discuss one of the key issues that still need to be addressed to improve the efficiencies and utilize the new generation of many-core servers that are transforming the information technology landscape.

Current Information Technology solutions have become silos of server, storage and network infrastructure with poor end-to-end distributed transaction reliability, availability, performance and security as recent episodes at Sony, Amazon, Google, and RSA [1, 2, 3 and 4] demonstrate.  We believe that there is a need for reexamining the fundamental architectural foundation of Information Technologies to transform the data centers from their current role of being just managed server, networking, and storage hosting centers (whether physical or virtual), to true service switching centers with telecom grade trust.  We need a paradigm shift from resource switching and connection management to services switching and service connection management.  We also believe that new approaches are essential to replace the current efforts to replicate the complexity inside the data center today, also inside the many-core servers.  We hope WETICE2011 will continue the tradition to forge new collaborations that lead to innovation without a short term agenda based on immediate profit motives proposing incremental solutions which do not address fundamental cost and complexity issues. An opportunity exists to exploit the hardware upheaval unleashed by the many-core chips with equally innovative software solutions.

We conclude by quoting von Neumann [5] ”The basic principle of dealing with malfunctions in nature is to make their effect as unimportant as possible and to apply correctives, if they are necessary at all, at leisure. In our dealings with artificial automata, on the other hand, we require an immediate diagnosis. Therefore, we are trying to arrange the automata in such a manner that errors will become as conspicuous as possible, and intervention and correction follow immediately.” Comparing the computing machines and living organisms, he points out that the computing machines are not as fault tolerant as the living organisms. He goes on to say “It’s very likely that on the basis of philosophy that every error has to be caught, explained, and corrected, a system of the complexity of the living organism would not run for a millisecond.”

—  Neumann, J. v., General and Logical Theory of Automata. edited and compiled by William Aspray and Arthur Burks, MIT Press, 1987, p408

We need the architectural resilience  of the cellular organisms in our IT infrastructure at the core, to support global communication, collaboration and commerce at the speed of light with telecom grade trust.  Replicating the current complexity in our data centers also inside the many-core servers and devices will be equivalent to creating quicksand on which service castles will be built!

15 Papers received: 6 rejected, 6 accepted as FULL papers and 3 accepted as SHORT papers

For the Agenda, the Venue and accommodation, please visit

http://events.telecom-sudparis.eu/wetice/program/index_cdcgm.php

http://events.telecom-sudparis.eu/wetice/
 
Social Event: http://events.telecom-sudparis.eu/wetice/social_event/
 
See you in Paris.

References:

  1. Morris, C. (2011). Sony PlayStation Facing Yet Another Security Breach, New York, CNBC.com (http://www.cnbc.com/id/43079509 )
  2. Thibodeau, P. and Vijayan, J. (2011). Amazon EC2 service outage reinforces cloud doubts. Computerworld (http://www.computerworld.com/s/article/356212/Amazon_Service_Outage_Reinforces_Cloud_Doubts )
  3. http://www.searchenginejournal.com/googles-downtime-affected-5-of-the-internet/10463/
  4. Moscaritolo, A. RSA confirms Lockheed hack linked to SecurID breach, 2011, SC MAGAZINE, June 07 (http://www.scmagazineus.com/rsa-confirms-lockheed-hack-linked-to-securid-breach/article/204744/ )
  5. Neumann, J. v. (1987). General and Logical Theory of Automata. edited and compiled by William Aspray and Arthur Burks, MIT Press, p408

“Look Ma! No Hypervisor in My Clouds!!” and Other Future IT Trends in WETICE 2011

January 3, 2011 - Leave a Response

  Dr. Rao Mikkilineni and Dr. Giovanni Morana

Co-Chairs

  1st Track on Convergence of Distributed Clouds, Grids and their Management;
A combined track of the 3rd CCC and the 8th ETNGRID

 WETICE 2011 – Conference: 20th IEEE International Conference on Collaboration Techniques and Infrastructure 

Summary:

WETICE 2011 track on Convergence of Distributed Clouds, Grids and their Management to be held in Paris (June 27 – 29) has already received some interesting papers.  They include papers on new computing models, Clouds, Grids, their management and also some proofs of concept demonstrations.  The Proceedings will be published by the IEEE CS Press and distributed at the conference.  All published papers are refereed.  The last date for submission of papers is March 5, 2011.

This blog gives a preview of the ideas that will be discussed in June to attract other potential papers that could make a major impact on next generation clouds, grids and their management.

Master Foo and the Old Hand [1]

An experienced UNIX programmer, hearing of Master Foo’s wisdom, came to him for guidance. Approaching the Master, he bowed three times and said:

“Master Foo, I am gravely troubled. In my youth, those who followed the Great Way of Unix used software that was simple and unaffected, like ed and mailx. Today, they use vim and mutt. Tomorrow I fear they will use KMail and Evolution, and Unix will have become like Windows — bloated and covered over with GUIs.”

Master Foo said: “But what software do you use when you want to draw a poster?”

The programmer replied: “I…have never done that. But I am sure that I could use LaTeX or pic to accomplish it without GUIs, in the proper UNIX way.”

Master Foo then said: “Which one will reach the other side of the river: The one who dreams of a raft, or the one that hitchhikes to the next bridge?”

Upon hearing this, the programmer was enlightened.

From POTS, PANS, and SANs to Clouds, Grids and Their Management:

Ma Bell (Known as AT&T) in 1907 introduced the standard for a voice service in a cloud when Theodore Vail, the then president of AT&T, unveiled a vision for providing universal service and telecom grade trust (providing reliable, secure and high performance connection at a reasonable cost).  The service was analog.  The operation was manual.  The dial tone, introduced to assure the telephone user that the exchange is functioning when the telephone is taken off-hook by breaking the silence (before an operator responded) with an audible tone, has become a symbol for universal service and telecom grade trust.  Later on, the automated exchanges provided a benchmark for telecom grade trust that assures managed resources on-demand with high availability, performance and security.  Today, as soon as the user goes on hook, an intelligent network recognizes the user profile based on the dialing telephone number.  As soon as the destination party number is dialed, the network recognizes the destination profile and provisions all the network resources required to make a secure connection authenticating services usage, commences billing, monitors and assures the connection till one of the parties initiates a disconnect.  The history, from the days of (upper case) AT&T’s inception to the days when it was transformed to (lower case) at&t  in 2005, is filled with lessons on  regulation, brilliant Bell Labs innovation, the rise, missed or mismanaged business opportunities, deregulation, unscrupulous competitors such as WorldCom who cooked the books, the fall and the rebuilding of a great corporation in the United States.

A century later, the computing industry is rediscovering the same lessons in offering cloud based computing services.  Virtual computing services offered in the cloud are transforming the way consumers and businesses communicate, collaborate and conduct commerce at the speed of light.  The infrastructure makers, the service developers and the service operators are striving to capture the big chunk of a large market share by racing to provide universal access to virtual computing services with telecom grade trust.  There is a battle brewing between the “Appliances for High Performance Camp” and the “Open System Software and Services Approach for Everything Camp” of the current IT infrastructure vendors.  Figure 1 presents a reference model that shows the various stakeholders and their stakes in the current day cloud computing market.  The products and services that have evolved bottom up in the services stack from the server, network, storage or application and infrastructure software domains are expanding their reach into other domains to gain market share.

Current Cloud Market Landscape

Today, computing virtualization is provided with Hypervisor technology to create virtual servers, network virtualization is provided through multi-protocol routers and switches and storage virtualization is provided through specialized appliances supporting NAS and SAN.  New appliances are being rolled out for databases and storage transaction management.  Different virtualization platforms and orchestrators that integrate them are flooding the market.  The costs of associated services are skyrocketing.  Figure 1 shows various layers of management to provide application specific availability, reliability, performance, security and billing functions.  Various vendors play at various levels in each layer.  The complexity –  of heterogeneity, multiple vendor solutions and orchestrators that provide integration –  has been overwhelming the service developers, operators and consumers.

The situation is very similar to the days before Strowger’s switch eliminated many operators sitting in long rows plugging countless jacks into countless plugs and reducing the cost of adding new subscribers that had risen in a geometric proportion.  It was estimated that the management cost of a new subscriber was more than the revenue that the subscriber contributes.  According to the Bell System chronicles, one large city general manager of a telephone company at that time wrote that he could see the day coming soon when he would go broke merely by adding a few more subscribers [2].  The only difference between today’s IT data center and central office before Strowger’s switch is that “fewer – but very expensive consultants, countless hardware appliances, and countless software systems that manage them” replace “many operators, countless plugs and countless jacks”.  In addition, we have to account for the shelf-ware, the latency introduced by today’s systems administration paradigm (albeit automation ala RightScale and Sclr approach) and costs involved through the services business that has come to dominate the IT vendor revenue streams to help manage the complexity (which was sold in the guise of improving productivity and lowering Total Cost of Ownership).  It is estimated that 60% to 70% of IT data center cost is in its operation and management with or without virtualization in spite of a 10X improvement in hardware, space and energy savings with the new class of servers available today [3].  Figure 2 shows percentage Total Cost of Ownership (TCO) (for a 1500 server data center) over five years by each component with and without virtualization.

Five Year TCO of Virtualization According to a Vendor ROI Calclulator

While virtualization introduces many benefits such as consolidation, real-time business continuity and elastic scaling of resources to meet wildly fluctuating workloads, it adds another layer of management systems in addition to current computing, network, storage and application management systems.  Figure 3 shows a reduction by 50% of the five-year TCO with virtualization.  The Virtual Machine density of about 13 allows a great saving in hardware costs which is somewhat off-set by the new software, training and services costs of virtualization.

TCO over 5 Years with virtualization of 1500 servers using 13 VMs per Server

In addition, there is the cost of new complexity in optimizing the 13 VMs within each server in order to match the resources (network bandwidth, storage capacity, IOPs and throughput) to application workload characteristics, business priorities and latency constraints.  According to a storage consultant, Jon Toigo [4] “Consumers need to drive vendors to deliver what they really need, and not what the vendors want to sell them. They need to break with the old ways of architecting storage infrastructure and of purchasing the wrong gear to store their bits: Deploying a “SAN” populated with lots of stovepipe arrays and fabric switches that deliver less than 15% of optimal efficiency per port is a waste of money that bodes ill for companies in the areas of compliance, continuity, and green IT.”

Figure 4 shows the transition from managed physical server farms to managed virtual server farms:

Transition From Managed Server Farms to Managed Virtual Server Farms

The cost per VM is estimated to be around $2500 with minor variation with the use of VMWare, Microsoft, Red Hat or Citrix solution.  This is consistent with 2X improvement with a managed physical server cost of about $5000.  In spite of vendor claims, there is not much difference between different Hypervisors just as there was no big difference between DB2, Oracle and Sybase in the 1980′s.  They are all equally knowledge intensive and require expensive services solutions to maintain.  Further improvements in TCO have to come from new approaches that drastically reduce complexity of current layers of management systems.  These approaches may have to leverage hardware assisted management functionality in computing infrastructure and new software technologies that support self-configuring, self-monitoring, self-securing, self-healing and self-optimizing architectures. Perhaps it is time for designing a new class of autonomic systems that satisfy their eight defining characteristics [5].  It suffices to say that none of the systems in our data centers today come close to satisfying these defining characteristics.  They are worth looking up!

With huge maintenance and service revenues at stake, current IT vendors do not have incentives to invest in R&D that develops autonomic systems [6, 7 and 8] which will destroy their recurring revenue model.  One IT vendor makes four to five times more revenue annually from accumulated maintenance and services contracts than from new product sales with a captive customer base.  It seems that it does pay to design high maintenance systems.   Without questioning, we pay for security software that fixes the problems with operating systems that should not be there in the first place.  History again provides clues to this behavior.  Ma Bell was very reluctant to deploy fiber because it cannibalized its revenues from investments in copper.  It took deregulation to unleash the fiber. AT&T was dragged kicking and screaming to digital switching by the introduction of DMS switches by competing Nortel.   If history is any guide, perhaps this is an opportunity for new players to eat current IT player’s lunch, breakfast and dinner with fresh ideas and new R&D. Or perhaps we have to wait for a Dragon Warrior [9].

Dragon warrior or not, fortunately, high bandwidth wire-line and wireless networks, multi-core multi-CPU servers with hardware assisted virtualization and management features incorporated at the chip level are offering new alternatives which potentially can simplify the IT computing architecture and reduce the operation and management costs further.  The next conference on “Convergence of Distributed Clouds, Grids and their Management” sponsored under the Aegis of WETICE 2011, to be held at “Institut Telecom, Telecom Sud Paris”, Paris, France (www.telecom-sudparis.eu) from June 27th to 29th, 2011, is devoted to discuss new approaches that go beyond current state of the art.  We are inviting bold and new ideas from corporate R&D (if there still exists Bell Labs class innovation in fundamental computer science to create next generation von Neumann or Masater Foo), universities and entrepreneurs to propose and demonstrate potential next generation systems and ideas without limits on imagination.  What sets WETICE apart from larger conferences is that the conference tracks are kept small enough to promote fruitful discussions on the latest technology developments, directions, problems, and requirements. Each track includes paper presentations and group discussions while the keynote sessions and summary of discussions take place in joint sessions. WETICE welcomes papers on “work-in-progress” from Ph.D. students whose imagination is not often limited by commercial feasibility or financial agendas [7 and 8].

We are already receiving some interesting papers.  For example, an implementation of  a new Distributed Intelligent Managed Element (DIME) Network computing model [3] proof of concept will be demonstrated.  DIME networks could provide a network computing model to create distributed computing clouds and execute distributed managed workflows with high degree of agility, and managed reliability, availability, performance, security and utilization of distributed computing, storage and network resources.  While this computing model can be implemented using current generation IT infrastructure, this paradigm is ideally suited to utilize fully the new generation of multicore multi CPU servers to implement highly scalable, parallel and distributed systems transcending physical infrastructure or geographical boundaries.

Three different implementations of the DIME computing model will be demonstrated:

  1. Using a bare metal OS called Parallax which is built ground up to leverage parallelism and multicore chips in 64 bit architectures, the DIME network is implemented to execute managed workflows.  This implementation alleviates the need for a Hypervisor based virtual server and allows distributed managed computing services to be delivered with universal access and telecom grade reliability.
  2. By implementing the DIME networks using Virtual Servers loaded with just enough Linux OS compatible with current cloud architectures
  3. By implementing a just enough Linux OS to create DIME networks on  current generation  servers

Figure 5 shows the parallax version of DIME network implementation that eliminates the need for a Hypervisor and shows the service creation, delivery and assurance platform architecture.

DIME Network Computing Model and Distributed Parallel Computing Architecture

The DIME network computing model proposes a signaling network overlay over the computing network and allows parallelism in resource monitoring, analysis and reconfiguration based on workload variations, business priorities and latency constraints of the distributed software components.  A workflow is implemented as a set of tasks, arranged or organized in a directed acyclic graph (DAG) and executed by a managed network of distributed computing elements (called Distributed Intelligent Managed Elements, DIMEs). These tasks, depending on user requirements are programmed and executed as loadable modules in each DIME.  Figure 6 shows the DIME network based services creation, deployment and assurance framework.

Potential Distributed Services Creation, Delivery and Assurance Framework

In addition, WETICE 2011 will present papers on the state of the art and future trends in understanding the convergence of clouds, grids and their management.

We are seeking more new ideas to discuss in the conference.

The goals of the conference include (but are not limited to):

  1. Discovering new application scenarios, proposing new operating systems, programming abstractions and tools
  2.  Identifying the challenging problem that still need to be solved such as parallel programming, scaling and management of distributed computing elements, and
  3.  Reporting results and experiences gained by researchers in building dynamic Grid-based middleware, computing clouds (distributed or otherwise) and workflow management systems.

Various deadlines to participate in the conference are as follows:

  • Paper submissions deadline:                              March 5, 2011
  • Decision to paper authors:                                  April 4, 2011
  • Camera Ready papers to IEEE:                           April 29, 2011
  • WETICE-2011 conference:                                  June 27-29, 2011

All papers submitted will be reviewed by peers and selected papers will be published in WETICE2011 Conference Proceedings by IEEE.

All papers must be submitted to workshop@kawaobjects.com,

Details of the conference and past conference archives can be accessed at www.workshop.kawaobjects.com,

http://etngrid.diit.unict.it and

http://wetice.org

References:

  1. http://catb.org/esr/writings/unixkoans/oldhand.html
  2. http://www.telephonetribute.com/switches.html#Mr. Almon B. Strowger and His Electric Telephone Switch
  3. Rao Mikkilineni and Giovanni Morana, “Is the Network-centric Computing Paradigm for Muti-core, the Next Big Thing?” ( http://computingclouds.wordpress.com )
  4. Jon Toigo, http://www.datastorageconnection.com/article.mvc/Jon-Toigo-Exposes-More-About-Data-Storage-Ven-0001
  5. http://www.research.ibm.com/autonomic/overview/elements.html
  6. Acquisitions, Innovation and the Economics of the Invisible Hand
  7. Are the Short-Term Profit Motives and Wall Street-like Investing under the Influence, Trumping Long Term Innovation in the Silicon Valley?
  8. Organizational DNA, Disruptive Innovation, Economics at the Speed of Light and the Impact on Investment in Future
  9. Kung Fu Panda, The Movie, Dream Works, 2008
Follow

Get every new post delivered to your Inbox.