Data Management for SOA
by
Fred Cummins
In a Service Oriented Architecture (SOA), services are loosely coupled and are accessed across organizational boundaries. The service units-the business activities that perform the services and manage the associated capabilities-are finer grained than traditional business functions and their supporting enterprise applications. The decoupling of finer-grained capabilities is key to enterprise agility and economies of scale. The implementations of service units both outside and within the enterprise should be independent of other service units so that they can be independently optimized and adapted to new requirements with minimal impact on their users. However, this decoupling and autonomy conflicts with the use of shared databases.
Those who focus on data management have, for decades, driven the industry toward consolidation of databases under a philosophy that tighter coupling means greater efficiency and consistency. Putting all the data in one place eliminates duplication and synchronization problems. The data gurus are struggling with reconciling data management with the loose coupling of SOA.
Jill Dyche asserts that "SOA Starts with Data". She advocates creating data services-creating data hubs as services that manage and provide access to master data. Starting with data services has an appeal to IT organizations that feel the need to adopt SOA.
However, if the data services duplicate data that exists in the operational service units, the problems of latency, inconsistencies and synchronization are not eliminated, and accountability for the integrity may be fragmented among multiple business organizations. If the data services are expected to provide shared data storage for future services, then this will raise concerns about performance of the services that use that data and it undermines the autonomy of the service units. Finally, the creation of these data services achieves minimal if any business benefit.
Dan Gardner in "SOA and compute clouds point to rethinking data entirely: roles and permissions, not rows and tables" observes that much of an enterprise's data is no longer controlled by the IT organization and exists in many forms on PCs, in PDAs, in stakeholder systems and various services on the internet. With SOA and cloud computing, the data stores may be scattered over multiple, distributed computers.
Of course there has always been data outside the confines of the IT systems, but now the volume of data is exploding and connectivity and the Internet has made more data accessible from diverse sources.
The mass of uncontrolled data outside the control of the enterprise is not a SOA issue. These data should be viewed as a source of insights about the ecosystem, market trends and opinions that may affect the enterprise. These sources must be selected and filtered to obtain meaningful results, but they can't be controlled any more than they ever were.
The potential for exposure of proprietary or confidential data is a security risk but it's not a fundamental change requiring rethinking of data management. The mechanisms and models for management of access control do need some rethinking to deal with the multitude of system and user interactions both within the enterprise and with external stakeholders. The consequential risks are increased by the internet and portability of mass storage media.
The data that must remain the primary focus of attention for SOA are the data produced, consumed and managed by business systems that represent the past, present or future state of the enterprise. From a business perspective, the concerns are not a matter of distributed storage but how the data are validated, managed and protected.
Steve Karlovitz, proposes development of a data service layer in "SOAs and Data Management: Understanding the Data Service Layer." It is not clear from his blog how he defines the Data Service Layer that he characterizes as "a single entry point" and "centralized." I see three different interpretations: (1) the data service layer is a data access facility that supports database access by all applications using a canonical view of a shared database similar to a object-relational transformation facility, (2) data from heterogeneous application databases is replicated and integrated in an enterprise database with a canonical data schema, (3) access to heterogeneous databases is provided through requests expressed as queries on a canonical, virtual database.
The first approach is the traditional shared database that includes data edits and access controls. While isolation of the physical data structures from the application views is helpful, it raises concerns similar to those for Jill Dyche's data services. In addition, many services will continue to use legacy or COTS systems that incorporate their own databases. Heterogeneity of service unit implementation technologies is fundamental to SOA agility. It enables localized service adaptation and adoption of new technologies.
Replication of data in a shared database (option 2) is useful for providing an enterprise view of the state of the enterprise. This is essentially an operational data store or reporting database. Inconsistencies can be reconciled in the loading process. However, there will be delays in the updates from various sources, so achieving a fully consistent view may still be difficult. This replicated data should be used only for queries-it would be very difficult to manage updates. The master data, "the single version of the truth" is still in the source databases and must be controlled by their owners.
The third approach is the EII (Enterprise Information Integration) solution. A canonical, enterprise data model defines a virtual database that is the target for queries. The queries and responses are translated to obtain an integrated result from heterogeneous databases. EII did not gain much market acceptance when it was introduced several years ago, but with SOA, its time has come. While some EII tools support updates to the heterogeneous databases, updates should still be controlled by the service units that own those databases.
So the Data Services Layer (assuming approach 2 or 3) provides a solution for an enterprise view, but it does not provide a solution for management of the data that is shared by multiple service units.
In "The Case for Enterprise Data Services in SOA" Jeff Pollock also defines a layered data services approach, but he is particularly concerned that web services technology (incorporating XML, WSDL, etc.) has too much overhead for large volume data transfers. This is true, but SOA does not demand the universal use of web services technology. Conventional techniques for Extract, Transform and Load (ETL) are still appropriate for bulk data transfers.
Many of these concerns about data management arise as a result of viewing SOA as a technology instead of business architecture. A service is provided by an organization-a service unit-that includes not only one or more applications and databases, but people, intellectual property, and other facilities and resources that are necessary to produce business value-the product of the service. The value of SOA comes from the ability to integrate these service unit capabilities in multiple business contexts, and the ability to optimize and adapt them with minimal impact on their users.
Ideally, each service unit has its own database that defines the state of its operation and supports its activities. This will result in some of the same data occurring in the databases of multiple service units. This may be resolved (1) by exchanging updates or (2) by consolidation.
There are business trade-offs to be considered. For each data element there must be one master source, one service unit that is responsible for the integrity of that data element. Of particular concern is accountability for critical business records. Most often, the responsible service unit is the service unit that creates the data or does the most updates. For example, customer records are typically captured in association with order entry because that is where most updates originate. Updates from other service units must be validated and controlled by the service unit responsible for the master data. The data could be stored and controlled in a separate data service, but that just means there is one more database to synchronize.
On the other hand, consolidation of databases is a trade-off between flexibility and performance. Changes to the database schema must be coordinated among all participating service units. This calls for those service units to be closely affiliated organizationally for balancing of concerns. Organizational affiliation can bring further constraints on autonomy and thus agility and optimization of the service units with issues such as priorities and funding of changes.
Data is but one resource managed by a service unit. There are other resources such as people, machines and materials that are managed and exchanged by service units-some resources may be shared. These resources are not exchanged using XML; they are exchanged through mechanisms appropriate to the nature of the resource and the time, cost and distance for transfer. Different protocols may be employed for exchange of data. What is important is that the data exchanged must be consistent with a shared logical data model and in a form compatible to both the sender and receiver.
SOA as a business architecture is similar to traditional business architectures except the communications are faster and the service units are smaller providing greater efficiency and agility. Traditionally, service requests (orders) were communicated on paper, each department had its own files and tracked its work. SOA is an approach to optimization of the design of the enterprise leveraging new technology.
Data management for SOA should be approached as requiring an enterprise logical data model, mechanisms for federation and sharing of data among relatively autonomous service units, and a data management plan that defines responsibilities, flows, master data stores, latency of updates, synchronization strategies and accountability for data integrity and protection. This plan must align with the organizational responsibilities of service units and their data needs, and it must ultimately support an integrated representation of the state of the enterprise-history, current state and future plans.