Centralized Event Consumer
How can multiple consumers subscribe to the same event and effectively process the event, without overloading the underlying infrastructure, if similar data enrichment requirements exist?
Event subscribers often need to enrich the event data for further processing down the line. When many subscribers to the same or event have similar data enrichment requirements, as a result cascading effects of the subsequent concurrent calls to the same services or to services exposing data from the same back end, can affect the autonomy of the event handlers and also services deployed in the same containers, or other services accessing the same back end. Worst case this effect can cripple parts of a service inventory.
Introduce a cascading event consumer architecture, whereby each time an event subscriber enriches the event data, the enriched data is published as a new event. Any consumers that have similar event data enrichment requirements can then subscribe to the republished event with enriched data, instead of all consumers subscribing to the same event, each of them solving the data enrichment requirements on their own.
Each event consumer that enriches event data must republish the enriched event data.
Key is to apply the enriched event republishing up-front to make it attractive for new initiatives to consume the republished and enriched events.
Additional considerations apply when consumers have requirements that no events may be lost. In these situations asynchronous queuing and reliable messaging can be applied jointly with this pattern.
The application of this pattern creates an event architecture which can dramatically increase the autonomy of the service inventory and of applicable back ends by decreasing the concurrent access on back end systems. Instead of accessing the back end multiple times for every event, now the back end access happens only once or a few times for every event.
Discoverability may decrease as one event can appear in many forms.
Order of events may be affected, as no guarantees are be given that republished events still have the same order. If a previous event of several subsequent events takes longer to process than a later event, the republished event order may be affected. Special considerations apply if order of events must be maintained. These will however affect the scalability as ordered processing prevents concurrent processing.
ArchitectureService Composition Architecture, Event Architecture
In a large-scale enterprise environment where many services and back ends exist in a complex infrastructure, to save resources, often event driven messaging is used, as this is light-weight and fast.
When event driven messaging is used, it is very attractive to add more subscribers to the same event to create a pluggable environment where new functionality is plugged into the infrastructure, by creating new event subscribers that each deal with a specific functional problem.
Often in these scenarios, event subscribers created later in time have similar event data enrichment requirements. This causes for one single event, multiple consumers to access the same back end environment or resource.
As soon as multiple consumers are competing for the same resource, autonomy problems arise of either the services accessing the resource or perhaps the resource itself. If the resource itself gets into trouble, even consumers external to our service inventory and legacy users may be affected.
The resulting issue is that even for one single event, many concurrent services can start competing for the same resource. As a result the resource is overloaded and this can cripple parts of the service inventory.
Figure 1 – In a situation with many subscribers to the same event, often these subscribers have similar data enriching requirements. As such each event subscriber will effectively run the same queries on underlying data stores. As a consequence, the load on the underlying infrastructure will increase tremendously as more and more subscribers run for the same information. This can effectively cripple (parts of) a service inventory.
To prevent many concurrent event subscribers from concurrently accessing the same resource, dedicated consumers are introduced concerned with the data enrichment of the received events.
Once an event is received and data is enriched, these dedicated consumers must republish the enriched data structure, effectively becoming a new event source.
Any new consumers that need similar event data must subscriber to the republished and enriched event instead of the original event. This way, because only one (or a limited amount of consumers) needs access to the same underlying resource, the total amount of concurrent access attempts to the resource is reduced to a minimum.
By applying this pattern a layered event architecture is created which can effectively increase autonomy of the system and services, because the amount of concurrent access to the underlying infrastructure (services, messages, network, storage resources etc) is reduced.
Figure 2 – Event handling is delegated to dedicated consumers which each have different data enrichment requirements.
Dedicated event consumers are introduced each with a specific data enrichment requirement. Once these have enriched the event data, new events are sent with the enriched data. If multiple data sets must be added to an event, then a series of events/consumers can be cascaded to provide the ultimate receiver with appropriate data. This effectively creates an layered event architecture.
Figure 3 – Instead of all consumers subscribing to the same event (1), service consumers with similar data enrichment requirements use that event indirectly. Every consumer with specific data enrichment functionality is delegated this specific task and has a requirement to republish the enriched event (1a, 1b, 1ac). Any consumer which needs similar data must use the republished event instead of the initial event.
This approach can only be applied if all of the event consumers have high availability as well as a high autonomy. Also there is a requirement that all consumers have similar non-functional requirements, i.e. event delivery assurances (these do not exist if not explicitly designed for) to overcome potential availability issues of event consumers.
Furthermore it is important to understand that the order of received events is not guaranteed and the actual received event order will be more and more affected when more and more consumers are used to handle the delegated event data enrichment. At some point of time, data enrichment performance can dramatically differ between different requests for the same event type, as autonomy can be too low for consistently stable processing performance.
To overcome reliability issues, intermediately events can be dropped into a queue or similar message store (apply the asynchronous queuing pattern) to make sure that no events get lost even in case an event subscriber is temporarily not available. A similar thing can be achieved by using a standard lime WS-ReliableExchange.
As more and more events are layered on top of each other, the discoverability of appropriate event sources decreases as more and more different events are being published with similar data. This affects interpretability of any discovered events which can significantly impact overall discoverability. It becomes harder to find out if my data requirements are already solved by another event publisher.
By appointing centralized event consumer, in fact this is logic centralization in an event infrastructure/architecture.
Figure 4 – Overall Service Autonomy is increased as less underlying resource access is necessary, effectively preventing the underlying resources from becoming a bottleneck. Discoverability is impacted as more and more similarly looking events are being generated. Reliability of the solution can be improved by introducing asynchronous queuing to overcome any subscriber availability issues. The application of this pattern is a form of logic centralization.
Case Study Example
TelCo is a telecommunications company which started 15 years ago in a very competitive market. At the time of starting, TelCo was the second operator to start mobile communications in that country. The company thrived initially because they had chosen for a strategic SOA approach with for that time a state-of-the-art application of an event driven architecture.
The big benefit of this approach was that new functionality could be plugged in pretty easily and non-intrusive as events could be used for many new subscribers without affecting the code and functionality of existing subscribers.
As a basis of their efforts they had build a proprietary event management system with lots of specialized functionality. Telco focuses on mobile communications and because this market is very competitive, the lead time of implementing and releasing changes was crucial to TelCo. Whoever had the best proposition would have the biggest customer growth and every operator in the country would try to have an even better proposition by offering new "key" features quickly after each other. TelCo found itself in a race to show more and more innovations quickly to follow present market developments and to try and stay in the lead.
After a short time (within the first two years of existence) there were so many new advancements with releases sometimes more than twice per week that the system began to slow down. More and more events would bring the system in a state where handling a single event would occupy the system significantly, and sometimes some of the underlying resources would be significantly overloaded.
As a result of an internal analysis, the system architects concluded the entire system was overloaded and there was a problem in the event processing. As such the architects decided to make three significant changes:
- split the system into partitions to spread the load
- enhance the capacity of certain underlying resources and back-ends
- replace their bespoke event management system with a COTS product of a commercial vendor, who had indicated that their system would be able to process more than 50x the amount of events TelCo presently has with ease
After a couple of months their system was partitioned, their own bespoke event management system was replaced by the vendor product and the situation overall had improved but not significantly. Upgrading the back end systems had helped a bit as well but in the end the architects could foresee that similar problems would occur less than 8 months down the road if no drastic changes would be made.
The system architects had asked the designers and developers to build in extra logging which slowed down the system even more but it allowed the architects to dig deeper into the problematic system areas. What the system architects found is that a single event would cause an avalanche of back-end calls to the same critical back-ends supporting the organization.
As the company was still growing they realized that the next problematic system load would be reached a lot sooner than initially estimated, perhaps even within the next 4-6 months. Either radical changes would be made immediately or TelCo would be forced to slow down on their marketing and sales, something the company could not afford.
The system architects made the following fundamental change: event consumers were split into three logical categories:
- event consumers that require a specific event sequence
- event consumers that can live with out-of-order execution of received events
- consumers that need guaranteed event delivery
- consumers that do not need guaranteed event delivery (i.e. events can be skipped and nothing serious would break)
An analysis was done and the amount of event consumers that were classified in the first category fortunately was less than 15%. This means that for the majority of event consumers, no special infrastructure would be necessary which is necessary for maintaining the order of events. For the other two categories (85%) the following segmentation could be found: 3% did not have any problem with delivery assurances, but the remaining 82% needs delivery assurance.
Figure 5 – Classification of events after the analysis by architects.
This was actually good news to the architects as this meant that the really problematic areas (15% of the events) could be isolated in a relatively small area.
The first category (1) was split in the infrastructure and a new segment of infrastructure was introduced for the remaining event processing. Because of this split, in case a problematic event load would ever occur, the effects of this would remain somewhat isolated from the rest of the infrastructure.
Because only 15% of the events required significant extra hardware, a lot of the hardware purchased recently could be used to build the new segment of infrastructure.
Of the remaining 85%, only a very small subset would not need guaranteed event delivery and the amount of event processing in that area (2b) was so small that it made no sense to further split up the infrastructure for 2a/2b.
The second significant change the architects made is that they analyzed all the existing and planned event consumers to see whether they could find similar processing requirements in multiple event consumers. Once the analysis was complete, they found out that in fact many of the event consumers of the same event would use the same services and resources to enrich the data.
It was decided that in the events classified as (2a/2b) it made sense to appoint dedicated event consumers which would be solely responsible for enriching the event data. This meant a thorough redesign of the event architecture but the benefit would definitely outweigh the cost.
Due to the naturally increased amount of reuse, the build part of the redesigned event architecture was delivered 3 weeks ahead of schedule. Because naturally the amount of software assets had decreased, the amount of testing effort was also drastically decreased and also the test results came in 1 week early.
During load testing it was observed that despite all the structural changes, still a few events would cause problematic load. Fortunately the architects found that a selective caching strategy would help overcome most of the remaining areas. Three event consumers were given caching abilities to cache retrieved event data in a central cache for 20 minutes after the data would be retrieved. Due to the fact that many events revolve around the same subscribers and accounts in a relatively short amount of time this approach solved most of the problems.
Because the enterprise service bus TelCo had purchased shortly after start-up supports message delivery assurances, no further development was required for guaranteed message delivery and message in-order delivery.
The caching would reuse the bespoke caching framework built by TelCo several years ago, but because of the benefits they had of the delivery assurance framework that came out-of-the-box, the architects decided that the caching framework would be re-evaluated in the near future.
Telco architects were happy with the new approach and changed the reference architecture to accommodate for the new decisions so consecutive projects would be aware of the new event handling infrastructure.
Additionally, in the early phases of the software lifecycle, criteria were defined and introduced into the governance documentation to require classifying events and event consumers up-front into one of the defined categories.