Below I have pointed out some distributed design patterns that I have previously used and found very useful.
I have included links to the azure website which contains excellent detailed explanation of these design patterns.
Event Sourcing: The idea with event sourcing is that you track the current state of the world by keeping an initial start state and apply changes as deltas to the start state for each time-based event to get the current state. The problem solves the complexity with managing time-based truth of the world in a consistent manner.
In my early days as a programmer, one of the systems I constructed was a market data system for credit curves. The system applied market quotes on credit curves to build an average market credit curve based on quotes. The system also had user inputs where users could correct prior quotes (i.e., change a quote from 2 hrs ago or manually enter quotes). Back-dated changes made knowledge date management of the curves very tricky (i.e., how the curve looked 1 hr ago vs. how it looks now). I was applying quotes on a physical curve because the survival probability construction of the credit curve was numerically very compute intensive (it used Newton Rapson solver). When a user changed a quote from 2 hrs ago, I had to reconstruct all representations of the curve per quote following the change.
The system was becoming hard to support due to the edge cases around managing a time-based state. So I modified the design to separate the data (quotes) and curve creation logic. The current curve current levels were constructed via aggregating quotes from the start of the day on every tick. And then build the curve to validate the numerical consistency of the quote as a qualification criterion (drop quotes that don’t calibrate from future aggregation).
By separating the data construction (quotes) and my curve construction logic, the knowledge date representation of curves became simpler. For example, to understand how the curve looked 1 hr ago, I could take quotes up to 1 hr ago, aggregate them, and calibrate the curve using book-marked referential data from an hr ago.
Queue based level loading: This is a simple yet very practical design pattern where it suggests to distribute the work across workers via a queue where any worker that is free can pick the work from queue. I have done implementations of this pattern where works are assigned max number of work units that they can execute in parallel and the workers pick the work off of the queue in a round robin fashion. The problem we run with this design is that work is typically not uniformly distributed, some work units have high resource usage while others have low. An approach which has worked to solve this problem is that we build a health monitor inside the worker which checking their health based on available memory and compute that is being worked on, when load on a worker exceeds a certain threshold it disconnects itself from the queue so that other workers can pick the work and then reconnects once its healthy again.
Priority Queue design pattern: The pattern suggests to use a priority queue where higher priority messages are given more priority than lower priority messages. Or design to separate high priority queue from lower priority queue while dedicating more resources to the high priority queue. The approach is similar to job scheduling in operating systems to give high priority jobs more resources while ensuring that low priority jobs don’t wait forever. The division of work based on separate high and low priority queues enables dynamic elasticity where compute resources can be assigned based on queue lengths in high and low priority queues. See the facebook blog on their distributed priority queue here: https://engineering.fb.com/2021/02/22/production-engineering/foqs-scaling-a-distributed-priority-queue/
Competing consumer pattern: Competing consumers patterns simply states that workers compete based on availability to execute the work. This can be achieved by starting a pool of workers that are subscribing to the work queue and then whichever worker is free attempts to pick the work. The pool of workers can be dynamically sized based on the queue depth. The pattern does impose a criterion that work units are independent and can be run in parallel, this is a desirable criterion to support in an application by designing it to be stateless.
Pipes and filters: The pattern states that a large unit of work should be split into multiple independent sub tasks that can executed in parallel across nodes. Often work can be designed such that it can be broken in group of tasks that are independent and thus can be executed asynchronously and later combined. This pattern enables exactly that and thus helps in optimizing execution by utilizing hardware resources efficiently.
Claim check: This pattern suggests that large messages such as videos and images shouldn’t be transmitted on a message bus but rather written to a persistent store and a link to the message should be submitted on the message queue. This is an extremely important pattern, where message bus such as IBM-MQ technically support large messages up to several MB, the performance of MQ degrades substantially when a high load of such large messages is being transmitted on the queue. Even writing these messages in their raw form is very expensive, what I have done here in past is to convert the large object in to a messaging protocol such as Avro and then compressing the binary object before writing to disk.
Transient faults: The pattern acknowledges that cloud infrastructure is prone to transient errors due to varying load conditions and infrastructure issues pertaining from commodity hardware. And suggests that the processes should be designed with faults in mind and solve the issue of transient faults with interval based retries.
Compute Resource consolidation: This pattern suggests that tasks which are computationally expensive to run in separate computation units should be grouped together in a single computation unit and run as a consolidated singular task. The idea is based around that running tasks in separate computation units relates to increased time spend in IO and coordination that slows the execution and wastes resources. This consideration plays a critical role in performance. In many financial services there is a source service that provides the financial data and offers form of computation on the data. The design is typically done that the data is brought into the client application from the service and computation done inside the client process via a library provided by the service. The reason is that many repeated calls to compute are typically needed and entail a big IO penalty if repeated compute calls are made outside the process to an external service. A common example is a service that maintains holidays for various countries and has some date functions that return business days between two dates based while taking a combination of holiday calendars. These holiday service could provide data over rest but execute computation locally in the client process.
Compensating transaction: When distributed tasks are run independently its possible for the system to get into a state that needs to be rolled back to get to a consistent state if any of the tasks fail. This is typically complicated to do correctly and the need for a compensating transaction should be avoided in design if possible. However, in distributed applications need for a compensating transaction is sometimes necessary. In these cases, a transaction log can be written by the application which enables the application to rollback to a consistent state in case of a failure. This is similar to what distributed databases do when transaction need to be rolled back.
External configuration store: This pattern simply states that configuration should be kept separate from the deployed code. This is a critical design approach which benefits changing configuration easily without having to modify the code.
Anti patterns
Below are some patterns which should be avoided.
Chatty IO: Suggests to avoid multiple repeated IO requests. Ways to solve it is via doing a batch aggregated request or using an in-process cache.
Busy Front End: The pattern suggests to design a minimalist front end where resource intensive tasks are done in the back end. This pattern reduces complexity in the application and helps enable high performance where in the backend more compute resources can be assigned to the tasks.
No Caching: Proper cache design is critically important for good performance. Fetching the same data repeatedly from an external service leads to slow performance. On the other hand, caching everything in application memory may lead to memory pressure and frequent GC stalls. Thus, cache should ideally be designed to have frequently used data in application process cache, which is backed by an external in memory database and then finally by a physical disk-based data base.
Reference list
Cloud design best practices: Here is a link to the best practices for distributed system design written by the azure.
Cloud design patterns: Here is a list of distributed design patterns by azure.
Cloud architecture styles: Please see the various cloud architecture styles here.