Egeria: Scalability Analysis

Egeria is an open-source project that provides open metadata governance, allowing easy sharing of metadata across systems. In our previous 3 essays we have covered Egeria’s vision, architecture, and the code quality. We have seen examples of how ING and IBM use Egeria to share metadata across different services using Egeria as a middleman 1. We have also seen that Egeria invests in being open-source, adaptable, extendable, and lightweight to run for both reaching a high adoption rate and being a flexible platform2. While reaching these goals Egeria has managed to keep its technical debt manageable3. In this essay, we investigate how well Egeria can be scaled, including an analysis of the performance and the effects of their implemented design choices.

We start by discussing the scalability of Egeria, what has been done by the developers to make it scalable, and what the potential bottlenecks are. We will then inspect common worries for scalability applied to Egeria. Finally, we will end with an empirical performance analysis on Egeria of the execution time of common operations under different loads.

Scalability in Egeria

Egeria is used as a governance tool for metadata collections stored on one or more servers4. Likewise, Egeria is designed to be a distributed system as seen in the figure below. For such a distributed system, the common bottlenecks regarding scalability are different than for a centralized system 5. The performance of the CPU, memory or storage is of lesser concern since you can add extra resources more easily when needed. In a distributed setting the bottlenecks that occur are more often related to the design of the infrastructure and how the different components communicate with each other. To this end, we will further discuss these elements in the next section.

Figure: Figure 1: Egeria distributed system overview. (http://egeria-project.org/guides/planning/guide/#platforms-and-servers)

Egeria Scalable Design

In this section, we will discuss the design choices of the Egeria development team to support scalability.

One of these solutions is the possibility of running Egeria functionality across multiple servers. As can be seen in the figure below, a functionality of Egeria (OMAG Server) can either run solo (A), with other functionalities (B) or across multiple physical locations (C) by using Kubernetes 67.

Figure: Figure 2: Egeria functionalies distribution (https://egeria-project.org/concepts/omag-server-platform/)

Another design choice that can benefit scalability is caching8. Egeria allows for the usage of cashing but does not enforce it. The local access service can choose to cache the resulting data from a query to a non-local database. The local Egeria API decides what to cache depending on past and current traffic. This mechanism allows Egeria to increase request speed for frequent data and reduce load. Caching works well for read-only data as it is stored locally, while data writes require the edited data to be sent over the network. As Egeria is built on top of existing databases and is less commonly used to modify the data, this is well suited. Performance could degrade in larger systems where many writes happen.

Egeria Scalability Challenges

In this section, we discuss some of the common challenges in scalability and how this is reflected in Egeria. We also present some of the existing scalability issues in Egeria.

We have reviewed the common challenges regarding scalability defined by ACM9 and Google Cloud10. We have applied these challenges to Egeria where applicable, analysing whether Egeria can manage these challenges.

First of all, a scalable system needs to be fault-tolerant as with the increase of users and connected components, the chance of failure increases. Egeria is a distributed system. If one of the servers in a cohort fails the remaining systems will keep functioning as expected, increasing tolerance against failures.

Second, it is important to have a redundancy in your data to ensure data can be recovered in case of failure. Egeria does not (automatically) back-up data from third party databases. Servers joining the network can quickly be seeded with read-only copies of the available data which allows them to then service requests. If a database crashes its content might still be available to read from those cases. However, no one can edit the data that originated on that database as none of the other cohort members owns the data.

Another valuable mechanic for scalable systems is the possibility to enable and disable functionality through feature flags. This can be useful if components might fail or maintenance is needed on one of the components. Egeria is a distributed system with many possible configurations. The distributed nature allows for components to work independently and the configurations can be used to modify the used components and the features they use.

For a scalable system, it is important to minimize the amount of human intervention needed to keep the system running. Egeria does this by being self-configuring and highly automatable. They provide pre-built components which ease the integration. Moreover, multiple diagnostic tools monitor the system.

To that end, a scalable system needs to have proper logging of the events occurring in the system. Egeria also has an extensive and extensible logging tool called the audit log framework audit log framework.. This framework provides the ability to route audit messages to multiple destinations where they can be stored or processed automatically. The second option is extremely important for continuous operations. The audit log is extensible to allow the routing of messages to different destinations for reviewing and processing.

Not all challenges are related to the software functionality, as one challenge might be the communication within the Egeria development team. To make sure that the design decisions made are respected and developers are not working on conflicting solutions it is important to have good communication within the team. The core development team has strong internal and external communications to the rest of the open-source community and their userbase. They hold weekly development and community meetings open for everyone to join. The development meetings cover the current plans and progress regarding the code in detail, while the community meetings are there for questions and discussions about the system in general. Think about how things work, and why they work the way they do. They also make use of Slack and GitHub for other communications.

An important aspect of keeping a project maintainable and understanding decisions made in the past it’s important to have good documentation. Egeria’s documentation is detailed and educational. They not only provide code documentation, but also multiple examples scenarios, test setups, a learning environment, and other resources to make understanding and using the system as smooth as possible. One issue with Egerias documentation is that it is spread between two platforms, namely GitHub and their website. Both seem to contain roughly the same information with some links (and maybe thus also other information) being outdated on GitHub. These two sources are not linked and might in the futures cause problems as it’s difficult to keep both sources up to date and accurate.

Lastly, a project needs to keep the future in mind, keeping quality high and avoiding technical debt. Egeria provides reporting on the code quality over the past two years. As can be seen from the figure below, The technical debt has been kept relatively low and constant which is a sign of a scalable system, as can be seen in the figure below.

Figure: Figure 3: Technical debt over time Egeria (https://sonarcloud.io/summary/overall?id=odpi_egeria)

Overall we can see that the architecture was built with scalability in mind, in addition, it tries to make the setup process and the maintenance as easy as possible for organizations. Therefore, even for smaller organizations, Egeria is not a system that requires a huge amount of expertise to maintain. Nonetheless, some issues could limit scalability.

One of the issues that were discussed by the main developer during a developer meeting we attended is the fact that there could potentially be some problems with concurrent writes, while the developers did not expect this issue to occur frequently, mainly due to the low number of expected writes, it is a potentially critical point of failure that gets more problematic as the number of servers in a cohort increases.

Performance Analysis

This section explains how communication within Egeria is implemented. Additionally, we explore the performance of one example connector and compare its performance to another connector. These performance measurements are gathered using the performance workbench, which is an Egeria tool.

For the performance of Egeria, it is important to see how the different components communicate. In Egeria, this is defined by its connectors, and Egeria distinguishes between three types of connectors11. Some connectors support the exchange and maintenance of metadata. Then there are connectors for Egeria’s runtime operations, like event bus connectors, registry stores, log destinations and more. Lastly, some connectors provide access to digital resources. The first two connector types are the most important for Egeria’s scalability as they define how Egeria communicates with connected third-party systems and how Egeria’s internal components cooperate. These components can form the bottleneck of the distributed system for scalability.

Whether Egeria can scale well with third party technologies is very dependent on the type of connector used. For example, when a third party technology is added, the system designers can choose between an integration connector that is either inbound and/or outbound; synchronous, polling or event-driven12.

The runtime connectors of Egeria provide a communication point for the services offered in a specific environment 13. Most of these current connectors relate to either persistent storage or the connection to distributed services. Examples of these services are security, network consistency, audit logs and data topics.

The benefit of using these connectors for both internal and external communication between components is that the connector can be fitted to specific projects.

While a performance analysis on a large scale setup focussing on the communication between Egeria instances would be interesting it is not feasible for us to set up such a configuration. Instead, a performance benchmark from Egeria is used.

For our analysis of the XTDB connector performance, we made use of the open metadata conformance test suite. This test suite will perform common operations on the connector using different dataset sizes and measure the performance. This is then used to analyse the scalability of the connector. For the XTDB connector the results of running the test suite are as follows:

Figure: Figure 4: Performance analysis of XTDB connector

From these results, we can see four instances where performance severely degrades when having a bigger dataset. These are the search methods for entities and relationship and their histories. This performance degradation has also been identified by the Egeria team who have mentioned that looking into optimizations for larger datasets should be the next focus for performance improvements of this connector14.

Apart from identifying which groups of functions have the largest performance decrease with larger datasets this test suite can also be used to compare different connectors with each other. This has been used by the Egeria team to compare the performance of XTDB against JanusGraph, which are both used as graph databases. With these tests, they noticed that XTDB still outperforms JanusGraph in almost every single method even at 8 times the volume of metadata15. The difference in connector performance shows that careful consideration should be made when designing these connectors, scaled to the project needs.

Conclusion

While some components of Egeria do not scale well under increasing loads, the identified architectural challenges for the project to be scalable seem well covered. Despite the presence of some issues, for example, concurrent writes, overall Egeria appears to be a well scalable project.

References


  1. https://desosa2022.netlify.app/projects/egeria/posts/egeria-product-vision-essay-1/ ↩︎

  2. https://desosa2022.netlify.app/projects/egeria/posts/egeria_essay_2_architecture/ ↩︎

  3. https://desosa2022.netlify.app/projects/egeria/posts/egeria-quality-and-evolution/ ↩︎

  4. http://egeria-project.org/guides/planning/guide/#platforms-and-servers ↩︎

  5. https://www.oreilly.com/library/view/effective-multi-tenant-distributed/9781492042839/ch07.html ↩︎

  6. https://egeria-project.org/concepts/omag-server-platform/ ↩︎

  7. https://egeria-project.org/guides/operations/kubernetes/k8s/ ↩︎

  8. https://youtu.be/n-Xm8_WIyBM?t=1168 ↩︎

  9. https://queue.acm.org/detail.cfm?id=2512489 ↩︎

  10. https://cloud.google.com/architecture/scalable-and-resilient-apps ↩︎

  11. http://egeria-project.org/connectors/ ↩︎

  12. http://egeria-project.org/connectors/#integration-connectors ↩︎

  13. http://egeria-project.org/connectors/#runtime-connectors ↩︎

  14. https://egeria-project.org/connectors/repository/xtdb/performance/ ↩︎

  15. https://egeria-project.org/connectors/repository/xtdb/performance/#xtdb-vs-janusgraph ↩︎