docs/docs/srs.md

18 KiB

Software Requirement Specification (SRS) Document

Elixier Management Tool Development Project


1. Introduction

1.1 Overview

This document covers functional requirements, non-functional requirements, system architecture, and design constraints for the Elixier Management Tool app.

1.2 Purpose

The purpose of this document is to provide a detailed Software Requirements Specification (SRS) for the development of the Elixier Management Tool application. This application will provide users with a GUI to manage the Elixier data analytics platform.

1.3 Scope

Elixier Management Tool will be a web-based solution designed to help platform provider to manage multiple tenants, help tenants to manage their data platform supporting multiple Kubernetes clusters, and also to help tenant admins to manage the users of the platform.


2. System Overview

Elixier is an open-source data analytics plaform, which consists of a collection of services (such as Airflow, Superset, Opensearch etc.) within a Kubernetes cluster that has been configured to work with each other, to allow tenants to store, perform ETL and analyze their data.

Current method to configure & deploy services within this data platform is by using Helm chart. The configuration files need to be edited manually & helm command has to be executed via terminal to deploy to the Kubernetes cluster.

The Elixier Management Tool app will provide a GUI to allow management tasks described above to be performed via a web-based interface.

2.1 System Goals

The goal is to have a web based graphical management tool for simplifying the processes of:

  • Server monitoring
  • Automation of the process of installation, management, upgrade and uninstallation of server side software
  • Automation and centralization of configuration management of software
  • Automation of configuration management of software with many interconnected components.
  • Self service portal for customers to deploy software on their own environment
  • Extending with new service deployment template for any new software that need to be deployed.

2.2 Success Criteria

The system would be considered successful when following criteria are met:

  • Multiple tenants able to use the software to manage their own infrastructure
  • Users are able to deploy, configure and manage services easily
  • Developers are able to create new service template plugin and make it available for users with minimal friction, and users are able to use it.
  • The system can be used for on-premises client and on-cloud clients

3. Functional Requirements

3.1 General Requirements

  • The system shall use asynchronous I/O pattern.
  • The system backend shall be written using Python, using FastAPI framework as a base
  • The system frontend shall be written using React
  • The system shall use OpenID Connect (OIDC) and Security Assertion Markup Language (SAML) for the Single Sign-On (SSO) feature, for users to access services on the data platform.
  • The system shall integrate with Kubernetes via API server to deploy services.
  • The system shall integrate with Authentik to handle user identity management.
  • The system shall integrate with RabbitMQ for asynchronous message brokering between components.
  • The system shall be installable on-premises, to manage hosts and clusters behind secured network.
  • The system shall be designed with preference to monolithic approach to backend development to minimize complexity

3.1 Authentication Management

  • The system shall have user, group and role management.
  • The system shall have 3 groups of users: global admin, tenant admin & tenant user. The global admin must be able to manage all tenants, while tenant admin only able to manage the tenant they are inside.
  • The system shall support multiple tenants, and each tenant may add one or more clusters.
  • The system shall support API key authentication to backend.
  • The system shall provide audit logs of key actions done againts host, clusters and tenant

3.2 Authorization Management

  • The system shall authorize access to cluster components through the management tool.
  • The system shall use OIDC grant or SAML claim to identify user role membership
  • The system shall allow tenant to use their own SAML/OIDC provider.
  • The system shall log authorization requests to an audit log
  • The system shall provide list of dataset in the cluster components through the management tool.
  • The system shall allow tagging of dataset through the management tool
  • The system shall allow management of tags through the management tool
  • The system shall manage authorization rules to dataset through the management tool, using both resource based ACL and tag based ACL.
  • The system shall provide list of permissions that can be granted to cluster components through the management tool
  • The system shall allow assignment of permission grants through the management tool.

3.3 Tenant Management

  • The system shall allow global admins to create, update, and delete tenants.
  • The system shall allow tenant admins to manage tenant-specific settings, including branding, default permissions, and notification preferences.
  • The system shall provide a mechanism for tenant onboarding, including guided setup wizards for new tenants.
  • The system shall enable tenant admins to define custom quotas for resources such as CPU, memory, and storage within their clusters.
  • The system shall maintain isolation between tenants to ensure no data leakage or unauthorized access occurs across tenants.
  • The system shall allow tenant admins to view and manage all clusters and hosts associated with their tenant.
  • The system shall enable tenant admins to create and manage user accounts specific to their tenant, including assigning roles and permissions.
  • The system shall provide tenant-specific dashboards that display key metrics such as resource utilization, service uptime, and user activity.
  • The system shall allow tenant admins to customize alerts and notifications for their specific tenant environment.
  • The system shall enable tenants to associate billing or usage information for their resources, integrating with external billing systems if necessary.
  • The system shall support tenant-level backup and restore features for critical resources such as configurations, datasets, and services.
  • The system shall allow tenants to view detailed audit logs of actions performed by their users, including timestamps and IP addresses.
  • The system shall provide mechanisms to export and import tenant configurations for migration or backup purposes.

3.3 Host Management

  • The system shall be able to help tenants to add hosts to be managed and monitored through the management tool. Host ownership is by tenant.
  • The system shall monitor and collect resource consumption statistics of hosts and visualize in the management tool, which includes CPU, RAM, network, process list, systemd logs, service logs, error logs.
  • The system shall be able to manage and modify host service configuration and deploy configuration to the hosts
  • The system shall provide versioning of configuration for roll-back
  • The system shall be able to roll-back service configuration to a version either by version name or by date.
  • The system shall manage hosts through agents that connect to management tool through secured HTTPS connection.
  • The system shall provide dashboard to monitor host being managed

3.4 Kubernetes Cluster Management

  • The system shall be able help tenant to deploy Kubernetes cluster to hosts that it manages.
  • The system shall use k0s to deploy new Kubernetes cluster
  • The system shall work with new or existing, CNCF-certified Kubernetes cluster through adding cluster config file.
  • The system shall be able to monitor Kubernetes cluster on at least following metrics: host resource consumption, host resource request/limit consumption, number of containers, pod, deployment, statefulset, PVC, PV
  • The system shall be able to deploy Kubernetes components through uploading Kubernetes YAML file

3.5 Service Management

  • The system shall provide list of services that can be deployed on Kubernetes cluster
  • The system shall allow tenant users to deploy services to the clusters. Services may be deployed single or multiple instances to a cluster
  • The system shall provide a dashboard for each services, which list down the following: shortcut url to access key user interface of deployed service, key service logs, key audit logs, list of components that are up / down, resource consumption of services.
  • The system shall provide alerts when services encounters issues such as going down, having resource issues, slow response.
  • The system shall provide a mechanism to update service configuration and deploy the configuration update. Configuration may either be ConfigMap, Secret, or config file stored in PV. After updating configuration, service must be able to be restarted either automatically or waiting for user confirmation
  • The system shall provide versioning of configuration for roll-back
  • The system shall be able to roll-back service configuration to a version either by version name or by date.
  • The system shall have a plugin mechanism for registering new service template. New service template shall provide at least following hooks: install, uninstall, restart, metric collection, configuration update, get service urls, upload file to service, download file from service, list files in service. New service template shall be implemented as a Python class, loaded dynamically from a folder.

3.6 Secret Management

  • The system shall provide a mechanism to manage secrets and store it encrypted in database.
  • The system shall provide a mechanism to generate secret based in random generator
  • The system shall support secret versioning, allowing rollback to a previous version of a secret.
  • The system shall encrypt all secrets in transit and at rest using industry-standard encryption protocols.
  • The system shall provide a role-based access mechanism to manage and retrieve secrets securely.
  • The system shall integrate with Kubernetes Secrets to synchronize secrets into clusters.

3.6 Object Storage Bucket Management

  • The system shall provide a mechanism to manage files in S3 bucket. S3 bucket may reside in AWS or Minio.
  • The system shall provide a mechanism to manage files in Azure Object Storage.
  • The system shall provide a mechanism to manage access keys to buckets
  • The system shall support tagging and categorization of files within buckets for better organization.
  • The system shall allow uploading, downloading, and deleting files through the management tool's GUI.
  • The system shall display file metadata such as size, type, upload date, and owner.

3.6 PostgreSQL Service Management

  • The system shall provide a mechanism to deploy PostgreSQL database cluster, as a service template plugin.
  • The system shall provide dashboard to monitor PostgreSQL databases, including performance metrics like query times, CPU usage, and memory consumption.
  • The system shall allow tenant users to create and delete PostgreSQL instances within their clusters.
  • The system shall provide automated backups for PostgreSQL databases with user-configurable backup intervals.
  • The system shall allow restoration of databases from backups, either fully or partially.
  • The system shall provide a mechanism for managing PostgreSQL configuration files and applying updates dynamically.

3.6 Trino Service Management

  • The system shall provide a mechanism to deploy Trino query engine cluster, including for High Availability, as a service template plugin.
  • The system shall provide a GUI to manage Trino configurations, including catalog and connector settings.
  • The system shall allow monitoring of query execution, including status, duration, and resource usage.
  • The system shall support user and role-based access to Trino clusters, integrating with the system's identity management solution.
  • The system shall provide audit logs of executed queries and configuration changes for compliance.
  • The system shall provide data catalog GUI to list down dataset in Trino
  • The system shall provide GUI for managing access control in Trino through tag based policy

3.7 Superset Service Management

  • The system shall provide a mechanism to deploy Apache Superset cluster, including for High Availability, as a service template plugin.
  • The system shall allow users to create and manage Superset dashboards and visualizations through the GUI.
  • The system shall monitor Superset performance, including response times and resource usage.
  • The system shall enable integration of Superset with various data sources, including PostgreSQL and Trino.
  • The system shall provide backup and restoration mechanisms for Superset configurations and metadata.
  • The system shall support scaling of Superset worker nodes dynamically based on workload.

3.8 Airflow Service Management

  • The system shall provide a mechanism to deploy Apache Airflow cluster, including for High Availability, as a service template plugin.
  • The system shall allow tenant users to create and manage Directed Acyclic Graphs (DAGs) for workflows.
  • The system shall provide real-time monitoring of DAG execution, including status, logs, and resource usage.
  • The system shall support scaling of Airflow worker nodes dynamically based on workload.
  • The system shall integrate with the system's secret management module to securely pass sensitive data to Airflow workflows.
  • The system shall provide mechanisms for managing Airflow configuration and environment variables through the GUI.

4. Non-Functional Requirements

4.1 Usability

  • The system shall provide a user-friendly, intuitive interface that is easy to navigate.

4.2 Security

  • The system shall implement strong encryption for data in transit (SSL/TLS).
  • The system shall implement role-based access control (RBAC) to ensure data security and privacy.

4.3 Reliability

  • The system shall have appropriate failover mechanisms in place to ensure high uptime, preferably through Raft consensus.
  • The system shall be able to handle network outages and resume data synchronization once the network is restored.

4.4 Performance

  • The system shall load the dashboard in under 5 seconds for standard users.
  • Data queries should return results within 3 seconds for queries with a dataset of up to 1 GB.

4.5 Compatibility

  • The system shall be compatible with the latest versions of Chrome, Firefox, and Edge browsers.
  • The system shall be built using responsive UI and work across devices (desktops, tablets, and smartphones).

6. System Architecture

6.1 Overview

The system will be built using an architecture with the following components:

  • Frontend: React-based & Tailwind-based frameworks, for example: Next.js and Flowbite
  • Backend: FastAPI for REST API endpoints, websocket & service template plugin.
  • Database: PostgreSQL for structured data storage.
  • Authentication: Authentik with OpenID Connect (OIDC) for secure login and Single Sign-On (SSO).
  • Message Broker: RabbitMQ for asynchronous message broker & queue.

Backend shall prefer monolithic architecture to minimize complexity.

6.2 Process Flow

The following is the high-level process flow of the system:

  1. Global admin adds a new tenant and add a user as admin in tenant.
  2. Tenant admin configures tenant account with organizational metadata.
  3. For tenant with existing Kubernetes cluster, tenant admin will add the cluster by providing kubeconfig file to the cluster.
  4. For tenant that doesn't have Kubernetes cluster, tenant admin will setup a new one by adding hosts and then creating cluster using the hosts.
  5. Tenant admin selects & configures credentials & resource quotas of services to be deployed on the cluster.
  6. Tenant admin select a service and deploys the services on the cluster.
  7. Tenant admin adds users to its tenant.
  8. Tenant user logs in and accesses installed services on a cluster to perform data processing & analytics activities.
  9. Tenant admin monitors the health of clusters & services, including troubleshooting problems arising.
  10. Tenant admin performs regular maintenance & security updates on the services, clusters & hosts OS.

7. Design Constraints

  • The system shall be web-based and designed to be responsive for both desktop and mobile views.
  • The system shall support future integration with additional Kubernetes services deployment with minimal effort.
  • The system shall adhere to modern web security standards, including secure data transmission and secure API design.

8. Appendices

8.1 Glossary

  • Elixier: An open-source data analytics plaform.
  • Service: The components of the data platform, such as Airflow, Superset, Opensearch, Git etc.
  • Platform Provider: A business entity that provides hosting service to install, configure and host this management tool, and optionally install, configure, or host the Kubernetes cluster for the data platform.
  • Tenant: A company or organization that uses this management tool to manage their data platform.
  • Dashboard: A user interface that provides an overview of data platform tenants, clusters, services and users.

8.2 Definitions, Acronyms, and Abbreviations

  • API: Application Programming Interface
  • UI: User Interface
  • GUI: Graphical User Interface
  • UX: User Experience
  • SLA: Service Level Agreement
  • JSON: JavaScript Object Notation
  • DB: Database
  • ETL: Extract, Transform, Load
  • SaaS: Software as a Service

8.3 References

  • OpenAPI Specification v3