GPU AI Architecture
Last updated
Last updated
The GPU AI Portal's architecture is a multi-layered, cohesive structure that provides a seamless, secure, and efficient user experience. Each layer has a distinct role, working in tandem to ensure the system's optimal performance. The architecture is built upon modern technologies, ensuring scalability, reliability, and robustness.
This layer is the visual gateway for users. It comprises the Public website, Customers area, and GPU providers area (Workers). The design is intuitive and user-centric, ensuring easy navigation and interaction.
Mainly used Tech Stack: ReactJS, Tailwind, web3.js, zustand.
A pivotal layer ensuring the system's integrity and safety. It encompasses a Firewall for network protection, an Authentication Service for user validation, and a Logging Service for tracking activities.
Mainly used Tech Stack:Firewall (pfSense, iptables), Authentication (OAuth, JWT), Logging Service (ELK Stack, Graylog).
Serving as the communication bridge, this layer has multiple facets: Public API for the website, Private APIs for Workers/GPU Providers and Customers, and Internal APIs for Cluster Management, Analytics, and Monitoring/Reporting.
Mainly used Tech Stack: FastAPI, Python, GraphQL, RESTful services, gunicorn, solana.
The system's powerhouse. It manages Providers (Workers), Cluster/GPU operations, Customer interactions, Fault Monitoring, Analytics, Billing/Usage Monitoring, and Autoscaling. Mainly used Tech Stack: FastAPI, Python, Node.js, Flask, solana, IO-SDK (a fork of Ray 2.3.0), Pandas.
The data repository of the system. It uses Main storage for structured data and Caching for temporary, frequently accessed data.
Mainly used Tech Stack: Postgres (Main storage), Redis (Caching).
This layer orchestrates asynchronous communications and task management, ensuring smooth data flow and efficient task execution.
Mainly used Tech Stack: RabbitMQ (Message Broker), Celery (Task Management).
The foundational layer. It houses the GPU Pool with hardware from our verified partners. Orchestration tools manage deployments, while Execution/ML Tasks handle computations and machine learning operations. Additionally, it provides Data Storage solutions. GPU performance is monitored using Nvidia-smi or NVIDIA DCGM.
Mainly used Tech Stack:
GPU/CPU Pool
Orchestration: Kubernetes, Prefect, Apache Airflow
Execution/ML Tasks: Ray, Ludwig, Pytorch, Keras, TensorFlow, Pandas
Data Storage: Amazon S3, Hadoop HDFS
Сontainerization: Docker
Monitoring: Grafana, Datadog, Prometheus, NVIDIA DCGM