ArtemiS3

Problem and Requirements

Problem

In recent years, USGS and NASA have been shifting their vast planetary datasets into the cloud. Currently they use Amazon Web Services (AWS) Simple Storage Service (S3) to host these datasets.

The current workflow used by USGS has main three steps. First they collect the mission data from NASA, next they store it on the cloud using AWS S3 Buckets, and lastly when the data is needed it is accessed via a simple S3 file browser.

This last step comes with a few problems, stemming from both the vast size of the dataset they are working with, which is in the range of multiple petabytes, and the current tool that they are using, which is a simple file explore like you might have on your PC. In order to find specific pieces of data with this system, they must manually root through thousands of folders and related files, which can massively slow down the workflow.

There is no full-text or metadata search option for the files.
Users must manually sift through large repositories.
Limited file visualization, they cannot view file previews, summaries, or the files themselves from the browser.
Security risks could be going undetected without the proper audit tools.

Requirements

Requirements Acquisition

In order to formulate and finalize our list of project requirements, we followed a four step process to ensure that our client was happy with our project. First we held our initial meeting with our client to discuss the problem the team was facing, view a demonstration of their workflow, and discuss some initial ideas for how to solve the problem. Next, we used what we learned from this meeting to draft our initial requirements document where we outlined the broad and high priority requirements. We then iterated on this document following our client's feedback while introducing more detailed and non-functional requirements to the draft. We then completed our finalized requirements document and presented it to our client for approval.

Domain-Level Requirements

Efficient Data Discovery: Fast, accurate search over millions of S3 files via metadata and content indexing.
Intuitive User Experience: Clear, organized UI suitable for novice and expert users.
Collaboration & Tagging: Save searches, tag files, and share insights.
Scalable & Reliable Architecture: Large-scale indexing, backups, and recovery.
Security & Privacy: AWS-aligned read-only access and optional policy audits.
Open-Source & Reusable: Containerized, deployable system for public S3 data managers.
Optional Visualization: Map-based visualization for spatial metadata.

Functional Requirements

Must Have (MVP requirements)

File Retrieval: Backend processes search queries and returns results efficiently.
Metadata/Text Indexing: Automated extraction of S3 metadata and text into Meilisearch.
Search Page: Main UI with search bar, filters, and results view.
Interactive Results: Real-time metadata previews and filtering (Svelte + Tailwind).
Filtering & Sorting: Advanced search refinement via UI + backend API.
Index Refresh: Update search index when new S3 files are added.

Should Have

Directory-Style Browsing: File/folder explorer for search results or full bucket.
File Downloads: Download files directly from the UI.
Search Snapshot Caching: Automatic Meilisearch backups.
User Tagging: Tag and annotate files for organization and collaboration.

Could Have

User Accounts: Personal settings and separate indexes.
Search History: Recent searches for quick retrieval.
Geo-spatial Visualization: In-browser map previews of geo-spatial files.
Security Analysis: Bucket policy and ACL auditing.
Hot Path Caching: Speed enhancement for popular queries.

Performance Requirements

Pending UI: Show loading indicators during long operations.
Response Time: Search results under 3 seconds for 10,000 objects.
Indexing Speed: ≥1,000 objects/min during indexing cycles.
Uptime: 99% availability under normal conditions.
Security Compliance: HTTPS-only traffic, AWS read-only IAM policies.
Scalability: Performance maintained under high user load.
Usability: Users must be able to complete a search within 2 minutes of entry.

Environmental Requirements

AWS Hosting: Deployed on AWS EC2 via Docker Compose.
Open Source: Fully open-source and publicly available on GitHub.
Tech Stack: Python + FastAPI, Svelte + Tailwind, PostgreSQL, Meilisearch.
CI/CD: Automated GitHub Actions pipeline.
Monitoring: Logs and metrics integrated with AWS CloudWatch.
Browser Accessibility: Works across all major modern browsers.