Problem
The current workflow used by USGS has main three steps. First they collect the mission data from NASA, next they store it on the cloud using Amazon Web Services' S3 Buckets, and lastly when the data is needed it is accessed via a simple S3 file browser.
This last step comes with a few problems, most stemming from the vast size of the dataset they are working with (it is in the range of multiple petabytes). In order to find specific pieces of data, they must root through thousands of folders and related files, which can massively slow down the workflow.
-
Some specific limitations that they have mentioned are:
- There is no full-text or metadata search option for the files.
- Users must manually sift through large repositories.
- Limit file visualization, plus the only indication of a file's contents is its name and data type.
- Security risks could be going undetected without the proper audit tools .
Requirements
As the project goes on we plan to hold regular meetings with our client to validate and refine the requirements. Our high level goals include a full-text and metadata search and the ability to tag and filter files, further goals include advanced file visualization and security auditing.
We will begin with a technical investigation where we will explore several technologies such as Boto3, Meilisearch, and React. In addition, we will survey existing S3 search tools for strengths and limitations and address scalability.