Project Description


Our project was initially sponsored by Western Digital, but due to unforeseen circumstances, we lost their support. We are grateful to our former client, Rajpal Singh, for providing the original concept for this project.

Modern solid state drives suffer from "Silent Errors", essentially when a portion of a drive fails without warning, causing lost files, corrupted data, and dead drives. This is a big issue for companies like Amazon, running AWS services on tens of thousands of SSD's. Drive manufacturers are trying to figure out a way to detect silent errors, and prevent them from causing issues for industry customers in the future, however, their current approach is so inefficient, any progress is highly unlikely.

In order to make this process more efficient, we were tasked to develop an Observability, Analytics, and Insight platform for SSD manufacturers. Our platform probes the hard drive of the system running the program, collecting data, storing it, and sending it to a visualization dashboard for analysis. This allows clients to easily collect and analyze performance metrics that reveal silent errors.

When implemented, our platform collects kernel-level data easily, allowing for non-kernel experts to work on the issue. The project was structured as our former client envisioned, with both a back-end and a front-end team. The back-end team is responsible for collecting, analyzing, compiling, and delivering the data to the front end. The front-end team then unpacks, organizes, and distributes the data among the appropriate databases before sending it to the visualization dashboard.

This platform will make detecting silent errors feasible, improving SSD's and saving manufacturers time and money. Once the program is completed and refined, it can be marketed to companies to meet their specific needs and directions. The initial concept for this project was provided by our sponsor, in the form of a PDF: Capstone project proposal.

High-Level Requirements

Key Requirement Thresholds

Functional Requirements

For more information about this, please visit the Documentation Tab and access the Requirements Document.

Expected Milestone Progression



The Gantt chart above outlines the timeline for our project milestones. Initially, we focused on data collection to populate the rest of the program with dummy data. Next, we developed the sanitizer/formatter and database in tandem, using example outputs. Once these were complete, we built the visualization dashboard, which is fed data from the database. Throughout this process, we conducted ongoing testing to ensure that all functionalities were refined before integration into the main system. Completion of these milestones is crucial for achieving the Minimum Viable Product (MVP), and any stretch goals should be deferred until then.


Resources

Solution


We developed a robust data collection and pipeline software for user-friendly display. The plan involves using eBPF, C++, and Python for data collection, with Python responsible for compiling and delivering data in JSON/CSV format. After delivery, a master MySQL database will be created, housing device information and additional tables derived from the eBPF collection. Once the data is stored, a Flask server will query the data and feed it directly into a Svelte app. This app will utilize D3 Designs to create relevant graphs for the client.


Technologies


Tech Demo

The video walkthrough of our program is divided into five parts, each representing a step in our solution:

  1. Probing and Collection: The program collects data from various BT files (Block RQ Complete + Error, Bio Latency, Bio Pattern) and device-specific data. This data is stored in CSV or Text files.
  2. Data Sanitization/Formatting: The CSV and Text files are cleaned and formatted to be easily readable and interpretable by both humans and machines. The cleaned data is then stored back into CSV files.
  3. Database Upload: The cleaned data from each file is uploaded to the appropriate tables in a MySQL database. The data related to the device being probed is also uploaded and linked with the collected data to maintain their association.
  4. Data Querying and Populating: The data from the database is queried and sent to the visualization dashboard. This is facilitated by a Flask server, which establishes the connection between the database and the dashboard.
  5. Visualization Dashboard: The queried data is displayed on the dashboard using a Svelte app and D3 designs. This allows the user to explore the data based on the BT files used.

Codebase


Here is the codebase for the Capstone Project. This includes our Minumum Viable Product, website, and deprecated AWS code: GitHub