Smart Cloud Shield

About Our Project

We have been tasked with designing a product that pre-emptively categorizes data store by IBM Spectrum Protect as hot or cold storage to avoid unnecessary overhead costs. If we are successful we should be able to categorize data correctly and in the end save our sponsor from extra costs. We envision having an end-to-end application which uses a big data tool to ingest file metadata, extract features from the metadata, and pass the features to a machine learning framework to allow the application to create a model which then can be used to classify an appropriate storage tier for specific data.
Original proposal

Why take this approach?

As it stands, many of the storage tiers at IBM Spectrum Protect are filled with data that doesn't belong in that tier. When data is stored in hot storage but is meant to be in cold, customers are paying for a service they're not even taking advantage of. This is also costly on IBM's behalf as well. When data is stored in cold storage but is meant for hot, data is frequently accessed from cold storage which can be costly and it might provide a poor user experience as the customers might need to access the data quickly and every time they try to access it, it takes a long time to access. Our pipeline should be able to accurately categorize data into an appropriate storage tier on ingestion, therefore removing the need for policies as well as the time that data is spent in an incorrect storage tier.