To harness and manage data, IT invests in data management tools and employs methods to import, cleanse, and store data. The focus of this activity is determining how the data will be stored. The better IT can characterize storage for the type of data it’s dealing with, the better IT can manage data.
With the advent of unstructured big data, which now accounts for about 80% of all enterprise data managed, a new wave of data repositories has come into use that don’t always use a data warehouse. The new forms of data repositories have evolved because the way organizations use data has changed. This change was a move away from structured data in clean, fixed record lengths towards more unstructured data without fixed record lengths.
Here is a breakdown of the data storage options in common use today:
1. Hierarchical and relational databases
Databases on mature enterprise platforms such as mainframes continue to operate with hierarchical and relational database structures that are mature, robust, and proprietary. These databases work exceptionally well. They are backed by an army of software utilities that ensure data integrity, security, monitoring and access.
Enterprise CIOs keep these databases because the databases are proven and best-in-class. On the other hand, it takes highly skilled staff to run these databases and IT budgets must support those salaries.
For the most part, proprietary databases contain structured System of Record data, but they are also used in Big Data analysis since many of the keys and vectors in Big Data for analysis come from System of Record systems.
2. Data Lakes
Data lakes are different. Its purpose is to store, secure and make accessible aggregated combinations of structured and unstructured data tailored to a specific business area. An example is a marketing and customer demographics data lake used by marketing for the purpose of developing a targeted product marketing campaign. Another example is a medical information system that combines patient visit records and documentation with patient MRIs, X-rays, and CT scans.
The data lake is a self-contained data store that is not as large as a hierarchical database, but is still fed by data inflows, which can come from a hierarchical database or from an external data source such as social media or some internal, unstructured data sources such as image and video files.
The intent is to make the data lake available to a specific user community and to regularly refresh the data lake from its incoming data feeders to ensure the data remains current and relevant. CIOs task their organizations with ensuring the right data practices are in place for each IT-backed data lake.
3. Data streams
While data lakes are stagnant pools of data that need to be regularly refreshed by inflows of incoming new data, data streams are the exact opposite. That’s because the data in a data stream is constantly moving, so it never gets old.
A good example is the IoT (Internet of Things) data coming in from surveillance cameras, robots, industrial plants, drones, etc. Aside from storing snapshot-in-time activity logs relevant to system monitoring, debugging, and security, most stream data is ephemeral. It doesn’t need to be stored in a data repository for the long term, but it does require fast, point-to-point data transport for the business operations it supports, and IT must budget for it.
4. Data Oceans
Data oceans are accumulations of vast, unexplored and unprocessed data flowing in and out across the enterprise. Companies store this data because they think they might need it in the future. Unfortunately, there is also a high risk that the data will never be used.
Because data ocean data has never been cleaned or processed, it is heavily polluted and unlikely to provide high-quality analysis. As the data ocean continues to expand, storage costs more money and becomes more difficult to manage. The key to managing this data is determining how long you want to keep it. If it’s a treasure trove of emails, you might want to store them for legal discovery purposes, should the companies ever be involved in litigation. If it’s a bunch of IoT jitter or junk from old test systems, it’s best to discard it. In any case, clear IT policies and practices should be in place to manage oceans of data.
What to read next:
New storage trends promise to help companies navigate a data avalanche
Storage shouldn’t be treated like an unloved part of IT
Data Fabrics: Six Top Use Cases