Data proliferation refers to the rapid growth of data, often resulting in a large amount of replicated and low-quality data. This can be costly to manage and may pose compliance and operational risks to an organization. While it may be necessary to analyze this data in order to understand its structure, sources, and uses, it may ultimately have little value to the organization and can be difficult to discard. The following are illustrative examples of data proliferation.
Customer Data
It is common for multiple systems in an organization to maintain customer data. Such data is commonly out of sync between systems with no clear single source of truth. This can cause operational failures such as sending a bill to the wrong address.
Documents
Knowledge workers tend to create a lot of documents that get checked into a document management system. In many cases, such documents become completely unused with time but are retained as a precaution.
Communication
Communications such as emails can gather at the rate of hundreds per employee per day. Most communications lose their value almost immediately but often are retained for an extended period of time.
Backups
Backups of data, documents and communications often need to be retained in case something important was deleted from the source systems. If someone deletes a critical email, the only copy may be in a backup from a particular day last year. As such, backups are commonly stored for long periods of time. This can consume considerable resources despite the fact that backups are rarely used.
Transactional Data
Transactional data such as market trades and website purchases can grow extremely quickly. Transactional data is often viewed as valuable for historical research. For example, it is common to look at patterns in stock trades going back decades.
Social Data
Data that is shared by people on a public or private social network. Often viewed as valuable for purposes such as market research and machine learning.
Sensors & Machines
Machine and sensor generated data. Sensors have become cheap to the extent than they can be embedded in everyday objects in great numbers. Such data may be generally less valuable than human generated data. For example, video of a train tunnel or data from a tire pressure sensor isn’t interesting for long. Nevertheless, sensor data potentially represents a gigantic source of data that is far larger than all other sources combined.