Understanding Data Classification in Big Data
Written on
Chapter 1: Data Classification in Big Data
In this module, we delve into the classification of data within the realm of Big Data. Traditional computing systems typically handle structured data, organized neatly in tables with distinct rows, columns, and fields for processing. However, the emergence of Big Data has introduced a myriad of data sources, each possessing unique types and formats, complicating their management.
We categorize data into three primary types:
- Structured Data
- Semi-Structured Data
- Unstructured Data
The primary goal of Big Data applications is to analyze data originating from various sources and differing types. For instance, a single application may integrate structured database data, satellite GIS information, files from Word and Excel, social media interactions, and machine sensor outputs. This amalgamation of data can yield valuable "insights," which may lead to problem-solving, answering queries, or potentially creating a new data-driven product.
Section 1.1: Overview of Data Types
There are three main classifications of data: Structured, Semi-Structured, and Unstructured. Here’s a brief overview of each:
Structured Data
This type of data is arranged in rows and columns, typically found within relational databases. It is highly efficient for retrieval and processing, with tools like spreadsheets and SQL designed for interaction with this data.
Semi-Structured Data
Semi-structured data is less rigidly organized and usually comes from web sources in formats like XML and JSON. It requires preliminary analysis to decipher its structure.
Unstructured Data
This category includes diverse formats such as videos, audio recordings, emails, and general text documents (like blog posts). Social media app data, such as WhatsApp messages, also falls under this classification. Analyzing unstructured data necessitates a pre-processing step.
Data can be categorized as either in motion or at rest.
Section 1.2: Data in Motion vs. Data at Rest
Data in Motion
This term refers to "streaming" data actively moving across networks. For example, "live streaming" pertains to real-time video broadcasts over the internet. Processing this type of data can be challenging and costly, yet when leveraged correctly, it can provide real-time insights crucial for business problem-solving. The safeguarding of data in motion is achieved through encryption.
Data at Rest
In contrast, data at rest is stored securely in a stable location and is not in transit. After reaching its destination, such data may receive additional security layers, such as encryption and password protection. This data is essential as it chronicles the company's history and supports its operations.
Small Data
The term "Small Data" refers to manageable data quantities sufficient for decision-making. Unlike Big Data, which focuses on volume, Small Data emphasizes quality, providing ready-to-use, cleaned data suitable for departmental analysis. Examples include data generated from ERP or CRM systems. Both Big Data and Small Data serve as complementary solutions, beneficial in addressing common challenges.
Curiosities:
- Approximately 80% of global data is unstructured, prompting organizations to explore its insights through Big Data technologies.
- Structured data is stored in "Data Warehouses," which come with substantial maintenance costs and require specialized IT professionals.
- The advent of Big Data introduced the "Data Lake" concept, a repository for storing unstructured data from various sources, including social media.
- XML (eXtensible Markup Language) is widely used on the web for electronic forms.
- JSON (JavaScript Object Notation) facilitates data exchange between applications.
Chapter 2: The Explosion of New Data
Data generation is accelerating exponentially across various types and formats. Emerging fields such as genomic analysis, the Internet of Things (IoT), space research, geographic information systems (GIS), and autonomous vehicles are significantly contributing to the growth of data for Big Data applications.
For instance, an autonomous vehicle generates data at an astounding rate: Radars produce between 10 to 100KB, Sonars between 10 to 100KB, GPS generates around 50KB, cameras output between 20 to 40MB, and LIDAR 3D scanners contribute 10 to 70MB. Collectively, an autonomous vehicle can generate up to 4,000GB (4 Terabytes) of data in a single day, equivalent to the data generated by approximately 3,000 individuals.
Curiosities:
- It is estimated that by 2025, there will be two billion sequenced genomes, requiring over 40 exabytes of storage and involving 10,000 trillion hours of processing.
- As of 2020, a typical user generates about 1.5GB of data daily, while an autonomous vehicle produces 4TB, an aircraft 5TB, and a factory could generate up to 1PB each day.
- NASA's Jet Propulsion Laboratory (JPL) faces challenges in archiving and accessing approximately 700 terabytes of imagery from space every day, equating to two days' worth of internet data traffic.
- Daily, humanity generates about 2.5 quintillion bytes of data, which can be numerically represented as 2,500,000,000,000,000,000 bytes.
Chapter 3: Understanding Data Files
You might have composed a document in Word, saving it as a "file" on your disk for future editing. Files are fundamental to computer processing and exist in various formats (text, executable, messages, databases, graphics, images). They are organized within directories, accessible via the operating system's file manager, and can be manipulated on local disks or cloud-based storage.
In the context of Big Data, files are distributed across different nodes (computers, disks) within a network, interconnected via local or wide-area networks. This reflects a distributed parallel processing system, allowing tasks to be executed on the nearest node or shared across multiple nodes.
Regardless of their classification—structured, semi-structured, or unstructured—data is always stored in "FILES" distributed throughout the network.
Curiosities:
- Local processing occurs on individual computers, whereas distributed processing takes place simultaneously across multiple network nodes.
- Local files are managed by operating systems like Windows, MacOS, and Linux, while distributed files are handled by Big Data systems such as Hadoop.
- Key characteristics of Distributed Files include: (a) Files are partitioned into blocks, (b) Files are replicated for security, (c) Files and data are scalable to accommodate large volumes, (d) Files are fault-tolerant.
Support the author's work and subscribe via email for more insights. For further reading, return to the Course Overview and select the link for your next lesson.