Let's talk about data with data

2021-06-04

The future is here... it is there for all to see. Technological growth is exponential, as is the data generated. However, it is estimated that less than 20% of this data is analyzed. In this article we will talk about them, which is why it seems appropriate to define what data is.

90% of the world's data was created in the last two years and is predicted to grow by 40% per year.1

We can say that a data is the representation of a variable (qualitative or quantitative) that indicates an assigned value. Simply put, it is only a value until it is contextualized and analyzed to transform it into information. And information is power, power to make a decision, power to better understand the present, and why not the future. Along with this acquired power comes a great responsibility, something that we will discuss in a future post and which is known as Data Ethics.

Every day, we create approximately 2.5 quintillion bytes of data.2

A quintillion contains 30 zeros and as we saw, there are more and more data. Contrary to what one might think, not all of them are stored in tables or spreadsheets (like excel, for example). They are everywhere, hidden in emails, images, audios and videos. The interesting thing is that in order to examine and measure a set of data, it is not always necessary to organize or tabulate them to analyze them in order to obtain results that may be of interest for the objectives of a project.

Next, we will talk about the different types of data, since it is important to know and understand them so that they can be used when building solutions.

A widely used classification suggests dividing data types into structured, semi-structured and unstructured. The former are usually text files that are stored in table format, spreadsheets or relational databases. Each of these contains headings for each category, allowing them to be identified. In such tables, each row corresponds to a record, say for example, a customer. And the columns represent attributes of those customers, such as their monthly income, age, date of birth, among others.

It is estimated that by 2025, 80% of data will be unstructured leaving only 20% structured data.3

As for semi-structured data, they do not have a strict structural framework, but they do have some distinguishable properties. Semi-structured data include texts organized by themes or topics that contain text or information without a structure of their own.
Emails, for example, are semi-structured by sender, recipient, subject, date, and so on. These in particular can provide a wealth of data mining opportunities for companies to analyze customer feedback, ensure customer service is working properly and help build marketing materials.
Another example is social networking platforms, such as Facebook, which organizes information by user, friends, groups, market, etc., but the comments and text contained in these categories are not structured.

In 2020, 306.4 billion emails were sent per day.4

Finally, unstructured data is data that does not present any well-defined structure, such as videos, images, audios, etc. In mid-2020, Instagram introduced automatically generated captions from the processing of videos available on Instagram TV (IGTV). Using Artificial Intelligence, this facilitates the use of the application and has been designed with the intention of helping hearing impaired users.

In 2016, it was estimated that an average of 95 million photos were shared per day on Instagram. 5

In conclusion, there will be more and more data and only a small portion of the totality is being analyzed. This is one of the reasons why we are so passionate about a discipline like data science. Doing what we love, we are dedicated to making the data speak for itself, thus providing valuable information to the decision makers, you.

Published by

Wais

Leave a Reply

Your email address will not be published.