Big data webinar is a part of Faculty of Economics and Business Universitas Brawijaya Dies natalis webinar series. This first webinar introduces the field of big data that is explained by Ardi Imawan, S.Kom., M.Sc. from DOT Indonesia
“Data is the new science. Big data holds the answer” — Pat Gelsinger
Nowadays we often heard the terminology of Industry 4.0, Artificial Intelligence (AI), Big data, and other popular terminology. Big data may refer to today’s marketing but in the next decade, with the increasing digital innovation, the term “big” may likely to get irrelevant because of the increasing use of data. In its place, more specific words such as Data Science, Data Engineering, Data Processing, and other have started to appear. Big data can be used to analyze the problem that we face whether it is business problem, social problem, economic problem, etc. Data has become the new oil which emphasize the importance of data to be likened with oil, the ability to employ usage of big data is essential in the modern times.
Statistics shows that the big data market shows that big data market increase steadily from 2009 to 2015, which then spiked in 2020. This trend is supported by the increasing internet accessibility to everyone, nowadays almost everyone isconnected to the internet, they use services provided with connection to the internet which means that more data are collected now.
Then, what makes big data “big”? Big data itself has 5 main characteristics that differentiates it from traditional data:
Several characteristics that make big data “Big” are:
· Veracity, it refers to the quality and origin of the data, or how accurate the data is.
· Volume, big data is generated with a vast amount that it cannot be contained within one hardware unlike traditional data.
· Variety, unlike traditional data, big data has high variety be it from the variety of sources or even variety of the data type.
· Velocity, refers to the speed of which big data is generated with an extreme speed with a never-ending process, data contained within big data are generated anytime someone use internet.
· Value, refers to the usefulness of the data when it is analyzed.
Big data has been applied in several fields, namely:
· Digital marketing optimization
· Data exploration and discovery
· Fraud detection and prevention
· Social network and relationship analysis
· Machine-generated data analytics
· Data retention
Example of Big Data Analysis in Real Life
One example that we often encounter in real life is the recommendation that pop up when we open YouTube. By utilizing the historical data of video that we like and watch, YouTube is able to determine which video they should recommend to satisfy our preference, this is commonly called as the YouTube Algorithm that is used to analyze the habit of viewers and recommend the right video that suit their interest.
YouTube as part of Google also manages to personalize the advertisement they show for us, for example when we search product in Google search, the advertisement in the YouTube will show the product related to those products we are interested with, for example if we like games then YouTube will present advertisement about games.
Type of data source are divided into two types, they are structured data and unstructured data
Structured Data are mostly like the traditional data we often encounter in the form of tables. This data can be displayed in rows, columns, and relational database. The basic data types are mostly in the form of numbers, dates, and strings. It is relatively small so it required less storage and also easier to manage and protect. It is estimated that structured data made up 20% of enterprise data.
Unstructured Data are data that cannot be displayed in rows, columns, and relational databases. The data have varying data types that can range from documents, images, video, e-mails, etc. Because of the variety and the volume of the data, it requires more storage capacity and also more difficult to manage and protect. It is estimated that unstructured data made up 80% of enterprise data.
The unstructured data shows significant spike in recent years and it is predicted that this growth is likely to increase in the coming years. Despite the usefulness of the unstructured data, it generates another problem, namely the storage and computational problem.
On the figure above, scale up means to add resources to an existing system to achieve desired computational and storage performance. Scale out on the other hand is multiplying the system in several different places to distribute the weight of storage and computation.
To effectively solve the storage and computational problem, it will be expensive for the company to manage such large unstructured data. This is where cloud computing company comes in, cloud computing provider rents their infrastructure to help company store their data and also offering cloud computation, thus company did not need to scale their own infrastructure.
There are several features of the cloud computing:
· Pay as you go
Several major cloud computing companies consists of Google, Amazon Web Services, Microsoft Azure, and Alibaba Cloud.
Data Analysis Process
In doing big data analysis, we will follow ordered steps which starts from data preparation.
The data preparation schema above shows the steps of data preparation before the data can be used in analysis. It starts from gathering the data and by doing exploratory data analysis (EDA) we can discover the pattern of the data, after that we cleanse the data to make it easier for us to focus on meaningful features of the data. After that the data undergoes data transformation and enriching process before it is stored in the storage.
Let us start with data gathering
One method that can be used in gathering data comes in the form of web scraping
Web scraping itself is a method to collect data from a website by scraping the html document that made up the website itself. To understand web scraping it is better to understand HTML first.
Introduction to HTML
Hyper Text Markup Language or commonly known as HTML is syntax that made up the contents of the website. HTML itself can be likened as the skeleton of the website because it gives the basic structure of the website. The structure in HTML is mainly consisted of parent, branch, and root.
HTML structure has branches or often referred as children, they are usually in the form of <head> which indicates the heading of the website and ended with </head>. And also <body> </body> that is the main content of the website. The body also has children such as h1 and p that denotes heading 1 and paragraph respectively. Furthermore, structure can also be in the form of table with structure as follows:
Web scraping is a delicate practice because when we do web scraping, we will send a lot of requests to a website that can impact the speed of access of the website because the website needs to send a lot of requests in a narrow timeframe, this is why some websites forbid the user from doing web scraping. To learn web scraping safely, we can go to the website called Fake Python that is specifically made to help people practice their web scraping skills.