Why big data is so important to data science

What is big data?

Big data is the term for the increasing amount of data collected for analysis. Every day, vast amounts of unsorted data is drawn from various apps and social media, requiring data processing.

Creating data sets for such a volume of data is more complex than creating those used in traditional data sorting. This is because the value of the data needs defining; without a definition it is just a lot of detail with no real meaning. Despite the term only relatively recently coming into everyday usage, big data has been around since the 1960s with the development of the relational database. It was the exponential rise in the amount and speed of data being gathered through sites like Facebook and YouTube that created the drive for big data analytics amongst tech companies. The ‘Three Vs’ model characterises big data by volume, variety, and velocity (with veracity and variability sometimes being added as fourth and fifth Vs). Hadoop appeared in 2005 offering the open-source framework to store big data and analyse it. NoSQL, the database for data without a defined structure, also rose in stature around about this time. From that point, big data has been the major focus of data science.

What is big data analytics?

Big data analytics is the sorting of data to uncover valuable insights. Before we had the technology to sort through huge volumes of large data sets using artificial intelligence, this would have been a much more laborious and slower task. The kind of deep learning we can now access through data mining is thanks to machine learning. Data management is much more streamlined now, but it still needs data analysts to define inputs and make sense of outputs. Advances like Natural Language Processing (NLP) may offer the next leap for data analytics, NLP allows machines to simulate the ability to understand language in the way that humans do. This means machines can read content and understand sentences rather than simply scanning for keywords and phrases.

In 2016, Cisco estimated annual internet traffic had, for the first time, surpassed one zettabyte (10007 or 1,000,000,000,000,000,000,000 bytes) of data. Big data analysis can run into data sets reaching into terabytes (10004) and petabytes (10005). Organisations store these huge amounts of data in what are known as data lakes and data warehouses. Data warehouses store structured data with data points relating to one another that has been filtered for a specific purpose. These offer answers to fast SQL (structured query language) queries, which stakeholders can use for things like operational reporting. Data lakes contain raw data that has not yet been defined, drawn from apps, social media, and Internet of Things devices that await definition and cataloguing in order to be analysed.

The data flow of usable data usually involves capture, pre-processing, storage, retrieval, post-processing, analysis, and visualisation. Data visualisation is important because people tend to grasp concepts quicker through representations like graphs, diagrams, and tables.

What is Spark in big data?

Spark is a leading big data platform for large-scale SQL databases that leads to machine learning. Like Hadoop before it, Spark is a data processing framework, but it works faster and allows stream processing (or real-time processing) as opposed to just batch processing. Spark uses in-memory processing making it 100 times faster than Hadoop. Whereas Hadoop is written only in Java, Spark is written in both Java and Scala, but implementation is in Scala. With less lines of code, this speeds up processing significantly.

Both Hadoop and Spark are owned by Apache after Spark was acquired from University of California, Berkeley’s AMPLab. Using the two in tandem leads to the best results – Spark for speed and Hadoop for security amongst other capabilities.

How is big data used?

Big data is important because it provides business value that can help companies lead in their sector – it gives a competitive advantage when used correctly.

Increasingly, big data is being used across a wide range of sectors including e-commerce, healthcare, and media and entertainment. Everyday big data uses include eBay using a customer’s purchase history to target them with relevant discounts and offers. As an online retailer, eBay’s use of big data is not new. Yet, within the retail sphere, McKinsey & Company estimate that up to 30% of retailers’ decision-making when it comes to pricing fails to deliver the best price. On average, what feels like a small increase in price of just 1% translates to an impressive 8.7% increase in operating profits (when we assume no loss in volume). Retailers are missing out on these kinds of profits based on a relatively small adjustment by not using big data technologies for price analysis and optimisation.

In healthcare, apps on mobile devices and fitness trackers can track movement and sleep, diet, and hormones creating data sources. All this personal data is fed into big data analysis for further insights into behaviours and habits related to health. Big data can also provide huge strides in some of healthcare’s biggest challenges like treating cancer. During his time as President of the United States, Barack Obama set up the Cancer Moonshot program. Pooling data from genetically sequenced cancer tissue samples is key to its aim of investigating, learning, and maybe finding a cure for cancer. Some of the unexpected results of using these types of data, includes the discovery that the antidepressant, Desipramine, has the capability to help cure certain types of lung cancer.

Within the home, energy consumption can certainly be managed more efficiently with the predictive analytics that a smart meter can provide. Smart meters are potentially part of a larger Internet of Things (IoT) – an interconnected system of objects, which are embedded with sensors and software that feeds data back and forth. This data is specifically referred to as sensor data. As more ‘Things’ become connected to one another, in theory, the IoT can optimise everything from shopping to travel. Some buildings are designed to be smart ecosystems, where devices throughout are connected and feeding back data to make a more efficient environment. This is already seen in offices where data collection helps manage lighting, heating, storage, meeting room scheduling, and parking.

Which companies use big data?

Jeff Bezos, the founder of Amazon, has become the richest man in the world by making sure big data was core to the Amazon business model from the start. Through this initial investment in machine learning, Amazon has come to dominate the market by getting its prices right for the company and the customer, and managing its supply chains in the leanest way possible.

Netflix, the popular streaming service, takes a successful big data approach to content curation. It uses algorithms to suggest films and shows you might like to watch based on your viewing history, as well as understanding what film productions the company should fund. Once a humble DVD-rental service, Netflix enjoyed 35 leading nominations at the 2021 Academy Awards. In 2020, Netflix overtook Disney as the world’s most valuable media company.

These are just some of the many examples of harnessing the value of big data across entertainment, energy, insurance, finance, and telecommunications.

How to become a big data engineer

With so much potential for big data in business, there is great interest in professionals like big data engineers and data scientists who can guide an organisation with its data strategy.

Gaining a master’s that focuses on data science is the perfect first step to a career in data science. Find out more about getting started in this field with University of York’s MSc Computer Science with Data Analytics. You don’t need a background in computer science and the course is 100% online so you can fit it around your current commitments.