The use of statistics in data science

Statistics is the study of data. It’s considered a mathematical science and it involves the collecting, organising, and analysing of data with the intent of deriving meaning, which can then be actioned. Our everyday usage of the internet and apps across our phones, laptops, and fitness trackers has created an explosion of information that can be grouped into data sets and offer insights through statistical analysis. Add to this, 5.6 billion searches a day on Google alone and this means big data analytics is big business.

Although we may hear the phrase data analytics more than we hear reference to statistics nowadays, for data scientists, data analysis is underpinned by knowledge of statistical methods. Machine learning takes out a lot of the statistical methodology that statisticians would usually use. However, a foundational understanding of some basics in statistics supports strategy in exercises like hypothesis testing. Statistics contribute to technologies like data mining, speech recognition, vision and image analysis, data compression, artificial intelligence, and network and traffic modelling.

When analysing data, probability is one of the most used statistical testing criteria. Being able to predict the likelihood of something happening is important in numerous scenarios, from understanding how a self-driving car should react in a collision to recognising the signs of an upcoming stock market crash. A common use of probability in predictive modelling is forecasting the weather, a practice which has been refined since it first arose in the 19th century. For data-driven companies like Spotify or Netflix, probability can help predict what kind of music you might like to listen to or what film you might enjoy watching next.

Aside from our preferences in entertainment, research has recently been focused on the ability to predict seemingly unpredictable events such as a pandemic, an earthquake, or an asteroid strike. Because of their rarity, these events have historically been difficult to study through the lens of statistical inference – the sample size can be so small that the variance is pushed towards infinity. However, “black swan theory” could help us navigate unstable conditions in sectors like finance, insurance, healthcare, or agriculture, by knowing when a rare but high-impact event is likely to occur.

The black swan theory was developed by Nassim Nicholas Taleb, who is a critic of the widespread use of the normal distribution model in financial engineering. In finance, the coefficient of variation is often used in investment to assess volatility and risk, which may appeal more to someone looking for a black swan. In computer science though, normal distributions, standard variation, and z-scores can all be useful to derive meaning and support predictions.

Some computer science-based methods that overlap with elements of statistical principles include:

Time series, ARMA (auto-regressive) processes, correlograms
Survival models
Markov processes
Spatial and cluster processes
Bayesian statistics
Some statistical distributions
Goodness-of-fit techniques
Experimental design
Analysis of variance (ANOVA)
A/B and multivariate testing
Random variables
Simulation using Markov Chain Monte-Carlo methods
Imputation techniques
Cross validation
Rank statistics, percentiles, outliers detection
Sampling
Statistical significance

While statisticians tend to incorporate theory from the outset into solving problems of uncertainty, computer scientists tend to focus on the acquisition of data to solve real-world problems.

As an example, descriptive statistics aims to quantitatively describe or summarise a sample rather than use the data to learn about the population that the data sample represents. A computer scientist may perhaps find this approach to be reductive, but, at the same time, could learn from the clearer consideration of objectives. Equally, a statistician’s experience of working on regression and classification could potentially inform the creation of neural networks. Both statisticians and computer scientists can benefit from working together in order to get the most out of their complementary skills.

In creating data visualisations, statistical modelling, such as regression models, is often used. Regression analysis is typically used in determining the strength of predictors, trend forecasting, and forecasting an effect, which can be represented in graphs. Simple linear regression relates two variables (X and Y) with a straight line. Nonlinear regression relates to two variables in a nonlinear relationship, represented by a curve. In data analysis, scatter plots are often used to show various forms of regression. Matplotlib allows you to build scatter plots using Python; Plotly will allow the construction of an interactive version.

Traditionally, statistical analysis has been key in helping us understand demographics through a census – a survey through which citizens of a country offer up information about themselves and their households. From the United Kingdom, where we have the Office for National Statistics to New Zealand, where the equivalent public service department is called StatsNZ, these official statistics allow governments to calculate data such as gross domestic product (GDP). In contrast, Bhutan famously measures Gross National Happiness (GNH).

This mass data collection, mandatory upon every household in the UK, which goes back to the Domesday Book in England, could be said to hold the origins of statistics as a scientific field. But it wasn’t until the early 19th century that the census was really used statistically to offer insights into populations, economies, and moral actions. It’s why statisticians still refer to an aggregate of objects, events or observations as the population and use formulae like the population mean, which doesn’t have to refer to a dataset that represents citizens of a country.

Coronavirus has been consistently monitored through statistics since the pandemic began in early 2020. The chi-square test is a statistical method often used in understanding disease because it allows the comparison of two variables in a contingency table to see if they are related. This can show which existing health issues could cause a more life-threatening case of Covid-19, for example.

Observational studies have also been used to understand the effectiveness of vaccines six months after a second dose. These studies have shown that effectiveness wanes. Even more ground-breaking initiatives are seeking to use the technology that most of us hold in our hands every day to support data analysis. The project EAR asks members of the public to use their mobile phones to record the sound of their coughs, breathing, and voices for analysis. Listening to the breath and coughs to catch an indication of illness is not new – it’s what doctors have practised with stethoscopes for decades. What is new is the use of machine learning and artificial intelligence to pick up on what the human ear might miss. There are currently not enough large data sets of the sort needed to train machine learning algorithms for this project. However, as the number of audio files increases, there will hopefully be valuable data and statistical information to share with the world.

A career that’s more than just a statistic

Studying data science could make you one of the most in-demand specialists in the job market. Data scientists and data analysts have skills that are consistently valued across different sectors, whether you desire a career purely in tech or want to work in finance, healthcare, climate change research, or space exploration.

Take the first step to upgrading your career options and find out more about starting a part-time, 100% online MSc Computer Science with Data Analytics today.

The use of statistics in data science

A career that’s more than just a statistic

Apply now

University of York