Data science Archives - University of York

How do algorithms work?

May 23, 2022/in Articles, Data science /

Much of what we do in our day-to-day lives comprises an algorithm: a sequence of step-by-step instructions geared to garner results. In the digital sphere, algorithms are everywhere. They’re the key component of any computer program, built into operating systems to ensure our devices adhere to the correct commands and deliver the right results on request.

An algorithm is a coded formula written into software that, when triggered, prompts the tech to take relevant action to solve a problem. Computer algorithms work via input and output. When data is entered, the system analyses the information given and executes the correct commands to produce the desired result. For example, a search algorithm responds to our search query by working to retrieve the relevant information stored within the data structure.

There are three constructs to an algorithm.

Linear sequence: The algorithm progresses through tasks or statements, one after the other.
Conditional: The algorithm makes a decision between two courses of action, based on the conditions set, i.e. if X is equal to 10 then do Y.
Loop: The algorithm is made up of a sequence of statements that are repeated a number of times.

The purpose of any algorithm is to eliminate human error and to arrive at the best solution, time and time again, as quickly and efficiently as possible. Useful for tech users, but essential for data scientists, developers, analysts and statisticians, whose work relies on the extraction, organisation and application of complex data sets.

Types of algorithm

Brute force algorithm

Direct and straight to the point, the brute force algorithm is the simplest but the most applicable, eliminating incorrect solutions based on trial and error.

Recursive algorithm

Recursive algorithms repeat the same steps until the problem is solved.

Backtracking algorithm

Using a combination of the brute force and recursive approach, a backtracking algorithm builds a data set of all possible solutions incrementally. As the name suggests, when a roadblock is reached, the algorithm retraces or ‘undoes’ its last step and pursues other pathways until a satisfactory result is reached.

Greedy algorithm

All about getting more juice for the squeeze, greedy algorithms are employed to source and select the optimal solution to a problem. They typically extract the most obvious and immediate information in minimum time, enabling devices to sort through data quickly and efficiently. This algorithm is great for organising complex workflows, schedules or events programmes, for example.

Dynamic programming algorithm

A dynamic programming algorithm remembers the outcome of a previous run, and uses this information to arrive at new results. Applicable to more complex problems, the algorithm solves multiple smaller subproblems first, storing the solutions for future reference.

Divide and conquer algorithm

Similar to dynamic programming, this algorithm divides the problem into smaller parts. When the subproblems are solved, their solutions are considered together and combined to produce a final result.

Are algorithms artificial intelligence?

Algorithms define the process of decision-making, whereas artificial intelligence uses data to actually make a decision.

If a computer algorithm is simply a strand of coded instructions for completing a task or solving a problem, artificial intelligence is more of a complex web, comprising groups of algorithms and advancing this automation even more. Continuously learning from the accumulated data, artificial intelligence is able to improve, modify and create further algorithms to produce other unique solutions and strengthen the result. The output is not defined, as with algorithms, but designated. In this way, artificial intelligence enables machines to mimic the complex problem-solving abilities of the human mind.

Artificial intelligence algorithms are what determine your Netflix recommendations and recognise your friends in Facebook photos. They are also called learning algorithms, and typically fall into three types: supervised learning, unsupervised learning and reinforcement learning.

Supervised learning algorithms

In this instance, programmers feed training data (or ‘structured’ data sets) into the computer, complete with input and predictors, and show the machine the correct answers. The system learns to recognise the relational patterns and deduce the right results automatically, based on previous outcomes.

Unsupervised learning algorithms

This is where machine learning starts to speak for itself. A computer is trained with unlabeled (or ‘raw’) input data, and learns to mine for rules, detect patterns and summarise and group data points to help better describe the data to users. The algorithm is used to derive meaningful insights from the data, even if the human expert doesn’t know what they’re looking for.

Reinforcement learning algorithms

This branch of algorithm learns from interactions with the environment, utilising these observations to take actions that either maximise the reward or minimise the risk. Reinforcement learning algorithms allow machines to automatically determine the ideal behaviour within a specific context, in order to maximise its performance.

Artificial intelligence algorithms in action

From artificial intelligence powered smartphone apps to autonomous vehicles, artificial intelligence is embedded into our digital reality in a multitude of big and small ways.

Facial recognition software is what enables you to log in to your device in the first place, while apps such as Google Maps and Uber analyse location-based data to map routes, calculate journey times and fares and predict traffic incidents.

From targeted ads to personalised shopping, artificial intelligence algorithms are working to optimise our online experiences, while future applications will see the installation of self-driving cars and artificial intelligence autopilots.

Unmask the secrets of data science

Data is being collected at unprecedented speed and scale, becoming an ever-increasing part of modern life. While ‘big data’ is big business, it is of little use without big insight. The skills required to develop such insight are in short supply, and the expertise needed to extract information and value from today’s data couldn’t be more in demand.

Study the University of York’s 100% online MSc in Computer Science with Data Analytics and enhance your skills in computational thinking, problem-solving and software development, while advancing your knowledge of machine learning, data analytics, data mining and text analysis. Armed with this sought-after specialist knowledge, you’ll graduate the course with an abundance of career prospects in this lucrative field.

Data architecture: the digital backbone of a business

April 26, 2022/in Articles, Data science /

We are each the sum of our parts, and, in our modern technological age, that includes data. Our search queries, clicking compulsions, subscription patterns and online shopping habits – even the evidence collected from wearable fitness tech – feeds into our digital footprint. And, wherever we choose to venture on our online quests, we are constantly being tracked.

Experts claim that we create 2.5 quintillion bytes of data per day with our shared use of digital devices. With the big data analytics market slated to reach a value of $103 billion by 2027, there are no signs of data storage slowing down.

But it’s less about acquisition than application and integration, with poor data quality accounting for a cost of $3.1 trillion per year against the US economy according to market research firm IDC. While device-driven data may be fairly easy to organise and catalogue, human-driven data is more complex, existing in various formats and reliant on much more developed tools for adequate processing. Around 95% of companies can attest that their inability to understand and manage unstructured data is holding them back.

Effective data collection should be conceptual, logical, intentional and secure, and with numerous facets of business intelligence relying on consumer marketplace information, the data processed needs to be refined, relative, meaningful, easily accessible and up-to-date. Evidently, an airtight infrastructure of many moving parts is needed.

That’s where data architecture comes into the equation.

What is data architecture?

As the term would imply, data architecture is a framework or model of rules, policies and standards that dictate how data is collected, processed, arranged, secured and stored within a database or data system.

It’s an important data management tool that lays an essential foundation for an organisation’s data strategy, acting as a blueprint of how data assets are acquired, the systems this data flows through and how this data is being used.

Companies employ data architecture to dictate and facilitate the mining of key data sets that can help inform business needs, decisions and direction. Essentially, when collected, cleaned and analysed, the data catalogues acquired through the data architecture framework allow key stakeholders to better understand their users, clients or consumers and make data-driven decisions to capitalise on business.

For example, e-commerce companies such as Amazon might specifically monitor online marketing analytics (such as buyer personas and product purchases) to personalise customer journeys and boost sales. On the other hand, finance companies collect big data (such as voice recognition and facial detection) to enhance online security measures.

When data becomes the lifeblood of a company’s potential reach, engagement and impact, having functional and adaptable data architecture can mean the difference between an agile, informed and future-proofed organisation and one that is constantly playing catch-up.

Building blocks: key components of data architecture

We can better visualise data architecture by addressing some of the key components, which act like the building blocks of this infrastructure.

Artificial intelligence (AI) and machine learning models (ML)

Data architecture relies on strong IT solutions. AI and machine learning models are innovative technologies designed to make calculated decisions, including data collection and labeling.

Data pipelines

Data architecture is built upon data pipelines, which encompass the entire data moving process, from collection through to data storage, analysis and delivery. This component is essential to the smooth-running of any business. Data pipelines also establish how the data is processed (that is, through a data stream or batch-processing) and the end-point of where the data is moved to (such as a data lake or application).

Data streaming

In addition to data pipelines, the architecture may also employ data streaming. These are data flows that feed from a consistent source to a designated destination, to be processed and analysed in near real-time (such as media/video streaming and real-time analytics).

APIs (or Application Programming Interface)

A method of communication between a requester and a host (usually accessible through an IP address), which can increase the usability and exposure of a service.

Cloud storage

A networked computing model, which allows either public or private access to programs, apps and data via the internet.

Kubernetes

A container or microservice platform that orchestrates computing, networking, and storage infrastructure workloads.

Setting the standard: Key principles of effective data architecture

As we’ve learned, data architecture is a model that sets the standards and rules that pertain to data collection. According to simplilearn, effective data architecture, then, consists of the following core principles.

Validate all data at point of entry: data architecture should be designed to flag and correct errors as soon as possible.
Strive for consistency: shared data assets should use common vocabulary to help users collaborate and maintain control of data governance.
Everything should be documented: all parts of the data process should be documented, to keep data visible and standardised across an organisation.
Avoid data duplication and movement: this reduces cost, improves data freshness and optimises data agility.
Users need adequate access to data.
Security and access controls are essential.

The implementation and upkeep of data architecture is facilitated by the data architect, a data management professional who provides the critical link between business needs and wider technological requirements.

How is data architecture used?

Data architecture facilitates complex data collection that enables organisations to deepen their understanding of their sector marketplace and their own end-user experience. Companies also use these frameworks to translate their business needs into data and system requirements, which helps them prepare strategically for growth and transformation.

The more any business understands their audience’s behaviours, the more nimble they can become in adapting to ever-evolving client needs. Big data can be used to improve upon customer service, cultivate brand loyalty, and ensure companies are marketing to the right people.

And, it’s not all about pushing products. In terms of real-world impact, a shifting relationship to quality data could improve upon patient-centric healthcare, for example.

Take a dive into big data

Broaden your knowledge of all facets of data science when you enrol on the University of York’s 100% online MSc Computer Science with Data Analytics.

Get to grips with data mining, big data, text analysis, software development and programming, arming you with robust theoretical knowledge to step into the data sector.

Embracing the technological revolution: launching a career in computer programming

April 14, 2022/in Articles, Data science /

With our modern, globalised world so heavily reliant on data and technology, it is now almost impossible to comprehend the impact its absence would have on our lives. The prevalence of data and technology is advancing at an unprecedented speed and scale, fundamentally transforming the ways in which we live and work.

Supporting our increasingly automated lives and lifestyles through data collection, information analysis and knowledge sharing – in an effort to continuously advance and innovate upon existing processes and structures – is of strategic importance.

The UK digital skills shortage

The UK suffers from a critical digital skills shortage. Reports from a number of sources – including the latest report from the Department of Digital, Culture, Media and Sport (DCMS) – reveal that:

almost 20% of UK companies have a skills vacancy, with 14.1% reporting a lack of digital know-how
66% of digital leaders in the UK are unable to keep up with changes due to a lack of talent
the UK tech industry is facing its greatest shortages in cybersecurity, data architecture, and big data and data analysis
only 11% of tech leaders believe the UK is currently capable of competing on a global scale
data analysis is the fastest-growing skills clustering in tech, set to expand by 33% over the next five years
80% of digital leaders feel retention is more difficult post-pandemic due to shifting employee priorities

Evidently, there is a stark need for individuals with the skills and fundamentals necessary to harness technology’s potential, using it to guide, improve and provide insights into today’s global business environments. Millions are being invested to encourage more people to train for roles which require skills such as coding, data analytics, artificial intelligence (AI) and cybersecurity.

Digital skills are considered vital to post-pandemic economic recovery; in competitive, crowded marketplaces, evidence and data are key to guiding decision-making and business efforts. For those considering a career in computer science – whether in big data, web development, application development, programming, or any number of other fields – there has never been a better time to get involved.

Computer programming as a career

Depending on the role, industry and specialism, programmers can expect to undertake a wide-ranging array of tasks. For example:

designing, developing and testing software
debugging, to ensure that operating systems meet industry standards and are secure, reliable and perform as required
integrating systems and software
working alongside designers and other stakeholders to plan software engineering efforts
training end-users
analysing algorithms
scripting and writing code in different languages

The applications for this specialist skill set are vast – and the skills are required in almost every industry and sector. Individuals can work across, for example, websites and web applications, mobile and tablet applications, data structures and video games. Most of us will be familiar with the global, household names of Microsoft, Google and IBM – titans of the computing and technology industry. However, the technological skills and expertise gained from a computer science degree can open doors to careers in any number of businesses and sectors.

Potential career paths and roles could include:

computer programmer
software application developer
front-end/back-end web developer
computer systems engineer
database administrator
computer systems analyst
software quality assurance engineer
business intelligence analyst
network system administrator
data analyst

It’s a lucrative business. The current average salary for a programmer in the UK is £57,500 – a figure that can be well-exceeded with further experience and specialisation. It’s also a career with longevity; while computer programming is of paramount importance today, as the data and digital landscape continues to evolve, it’s only going to be even more important in the future.

What skills are needed as a computer programmer?

In the role of a programmer, it’s essential to combine creativity with the more technical and analytical elements of information systems. It’s a skilled discipline which requires artistry, science, mathematics and logic.

Indeed list a number of the more common skills required by computer programmers:

Proficiency with programming languages and syntax: While JavaScript is currently the most commonly used programming language, there are also many others, including Python, HTML/CSS, SQL, C++, Java, and PHP. Learning at least two computer programming languages will help to boost employability. Most programmers choose their area of computing specialism and then focus on the most appropriate language for that field.
Learning concepts and applying them to other problems: Take the example of CSS, where styles that are applied to a top-level webpage are then cascaded to other elements on this page. By understanding how programming concepts can be translated elsewhere, multiple issues can be resolved more efficiently.
Solid knowledge of mathematics: For the most part, programming relies on an understanding of mathematics that goes beyond the basics. Possessing solid working knowledge of arithmetic and algebra underpins many aspects of programming proficiency.
Problem-solving abilities: Code is often written and developed in order to create a solution to a problem. As such, having the capabilities to identify and solve problems in the most efficient way possible is a key skill for those working in programming.
Communication and written skills: Demonstrating how certain processes and results are produced – for example, stakeholders who may have limited or no programming and technical knowledge – is often a necessary part of the role. The ability to coherently communicate work is vital.

For those interested in developing their skill set, there exist a wealth of interactive, online courses and certifications to get started. Typical entry requirements include an undergraduate/bachelor’s degree.

Launch a new, fulfilling career in information technology and programming

Kickstart your career in the computing sector with the University of York’s online MSc Computer Science with Data Analytics programme – designed for those without a background in computer science.

This flexible course offers you in-depth knowledge and skills – including data mining and analytics, software development, machine learning and computational thinking – which will help you to excel in a wide variety of technological careers. You’ll also become proficient in a number of programming languages, all available to start from beginner’s level. Your studies will be supported by our experts, and you’ll graduate with a wide array of practical, specialist tools and know-how – ready to capitalise on the current skills shortage.

Artificial intelligence and its impact on everyday life

March 21, 2022/in Articles, Data science /

In recent years, artificial intelligence (AI) has woven itself into our daily lives in ways we may not even be aware of. It has become so pervasive that many remain unaware of both its impact and our reliance upon it.

From morning to night, going about our everyday routines, AI technology drives much of what we do. When we wake, many of us reach for our mobile phone or laptop to start our day. Doing so has become automatic, and integral to how we function in terms of our decision-making, planning and information-seeking.

Once we’ve switched on our devices, we instantly plug into AI functionality such as:

face ID and image recognition
emails
apps
social media
Google search
digital voice assistants like Apple’s Siri and Amazon’s Alexa
online banking
driving aids – route mapping, traffic updates, weather conditions
shopping
leisure downtime – such as Netflix and Amazon for films and programmes

AI touches every aspect of our personal and professional online lives today. Global communication and interconnectivity in business is, and continues to be, a hugely important area. Capitalising on artificial intelligence and data science is essential, and its potential growth trajectory is limitless.

Whilst AI is accepted as almost commonplace, what exactly is it and how did it originate?

What is artificial intelligence?

AI is the intelligence demonstrated by machines, as opposed to the natural intelligence displayed by both animals and humans.

The human brain is the most complex organ, controlling all functions of the body and interpreting information from the outside world. Its neural networks comprise approximately 86 billion neurons, all woven together by an estimated 100 trillion synapses. Even now, neuroscientists are yet to unravel and understand many of its ramifications and capabilities.

The human being is constantly evolving and learning; this mirrors how AI functions at its core. Human intelligence, creativity, knowledge, experience and innovation are the drivers for expansion in current, and future, machine intelligence technologies.

When was artificial intelligence invented?

During the Second World War, work by Alan Turing at Bletchley Park on code-breaking German messages heralded a seminal scientific turning point. His groundbreaking work helped develop some of the basics of computer science.

By the 1950s, Turing posited whether machines could think for themselves. This radical idea, together with the growing implications of machine learning in problem solving, led to many breakthroughs in the field. Research explored the fundamental possibilities of whether machines could be directed and instructed to:

think
understand
learn
apply their own ‘intelligence’ in solving problems like humans.

Computer and cognitive scientists, such as Marvin Minsky and John McCarthy, recognised this potential in the 1950s. Their research, which built on Turing’s, fuelled exponential growth in this area. Attendees at a 1956 workshop, held at Dartmouth College, USA, laid the foundations for what we now consider the field of AI. Recognised as one of the world’s most prestigious academic research universities, many of those present became artificial intelligence leaders and innovators over the coming decades.

In testimony to his groundbreaking research, the Turing Test – in its updated form – is still applied to today’s AI research, and is used to gauge the measure of success of AI development and projects.

This infographic detailing the history of AI offers a useful snapshot of these main events.

How does artificial intelligence work?

AI is built upon acquiring vast amounts of data. This data can then be manipulated to determine knowledge, patterns and insights. The aim is to create and build upon all these blocks, applying the results to new and unfamiliar scenarios.

Such technology relies on advanced machine learning algorithms and extremely high-level programming, datasets, databases and computer architecture. The success of specific tasks is, amongst other things, down to computational thinking, software engineering and a focus on problem solving.

Artificial intelligence comes in many forms, ranging from simple tools like chatbots in customer services applications, through to complex machine learning systems for huge business organisations. The field is vast, incorporating technologies such as:

Machine Learning (ML). Using algorithms and statistical models, ML refers to computer systems which are able to learn and adapt without following explicit instructions. In ML, inferences and analysis are discerned in data patterns, split into three main types: supervised, unsupervised and reinforcement learning.
Narrow AI. This is integral to modern computer systems, referring to those which have been taught, or have learned, to undertake specific tasks without being explicitly programmed to do so. Examples of narrow AI include: virtual assistants on mobile phones, such as those found on Apple iPhone and Android personal assistants on Google Assistant; and recommendation engines which make suggestions based on search or buying history.
Artificial General Intelligence (AGI). At times, the worlds of science fiction and reality appear to blur. Hypothetically, AGI – exemplified by the robots in programmes such as Westworld, The Matrix, and Star Trek – has come to represent the ability of intelligent machines which understand and learn any task or process usually undertaken by a human being.
Strong AI. This term is often used interchangeably with AGI. However, some artificial intelligence academics and researchers believe it should apply only once machines achieve sentience or consciousness.
Natural Language Processing (NLP). This is a challenging area of AI within computer science, as it requires enormous amounts of data. Expert systems and data interpretation are required to teach intelligent machines how to understand the way in which humans write and speak. NLP applications are increasingly used, for example, within healthcare and call centre settings.
Deepmind. As major technology organisations seek to capture the machine learning market, they are developing cloud services to tap into sectors such as leisure and recreation. For example, Google’s Deepmind has created a computer programme, AlphaGo, to play the board game Go, whereas IBM’s Watson is a super-computer which famously took part in a televised Watson and Jeopardy! Challenge. Using NLP, Watson answered questions with identifiable speech recognition and response, causing a stir in public awareness regarding the potential future of AI.

Artificial intelligence career prospects

Automation, data science and the use of AI will only continue to expand. Forecasts for the data analytics industry up to 2023 predict exponential expansion in the big data gathering sector. In The Global Big Data Analytics Forecast to 2023, Frost and Sullivan project growth at 29.7%, worth a staggering $40.6 billion.

As such, there exists much as-yet-untapped potential, with growing career prospects. Many top employers seek professionals with the skills, expertise and knowledge to propel their organisational aims forward. Career pathways may include:

Robotics and self-driving/autonomous cars (such as Waymo, Nissan, Renault)
Healthcare (for instance, multiple applications in genetic sequencing research, treating tumours, and developing tools to speed up diagnoses including Alzheimer’s disease)
Academia (leading universities in AI research include MIT, Stanford, Harvard and Cambridge)
Retail (AmazonGo shops and other innovative shopping options)
Banking
Finance

What is certain is that with every technological shift, new jobs and careers will be created to replace those lost.

Gain the qualifications to succeed in the data science and artificial intelligence sector

Are you ready to take your next step towards a challenging, and professionally rewarding, career?

The University of York’s online MSc Computer Science with Data Analytics programme will give you the theoretical and practical knowledge needed to succeed in this growing field.

What is data visualisation?

February 22, 2022/in Articles, Data science /

Data visualisation, sometimes abbreviated to dataviz, is a step in the data science process. Once data has been collected, processed, and modelled, it must be visualised for patterns, trends, and conclusions to be identified from large data sets.

Used interchangeably with the terms ‘information graphics’, ‘information visualisation’ and ‘statistical graphs’, data visualisation translates raw data into a visual element. This could be in a variety of ways, including charts, graphs, or maps.

The use of big data is on the rise, and many businesses across all sectors use data to drive efficient decision making in their operations. As the use of data continues to grow in popularity, so too does the need to be able to clearly communicate data findings to stakeholders across a company.

The importance of effective data visualisation

When data is presented to us in a spreadsheet or in it’s raw form, it can be hard to draw quick conclusions without spending time and patience on a deepdive into the numbers to understand results. However, when information is presented to us visually, we can quickly see trends and outliers.

A visual representation of data allows us to internalise it, and be able to understand the story that the numbers tell us. This is why data visualisation is important in business – the visual art communicates clearly, grabs our interest quickly, and tells us what we need to know instantly.

In order for data visualisation to work effectively, the data and the visual must work in tandem. Rather than choosing a stimulating visual which fails to convey the right message, or a plain graph which doesn’t show the full extent of the data findings, a balance must be found.

Every data analysis is unique, and so a one-size-fits-all approach doesn’t work for data visualisation. Choosing the right visual method to communicate a particular dataset is important.

Choosing the right data visualisation method

There are many different types of data visualisation methods. So, there is something to suit every type of data. While your knowledge of some of these methods may span back to your school days, there may be some which you are yet to encounter.

There are also many different data visualisation tools available, with free options available on Google Charts and the open sourced Tableau Public.

Examples of data visualisation methods:

Charts: data is represented by symbols – such as bars in a bar chart, lines in a line chart, or slices in a pie chart.
Tables: data is held in a table format within a database, consisting of columns and rows – this format is seen most commonly in Microsoft Excel sheets.
Graphs: diagrams which show the relation between two variable quantities which are measured along two axes (usually x-axis and y-axis) at right angles.
Maps: used most often to display location data, advancements in technology mean that maps are often digital and interactive which offers more valuable context of the data.
Infographics: a visual representation of information, infographics can include a variety of elements including images, icons, texts and charts which conveys more than one key piece of information quickly and clearly.
Dashboards: graphical user interfaces which provide at-a-glance views of key performance indicators relevant to a particular objective or business process.
Scatter plots: represents values for two different numerical variables by using dots to indicate values for an individual data point on a graph with a horizontal and vertical axis
Bubble charts: an extension of scatter plots which displays three dimensions of data – two values in their dot placement, and a third value through its size.
Histograms: a graphical representation which looks similar to a bar graph but condenses large data sets by grouping data points into logical ranges.
Heat maps: show the magnitude of a phenomenon as a variation of two colour dimensions which gives cues on how the phenomenon is clustered or varied over physical space.
Treemaps: uses nested figures – typically rectangles – to display large amounts of hierarchical data
Gantt charts: a type of bar chart which illustrates a project schedule, showing the dependency relationships between activities and current schedule status.

Data visualisation and the Covid-19 pandemic

The Covid-19 outbreak was an unprecedented event which had never been seen in our lifetimes. Because of the scale of the virus, its impacts on our daily lives, and the sudden nature of abrupt change, the way public health messages and evolving information on the situation were communicated was often through data visualisation.

Being able to visually see the effects of Covid-19 enabled us to try to make sense of a situation we weren’t prepared for.

As Eye Magazine outlines in the article ‘The pandemic that launched a thousand visualisations’: ‘Covid-19 has generated a growth in information design and an opportunity to compare different ways of visualising data’.

The John Hopkins University (JHU) Covid-19 Dashboard included key statistics alongside a bubble map to indicate the spread of the virus. A diagram from the Imperial College London Covid-19 Response Team was influential in communicating the need to ‘flatten the curve’. Line graphs from the Financial Times created visual representations of how values such as case numbers by country changed from the start of the outbreak to present day.

On top of this, data scientists within the NHS digital team built their capabilities in data and analytics, business intelligence, and data dashboards quickly to evaluate the rates of shielded patients, e-Referrals, and Covid-19 testing across the UK.

The use of data visualisation during the pandemic is a case study which will likely hold a place in history. Not only did these visualisations capture new data as it emerged and translate it for the rest of the world, they will also live on as scientists continue to make sense of the outbreak and the prevention of it happening again.

Make your mark with data visualisation

If you have ambitions to become a data analyst who could play an important role in influencing decision making within a business, an online MSc Computer Science with Data Analytics will give you the skills you need to take a step into this exciting industry.

This University of York Masters programme is studied part-time around your current commitments, and you’ll gain the knowledge you need to succeed. Skilled data analytics professionals are in high demand as big data continues to boom. With us, we’ll prepare you for a successful future.

What is data mining?

January 26, 2022/in About York, Data science /

Using data to identify sequential patterns, correlations and trends has been used throughout history but it was only in the 1990s that the term ‘data mining’ was coined. As digital technologies grow and evolve at unprecedented speed, so too do the methods and materials used to collect and interpret data.

As data mining – also known as knowledge discovery in data (KDD) – can be used to accurately predict certain outcomes, it is no surprise that more businesses than ever now use this practice in their day-to-day operations. This business intelligence practice isn’t only limited to tech giants like Google and Amazon. Data mining is used across all sectors, including retailers, banks, manufacturers, and telecommunications providers to name a few.

By using automated data analysis, businesses build an understanding of how their customers interact with their services or the popularity of products. It can also be used to keep track of the economy, risk and competition. Using this branch of data science in business can create stability and give a clear steer in managerial decision-making, as well as generating large financial gains and growth.

The foundation of data mining is underpinned by three scientific disciplines:

statistics – the numeric study of data relationships
artificial intelligence – software and/or machines which display human-like intelligence
machine learning – algorithms which can learn to make predictions through data

How data mining is used in business

Different volumes of data are captured from multiple different teams within a business. Data warehousing across an organisation is an efficient way to have all data stored centrally. Data mining uses the information in centrally stored databases to glean insights on the past, present and future of an organisation’s operations and output.

By using data, marketing campaigns can be optimised to improve segmentation, cross-sell offers, and target customers more directly, thereby increasing return on investment (ROI). Data collected in a marketing campaign can also be used by sales teams, and data mining can provide useful information on the customers most likely to convert into sales for a more efficient process.

Data mining techniques can also be used to reduce costs across a business’s operational functions by identifying glitches in processes and aiding more thoughtful decision making.

Primarily used in banking and financial institutions, data mining can be used for fraud detection as data anomalies can highlight risks quickly.

How does data mining work?

This first step of effective data mining is data collection. Many businesses have a data warehouse where a large collection of business data is stored and used to help make effective business decisions.

When undertaking data mining projects, the six phases in the Cross-Industry Standard Process for Data Mining (CRISP-DM) is a flexible workflow which is frequently used as a guideline for the data mining process.

The CRISP-DM phases are:

business understanding – identifying project objectives and scope, and uncovering a question or problem that data mining can answer or solve
data understanding – collecting the raw data relevant to the question, which often comes from multiple sources and may include both structured and unstructured data, and initial exploratory analysis to select the subset of data for analysis and modeling
data preparation – preparing the final data set and identifying the dimensions and variables to explore and prepare the final data set for model creation
modeling – selecting the appropriate modeling technique for the data set
evaluation – testing and measuring the model on its success at answering the question or solving the problem outlined in phase one, and editing the model or the question whilst assessing the progress to ensure it’s on the right track
deployment – deploying the model into the real world once it is accurate and reliable through a well thought out roll-out plan

Data mining modeling techniques

There are three main data mining modeling techniques in use today.

Descriptive modeling

This data mining technique uncovers shared similarities or groupings in historical data to determine answers to set questions on successes or failures. Within this technique, questions can include:

clustering – which groups similar records together as part of data mining applications
anomaly detection – which identifies any outliers
association rule learning – which detects relationships between records within the data set
principal component analysis – which detects relationships between variables
affinity grouping – which groups the data of individuals together through their common interests or similar goals

Predictive modeling

Predictive analytics can be used to predict events and outcomes in the future and this technique can help uncover insights relating to any business. Within this technique, questions can include:

regression – which measures the strength of a relationship between one dependent variable and a series of independent variables
neural networks – which are computer programs that detect patterns, make predictions, and learn from them (a neural network of three or more layers of big data is considered deep learning)
decision trees – which are tree-shaped diagrams with each branch representing different probable outcomes
support vector machines – which are supervised learning models with associated learning algorithms

Prescriptive modeling

This model uses a combination of techniques and tools and applies them against input from many different small and large data sets including historical data, real-time data, big data, and text mining which filters and transforms unstructured data from the web, social media, comment fields, books, email, PDFs, audio and other text sources. Within this technique, questions can include:

prescriptive analysis plus rules – which develops if/then rules from patterns and predicts outcomes
marketing optimisation – which simulates the most successful media mix for marketing campaigns to create the highest ROI

Does data mining require coding?

There are a wide range of data mining software and tools available, from open source programming languages such as R and Python to familiar tools like Excel.

As programming languages are a key part of manipulating, analysing and visualising data, data miners need an understanding of these languages to be able to efficiently use data mining tools to uncover knowledge discovery in databases.

Become an integral part of the modern workplace

Data scientists are in high demand across businesses in a range of sectors, so by studying a specialised degree you will be setting yourself up for a successful future in this ever-evolving field.

Learn how to solve business problems, develop key skills in data management and database systems, and further your understanding of data mining algorithms on the University of York’s 100% online MSc Computer Science with Data Analytics.

Studying part-time and around your own commitments, you can continue to earn as you learn and apply your learning to your current role.

The use of statistics in data science

December 20, 2021/in Articles, Data science /

Statistics is the study of data. It’s considered a mathematical science and it involves the collecting, organising, and analysing of data with the intent of deriving meaning, which can then be actioned. Our everyday usage of the internet and apps across our phones, laptops, and fitness trackers has created an explosion of information that can be grouped into data sets and offer insights through statistical analysis. Add to this, 5.6 billion searches a day on Google alone and this means big data analytics is big business.

Although we may hear the phrase data analytics more than we hear reference to statistics nowadays, for data scientists, data analysis is underpinned by knowledge of statistical methods. Machine learning takes out a lot of the statistical methodology that statisticians would usually use. However, a foundational understanding of some basics in statistics supports strategy in exercises like hypothesis testing. Statistics contribute to technologies like data mining, speech recognition, vision and image analysis, data compression, artificial intelligence, and network and traffic modelling.

When analysing data, probability is one of the most used statistical testing criteria. Being able to predict the likelihood of something happening is important in numerous scenarios, from understanding how a self-driving car should react in a collision to recognising the signs of an upcoming stock market crash. A common use of probability in predictive modelling is forecasting the weather, a practice which has been refined since it first arose in the 19th century. For data-driven companies like Spotify or Netflix, probability can help predict what kind of music you might like to listen to or what film you might enjoy watching next.

Aside from our preferences in entertainment, research has recently been focused on the ability to predict seemingly unpredictable events such as a pandemic, an earthquake, or an asteroid strike. Because of their rarity, these events have historically been difficult to study through the lens of statistical inference – the sample size can be so small that the variance is pushed towards infinity. However, “black swan theory” could help us navigate unstable conditions in sectors like finance, insurance, healthcare, or agriculture, by knowing when a rare but high-impact event is likely to occur.

The black swan theory was developed by Nassim Nicholas Taleb, who is a critic of the widespread use of the normal distribution model in financial engineering. In finance, the coefficient of variation is often used in investment to assess volatility and risk, which may appeal more to someone looking for a black swan. In computer science though, normal distributions, standard variation, and z-scores can all be useful to derive meaning and support predictions.

Some computer science-based methods that overlap with elements of statistical principles include:

Time series, ARMA (auto-regressive) processes, correlograms
Survival models
Markov processes
Spatial and cluster processes
Bayesian statistics
Some statistical distributions
Goodness-of-fit techniques
Experimental design
Analysis of variance (ANOVA)
A/B and multivariate testing
Random variables
Simulation using Markov Chain Monte-Carlo methods
Imputation techniques
Cross validation
Rank statistics, percentiles, outliers detection
Sampling
Statistical significance

While statisticians tend to incorporate theory from the outset into solving problems of uncertainty, computer scientists tend to focus on the acquisition of data to solve real-world problems.

As an example, descriptive statistics aims to quantitatively describe or summarise a sample rather than use the data to learn about the population that the data sample represents. A computer scientist may perhaps find this approach to be reductive, but, at the same time, could learn from the clearer consideration of objectives. Equally, a statistician’s experience of working on regression and classification could potentially inform the creation of neural networks. Both statisticians and computer scientists can benefit from working together in order to get the most out of their complementary skills.

In creating data visualisations, statistical modelling, such as regression models, is often used. Regression analysis is typically used in determining the strength of predictors, trend forecasting, and forecasting an effect, which can be represented in graphs. Simple linear regression relates two variables (X and Y) with a straight line. Nonlinear regression relates to two variables in a nonlinear relationship, represented by a curve. In data analysis, scatter plots are often used to show various forms of regression. Matplotlib allows you to build scatter plots using Python; Plotly will allow the construction of an interactive version.

Traditionally, statistical analysis has been key in helping us understand demographics through a census – a survey through which citizens of a country offer up information about themselves and their households. From the United Kingdom, where we have the Office for National Statistics to New Zealand, where the equivalent public service department is called StatsNZ, these official statistics allow governments to calculate data such as gross domestic product (GDP). In contrast, Bhutan famously measures Gross National Happiness (GNH).

This mass data collection, mandatory upon every household in the UK, which goes back to the Domesday Book in England, could be said to hold the origins of statistics as a scientific field. But it wasn’t until the early 19th century that the census was really used statistically to offer insights into populations, economies, and moral actions. It’s why statisticians still refer to an aggregate of objects, events or observations as the population and use formulae like the population mean, which doesn’t have to refer to a dataset that represents citizens of a country.

Coronavirus has been consistently monitored through statistics since the pandemic began in early 2020. The chi-square test is a statistical method often used in understanding disease because it allows the comparison of two variables in a contingency table to see if they are related. This can show which existing health issues could cause a more life-threatening case of Covid-19, for example.

Observational studies have also been used to understand the effectiveness of vaccines six months after a second dose. These studies have shown that effectiveness wanes. Even more ground-breaking initiatives are seeking to use the technology that most of us hold in our hands every day to support data analysis. The project EAR asks members of the public to use their mobile phones to record the sound of their coughs, breathing, and voices for analysis. Listening to the breath and coughs to catch an indication of illness is not new – it’s what doctors have practised with stethoscopes for decades. What is new is the use of machine learning and artificial intelligence to pick up on what the human ear might miss. There are currently not enough large data sets of the sort needed to train machine learning algorithms for this project. However, as the number of audio files increases, there will hopefully be valuable data and statistical information to share with the world.

A career that’s more than just a statistic

Studying data science could make you one of the most in-demand specialists in the job market. Data scientists and data analysts have skills that are consistently valued across different sectors, whether you desire a career purely in tech or want to work in finance, healthcare, climate change research, or space exploration.

Take the first step to upgrading your career options and find out more about starting a part-time, 100% online MSc Computer Science with Data Analytics today.

What is data science?

October 12, 2021/in Articles, Data science /

Data science includes the fields of statistics, scientific methods, artificial intelligence (AI) and data analysis. Every day, huge amounts of data are collected from our usage of the internet, our phones, the Internet of Things, and other objects embedded with sensors that provide data sources. This mass of information can be used by data scientists to create algorithms for data mining and data analysis from machine learning.

Once machines are conversant in what they’re looking for, they can potentially create their own algorithms from looking at raw data. This is called deep learning, which is a subset of machine learning. It usually requires some initial supervised learning techniques, for example, allowing the machine to scan through labelled datasets created by data scientists. However, because the machine will be powered by a neural network of at least three layers, its thinking simulates that of the human brain, and it can start noticing patterns beyond its specific training.

What’s the difference between data science and computer science?

Computer science and data science are sometimes used interchangeably. Pure computer science is focused on software, hardware and offering advances in the capacity of what computers can do with data collection. Data science is more interdisciplinary in scope, and involves aspects of computer science, mathematics, statistics and knowledge of specific fields of enquiry. Computer scientists use programming languages like Java, Python, and Scala. Data analysts are likely to have basic knowledge of SQL – the standard language for communicating with databases – as well as potentially R or Python for statistical programming.

Data analytics is concerned with telling a compelling story based on data that’s been sorted by machine learning algorithms. Although data analysts are expected to have some programming skills, their role is more concerned with interpreting and presenting clear and easily understandable data visualisations. This could be data that supports an argument, or data that proves an assumption wrong.

Data engineers are part of the data analytics team, but they work further up the pipeline (or lifecycle as it’s sometimes known) overseeing and monitoring data retrieval, as well as storage and distribution. Hadoop is the most used framework for storing and processing big data. The Hadoop distributed file system (HDFS) means that data can be split and saved across clusters of servers. This is economical and easily scalable as data grows. The MapReduce functional programming model adds speed to the equation. MapReduce performs parallel processing across datasets rather than sequential processing, which significantly speeds things up.

Why data science is important

Data science is most commonly used for predictive analysis. This helps with forecasting and decision-making in a wide spectrum of areas from weather to cybersecurity, risk assessment and FinTech. Statistical analysis helps businesses make decisions with confidence in an increasingly unpredictable world. It also offers up insight into broader trends or helps zero-in on a particular consumer segment, which can give businesses a competitive advantage. Big names like McKinsey and Nielsen use data to report on larger sector-wide trends and provide analysis on the effects of geopolitical and socio-economic events. Many organisations pay good money for these reports so that they can plan and stay ahead of the curve.

In the 21st century, AI and big data are revolutionising whole industries such as insurance, financial services and logistics. In healthcare, big data enables faster identification of high-risk patients, more effective interventions, and closer monitoring. Public transport networks can function more economically and sustainably thanks to data analysis. As the climate crisis increases the frequency of extreme weather, improved forecasting can help to mitigate the worst of the damage.

Data science is the fastest growing job area on LinkedIn and is predicted to create 11.5 million jobs by 2026 according to the US Bureau of Labour Statistics. Many leading tech-based companies like LinkedIn, Facebook, Netflix and Deliveroo rely heavily on data science and are driving demand for analysts.

How to learn data science

Data science tutorials can be found all over the internet and you can get a reasonable understanding of how it works from these, as well as certification – for example, from Microsoft on Azure. However, for professionals, a qualification like an MSc Data Science or a postgraduate degree in an associated subject area like an MSc Computer Science with Data Analytics is highly valued by employers. This can be studied full-time or part-time while you gain work experience in the area you wish to specialise in. Academia can only take you so far in understanding the theory but working hands-on in the world of data science will help you in the practice of this subject, honing your skills. It’s not one of the prerequisites for taking on a role, but it will help you stand out from the crowd in a competitive job market.

Data science is a burgeoning field that can complement most of the social sciences and there is an increasing demand for expertise in this area. Data scientists can come from a wide variety of backgrounds such as the fields of psychology, sociology, economics, and political science, because data and statistics are valuable and applicable to all these areas.

A score of 6.5 in IELTS (the International English Language Testing System) is one of the entry requirements for a degree in data science or computer science. This is because English is considered a first language in data science internationally, but also because natural language processing works off the English language as a primary reference point when programming in Python.

Ready to discover more about the world of data?

If you’ve reached a point in your career where you want to specialise in data analytics, now is the time to explore an online MSc Computer Science with Data Analytics from the University of York. Offering knowledge in machine learning, data analytics, data mining and text analysis, you’ll also create your own data analytics project in an area of interest.

Find out more about the six start dates throughout the academic year and plan your future.

What you need to know about blockchain

September 15, 2021/in About York, Artificial intelligence, Data science /

Blockchain technology is best known for its role in fintech and making cryptocurrency a reality, but what is it?

Blockchain is a database that stores information in a string of blocks rather than in tables, and which can be decentralised by being made public. Bitcoin, one of the most talked about and unpredictable cryptocurrencies, uses blockchain as does Ether, the currency of Ethereum.

Although cryptocurrencies have been linked with criminal activity, blockchain’s mechanism of storing data with time stamps provides offers transparency and traceability. Although central banks and financial institutions have been wary of the lack of regulation, retailers are increasingly accepting Bitcoin transactions. It’s said that Bitcoin founder, Satoshi Nakamoto, created the cryptocurrencies as a response to the 2008 financial crash. It was a way of circumnavigating financial institutions by saving and transferring digital currency in a peer-to-peer network without the involvement of a central authority.

Ethereum is a blockchain network that helped shift the focus away from cryptocurrencies when it opened in 2015 by offering general purpose blockchain that can be used in different ways. In a white paper written in 2013, the founder of Ethereum, Vitalik Buterin, wrote about the need for application development beyond the blockchain technology of Bitcoin, that would lead to attachment to real-world assets such as stocks and property. Ethereum blockchain has also provided the ability to create and exchange non-fungible tokens (NFTs). NFTs are mainly known as digital artworks but can also be digital assets, such as viral video clips, gifs, music, or avatars. They’re attractive because once bought, the owner has exclusive rights to the content. They also protect the intellectual property of the artist by being tamper-proof.

There has recently been a lot of hype around NFTs because the piece Everydays: The First 5000 Days by digital artist Beeple (Mike Winkelmann) sold for a record-breaking $69,346,250 at auction. That’s the equivalent of 42,329 Ether, which was what Vignesh Sundaresan, owner of Metapurse, used to purchase the piece that combines 5,000 images created and collated over 13 years. NFTs may seem like a new technology but they’ve actually been around since 2014.

IOTA is the first cryptocurrency to make possible free micro-transactions between Internet of Things (IoT) objects. While Ethereum moved the focus away from cryptocurrency, IOTA is looking to move cryptocurrency beyond blockchain. By using a Directed Acyclic Graph called the Tangle, IOTA manages to rid any need for miners, allows for near-infinite scaling, and removes fees entirely.

How blockchain works

Blockchain applications are many and varied including the decentralisation of financial services, healthcare, internet browsing, real estate, government, voting, music, art, and video games. Blockchain solutions are increasingly utilised across industries, for example, to provide transparency in the supply chain, or in lowering administrative overheads with smart contracts.

But how does it actually work? Blockchain uses several technologies including distributed ledger technology, digital signatures, distributed networks and encryption methods to link the blocks of the ledger for record-keeping. Information is collected in groups which make up the blocks. The blocks have certain capacities which, once filled, become chained to the previously filled block. This creates a timeline because each block is given a timestamp which cannot be overwritten.

The benefits of blockchain are seen not just in cryptocurrencies but in legal contracts and stock inventories as well as in the sourcing of products such as coffee beans. There are notoriously many steps between coffee leaving the farm where it was grown and reaching your coffee cup. Because of the complexity of the coffee market, coffee farmers often only receive a fraction of what the end-product is worth. Consumers also increasingly want to know where their coffee has come from and that the farmer received a fair price. Initially used as an effective way to cut out the various middlemen and streamline operations, blockchain is now being used as an added reassurance for supermarket customers. In 2020, Farmer Connect partnered with Folger’s coffee in using the IBM blockchain platform to connect producers with customers. A simple QR code helps consumers see how the coffee they hold in their hand was brought to the shelf. Walmart is another big name providing one of many case studies for offering transparency with blockchain by using distributed ledger software called Hyperledger Fabric.

Are blockchains hackable?

In theory, blockchains are hackable, however the time and resources – including a vast network of computers – needed to achieve a successful hack are beyond the average hacker. Even if a hacker did manage to simultaneously control and alter 51% of the copies of the blockchain in order to gain control of the ledger and make their own copy the majority copy, each block would then have different timestamps and hash codes (the cryptographic algorithm). The deliberate design of blockchain – using decentralisation, consensus, and cryptography – makes it impossible to alter the chain without it being noticed by others and irreversibly changing the data along the whole chain.

Blockchain is not invulnerable to cybersecurity attack through phishing and ransomware but it is currently one of the most secure forms of data storage. Permissioned blockchain adds an additional access control layer – actions performed only by identifiable users allow access. These blockchains are different to both public blockchains and private blockchains.

Are blockchains good investments?

Currencies like Bitcoin and Ether are proving to be good investments both in the short-term and the long-term; NFTs are slightly different though. A good way to think about NFTs is as collector’s items in digital form. Like anything that’s collectable, it’s best to buy something because you truly admire it rather than because it’s valuable, especially in the volatile cryptocurrency ecosystem. It’s also worth bearing in mind that the values of NFTs are based entirely on what someone is prepared to pay rather than any history of worth – demand drives price.

Anyone can start investing but as most digital assets like NFTs can only be bought with cryptocurrency, you’ll need to purchase some, which you can easily do with a credit card on any of the crypto platforms. You will also need a digital wallet in which to store your cryptocurrency and assets. You’ll be issued with a public key, which works like an email address when sending and receiving funds, and a private key, which is like a password that unlocks your virtual vault. Your public key is generated by your private key which makes them a pair and adds to the security of your wallet. Some digital wallets like Coinbase also serve as crypto bank accounts for savings. Although banks occasionally freeze accounts with relation to Bitcoin transactions, they are becoming more accustomed to cryptocurrencies. Investment banks such as JP Morgan and Barclays even show interest in the asset class despite the New York attorney general declaring “Play by the rules or we will shut you down” in March 2021.

Are blockchain transactions traceable?

In a blockchain, each node (a bank of computers) has a complete record of the data that has been stored on the blockchain since it began. So for example, the data held by a Bitcoin is the entire history of its transactions. If one node presents an error in its data, the thousands of other nodes help by providing a reference point for the error so it can correct itself. This architecture means that no single node in the network has the power to alter information held within it. It also means that the record of transactions in each block that make up Bitcoin’s blockchain is irreversible. This also means that any Bitcoins extracted by a hacker can be easily traced by the transactions that appear in the wake of the hack.

Blockchain explorers allow anyone to see transactions happening in real-time.

Learn more about cryptocurrencies and blockchain

Whether you’re interested in improving cybersecurity or becoming a blockchain developer, looking for enhanced expertise in data science or artificial intelligence, specialist online Master’s degrees from University of York cover some of the hottest topics in these areas.

Discover more and get a step ahead with the MSc Computer Science with Data Analytics or the MSc Computer Science with Artificial Intelligence.

What is machine learning?

August 27, 2021/in Articles, Artificial intelligence, Data science /

Machine learning is considered to be a branch of both artificial intelligence (AI) and computer science. It uses algorithms to replicate the way that humans learn but can also analyse vast amounts of data in a short amount of time.

Machine learning algorithms are usually written to look for recurring themes (pattern recognition) and spot anomalies, which can help computers make predictions with more accuracy. This kind of predictive modelling can be for something as basic as a chatbot anticipating what your question may be about to something quite complex, like a self-driving car knowing when to make an emergency stop.

It was an IBM employee, Arthur Samuel, who is credited with creating the phrase “machine learning” in his 1959 research paper, “Some studies in machine learning using the game of checkers”. It’s amazing to think that machine learning models were being studied as early as 1959 given that computers now contribute to society in important areas as diverse as healthcare and fraud detection.

Is machine learning AI?

Machine learning represents just a section of AI capabilities. There are three major areas of interest that use AI – machine learning, deep learning, and artificial neural networks. Deep learning is a field within machine learning, and neural networks is a field within deep learning. Traditionally, machine learning is very structured and requires more human intervention in order for the machine to start learning via supervised learning algorithms. Training data is chosen by data scientists to help the machine determine the features it needs to look for within labelled datasets. Validation datasets are then used to ensure an unbiased evaluation of a model fit on the training data set. Lastly, test data sets are used to finalise the model fit.

Unsupervised learning also needs training data, but the data points are unlabelled. The machine begins by looking at unstructured or unlabelled data and becomes familiar with what it is looking for (for example, cat faces). This then starts to inform the algorithm, and in turn helps sort through new data as it comes in. Once the machine begins this feedback loop to refine information, it can more accurately identify images (computer vision) and even carry out natural language processing. It’s this kind of deep learning that also gives us features like speech recognition.

Currently, machines can tell whether what they’re listening to or reading was spoken or written by humans. The question is, could machines then write and speak in a way that is human? There have already been experiments to explore this, including a computer writing music as though it were Bach.

Semi-supervised learning is another learning technique that combines a small amount of labelled data within a large group of unlabelled data. This technique helps the machine to improve its learning accuracy.

As well as supervised and unsupervised learning (or a combination of the two), reinforcement learning is used to train a machine to make a sequence of decisions with many factors and variables involved, but no labelling. The machine learns by following a gaming model in which there are penalties for wrong decisions and rewards for correct decisions. This is the kind of learning carried out to provide the technology for self-driving cars.

Is clustering machine learning?

Clustering, also known as cluster analysis, is a form of unsupervised machine learning. This is when the machine is left to its own devices to discover what it perceives as natural grouping or clusters. Clustering is helpful in data analysis to learn more about the problem domain or understand arising patterns, for example, customer segmentation. In the past, segmentation was done manually and helped construct classification structures such as the phylogenetic tree, a tree diagram that shows how all species on earth are interconnected. From this example alone, we can see how what we now call big data could take years for humans to sort and compile. AI can manage this kind of data mining in a much quicker time frame and spot things that we may not, thereby helping us to understand the world around us. Real-world use cases include clustering DNA patterns in genetics studies, and finding anomalies in fraud detection.

Clusters can overlap, where data points belong to multiple clusters. This is called soft or fuzzy clustering. In other cases, the data points in clusters are exclusive – they can exist only in one cluster (also known as hard clustering). K-means clustering is an exclusive clustering method where data points are placed into various K groups. K is defined in the algorithm by the number of centroids (centre of a cluster) in a set, which it then uses to allocate each data point to the nearest cluster. The “means” in K-means refers to the average, which is worked out from the data in order to find the centroid. A larger K value is an indication of many, smaller groups, whereas a small K value shows larger, broader groups of data.

Other unsupervised machine learning methods include hierarchical clustering, probabilistic clustering (including the Gaussian Mixture Model), association rules, and dimensionality reduction.

Principal component analysis is an example of dimensionality reduction – reducing larger sets of variables in the input data without losing variance. It is also a useful method for the visualisation of high-dimensional data because it ranks principal components according to how much they contribute to patterns in the data. Although more data is generally helpful for more accurate results, it can lead to overfitting, which is when the machine starts picking up on noise or granular detail from its training data set.

The most common use of association rules is for recommendation engines on sites like Amazon, Netflix, LinkedIn, and Spotify to offer you products, films, jobs, or music similar to those that you have already browsed. The Apriori algorithm is the most commonly used for this function.

How does machine learning work?

Machine learning starts with an algorithm for predictive modelling, either self-learnt or programmed that leads to automation. Data science is the means through which we discover the problems that need solving and how that problem can be expressed through a readable algorithm. Supervised machine learning requires either classification or regression problems.

On a basic level, classification predicts a discrete class label and regression predicts a continuous quantity. There can be an overlap in the two in that a classification algorithm can also predict a continuous value. However, the continuous value will be in the form of a probability for a class label. We often see algorithms that can be utilised for both classification and regression with minor modification in deep neural networks.

Linear regression is when the output is predicted to be continuous with a constant slope. This can help predict values within a continuous range such as sales and price rather than trying to classify them into categories. Logistic regression can be confusing because it is actually used for classification problems. The algorithm is based on the concept of probability and helps with predictive analysis.

Support Vector Machines (SVM) is a fast and much-used algorithm that can be used for both classification and regression problems but is most commonly used in classification. The algorithm is favoured because it can analyse and class even when there is a limited amount of data available. It groups data into classes even when the classes are not immediately clear because it looks at the data three-dimensionally and uses a hyperplane rather than a line to separate it. SVMs can be used for functions like helping your mailbox to detect spam.

How to learn machine learning

With an online MSc Computer Science with Data Analytics or an online MSc Computer Science with Artificial Intelligence from University of York, you’ll get an introduction to machine learning systems and how they are transforming the data science landscape.

From big data to how artificial neurons work, you’ll understand the fundamentals of this exciting area of technological advances. Find out more and secure your place on one of our cutting-edge master’s courses.

Everything you need to know about data analytics

July 15, 2021/in Articles, Data science /

Data analytics is a key component of most business operations, from marketing to supply chain. But what does data analytics mean, and why are so many organisations utilising it for business growth and success?

What is data analytics?

Data analytics is all about studying data – and increasingly big data – to uncover patterns and trends through analysis that leads to insight and predictability. Data analytics emerged from mathematics, statistics and computer programming before becoming a field in its own right. It’s related to data science and it’s a skill that is highly desirable and in demand.

We live in a world full of data gleaned from our various devices, which track our habits in order to understand and predict behaviours as well as help decision-making. Algorithms are created based upon the patterns that arise from our usage. Data can be extracted from almost any activity, whether it’s tracking sleep patterns or measuring traffic flow through a city. All you need are defined metrics. Although much of data extraction is automated, the role of data analysts is to define subsets, look at the data and make sense if it, thereby providing insight that can improve everyday life

Why is data analytics important?

Data analytics is particularly important in providing business intelligence that helps with problem-solving across organisations. This is known as business analytics, and it’s become a key skill and requirement for many companies in making business decisions. Data mining, statistical modelling, and machine learning are all major elements of predictive analytics which uses historical data. Rather than simply looking at what happened in the past, businesses can get a good idea of what will happen in the future through analysis and modelling of different types of data. This can then help them assess risk and opportunity when planning ahead.

In healthcare, for example, data analytics helps streamline operations and reduce wait times, so patients are seen more quickly. During the pandemic, data analysis has been crucial in analysing figures related to the rate of infection, which then helps in identifying hotspots, and forecasting either an increase or decrease in infections.

Becoming qualified as a data analyst can lead to work in almost any sector. Data analysis is essential for managing global supply chains and for planning in banking, insurance, healthcare, retail and telecommunications.

The difference between data analytics and data analysis

Although it may seem like data analytics and data analysis are the same, they are understood slightly differently. Data analytics is an overarching term that defines the practice, while data analysis is just a section of the entire process. Once data sets have been prepared, usually using machines to speed up the sorting of unstructured data, data analysts use techniques such as data cleansing, data transforming and data modelling to build insightful statistical information. This is then used to help improve and optimise everyday processes with data analytics as a whole.

What is machine learning?

Machine learning – a form of artificial intelligence – is a method of data analysis that uses automation for analytical model building. Once the machine has learnt to identify patterns through algorithms, it can make informed decisions without the need for human input. Machine learning helps speed up data analysis considerably, but this relies on data and parameters being accurate and unbiased, something that still needs human intervention and moderation. It’s a current area of interest because the way that data analysis progresses and supports us is reliant on a more diverse representation amongst data analysts.

Currently, most automated machine learning is based on simple, straightforward problems. More complex problems still require at least two people to work on them, so artificial intelligence is not going to take over any time soon. Human consciousness is still a mystery to us, but it is what makes the human brain’s ability to analyse unique.

What are data analytics tools?

There are a number of tools that help with analysis and overall analytics, and many businesses utilise them at least some of them for their day-to-day operations. Here are some of the more popular ones, which you may have heard of:

Microsoft Excel is one of the most well-known and useful tools for tabular data.
Tableau is business intelligence software that helps to make data analysis fast and easy by linking with Excel spreadsheets.
Python is a programming language used by data analysts and developers which makes it easy to collaborate on machine learning and data visualization amongst other things.
SQL is a domain-specific programming language that uses structured query language.
Hadoop is a distributed file system that can store and process large volumes of data.

Analysts also use databases that provide storage for data which is relational (SQL) and non-relational (NoSQL). Learning about all of these tools and becoming fluent in how to use them is necessary to become a data analyst.

How to get into data analytics

Working in data analytics requires a head for numbers and statistical techniques. But it also requires the ability to spot problems that need solving and the understanding of the criteria needed for data measurement and analysis to provide the solutions.

You need to become familiar with the wide range of methods used by analysts such as regression analysis (investigating the relationship between variables), Monte Carlo simulation (frequently used for risk analysis) and cluster analysis (classifying relative groups). In a way, you are telling a story through statistical data so you need to be a good interpreter of data and communicator of your findings. You will also need patience because, in order to start your investigations, it’s important to have good quality data. This is where the human eye is needed to spot things like coding errors and to transform data into something meaningful.

Studying for an MSc Computer Science with Data Analytics online

You can become a data analyst with the postgraduate course, MSc Computer Science with Data Analytics from the University of York. The course is 100% online with six starts per year so you can study anywhere, any time.

You can also pay per module with topics covered such as Big Data Analytics, Data Mining and Text Analysis, and Artificial Intelligence and Operating Systems. Once you’ve completed the learning modules you can embark on an Individual Research Project in a field of your choice.

Take the next step in your career by mastering the science of data analytics.

Types of algorithm

Brute force algorithm

Recursive algorithm

Backtracking algorithm

Greedy algorithm

Dynamic programming algorithm

Divide and conquer algorithm

Are algorithms artificial intelligence?

Supervised learning algorithms

Unsupervised learning algorithms

Reinforcement learning algorithms

Artificial intelligence algorithms in action

Unmask the secrets of data science

What is data architecture?

Building blocks: key components of data architecture

Artificial intelligence (AI) and machine learning models (ML)

Data pipelines

Data streaming

APIs (or Application Programming Interface)

Cloud storage

Kubernetes

Setting the standard: Key principles of effective data architecture

How is data architecture used?

Take a dive into big data

The UK digital skills shortage

Computer programming as a career

What skills are needed as a computer programmer?

Launch a new, fulfilling career in information technology and programming

What is artificial intelligence?

When was artificial intelligence invented?

How does artificial intelligence work?

Artificial intelligence career prospects

Gain the qualifications to succeed in the data science and artificial intelligence sector

The importance of effective data visualisation

Choosing the right data visualisation method

Examples of data visualisation methods:

Data visualisation and the Covid-19 pandemic

Make your mark with data visualisation

How data mining is used in business

How does data mining work?

Data mining modeling techniques

Descriptive modeling

Predictive modeling

Prescriptive modeling

Does data mining require coding?

Become an integral part of the modern workplace

A career that’s more than just a statistic

What’s the difference between data science and computer science?

Why data science is important

How to learn data science

Ready to discover more about the world of data?

How blockchain works

Are blockchains hackable?

Are blockchains good investments?

Are blockchain transactions traceable?

Learn more about cryptocurrencies and blockchain

Is machine learning AI?

Is clustering machine learning?

How does machine learning work?

How to learn machine learning

What is data analytics?

Why is data analytics important?

The difference between data analytics and data analysis

What is machine learning?

What are data analytics tools?

How to get into data analytics

Studying for an MSc Computer Science with Data Analytics online

Apply now

University of York