I, like many others, am constantly frustrated by the sensationalist use of terminology in data science. Twitter user @xaprb neatly summed up my own feelings when he tweeted
When you're fundraising, it's AI [Artificial Intelligence].
When you're hiring, it's ML [Machine Learning].
When you're implementing, it's linear regression.
When you're debugging, it's print.
As someone with a background in operations research, I'm also stung by the following joke.
Q: What the difference between an operations researcher and a data scientist?
A: About $50k per annum.
Data science does, at times, feel like it's putting shiny new labels on dusty old concepts. And, I think, this results in some confusion for those who are new to the field.
A rose by any other name...
The following terms often appear to be synonymous.
- data analytics
- data science
- big data
- machine learning
- artificial intelligence (AI)
Strangely enough, I rarely hear "statistics" or "operations research" mentioned in the same context. But that's for another blog post...
French mathematician Henri Poincare said that mathematics was the art of giving the same name to different things. He may well have made the same observation of data science were he alive today.
Do labels matter? Andrew, Andy or Drew--it's still just me.
In the case of data science, however, I think they do matter. The five areas mentioned above cover a huge amount of conceptual ground and organizations need to know what capabilities best serve their needs and how to recruit people with the appropriate skills.
If you recruit experts in logic-based AI and set them to work on your d3 dataviz projects the results are probably going to be suboptimal. Clarity is always welcome.
I also find that people new to the area--e.g., those in search of training--usually crave some understanding of the different areas/terms. It's difficult to know what to learn if you can't identify it. I've seen people who wanted to learn how to create charts struggling with support vector machines because they followed the hyped terms.
Let me be clear. I can't offer a definitive definition of these terms. I can't even ask you to agree with my definitions of them. I can only start shouting in an already crowded bazaar. However, it's a conversation that keeps needing to be had. So, let's begin.
What is data analytics?
Data analytics is the most common use of data in organizations. It's widely and regularly used in day-to-day operations. It's used to produce things like
- daily or quarterly sales figures
- revenue by department
- inventory reports
Data analysts are often extracting data from relational databases (like SQL Server and Oracle) and presenting them as reports and corporate dashboards.
They will often produce "one-off" charts for presentations or to inform particular business decisions.
Data analysis is generally about understanding what has happened or is currently happening, in the organization.
What skills are desirable for a data analyst?
Data analysts should ideally be able to
- Manipulate databases using SQL (database query language)
- Use dashboard tools and understand how to design effective organizational dashboards. Stephen Few's "Information Dashboard Design" book is essential reading if you wish to know more about this topic.
- Utilise statistical tools to ensure that poor data, such as biased samples, don't filter through to decision-makers
- Produce effective, clear charts that inform, rather than confuse, decision-makers
- Take a wider interest in the business so that they can suggest data that might be useful
- Provide a bridge between management and DBAs (database administrators)
What is data science?
Data science is somewhat of an umbrella term. However, if we attempt to distinguish it from the other terms above, it's largely about making inferences from data. The data scientist is often attempting to create new knowledge from existing data--e.g., by producing predictions.
Data scientists look to uncover patterns in the data. This usually leads to more assumptions than are required in data analytics, and data scientists have to get used to most of their explorations ending up down blind alleys.
While data science projects do make use of well-structured relational data, they commonly involve the use of disparate, messy, unstructured data--such as customer feedback comments, or third-party datasets. Quality control becomes a big issue when working with such data.
What skills are desirable for a data scientist?
Data scientists should ideally be able to
- Use relatively advanced statistical tools and methods
- Program in at least one data science language (e.g., Python, R)
- Extract and manipulate data from diverse data sources--e.g., relational and non-relational databases, spreadsheets, JSON documents
- "Mung" data to transform it between different formats
- Use machine learning methods, such as clustering and random forests
- Produce custom visualisations using tools like ggplot2, Matplotlib or D3.js
- Explain complex, statistical analyses to non-technical decision-makers
What is big data?
Big data is generally working with data that is too large to be processed using standard (e.g., workstation, single server) tools. The boundaries are constantly shifting. Apparently, it's now possible to spin up a machine with 1TB of RAM in Azure--so workstations can handle fairly hefty loads.
Big data is partially an enabling technology for data analytics and data science. It provides the data that those areas require to sustain them. Big data platforms may be used to manage data that isn't destined for more detailed analysis, such as logs stored for regulatory reasons.
Spark is generally the prefered big data platform for data scientists. It has a powerful machine learning library (MLlib) that makes it easy to perform analyses on massive data sets.
What skills are desirable for a big data specialist?
Big data specialists should ideally be able to
- Operate and manage clusters of networked computers
- Maintain high availability of the cluster
- Understand cybersecurity issues and how they relate to securing a cluster of computers and the vulnerability of the big data platforms used in their organisations
- Program using enterprise languages, such as Java and Scala
- Tune their chosen big data platforms to ensure and maintain performance
What is machine learning?
Machine learning can probably be considered a subset of the tasks undertaken by a data scientist. However, as machine learning is a large and complex area, it is likely that a general data scientist won't have a deep knowledge of machine learning techniques and tools.
Organizations that wish to make significant use of machine learning, and have relatively novel requirements, may turn to experts in a particular branch of machine learning (e.g., deep learning).
Designing and tuning machine learning systems to get the best from them can require significant specialist experience--an experience that a more general data scientist doesn't have the time to gain and/or maintain. The tooling around many machine learning approaches (e.g., TensorFlow for deep learning) can take time to master.
Machine Learning is generally engaged to perform predictions based on complex data sets. It's often the case that the problem domain is poorly understood and machine learning is deployed to try and uncover patterns that can be exploited to form predictions in previously uncharted areas/scenarios.
What skills are desirable for a machine learning specialist?
Machine learning specialists should ideally be able to
- Provide deep expertise in one or more modeling techniques/tools
- Understand the statistical basis of the algorithms they use
- Write programs in popular machine learning languages, such as Python
- Exploit fast processing technologies such as GPUs and FPGAs
- Tune model "hyperparameters" to improve predictive accuracy
- Identify where machine learning tools will be ineffective
- Work closely with data scientists to ensure machine learning technology delivers results for the organization
- Advise on how models can be moved from research to production
What is AI?
Artificial intelligence (AI) has been around since the 1950s. Enthusiasm for it has waxed and waned, but it's currently experiencing a renaissance--largely through its contribution to machine learning.
Technologies pioneered by AI researchers that are now being used extensively by organizations include
- Deep learning (previously known as neural networks)
- Natural language process (NLP)--used in conversational user interfaces
- Image processing, as used in products such as self-driving cars
Cheap computer processing and storage have transformed old AI techniques in practical technologies in recent years.
AI research provides the foundation for many of the capabilities discussed previously. It is mostly performed in universities or research institutes. Successful ideas are then picked up and transferred to operational use.
What skills are desirable for an AI researcher?
The skills required by an AI researcher are completely determined by the area of their research. By its very nature, research is highly specialized and attracts deep niche experts.
At the end of the day, it doesn't matter where you draw the boundaries between these terms--or even what terms you use. But, it does matter that you have some terms, with clear, discrete definitions--and that you use them consistently in your organization.
As in all human endeavors language matters. And, it's especially important to attempt to be clear in areas where there is considerable pre-existing confusion.
Take a look at this Data Science Infographic to get more information and an understanding of the difference and importance of the five topics discussed.