A few years ago in data science, most job openings required a PhD or at least a master’s degree in mathematics, statistics, or a similar subject as a primary criterion.
Everything has changed in the last couple of years. There has been widespread development of machine learning libraries that abstract away the complex nature of algorithms, as well as the realization that the practical application of machine learning to solve business problems requires a set of skills that are not usually acquired only through academic training. Companies are now hiring data scientists based on their ability to do applied data science rather than research.
Applied data science that delivers value to businesses in the shortest amount of time requires a very hands-on skill set. As more companies move their data and machine learning solutions to the cloud, it’s becoming increasingly important for data scientists to be aware of the new tools and technologies involved.
Also, the days when data scientists worked exclusively on data modeling, using data collected by data engineers and then handing off the model to a team of software engineers to put into production, are largely over. Especially outside of tech giants like Amazon, Facebook and Google.
To maximize business value, data scientists must understand all phases of the model development life cycle. It is important to have at least practical knowledge in the field of data link development, data analysis, machine learning, mathematics, statistics, data engineering, cloud computing and software development. This means that as 2021 approaches, this “generalist” data scientist will be preferred in most enterprises.
“ The bigger the picture, the more unique is the potential contribution of a person. Our biggest strength is the exact opposite of narrow specialization. It is the capacity for broad integration.” David Epstein.
This article doesn’t cover absolutely everything you need to become a data scientist in 2021, but it does reveal the key skills, both new and old, that will become the most important for every successful data scientist in the near future.
Python 3 (latest version) has now firmly become the default language version for most applications, as support for Python 2 has been dropped by most libraries as of January 1, 2020. Important: If you are learning Python for Data Science now, take a course that works with version 3.
You will need a good understanding of the basic syntax of the language and how to write functions, loops, and modules. Knowledge of both object-oriented and functional programming in Python, as well as the ability to develop, execute, and debug programs.
Pandas is still the number one Python library for data processing and analysis. In 2021, knowing Pandas will also be one of the most important data scientist skills.
Data is at the heart of any data science project, and Pandas is the tool that lets you clean, process, and extract useful information from it. Most machine learning libraries these days also commonly use Pandas DataFrames as standard input.
3. SQL and NoSQL
SQL has been around since the 1970s but is still one of the most important skills for data scientists. The vast majority of businesses use relational databases as a repository for analytical data, and SQL is the tool that will provide you with this data.
NoSQL are databases that do not store data as relational tables, instead data is stored as key-value pairs, wide columns or graphs. Examples of NoSQL databases include Google Cloud Bigtable and Amazon DynamoDB.
As the amount of data collected by companies increases and unstructured data is increasingly used in machine learning models, organizations are turning to NoSQL databases as a complement or alternative to a traditional data warehouse. This trend is likely to continue in 2021, and as a data scientist, it’s important to get at least a basic understanding of how to interact with data in this form.
According to one survey conducted in January of this year, 88% of organizations at that time were using some form of cloud infrastructure. The impact of COVID-19 has likely accelerated this process even further.
The use of the cloud in other areas of business usually goes hand in hand with cloud solutions for data storage, analytics and machine learning. Major cloud service providers such as Google Cloud Platform, Amazon Web Services, and Microsoft Azure are rapidly developing tools to train, deploy, and maintain machine learning models.
As data scientists who will be working in 2021 and beyond, it is highly likely that you will be dealing with data stored in a cloud-based database such as Google BigQuery and developing cloud-based machine learning models. Experience and skills in this area are likely to be in demand as we approach 2021.
5. Air flow
Apache Airflow, an open source workflow management tool, is rapidly being adopted by many companies to manage ETL processes and machine learning pipelines. Many big tech companies like Google and Slack use it, and Google even built their Cloud Composer around this tool.
Airflow is increasingly mentioned as a desirable skill for data scientists in job postings. As mentioned at the beginning of this article, it will become more important for data scientists to be able to create and manage their own data pipelines for analytics and machine learning. The growing popularity of Airflow is likely to continue, at least in the short term, and as an open source tool, it should definitely be explored by every aspiring data scientist.
6. Software engineering
Code for data analysis is usually very complicated, it is not always thoroughly tested and it does not follow the rules of formatting. This is fine for upfront data exploration and quick analysis, but when it comes to putting machine learning models into production, the data scientist will need a good understanding of software development principles.
If you’re planning to work as a data scientist, chances are you’ll either be putting models into production yourself, or at least actively participating in the process. Therefore, it is important to master the following skills:
- Coding conventions like PYTHON
- Unit testing
- Version control like Github
- Dependencies and virtual environments
- Containers like Docker
This article has highlighted the major trends in data scientist skills. These insights were drawn from reviewing current data science job postings, the author’s own experience as a data scientist, and reading articles about future trends in the field.
This is not a complete list, it definitely takes a lot more skill and experience to become a successful data scientist. But in the next year, it is likely that the most important thing is to focus on these skills.
Thank you for reading!