Andreea Munteanu
on 13 June 2024
Data science is one of the most exciting topics of the last century. With its utility in industries of all kinds, it’s easy to see why it has been rated as one of the top 20 fast-growing occupations in the US, according to the Bureau of Labour Statistics. However, entering this fast-growing space isn’t’ easy: newcomers face significant challenges in setting up their environments, dealing with package dependencies, or accessing compute resources. Given these obstacles, it’s easy to see why a talent shortage persists in the data science field, and why overcoming these challenges is vital for teams and companies.
This blog will walk you through the most common challenges that data science newcomers face, review popular data science platforms, and take a look at the bigger picture of how open source is used in data science. With these insights, you will be able to more easily choose the right tools and options to simplify your work and focus on upskilling in the data science field.
Is it easy to get started on data science?
Data science is a rewarding career, but starting out as a newcomer can be challenging. Here are the most common obstacles that new data scientists face when starting out their careers:
- Time spent on tooling: Data scientists spend more time configuring and fixing their tools than building models. Between tool selection and integration, package dependencies, people who are active in this field need always to ensure the system works. Looking for an out-of-the-box solution seems the most obvious option, however, tools that are seamlessly integrated and can be deployed within minutes are also a viable option.
- Configurations: Whether it’s GPU configurations or managing package dependencies, data scientists need to do tedious tasks before they can get started. A 2023 report from Anaconda found approximately a quarter of commercial data scientists report being blocked by managing package dependencies or access to compute resources.
- Learning curve: Something new is coming up every other day in this field and it often feels overwhelming for newcomers, who are under pressure to quickly upskill in many different areas at once, from programming to development tool maintenance. Data scientists are constantly upskilling via a number of channels and often by themselves: according to the latest Stack Overflow Developer Survey, a majority of developers upskill using online courses, blogs and technical documentation. This shows that data scientists need time and space to focus on the actual skills they are trying to acquire rather than preparing the environment to start learning.
- Initial cost: Data science can be costly; newcomers would like to lower their initial investment before they commit long-term to data science as the career path forward for them. Open-source tooling has been a great option for saving on set-up costs: it enables future data scientists and ML engineers to get started with no costs and have access to projects that are already available.
As you can see, new data scientists typically face a rough start. However, the good news is that once they are on track, it gets easier and easier every day.
How to choose a Data Science platform
As I mentioned before, it seems that a new tool, framework or library for data science or machine learning is launched every other day. This can be overwhelming. How do you actually choose from this wide variety of options?
Before we get into the weeds of tools, let’s take a moment to look at the main capabilities and key considerations that a data science platform should have:
- Exploratory data analysis: being able to perform initial exploratory data analysis is crucial, especially for people looking to use a data science tool on a workstation. It enables them to focus on the initial stages of the machine learning lifecycle, understand the data set, get some data visualisations and do initial data preprocessing.
- Machine learning lifecycle: The main purpose of any professional or enthusiast who is active in this space is to build models. Therefore, they need tools that cover multiple parts of the machine learning lifecycle, enabling them to build and store models, and track and reproduce experiments. It covers the initial part of the machine learning lifecycle, such that the development of models is made at ease.
- Popular tools: For any beginner, the scale of adoption of their chosen tools can make or break them. When a tool is used by more people, it typically has better awareness and documentation of bugs, challenges, and workarounds. If we look at the open source space, the community provides extensive support and guidance, enabling professionals from different areas to benefit from continuous improvements, fixes, and workarounds for popular tools and platforms.
- Ease of use: everyone wants tools that are easy to use. The main objective of a data scientist is not endless tinkering with tools, so having an intuitive platform that accelerates project delivery and reduces the learning curve is vital for their work.
Scalability: While many AI projects start small, every data scientist should also have a long-term vision and consider scalability capabilities. This helps data scientists to grow as the project matures without a need to upskill in other tools.
Join our webinar to learn more about data science tools
Register nowNow that we have an idea of what to look for in a data science tool or platform, let’s take a closer look at the popular options that data scientists use.
Returning to the preference for open source, we should look at the entire stack and how open-source tooling can accelerate the entire process. Linux has pioneered the open source space, with Ubuntu being the most adopted distribution. It has a powerful command line that data scientists and machine learning engineers enjoy using, and it simplifies their operational tasks. Furthermore, there is a lot more from open source that could enhance someone’s journey in the data science space. Python is a great example: it’s the preferred programming language in data science, and many of its libraries, such as Pandas, Numpy, PyTorch, and TensorFlow, have been widely adopted in countless data science projects.
But how do you actually build the models? In the Stack Overflow report we mentioned above, Jupyter Notebook is listed as one of the top technologies used in data science. It is a powerful tool for performing many data science or machine learning tasks, including cleaning data and building ML pipelines or training models. In the same area, MLflow, which is used for experiment tracking and model registry, hit 10 million users over a year ago, leading to open source adoption. Such a platform is often deployed on a workstation with a GPU, which must also be configured. NVIDIA, for example, has a GPU operator that streamlines the experience for cloud-native applications.
These are just some of the examples of tools that one can use. Once they are selected, data scientists need to integrate them into a cohesive solution. Whenever they are deployed, they use a series of different packages that have dependencies and versioning constraints. Users need to coordinate this effort to ensure the good functionality of the platform, including upgrades and updates which might challenge it.
Looking at data scientists’ initial challenges, they should look for tools that cover most of them at the lowest possible cost. The Data Science Stack (DSS) is a solution provided by Canonical that puts together leading open source tools that cover part of the machine learning lifecycle, enabling users to develop, optimise and store models without hefty start-up costs, time-consuming setup or difficult configurations.
What is Data Science Stack (DSS)?
Data Science Stack (DSS) is an out-of-the-box solution for data scientists and machine learning engineers, published by Canonical. It is a ready-made environment for ML enthusiasts that enables them to develop and optimise models without spending time on the necessary tooling. It is designed to run on any Ubuntu AI workstation, maximising the GPU’s capability and simplifying its usage. Are you curious?
DSS includes leading open source tools, such as Jupyter Notebook and MLflow, with full integration. It has, by default, two of the most adopted ML images, Pytorch and TensorFlow. They can be deployed using an intuitive command line interface (CLI), and then, the UIs of the tools can be accessed to dive into data science.
Beyond giving access to an ML solution, DSS also takes care of the packaging dependencies, ensuring that all the tools, libraries and frameworks work seamlessly together and are compatible with the machine’s hardware. In addition, DSS simplifies the GPU configuration by including the GPU operator and all the benefits that it comes with.
Try Canonical’s Data Science Stack
It is available in beta, inviting data scientists, machine learning engineers, and AI enthusiasts to share their feedback with us. You can easily deploy it on your Ubuntu machine, tell us about your experience, and benefit from this ongoing community feedback.
Join our webinar
If you would like to learn more about data science tools join our webinar on [date]. Together with Michal Hucko, we will talk about:
- Key considerations when selecting data science tools
- Challenges of the data science landscape
- Data science with open source tooling
- Demo of the DSS