Friday, February 23, 2024
HomeTrending What is the best language for data science?

[Data science 2022] What is the best language for data science?

data science

Table of Contents

  • The answer to this question is more subtle than it seems
  • Comparing Programming Languages ​​and Technologies as Notations
  • Optimize first prototyping time
  • Factors Affecting Prototyping Speed
  • Ethical Concerns in Rapid Prototyping
  • Conclusion

The answer to this question is more subtle than it seems

In a recent group discussion (data science), I saw data scientists debating between PyTorch and TensorFlow as better machine learning frameworks. I found it interesting because I’ve heard another version of this argument many times before. Python or R, MATLAB or Mathematica, Windows or Linux.

Over the years I’ve learned a lot of different programming languages, but I’ve come to realize that the question shouldn’t be which language or framework is best, but which one is best for the task at hand.

So what is the best language to use for prototyping machine learning? Ask five data scientists this question and you might get five different answers. In answering this question, you need to think about what language you are most comfortable with and what libraries and models are available in that language so that you can prototype machine learning as quickly as possible.

Comparing Programming Languages ​​and Technologies as Notations

We often think of programming languages ​​as a form of technology. All programming languages ​​can be equated as forms of technology because they accomplish the same high-level task of developing machine code. But for the same task, one language may be able to do the task faster, and another language may be able to do a more complex task that cannot be done. When viewed as a form of technology, languages ​​are logically equivalent. But when viewed as a notation, one language may be better at the task than another.

A mathematical analogy (to consider notation-based comparisons) is the notion of computing the derivative of a function. To do this task, mathematicians need to use some form of notation (Fig. 1).

The right side of Figure 1 is the dot notation developed by Sir Isaac Newton. Around the same time, Gottfried Wilhelm Leibniz’s notation, shown on the left, was developed. If we think of these notations as techniques, they are logically the same, since both can compute derivatives.

However, when considered as a notation, the two are not necessarily equal. For some tasks, one notation may be better. For example, Leibniz’s notation is much more intuitive than Newton’s notation for solving differential equations through variable separation, integration by parts, and so on. Therefore, Leibniz notation can be considered to be stronger than Newton notation for solving differential equations (Fig. 2).

If these two different forms of mathematical notation have varying levels of power associated with their usefulness, are there also levels of power associated with programming languages? Of course there is!

Optimize first prototyping time

Any discussion of which programming language is the best always comes down to the language’s computational performance. For example, Peter Xie wrote this article about Python being 45,000 times slower than C. Does the fact that C is faster than Python mean that C is better? Not necessarily.

I’ve shown you some tricks to speed up your Python code (in my online course ), but I think people are missing a really important point. A developer’s or data scientist’s time is much more expensive than a machine’s processing time.

People often get caught up in trying to fully optimize code on the first or second iteration. I think you’re right that instead of optimizing machine time, you should write proof-of-concept code and then optimize the time it takes to deliver value to your customers in the first place. is.

Python may be slower to compute than C, but compared to C, it takes fewer lines of code to create a machine learning model in Python, and ultimately takes less time to code. . Small things like dynamic typing over static typing make a big difference in speed when prototyping.

However, many of the standard languages ​​used by data scientists have features and abstractions similar to Python, so Python isn’t always the right language for the job. Libraries actually seem to be more important than the core language (in choosing a programming language).

Like me, you probably have a lot of experience with natural language processing (NLP).

In my experience Python has the most powerful library for NLP. scikit-learn , NLTK, spaCy, gensim, textacy, and many other libraries enable data scientists to quickly clean, preprocess, feature, and model text data. However, when trying to do data profiling and time series analysis, the packages available for R prove to be more powerful.

That said, when I need to use more sophisticated time-series techniques like recurrent neural networks, I often come back to Python because Python’s libraries excel at deep learning.

When you start using complex machine learning architectures like deep learning, the time it takes to train a model usually increases rapidly. However, as computing power continues to improve, learning time is becoming less of an issue.

In fact, I would argue that the main reason we are building more complex machine learning models is because more computing power has become available, and organizations are racing to take advantage of it all.

When choosing the language or library to use in your final project, I recommend choosing the one that optimizes the time to build the first prototype. When you learn to code in a language, you actually start thinking in that language. I know this myself.

I code mostly in Python now, but when I decide how to solve a problem, I think about how to solve it in Python. When I look back on my career learning a new language, I realize that the language I was using at the time made all the difference in how I solved problems.

In July 2020, Peter Xie, a self-described Python and AI enthusiast, posted an article on Medium titled “How slow is Python compared to C?”
In this article, when measuring the processing time of Python and C, I created a loop code that sets the initial value to 0 and adds 1, and adopted the method of how long the loop can be executed in 1 second.
 As a result of measurement, Python looped 10 million times per second, while C looped 450 million times . From this result, I concluded that Python is 45 times slower than C.
One of the authors of this article, Leah Simpson, teaches a data science course at Skillshare, an online course that teaches creative, hobby, and technology skills .

Factors Affecting Prototyping Speed

One of the main factors affecting prototyping speed is the time it takes to build the model architecture. I have leaned towards libraries like scikit-learn for building model architectures.

This is because (with scikit-learn) a lot of work is not required to build the model architecture, so the initial model can be built quickly. I usually start with the simplest model and work my way up the complexity of the model.

It is only after exhausting simple models that you start building more complex deep learning models. At the moment, we are looking for abstractions that allow us to build model architectures as quickly as possible using tools like Keras and TensorFlow.

Another factor that affects the speed of prototyping is the amount of data required to train the initial model. In many cases, you can start with simple model development that doesn’t require much data.

However, a simple model may not meet the accuracy requirements. So we have to use more complex models, which usually require more data. In this case, it may take a month to acquire and create more data, which ultimately slows down the speed of prototype development.

You might try to use transfer learning techniques or pre-trained models to accelerate the process of acquiring/creating more data. While these techniques and models may allow for faster prototyping, they can introduce serious ethical issues into prototyping.

Ethical Concerns in Rapid Prototyping

As data scientists, we have a responsibility to ensure that our models do not yield biased results for protected populations.

Amazon learned this lesson the hard way . If you’re using a pre-trained model, you may not have access to the dataset the model trained on. Even if you have training data, you may not know how it was cleaned, preprocessed, and sampled for training and testing.

Each one of these steps may fail to remove the bias inherent in the dataset. Worse, it may introduce or even amplify biases in the dataset (during the pre-training stage). I can only speak for myself, but I’m not confident that a pretrained model can be debiased in the process of transfer learning. So I’m pretty cautious about using this technique. However, when I am convinced that the risks associated with a business use case are limited, I rely on websites like Hugging Face for which library to use.

When using pre-trained models, the power of the machine learning models used effectively replaces the power of libraries and programming languages ​​that dictate what I use. As you can see, I routinely do state-of-the-art data science work with PyTorch, either using transfer learning or directly using models built by other organizations without transfer learning. is included.

In October 2018, Reuters reported that in 2014, Amazon’s machine learning research team used machine learning to automatically rank job seekers at the company from one to five stars. built the model. But in 2015, it was discovered that the model didn’t underestimate women .
 It is thought that the reason for the above bias is that the machine learning model learned that men are more likely to be hired as
a result of learning the patterns between resumes and hiring over the past 10 years or more .

As a result of learning about these biases, a description like “captain of the women’s chess team” in a resume lowered the evaluation. In addition, since many men were hired in the past, it cannot be judged that men are superior in technical positions. This is because some of the women who were not hired may have become excellent engineers.

(*Translation Note 4) Hugging Face is a startup that publishes and provides various models and datasets, advocating the democratization of machine learning.

Conclusion

After all, the regular battles over languages, libraries , and models are irrelevant. No single programming language is best for data science. The best programming languages ​​are the ones you can prototype the fastest and deliver value to your customers and end users.

While prototyping, don’t get hung up on optimizing your code right away. Instead, try to optimize the speed of the data scientist (prototyping). Once you have successfully demonstrated that your model solves the problem at hand, you can use the profiler to identify where the bottlenecks are in your code.

You can optimize your machine’s processing time this way, but you can’t use the profiler until you’ve written the prototype code. Try not to hitch the carriage in front of the horse! Try to identify the type of problem you are trying to solve and choose the best toolset to help you build it the fastest.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Recent Posts

Most Popular

Recent Comments