Tools and Data to Experiment with Machine Learning

In this post, I write about tools, web platforms and Data to Experiment with Machine Learning.

Libraries and APIs

With a focus on Python libraries, I want to mention scikit-learn.org, TensorFlow by Google, PyThorch by Facebook, and Keras (API), the most mature tools providing the Machine Learning algorithms. Keras is an API presenting an easier usage of other libraries such as TensorFlow or Aesara (former Theano). PyTorch is also more user-friendly compared to TensorFlow. Nevertheless, TensorFlow seems to be more complicated to use; it is more mature and has a larger community and better support.

Python open-source library scikit-learn provides a comprehensive selection of machine learning techniques (regression, classification, clustering), feature selection, metrics, preprocessing, and other functionality. At this moment, Scikit-learn, is lacking deep learning functionality; however, we can use TensorFlow with the Scikit Flow wrapper for creating neural networks using the Scikit-learn approach.

XGBoost library is another option for applications requiring multicore parallelism. XGBoost is Extreme Gradient Boosting, using boosted trees to build regression, classification, ranking, and other predictive models.

Platforms

To learn how Machine Learning works in practice with the Worldwide community, I recommend kaggle.com as one of the first steps in ML experimentation. Kaggle has loads of posts, discussions, shared code, datasets, and loads competitions on different topics, from simple regression to working with visual media. Kaggle also has courses in Python, Machine Learning, data manipulation and visualisation, SQL, and others.

Openml.org is a Machine Learning platform created to share machine learning experiments and datasets to facilitate reproducibility in research. OpenML contributors also provide API to access ML libraries such as scikit-learn [1].

Datasets

It is essential to mention that datasets on these websites are already preprocessed and ready to experiment. For real-life training purposes, It is good to try out your own data collection, for instance, web scrapping, or to use publicly available APIs. For example, Twitter streaming API is a good source of real-life data. I have shared the tweets collection code in my GitHub repository if you like to experiment with Twitter data.

Tools and Data to Experiment with Machine Learning

Introduction

Libraries and APIs

Platforms

Datasets

References

Citation