Elena' s AI Blog

Tools and Data to Experiment with Machine Learning

19 Oct 2021 / 3 minutes to read

Elena Daehnhardt


Midjourney AI-generated art


Introduction

In this post, I write about tools, web platforms and Data to Experiment with Machine Learning.

Libraries and APIs

With a focus on Python libraries, I want to mention scikit-learn.org, TensorFlow by Google, PyThorch by Facebook, and Keras (API), the most mature tools providing the Machine Learning algorithms. Keras is an API presenting an easier usage of other libraries such as TensorFlow or Aesara (former Theano). PyTorch is also more user-friendly compared to TensorFlow. Nevertheless, TensorFlow seems to be more complicated to use; it is more mature and has a larger community and better support.

Python open-source library scikit-learn provides a comprehensive selection of machine learning techniques (regression, classification, clustering), feature selection, metrics, preprocessing, and other functionality. At this moment, Scikit-learn, is lacking deep learning functionality; however, we can use TensorFlow with the Scikit Flow wrapper for creating neural networks using the Scikit-learn approach.

XGBoost library is another option for applications requiring multicore parallelism. XGBoost is Extreme Gradient Boosting, using boosted trees to build regression, classification, ranking, and other predictive models.

Platforms

To learn how Machine Learning works in practice with the Worldwide community, I recommend kaggle.com as one of the first steps in ML experimentation. Kaggle has loads of posts, discussions, shared code, datasets, and loads competitions on different topics, from simple regression to working with visual media. Kaggle also has courses in Python, Machine Learning, data manipulation and visualisation, SQL, and others.

Openml.org is a Machine Learning platform created to share machine learning experiments and datasets to facilitate reproducibility in research. OpenML contributors also provide API to access ML libraries such as scikit-learn [1].

Datasets

It is essential to mention that datasets on these websites are already preprocessed and ready to experiment. For real-life training purposes, It is good to try out your own data collection, for instance, web scrapping, or to use publicly available APIs. For example, Twitter streaming API is a good source of real-life data. I have shared the tweets collection code in my GitHub repository if you like to experiment with Twitter data.

kaggle.com

openml.org

Did you like this post? Please let me know if you have any comments or suggestions.

Posts about Machine Learning that might be interesting for you




References

[1] Matthias Feurer, Jan N. van Rijn, Arlind Kadra, Pieter Gijsbers, Neeratyoy Mallik, Sahithya Ravi, Andreas Mueller, Joaquin Vanschoren, Frank Hutter. OpenML-Python: an extensible Python API for OpenML. arXiv:1911.02490 [cs.LG], 2019

desktop bg dark

About Elena

Elena, a PhD in Computer Science, simplifies AI concepts and helps you use machine learning.

Citation
Elena Daehnhardt. (2021) 'Tools and Data to Experiment with Machine Learning', daehnhardt.com, 19 October 2021. Available at: https://daehnhardt.com/blog/2021/10/19/edaehn-ml-datasets/
All Posts