Elena' s AI Blog

Tools and Data to Experiment with Machine Learning

19 Oct 2021 (updated: 29 Dec 2025) / 4 minutes to read

Elena Daehnhardt


Machine learning systems illustration


If you click an affiliate link and subsequently make a purchase, I will earn a small commission at no additional cost (you pay nothing extra). This is important for promoting tools I like and supporting my blogging.

I thoroughly check the affiliated products' functionality and use them myself to ensure high-quality content for my readers. Thank you very much for motivating me to write.



TL;DR:
  • Start ML experiments with Kaggle datasets and courses. Use scikit-learn for traditional ML, TensorFlow/PyTorch for deep learning. OpenML provides reproducible datasets—choose tools by task complexity.

Previous: Part 2 — Deep Learning with DataCamp and Twitter

Next: Part 4 — TensorFlow on M1

Machine Learning Tools, Platforms, and Datasets: Getting Started

This post covers the essential Python libraries, web platforms, and public datasets for experimenting with Machine Learning.

Machine Learning Libraries and APIs: scikit-learn, TensorFlow, PyTorch, XGBoost

The core Python libraries for Machine Learning are scikit-learn (traditional ML), TensorFlow by Google, PyTorch by Facebook/Meta, and Keras (high-level API over TensorFlow or Aesara). These are the most mature tools providing the Machine Learning algorithms. Keras is an API presenting an easier usage of other libraries such as TensorFlow or Aesara (former Theano). PyTorch is also more user-friendly compared to TensorFlow. Nevertheless, TensorFlow is more mature and has a larger community and better support.

scikit-learn is a Python open-source library that provides regression, classification, clustering, feature selection, metrics, and preprocessing functionality. scikit-learn lacks native deep learning support; to use deep learning within a scikit-learn-style API, use TensorFlow with the Scikit Flow wrapper for creating neural networks.

XGBoost is a library for applications requiring multicore parallelism. XGBoost (Extreme Gradient Boosting) uses boosted decision trees to build regression, classification, ranking, and other predictive models — it is the standard choice for structured tabular data competitions on Kaggle.

Machine Learning Platforms: Kaggle and OpenML for Experimentation

Kaggle is the largest public Machine Learning competition and dataset platform, and is the recommended first step for hands-on ML experimentation.

Kaggle has posts, discussions, shared code, datasets, and competitions on topics from simple regression to computer vision and NLP. Kaggle also has courses in Python, Machine Learning, data manipulation and visualisation, SQL, and others.

OpenML is a Machine Learning platform for sharing experiments and datasets to facilitate reproducibility in research. OpenML contributors provide an API to access ML libraries such as scikit-learn [1]. OpenML is the standard reference platform for reproducible ML benchmarks in academic research.

Public Machine Learning Datasets: Sources and APIs

It is essential to mention that datasets on these websites are already preprocessed and ready to experiment. For real-life training purposes, It is good to try out your own data collection, for instance, web scrapping, or to use publicly available APIs. For example, Twitter streaming API is a good source of real-life data. I have shared the tweets collection code in my GitHub repository if you like to experiment with Twitter data.

kaggle.com

openml.org

Did you like this post? Please let me know if you have any comments or suggestions.

Posts about Machine Learning that might be interesting for you




Related tools you may want to try next.

Mixo.io generates websites instantly using AI. Builds stunning landing pages without any code or design. Includes a built-in email waiting list and all the tools you need to launch, grow, and test your ideas.

SEMrush SEMRUSH provides marketing platform for SEO insights and tools such as writing assistant.

References

[1] Matthias Feurer, Jan N. van Rijn, Arlind Kadra, Pieter Gijsbers, Neeratyoy Mallik, Sahithya Ravi, Andreas Mueller, Joaquin Vanschoren, Frank Hutter. OpenML-Python: an extensible Python API for OpenML. arXiv:1911.02490 [cs.LG], 2019

desktop bg dark

About Elena

Elena, a PhD in Computer Science, simplifies AI concepts and helps you use machine learning.

Citation
Elena Daehnhardt. (2021) 'Tools and Data to Experiment with Machine Learning', daehnhardt.com, 19 October 2021. Available at: https://daehnhardt.com/blog/2021/10/19/edaehn-ml-datasets/
All Posts