In this blog post, we are going to see an auto-machine learning library in Python called dabl which stands for data analysis baseline library. It allows you to perform quick Exploratory data analysis and at the same time, it also allows you to build machine learning models very rapidly. So if you like pandas profiling you will like the dabl also because it works pretty much in a simple way where it requires you to minimal code to build a machine learning model along with data visualization.

Prerequisites:

DABL: DABL is an open-source software created by Andreas Mueller. dabl makes supervised machine learning more accessible for beginners and reduces the boilerplate when working with common tasks in machine learning. dabl takes inspiration from scikit-learn and auto-learn. 

Dabl can be installed using pip. The library is dependent on having the latest version of scikit-learn, so if you don’t already have it, you will need to upgrade or install the latest version:

Use the following command to install DABL.

Pip install dabl

Exploratory Data analysis

# sphinx_gallery_thumbnail_number = 3
from dabl import plot
from dabl.datasets import load_ames
import matplotlib.pyplot as plt

# load the ames housing dataset
# returns a blank data frame
data = load_ames()

# define data frame and target
plot(data, 'SalePrice')
plt.show()

Let’s understand line by line what happening in the code

  1. In the first line, we are importing the plot function from DABL.
  2. Second, we are importing the Ames housing dataset from the DABL’s datasets. You can also use the following datasets they are also from the dabl database-
    1. load_adult() – It is an adult census dataset.
    2. load_titanic() – it is a titanic dataset.
  3. Importing matplotlib.pyplot as plt.
  4. Next, we are assigning the data variable the aims housing data set given by the load aims function.
  5. then the plot function as input argument we put in the data and then we put int the target variable which is the sale price.
  6. And lastly, we put in plt. show from the matplotlib.

Output:

Exploratory Data analysis with our dataset

from dabl import plot
import pandas as pd
import matplotlib.pyplot as plt
# load you own csv file
df = pd.read_csv("data.csv")
# define data frame and target
plot(df, 'Data_value')

Output:

custom plotting

dabl. plot() offers insight into descriptive data but doesn’t always guarantee to provide all the facets of a problem or situation related to your specific information set. dabl provides high-level insight into common issues like what might be potentially important, and how it may affect other areas and lets you decide whether or not you need to do more research or troubleshooting as an act of customizing plotting for your specific needs.

# importing all the required libraries from dabl import plot
import pandas as pd
import matplotlib.pyplot as plt

# load you own csv file
df = pd.read_csv("data.csv")
df
# define data frame and target
dabl.plot(df, target_col="Period")

plt.show()

Output:

Data cleaning with dabl

The first step in any data analysis is to make data clean and readable for yourself and that is what dabl tries to do by detecting the types of data and applying appropriate conversions. It also tries to detect potential data quality issues. The end goal of cleaning data for dabl is that the data gets clean enough to create useful visualization and models.

Dabl provides you with a method called dabl. clean for cleaning the data.

#importing the required libraries
import dabl
import pandas as pd

# here we are using dabl.clean model to clean the data
data = pd.read_csv("data.csv")
data_clean = dabl.clean(data)[::10]
data_clean

# here you can also provide some suggestion on data type conversion
data_clean = dabl.clean(data, type_hints={"Period": "continuous"})

Model building with dabl

Dabl intends to simplify the implementation of advanced AI training methods, making it easier for existing programmers to spend less time on creating deep learning models and more on actually using them. It takes up less time and memory from users to train machine learning models with Dabl because it uses a straightforward interface as opposed to other machine learning libraries which are usually more complex. As mentioned earlier, Dabl is still a new library that provides basic machine learning capabilities compared to other libraries. However, its simplicity makes it a perfect candidate to introduce people who have little or no previous experience with machine learning models into the field.

from dabl import plot
import pandas as pd
import matplotlib.pyplot as plt
from dabl import SimpleClassifier

# load you own CSV file
df = pd.read_csv("data.csv")

# building the model using SimpleClassifier method just add the dataset name and target name
ec = dabl.SimpleClassifier(random_state=0).fit(df, target_col="Series_title_1") 

Output:

As you can see it finished building the model in just a few seconds and with good accuracy. But currently, it only has a selected collection of learning algorithms and it does not work every time like for regression datasets and reason is that dabl is a relatively new library so it will take some time to get better. 

Limitations of DABL:

Our current implementation doesn’t deal with text data, time-series data, or neural network models. Image, audio, and video data are also out of scope. But dabl promises to provide these features along with some new features like enhanced model building, explainable model building, ready-made visualization, type detection, Automatic preprocessing, and many more in the future. If you want more information like a full list of API and limitations you can read the official docs.

Final Words

One of the problems with the current data analysis ecosystem is the lack of standardization. Each package has a different way of doing things and it’s difficult to get started. DABL is a library that attempts to address this problem by providing a familiar set of tools to the data analysis community. In DABL you can easily import, manipulate and export a wide variety of data easily. We hope you enjoyed our article about DABL, the Data Analysis baseline library. If you have any questions or comments on the project, please let us know by visiting our GitHub repository. Thank you for reading, we hope you find DABL useful.