Technical requirements – Exploring Data for Machine Learning

Imagine embarking on a journey through an expansive ocean of data, where within this vastness are untold stories, patterns, and insights waiting to be discovered. Welcome to the world of data exploration in machine learning (ML). In this chapter, I encourage you to put on your analytical lenses as we embark on a thrilling quest. Here, we will delve deep into the heart of your data, armed with powerful techniques and heuristics, to uncover its secrets. As you embark on this adventure, you will discover that beneath the surface of raw numbers and statistics, there exists a treasure trove of patterns that, once revealed, can transform your data into a valuable asset. The journey begins with exploratory data analysis (EDA), a crucial phase where we unravel the mysteries of data, laying the foundation for automated labeling and, ultimately, building smarter and more accurate ML models. In this age of generative AI, the preparation of quality training data is essential to the fine-tuning of domain-specific large language models (LLMs). Fine-tuning involves the curation of additional domain-specific labeled data for training publicly available LLMs. So, fasten your seatbelts for a captivating voyage into the art and science of data exploration for data labeling.

First, let’s start with the question: What is data exploration? It is the initial phase of data analysis, where raw data is examined, visualized, and summarized to uncover patterns, trends, and insights. It serves as a crucial step in understanding the nature of the data before applying advanced analytics or ML techniques.

In this chapter, we will explore tabular data using various libraries and packages in Python, including Pandas, NumPy, and Seaborn. We will also plot different bar charts and histograms to visualize data to find the relationships between various features, which is useful for labeling data. We will be exploring the Income dataset located in this book’s GitHub repository (a link for which is located in the Technical requirements section). A good understanding of the data is necessary in order to define business rules, identify matching patterns, and, subsequently, label the data using Python labeling functions.

By the end of this chapter, we will be able to generate summary statistics for the given dataset. We will derive aggregates of the features for each target group. We will also learn how to perform univariate and bivariate analyses of the features in the given dataset. We will create a report using the ydata-profiling library.

We’re going to cover the following main topics:

  • EDA and data labeling
  • Summary statistics and data aggregates with Pandas
  • Data visualization with Seaborn for univariate and bivariate analysis
  • Profiling data using the ydata-profiling library
  • Unlocking insights from data with OpenAI and LangChain

Technical requirements

One of the following Python IDE and software tools needs to be installed before running the notebook in this chapter:

  • Anaconda Navigator: Download and install the open source Anaconda Navigator from the following URL:

https://docs.anaconda.com/navigator/install/#system-requirements

  • Jupyter Notebook: Download and install Jupyter Notebook:

https://jupyter.org/install

  • We can also use open source, online Python editors such as Google Colab (https://colab.research.google.com/) or Replit (https://replit.com/)

The Python source code and the entire notebook created in this chapter are available in this book’s GitHub repository:

https://github.com/PacktPublishing/Data-Labeling-in-Machine-Learning-with-Python

You also need to create an Azure account and add an OpenAI resource for working with generative AI. To sign up for a free Azure subscription, visit https://azure.microsoft.com/free. To request access to the Azure OpenAI service, visit https://aka.ms/oaiapply.

Once you have provisioned the Azure OpenAI service, deploy the LLM model – either GPT-3.5-Turbo or GPT 4.0 – from Azure OpenAI Studio. Then copy the keys for OpenAI from OpenAI Studio and set up the following environment variables:
os.environ[‘AZURE_OPENAI_KEY’] = ‘your_api_key’
os.environ[‘AZURE_OPENAI_ENDPOINT”) =’your_azure_openai_endpoint’

Your endpoint should look like this: https://YOUR_RESOURCE_NAME.openai.azure.com/.

No Responses

Leave a Reply

Your email address will not be published. Required fields are marked *



Terms of Use | About yeagerback | Privacy Policy | Cookies | Accessibility Help | Contact yeagerback