mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-02 18:50:17 -05:00
Created subfolders within images/ based on filetype
Better organization for the future to build a PDF etc. cause images need to be pulled from the right type for quality rendering. Currently, not being used but will be useful in the future and plus the organization now doesn't hurt by any means, only makes the "code" cleaner.
This commit is contained in:
@@ -1,6 +1,6 @@
|
||||
# Data Engineering
|
||||
|
||||

|
||||

|
||||
|
||||
Data is the lifeblood of AI systems. Without good data, even the most advanced machine learning algorithms will fail. In this section, we will dive into the intricacies of building high-quality datasets to fuel our AI models. Data engineering encompasses the processes of collecting, storing, processing, and managing data for training machine learning models.
|
||||
|
||||
@@ -38,7 +38,7 @@ We begin by discussing data collection: Where do we source data, and how do we g
|
||||
|
||||
In many domains of machine learning, while sophisticated algorithms take center stage, the fundamental importance of data quality is often overlooked. This neglect gives rise to [“Data Cascades”](https://research.google/pubs/pub49953/) (see @fig-cascades) — events where lapses in data quality compound, leading to negative downstream consequences such as flawed predictions, project terminations, and even potential harm to communities.
|
||||
|
||||
{#fig-cascades}
|
||||
{#fig-cascades}
|
||||
|
||||
Despite many ML professionals recognizing the importance of data, numerous practitioners report facing these cascades. This highlights a systemic issue: while the allure of developing advanced models remains, data is often underappreciated.
|
||||
|
||||
@@ -46,7 +46,7 @@ Take, for example, Keyword Spotting (KWS) (see @fig-keywords). KWS serves as a p
|
||||
|
||||
It is important to appreciate that these keyword spotting technologies are not isolated; they integrate seamlessly into larger systems, processing signals continuously while managing low power consumption. These systems extend beyond simple keyword recognition, evolving to facilitate diverse sound detections, such as the breaking of glass. This evolution is geared towards creating intelligent devices capable of understanding and responding to a myriad of vocal commands, heralding a future where even household appliances can be controlled through voice interactions.
|
||||
|
||||
{#fig-keywords}
|
||||
{#fig-keywords}
|
||||
|
||||
Building a reliable KWS model is not a straightforward task. It demands a deep understanding of the deployment scenario, encompassing where and how these devices will operate. For instance, a KWS model's effectiveness is not just about recognizing a word; it's about discerning it among various accents and background noises, whether in a bustling cafe or amid the blaring sound of a television in a living room or a kitchen where these devices are commonly found. It's about ensuring that a whispered "Alexa" in the dead of night or a shouted "Ok Google" in a noisy marketplace are both recognized with equal precision.
|
||||
|
||||
@@ -125,7 +125,7 @@ While platforms like Kaggle and UCI Machine Learning Repository are invaluable r
|
||||
|
||||
In addition, bias, validity, and reproducibility issues may exist in these datasets and in recent years there is a growing awareness of these issues. Furthermore, using the same dataset to train different models as shown in the figure below can sometimes create misalignment, where the models do not accurately reflect the real world (see @fig-misalignment).
|
||||
|
||||
{#fig-misalignment}
|
||||
{#fig-misalignment}
|
||||
|
||||
### Web Scraping
|
||||
|
||||
@@ -149,7 +149,7 @@ While web scraping can be a scalable method to amass large training datasets for
|
||||
|
||||
Web scraping can yield inconsistent or inaccurate data. For example, the photo in @fig-traffic-light shows up when you search 'traffic light' on Google images. It is an image from 1914 that shows outdated traffic lights, which are also barely discernible because of the image's poor quality.
|
||||
|
||||
](images/1914_traffic.jpeg){#fig-traffic-light}
|
||||
](images/jpg/1914_traffic.jpeg){#fig-traffic-light}
|
||||
|
||||
### Crowdsourcing
|
||||
|
||||
@@ -185,7 +185,7 @@ Many embedded use-cases deal with unique situations, such as manufacturing plant
|
||||
|
||||
While synthetic data offers numerous advantages, it is essential to use it judiciously. Care must be taken to ensure that the generated data accurately represents the underlying real-world distributions and does not introduce unintended biases.
|
||||
|
||||
)](images/synthetic_data.jpg){#fig-synthetic-data}
|
||||
)](images/jpg/synthetic_data.jpg){#fig-synthetic-data}
|
||||
|
||||
## Data Storage
|
||||
|
||||
@@ -224,7 +224,7 @@ Data governance utilizes three integrative approaches: planning and control, org
|
||||
|
||||
* **The risk-based approach**, intensified by AI advancements, focuses on identifying and managing inherent risks in data and algorithms, especially addressing AI-specific issues through regular assessments and proactive risk management strategies, allowing for incidental and preventive actions to mitigate undesired algorithm impacts.
|
||||
|
||||
)](images/data_governance.jpg){#fig-governance}
|
||||
)](images/jpg/data_governance.jpg){#fig-governance}
|
||||
|
||||
Some examples of data governance across different sectors include:
|
||||
|
||||
@@ -254,7 +254,7 @@ This format enables decoding the features frame-by-frame for keyword matching. S
|
||||
|
||||
Data processing refers to the steps involved in transforming raw data into a format that is suitable for feeding into machine learning algorithms. It is a crucial stage in any ML workflow, yet often overlooked. Without proper data processing, ML models are unlikely to achieve optimal performance. “Data preparation accounts for about 60-80% of the work of a data scientist.” @fig-data-engineering shows a breakdown of a data scientist’s time allocation, highlighting the significant portion spent on data cleaning and organizing.
|
||||
|
||||
{#fig-data-engineering}
|
||||
{#fig-data-engineering}
|
||||
|
||||
Proper data cleaning is a crucial step that directly impacts model performance. Real-world data is often dirty - it contains errors, missing values, noise, anomalies, and inconsistencies. Data cleaning involves detecting and fixing these issues to prepare high-quality data for modeling. By carefully selecting appropriate techniques, data scientists can improve model accuracy, reduce overfitting, and enable algorithms to learn more robust patterns. Overall, thoughtful data processing allows machine learning systems to better uncover insights and make predictions from real-world data.
|
||||
|
||||
@@ -266,7 +266,7 @@ Data often comes from diverse sources and can be unstructured or semi-structured
|
||||
|
||||
Data validation serves a broader role than just ensuring adherence to certain standards like preventing temperature values from falling below absolute zero. These types of issues arise in TinyML because sensors may malfunction or temporarily produce incorrect readings, such transients are not uncommon. Therefore, it is imperative to catch data errors early before they propagate through the data pipeline. Rigorous validation processes, including verifying the initial annotation practices, detecting outliers, and handling missing values through techniques like mean imputation, contribute directly to the quality of datasets. This, in turn, impacts the performance, fairness, and safety of the models trained on them.
|
||||
|
||||
{#fig-data-engineering-kws2}
|
||||
{#fig-data-engineering-kws2}
|
||||
|
||||
Let’s take a look at an example of a data processing pipeline (see @fig-data-engineering-kws2). In the context of tinyML, the Multilingual Spoken Words Corpus (MSWC) is an example of data processing pipelines—systematic and automated workflows for data transformation, storage, and processing. By streamlining the data flow, from raw data to usable datasets, data pipelines enhance productivity and facilitate the rapid development of machine learning models. The MSWC is an expansive and expanding collection of audio recordings of spoken words in 50 different languages, which are collectively used by over 5 billion people. This dataset is intended for academic study and business uses in areas like keyword identification and speech-based search. It is openly licensed under Creative Commons Attribution 4.0 for broad usage.
|
||||
|
||||
@@ -284,7 +284,7 @@ Data labeling is an important step in creating high-quality training datasets fo
|
||||
|
||||
Labels capture information about key tasks or concepts. Common label types (see @fig-labels) include binary classification, bounding boxes, segmentation masks, transcripts, captions, etc. The choice of label format depends on the use case and resource constraints, as more detailed labels require greater effort to collect (@Johnson-Roberson_Barto_Mehta_Sridhar_Rosaen_Vasudevan_2017).
|
||||
|
||||
{#fig-labels}
|
||||
{#fig-labels}
|
||||
|
||||
Unless focused on self-supervised learning, a dataset will likely provide labels addressing one or more tasks of interest. Dataset creators must consider what information labels should capture and how they can practically obtain the necessary labels, given their unique resource constraints. Creators must first decide what type(s) of content labels should capture. For example, a creator interested in car detection would want to label cars in their dataset. Still, they might also consider whether to simultaneously collect labels for other tasks that the dataset could potentially be used for in the future, such as pedestrian detection.
|
||||
|
||||
@@ -345,7 +345,7 @@ and therefore enabling reproducibility.
|
||||
|
||||
With data version control in place, we are able to track the changes as shown in @fig-data-version-ctrl, reproduce previous results by reverting to older versions, and collaborate safely by branching off and isolating the changes.
|
||||
|
||||
{#fig-data-version-ctrl}
|
||||
{#fig-data-version-ctrl}
|
||||
|
||||
**Popular Data Version Control Systems**
|
||||
|
||||
@@ -374,7 +374,7 @@ By providing clear, detailed documentation, creators can help developers underst
|
||||
|
||||
@fig-data-card shows an example of a data card for a computer vision (CV) dataset. It includes some basic information about the dataset and instructions on how to use or not to use the dataset, including known biases.
|
||||
|
||||
{#fig-data-card}
|
||||
{#fig-data-card}
|
||||
|
||||
Keeping track of data provenance—essentially the origins and the journey of each data point through the data pipeline—is not merely a good practice but an essential requirement for data quality. Data provenance contributes significantly to the transparency of machine learning systems. Transparent systems make it easier to scrutinize data points, enabling better identification and rectification of errors, biases, or inconsistencies. For instance, if a ML model trained on medical data is underperforming in particular areas, tracing back the data provenance can help identify whether the issue is with the data collection methods, the demographic groups represented in the data, or other factors. This level of transparency doesn’t just help in debugging the system but also plays a crucial role in enhancing the overall data quality. By improving the reliability and credibility of the dataset, data provenance also enhances the model’s performance and its acceptability among end-users.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user