A Simple Reading List on Data Management & AI

A simple reading list on AI data management.

2 motherhood statements to rule all meetings.

I’ve sat in multiple meetings where I always encounter the same two motherhood statements.

First: “Garbage in, garbage out.”

Every time the topic of data comes up, someone invokes this mantra like it’s all that matters for data management for AI. But what does garbage actually look like?

Second: “Data is our competitive edge.”

Usually said with some conviction, as if having more data (even garbage) automatically translates to competitive advantage. But what makes it a competitive edge?

Both statements sound catchy. Perhaps it’s just me. But I usually don’t leave the meeting any the wiser.

Data management wasn’t something I had to think much about during my PhD. I only needed 2-3 good datasets that fit on my laptop, used to death. Any data quality issues would have naturally surfaced after the 100th model training run. Data lineage was a non-issue. Any other issues, such as representativeness, would be naturally picked on by reviewers.

And so when I started digging into the literature, I was pleasantly surprised. There are plenty of good papers on data for AI.

So here’s a reading list that shows that there is a lot more to data management beyond these motherhood statements.

Note: I have used open-access links from arXiv as far as possible.

The Foundations: What “Garbage” and “Edge” Actually Mean

These papers address the fundamentals, the problems that existed before GenAI made everything more complicated. From quality and drifts to transparency and ops. As well as what it means for data to be an edge.

1. “Data Collection and Quality Challenges in Deep Learning: A Survey” - Whang et al.

VLDB Journal (2023) This paper provides a systematic taxonomy to support data-centric AI, from collection (acquisition via discovery, augmentation and generation, labeling via semi-supervised or active learning and weak supervision) to quality assurance (validation, cleaning, and sanitization). Points out that perfect data cleaning is impossible so robust training also needs to be part of the solution. 📄 arXiv:2112.06409

2. “Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift” - Rabanser, Günnemann & Lipton (2019)

NeurIPS (2019) This paper focuses on how data drifts lead to failures. It proposes a pipeline that combines dimensionality reduction with statistical testing to detect changes. Other interesting points: simple statistical tests can best more complex ones; training a separate model to detect what changed in the dataset works; and not all data changes are bad. 📄 arXiv:1810.11953

3. “Datasheets for Datasets” - Gebru et al. (2018)

Communications of the ACM (2021) Paper proposes the equivalent of what we know as model cards for datasets - datasheets. The aim is for every dataset to be accompanied by a “datasheet” describing its operating characteristics. It outlines a workflow and set of questions covering the entire dataset lifecycle. 📄 arXiv:1803.09010

4. “DMOps: Data Management Operation and Recipes” - Choi & Park (2023)

ICML (2023) Paper proposes a standardized 12-step “recipe” covering the full lifecycle: from establishing business goals and securing raw data to schema design, human annotation, and final delivery. It also outlines a verification process that goes beyond internal consistency to include “external factor verification” and “model verification” (using models to detect errors in a human-in-the-loop cycle). 📄 arXiv:2301.01228

5. “Data-centric Artificial Intelligence: A Survey” - Zha et al. (2023)

arXiv preprint 📄 arXiv:2303.10158 A fairly useful framework for what the paper calls “Data-Centric AI”, which changes the focus from models as the edge to data as the edge. It organizes the data lifecycle into “Training Data Development” (data collection, labeling, preparation, reduction, augmentation), “Inference Data Development” (creating granular in- and out-of-distribution evaluation sets and engineering prompts), and “Data Maintenance” (ensuring reliability via data understanding, quality assurance). If you read one paper to understand the landscape, this is it.

The New Frontier: GenAI and Agentic AI

Everything above still applies. But GenAI and agents introduce problems that didn’t exist before.

1. “A Survey of LLM × DATA: From Data Management to Data Science” - Zhou et al. (2025)

arXiv preprint This survey covers the relationship between LLMs and data. How to collect, construct, and use data for language models. It also shows how we have moved from curated datasets to web-scale unstructured corpora. 📄 arXiv:2505.18458

2. “Scaling Trends for Data Poisoning in LLMs” - Bowen et al. (2024)

AAAI (2025) Bring attention to a key vulnerability of larger LLMs, which are more susceptible to data poisoning, not less. They learn harmful behaviors from minimal exposure more quickly than smaller models. 📄 arXiv:2408.02946

4. “Data Quality Challenges in Retrieval-Augmented Generation” - Müller et al. (2025)

ICIS (2025) Identifies 15 data quality dimensions across four RAG stages. Issues transform and propagate through the pipeline. Highlights that what looks nice and clean can become real garbage someway down the road in RAGs. Too many excellent figures to include here. Take a look at the paper! 📄 arXiv:2510.00552

5. “Episodic Memory in AI Agents Poses Risks that Should be Studied and Mitigated” - DeChant (2025)

SaTML (2025) Examines how episodic memory in agents, storing and retrieving action records, introduces new data management challenges. Retention policies, privacy risks, unintended learning. Your agent’s “garbage” might be experiences it created itself. 📄 arXiv:2501.11739

6. “What’s the Next Frontier for Data-Centric AI? Data Savvy Agents” - Seedat et al. (2025)

arXiv preprint Highlights an interesting point about how autonomous agents could do with data-savvy capabilities: proactive acquisition, context-aware processing, test data synthesis, continual adaptation. 📄 arXiv:2511.01015

“Which kind of garbage? Which kind of edge?”

There’s a lot more to data in AI once you go beyond these motherhood statements.

Next time someone says either one, maybe just ask - What do you mean?

Any must-reads in this area that you would recommend?

#AIDataManagement #DataCentricAI #AIRiskManagement #DataQuality #AIReadingList