Structured Data: Turning isolated files into reusable knowledge
Structured data often determines whether research stays usable or turns into unusable noise over time. In this post I am (somewhat officially, at least in spirit) declaring July the Structured Data Month, because this topic deserves far more attention than it usually gets in everyday practice.
When well-structured files are still unusable
Imagine a measurement file containing a matrix of numbers. The file itself may be neatly arranged, but it stands alone. The row and column information is somewhere else, the experimental groups are not linked, experimental design and batch information are missing, the processing steps are unclear, and nobody immediately knows where the raw data is stored or whom to contact.
This is an important distinction: structural formatting is not the same as structural meaning. A matrix of numbers without metadata, meaning without clear information on what the rows represent, what the columns measure, and what the experimental conditions were, can easily become dark data: data that exists, but is difficult or impossible to find, understand, access, or reuse. A file can be neatly formatted and still be scientifically ambiguous. For data to become truly reusable, the relevant information needs to be structured together, and people need to know that this dataset exists. This is what structured data actually means in a modern, data-driven organization.
From files to knowledge: what structured data actually means
Structured data is important because it turns isolated files into reusable knowledge. If data is well described, findable, accessible, and connected to its context, we can ask better questions. We can compare experiments, integrate datasets, detect patterns, reproduce analyses, and generate new hypotheses. This is also the logic behind the FAIR principles, which argue that scientific data should be Findable, Accessible, Interoperable, and Reusable, with a particular emphasis on making data usable not only by people, but also by machines.
The organizational reality: everyone is cooking their own soup
In companies, including pharma and enterprise R&D, this is especially relevant. Data is often generated in different departments, for different projects, with different tools, naming conventions, storage locations, and access rules. In addition, there are often limited incentives to make data reusable for people outside the team that originally generated it. Each team may be “cooking its own soup”. A team may build a spreadsheet that works perfectly for its weekly meeting. Locally, this can be efficient. But if the spreadsheet lacks standardized schemas, shared identifiers, or centralized metadata, it may be completely invisible and useless to a data scientist three doors down who is trying to train a predictive model. This is the classic trap of local optimization and global sub-optimization: something works well for one team, but limits value across the organization.
If nobody knows that a dataset exists, where it is stored, what it contains, or whom to contact, then the data cannot easily contribute to new analyses, new decisions, or new discoveries.
Why this matters for AI and machine learning
This also creates a direct bridge to AI and machine learning. AI systems do not become useful simply because the model is large. They become useful when the data they learn from, retrieve from, or reason over is high quality, well documented, and fit for purpose. In machine learning, data problems can cascade through the whole system, causing errors that are difficult to detect later (Sambasivan 2021). The “data work” may be less glamorous than model building, but it is often the critical step.
For example, if an AI assistant is used for Retrieval-Augmented Generation (RAG) to help scientists find past experimental insights, it cannot reason over an isolated spreadsheet. It needs the metadata envelope around the data: sample metadata, definitions of variables, links to protocols, information about data quality, and ideally a documented history of how the data was processed. If we want to train, fine-tune, or retrieve from AI systems, duplicated, inconsistent, biased, mislabeled, or poorly documented data can directly affect performance and reliability. Work on language model training data has shown that deduplication can reduce memorized output and improve model behavior, which is a practical reminder that data preparation is not an administrative detail, but part of model quality itself (Lee 2022).
Conclusion
Structured data is not an implementation detail but a prerequisite for reuse. Without consistent structure and metadata, data remains tied to its original context and loses value over time. Declaring July as Structured Data Month is a small reminder that improving how we structure and describe data is not optional overhead, but a practical step toward making research and data work more interoperable and sustainable.
Leave a comment