top of page

What makes Data Data?

  • Writer: Ismael K.G.
    Ismael K.G.
  • Aug 21, 2020
  • 11 min read

Updated: Aug 28, 2020


An abstract painting, different tones of blue emerge from the bottom-right, mixing into green on the left, where yellow draws a fine line between the blue and a red top-half of the image
Image by Raheel Shakeel from Pixabay

Data is term that is thrown about a lot these days. Whether in the context of Data Science, Big Data and AI, or data-driven business solutions, data-backed policy-making and so on, data is always there, lurking, ready to pounce and make an assertion seem more insightful. This “insight,” I propose, is fundamental to data. In other words, the nature of data is its having meaning to people. There are no data out there in the world, unperturbed by human interaction, so to speak. The data I am speaking of is a data that necessitates such human interaction and realises its nature by becoming widespread and understood.


This post is structured as follows. First, I briefly speak of the etymology and grammar of the word data. I then briefly explain why data needn’t conjure up ontological complications. After this, I explore four factors which I suggest grant data its insightful, meaningful or informative nature. These are interpretation, materialisation, context and metadata. Underlying these discussions is the assumption that data is meaningful by nature; or, rather, what there is in the world becomes meaningful through data.


On Etymology and Grammar


Data finds its roots in Latin, where it is the plural of datum, meaning "gift." Datum, in turn, is also the past participle of the Latin verb dare ("to give"), in which sense datum means "given." Linking back to the noun "gift," datum means "that which is given." It is rather romantic to think of data as "that which is given," as if immutable and unquestionable.

Whilst it is etymologically clear that data is the plural form of datum (e.g.: "having more than one datum means having data" or "the data are in different formats"), it is often used as an abstract noun in singular (e.g.: "data is important for decision-making"). In neither case (plural nor singular) is data countable in English (e.g., it would be odd to say "we have collected five data," although it may be technically correct in terms of having collected one datum, another datum, another and two more). The plural-data/singular-data may actually get some people's blood boiling (see this fun comic strip and all the many comments). For clarity, though, I will speak of data points as describing the individual data referred to by plural-data; whereas dataset shall refer to the conjunction of data points referred to by singular-data.


Metaphysical Data


So far, we have simply noted the ambiguity that might arise for newspaper and journal editors, who see “data” written in plural and singular, and even to refer to different things. A more fundamental question is what these things are — what is the stuff that data points refer to? One may take an approach to the world that assumes this to be objective and true and for it to contain some ultimate fact of the matter, some true and immutable data. However, this definition of data — as referring to what there is in the world — sidesteps what we may ordinarily call data points and datasets because these are clearly developed through human interaction with the world and for the purpose of their analysis.


In what regards data as some signifier for the ontological foundations of reality, I suggest we speak of brute data. Brute data, in existing separate from us, is not data in the meaningful sense I grant it. As an example of brute data, imagine an alien civilisation on a planet billions of light years away. This civilisation and its peoples exist and therefore constitute some form of brute data; but we do not know of them, so they are of no use to our epistemic endeavours. Data, by this analysis, carries epistemic import: it is meaningful insofar that it helps humanity better understand the world around us.This last sentence constitutes the most important assumption of this post.


What makes Data Data?


Fixing data’s definition as playing a role in our understanding of the universe grants the concept a function and clarifies what it is for it to be meaningful: data points are meaningful because they convey knowledge or serve as building blocks for the development of knowledge. But what might make a data point or dataset meaningful? I will suggest four factors can respond to this question: interpretation, materialisation, context and metadata. I also suggest these factors be considered as individually sufficient conditions for data to be (meaningful, insightful, informative) data, but not necessary. This allows for other factors to be identified beyond this post and clarifies that this list does not intend to be exhaustive.


Interpreting Data


I would like to begin this section by considering a fictitious survey designed for the purpose of the Definition-Obsession Amongst UK Postgraduates research project. This survey is encountered by my totally fictitious therapist, who suggests we run through the questionnaire as it is relevant to me. The responses will also be recorded and forwarded to the researcher conducting the study. This results in 20 minutes of my half-hour slots with my therapist being spent on the survey, but they are compassionate and adjust their tone as we go through the questionnaire, pausing for me to share more, offering water for me to regain my composure and so on (definitions are a very sensitive subject for me). We make it through the final question, the therapist turns the recording device off and jots down some notes in their usual notebook. We spend the next five minutes discussing my emotions at that time and the therapist offers some actionable advice in the final five minutes. A helpful visit as always! But what was I conveying when answering the survey questions?


When I speak [...], I convey information

When I speak with my (totally fictitious) therapist, I convey information (I like to think I generally do this when I communicate). I do not consider this to be data, but my therapist finds my explaining of my emotions to be of use. This is true insofar that it helps them adapt to the circumstances and guide me through them (my fictitious therapist is great, by the way). But they also take notes and revisit past sessions to provide better informed advice. Focusing on one session in particular where I remember fictitiously describing a fictitious childhood trauma and fictitiously repressing my tears as I fictitiously dug my clenched fist into the side of the fictitious blue chaise longue, my head pushing back against the somewhat uncomfortable frame, despite its luxuriously soft velvety fabric... I remember my fictitious therapist being moved by my tale. It took courage for me to get it off my chest, but also patience on their part to listen, understand and empathise. The information I was conveying was meaningful in that very moment. No doubt it would be brought up again after they scribbled notes about it, but it was information that was useful – it was data that fed immediately into my therapist's reactions, questions and advice.


The information was not useful in itself – it is not informative if I speak with the walls –, but it becomes useful once interpreted and processed by my fictitious therapist. But this meaningfulness may also reside in the eye of the beholder. My therapist finds my cries more insightful than, say, the stranger who walks past me as I cry in the street. Once again, there is a function that makes data what it is: meaning. To this effect (and relating with the metaphysical discussion above), data is human-centric. It takes a human's work not only to retrieve data points from the world, but also to confer meaning through interpretation to them. People make data.


Materialising Data


Returning to my fictitious sessions with my very fictitious therapist. I mentioned that they take notes in a notebook during our encounters. Recall that there was even a recording device for the time we conducted the Definition-Obsession Amongst UK Postgraduates survey. The purpose of the note-taking is for my therapist to recall past encounters and offer more robust advice as we continue our sessions. The purpose of the recording device was for my responses to the survey to feed into the wider project. In this case, the information I convey is not immediately important to the researcher at hand, but only becomes meaningful thanks to its being recorded.


This is the second factor that may help data realise its "meaningful nature." There is a larger issue at play here, which we might see more easily in the recording of my utterances than in, say, the counting of iris flowers. Counting irises will also constitute data for the right person (such as a perfume maker), but we can always go out to count irises – we could even have a graph with historical numbers of irises in a given field. My utterances with my therapist, however, cannot constitute useful information to anybody outside the room – partly because of doctor-patience confidentiality, partly because my therapist's notebook never leaves that room. Any conversation in the street or at a café doesn't constitute data either. Utterances are ephemeral, short-lived, momentary vocal realisations of our thoughts. But our speech acts do not constitute data per se. As dependent on language as society may be, utterances generally only come to form a part of the collective mind if crystallised or given some tangible form, such as text or a voice recording. The meaningfulness of materialised data is no more dependent only on one person's interpretation. Utterances are limited in who can process them – materialised data points are accessible to more people. Consider my utterances in the case of the researcher who receives the recording of my responses to their questionnaire – these are only of use given that the researcher can analyse them and listen to them at different times. There is one more thing the researcher needs to see the value in the data my therapist provides, though: context.


Contextualising Data


I am here referring to context in two senses: the context of a data point within a dataset, and the context of a data point as it interrelates with brute data and other datasets. The first sense of context is what the Definition-Obsession Amongst UK Postgraduates researcher is particularly interested in when analysing my responses to their survey. To this effect, my utterances are of little use to somebody who wants to understand Definition-Obsession Amongst UK Postgraduates because I am just one person – I do not provide a meaningful enough sample for there to be much use in a research project that wishes to understand Definition-Obsession Amongst UK Postgraduates. It is the aggregate analysis of all the responses the researcher receives what will provide useful insights.


For context in the second sense – relating with data points beyond the dataset in question – , we can return to my therapist. As described earlier on, my therapist collates the information I provide in a little notebook. The resulting notes, which go across sessions, allow my therapist to revise what we have previously discussed and to uncover unclear passages that require more information for proper scrutiny. What my therapist wants is more context. Indeed, if they ask "so how do you feel today?" and I respond "okay," they will wait for me to elaborate (it really bothers them when I do that, fictitiously). Thanks to their training and experience, they can also identify things I choose to ignore in my responses, or things I do not deem relevant but which are. In this way, my responses become more meaningful even to myself when given context. My speaking quietly, it might turn out, is because of a lack of self-confidence, which is apparent because of so many other aspects of my life (another very fictitious example), and so on.


Metadata for Data


We have so far seen that interpretation can make a dataset meaningful on its own, that the materialisation of data points can make them insightful to others, and that context can help data points become self-realised either by relating them to other data points within their dataset or other data points beyond what was initially collected. But what is this metadatathing `i mentioned? Well, imagine something terrible happens – imagine I steal my therapist's notebook.* I am aware there are notes in it about other patients and I want to read about them. I carry the notebook under my jacket secretively (even as if the people in the street would know I'd stolen it!). I get home and rush to my room, where I open the notebook and start reading what I find to be the most illegible handwriting I have ever seen! No matter, I can at least read the dates, which are always in the top right-hand corner, and which allow me to find a date I know I'd visited on. Under this date, I see there had been three sessions: one with KGI, one with IK and one with IKG – surely that last one is me! Ismael Kherroubi Garcia! I begin to read: "DOA: acdbc..." WHAT! I continue below: "distr w/ blg." WHAT!


Here's the thing: my therapist has been doing this for a long time. They always amaze me when their memory, and, as it turns out, they use the most unintelligible combination of abbreviations and acronyms. So, what my therapist writes is deemed informative data points to them but absolute gibberish to me. Why? Because only they have access to the tools needed to decipher the code, which lives in their mind: metadata.


Metadata is information about a dataset and its data points. It describes what some materialised data points refer to and what values they are in.

Metadata is information about a dataset and its data points. It describes what some materialised data points refer to and what values they are in. Metadata are not what a researcher (or my therapist) are interested in, but they are necessary to make sense of the data they are analysing. For example, the recording of the temperature of a fridge throughout the day might be in Fahrenheit — this fact would be metadata. Other metadata might relate to the method of temperature-recording, such as the fact that temperature is recorded to the second decimal place, in twenty-minute intervals and by a thermometer of such and such technical specifications. Ultimately, without these metadata, it would be difficult (if not impossible) to explain the dataset and render it meaningful.


Concluding


We have seen that four factors might help data become meaningful: interpretation, materialisation, context and metadata. Each of these may be subject to further scrutiny and I welcome your comments below. On interpretation, it is possible that my argument renders any distinction between data and information inscrutable, and this may be so, as it seems that the purpose of data — in becoming insightful — is to convey information. Materialisation also raises questions insofar that the "tangible" form that data points may take many shapes which may relate differently with their surroundings – further inquiry is needed into what these forms and relations look like. This links with a question one may raise about all four factors: are they actually distinct? Do metadata not provide context, and does one’s interpretation not also grant meaning through one’s personal circumstances — one’s own context? Might metadata not be necessary for interpretation if we take it to refer to the conceptual schemes and language an individual uses for interpretation? And is materialisation sufficient to make some dataset intelligible, or does it simply render information not unintelligible so that it may wait in the wings for its eventual realisation?


Three wider philosophical questions seem to be at play here too. Firstly, the “meaningfulness” of data, whilst an apparently essential part of the concept, remains ill-defined. Secondly, the purely epistemic and seemingly non-ontological sense in which data is described draws a line precisely where there may not be one, especially after mentioning brute data. Thirdly, whatever informative, meaningful and insightful mean, it seems to be that, given this essential part of data, the more people informed by some data points or datasets, the better data realises its core function. This is a curious conclusion regarding the ontology of data.


The above two paragraphs describe a few limitations of this analysis to hopefully incite further discussion around what data means. As we come to grips with the explosion of information available to us as private citizens, as well as to corporations and governments around the world, and as we face a technical skills gap in an increasingly technologically-driven workplace, conceptual clarity is increasingly necessary so that we can extract useful meaning from data.


*Just to clarify, I did return the notebook to my therapist. We had a long discussion about good versus evil and agreed I should never post about having a fictitious therapist again, as this hurt their feelings.

In writing this post, I encountered some useful analysis on the grammar of data by Jeff Aronson (2018)

I also found the metadata section of Wilkinson's (2005) Data chapter (in The Grammar of Graphics) helpful; the overall chapter may also satisfy those of you who are more technically-inclined

And thank you to @VictoriaCarr_ for reviewing this post and testing the intuitions behind my strange assertions!

Comments


bottom of page