Curious Extracts: Morvram’s blog on data, technology, society, and the world.

Data Representation

2022-07-18T00:00:00+00:00

Sebastian-Coleman describe Data Representation as "a set of rules for recording data items." Data cannot be recorded haphazardly - in order to be used meaningfully, it has to be ordered properly. Therefore data representation, as one aspect of the structure of data, is important for maintaining quality and usefulness of data. This is further reinforced by the classification by Redman which Sebastian-Coleman point to: data representation encompasses qualities such as interpretability, portability, the precision and flexibility of format, the ability to represent null values, efficient use of storage and representational consistency. All of these values and qualities - rules and constraints on how we store data - help to ensure that we are creating *useful* data, not simply increasing the *volume* of data we have access to. Data are collected for specific purposes to represent a meaning beyond the actual bits of data themselves, which is the 'semiotic function' - "using data means interpreting data's meaning" because data is inherently representational and is not 'the thing itself' that we are drawing conclusions about. Data, and our ability to interpret data, provides us with information about something that exists outside of the data itself, This concept of data representation is part of the broader subject of 'data management', an important consideration in the data science world and a subject that deals with the implications of data ownership. I am currently studying this subject in my night classes and might have a few things to say about it as I go through my research. Since I'm interested in the ethical use of data, data governance and data ownership as a research area raises a number of topics I might be interested in pursuing further. One example brought up by the paper "A Relational Theory of Data Governance," by Salome Viljoen, is the use of data from Amazon Ring cameras by police officers. This is an interesting example - there are downstream effects of data ownership which the user may not intend but which would have to be considered from an ethical standpoint and which must factor into a decision that is made by any organization which wishes to claim ethical standing. It is easy to make a claim of ethical standing in data usage without considering the downstream effects, similar to how some people will claim that Facebook's use of data is ethical because data collected without your permission is anonymized - even though this blatantly ignores the ease of data re-identification combined with the fact that breaches of Facebook's data warehouses can and have happened. SOURCES Sebastian-Coleman, Laura. Measuring Data Quality for Ongoing Improvement a Data Quality Assessment Framework. Amsterdam: Elsevier, 2013. Sebastian-Coleman, Laura. Meeting the Challenges of Data Quality Management. San Diego, UNITED STATES: Elsevier Science & Technology, 2022. Viljoen, Salomé. “A Relational Theory of Data Governance.” The Yale Law Journal, 2021, 82.

Inform 7

2022-06-30T00:00:00+00:00

Inform 7 is an engine for creating interactive fiction that relies on an interpreter format. The early games Zork, as well as my parents' strange personal favorite Leather Goddesses of Phobos (fun fact: the latter game was referenced in the movie adaptation of Andy Weir's novel "The Martian") use an interpreter style - that is, text adventures where the player plays the game by entering commands and hoping that the command they enter is understood by the game. There is a standard vocabulary of commands that's used to create these games, and they have a few big advantages over choose-your-own-adventure formats such as what you can create in Twine. Characters that move from place to place, coherent movement systems built into the game itself, Inform 7 comes pre-loaded with a lot of these features already assumed and then lets you build on top of them. I have decided to attempt to create a game in Inform 7, using as its premise an idea that I had for an RPG campaign but never got around to running properly. Due to its nature as a locked-building mystery, I think it would work well in the form of an Inform 7 game. As I work on this, over time, I'll be publishing further posts about what I have learned about the Inform 7 system, and my thoughts about interactive fiction design in general. I think this might be an interesting project to undertake, whether or not I finish it, and perhaps you'll agree. Who's to say.

Feature Engineering

2022-04-24T00:00:00+00:00

“Feature Engineering” is an integral part of the data analysis process consisting of representing the underlying data objects in ways that are machine readable and can be understood by an algorithm. While this is a subject I’ve worked with in my job and studied in classes, I haven’t until recently taken a formal, direct approach to learning the theory behind this specific facet of my work. I acquired the book “Feature engineering for machine learning and data analytics” in a humble bundle quite a long time ago and have been starting to go through it. The book is focused on the theory and doesn’t have any specific exercises to go through, but I have been supplementing it by doing exercises I find on the internet using R code in RMarkdown.

For instance, chapter 2 of the book focuses on the feature engineering part of text mining, and this example may make an effect illustration of what I mean when I talk about feature engineering as distinct from other parts of data analytics. Text mining is about deriving useful information from unstructured text data using a machine. So it is text mining when social media companies go through all of your tweets and use that to build a profile of your likes and dislikes, your opinions, your interests, the kinds of people you are friends with, whether or not you’re a potential dangerous enemy of the state who must be watched by intelligence officers… you know, the usual normal things. Humans produce enormous amounts of text data every day and much of it is available on the internet (though a large proportion of that remains un-indexed by search engines, hidden away in private websites where no spider may venture alive!), but that data isn’t easy for computers to understand in the form it comes into the world. That’s what makes it unstructured data, after all - without a mediating representation of the data, the computer can’t make heads or tails of it, although such rich information about the world is contained within.

Before the algorithm can understand the text, it has to be transformed into a format the computer understands. There are many techniques for computing features (“features” being the characteristics derived from underlying data objects which a computer can use for its algorithmic analyses) from text data, but I don’t need to get into that just yet. The point is that there is a transformation process which precedes analysis. Of course, I was already familiar with this process from my work, particularly the arduous process of data cleaning and harmonizing data from multiple sources. But the process for preparing text is in practice very different due to starting with completely unstructured data. I’ve practiced in R using the text of Jane Austen’s novels and attempting to build computer-analyzable data out of it, and there are many fiddly aspects which are just as challenging as aspects of the data cleaning and harmonization I do. It’s often frustrating trying to get the data to cooperate, because an R function or other aspect of the cleaning process does not behave the way you want it to or think that it should!

REFERENCES Dong, Guozhu, and Huan Liu, eds. Feature Engineering for Machine Learning and Data Analytics. First edition. Chapman & Hall/CRC Data Mining & Knowledge Discovery Series, no. 44. Boca Raton: CRC Press/Taylor & Francis Group, 2018.

First demo blog post!

2022-04-16T00:00:00+00:00

Ignore this post, I’m just testing the repository. For some reason the post isn’t showing up properly, and I find this very irritating. Let’s go!