Feature Engineering

“Feature Engineering” is an integral part of the data analysis process consisting of representing the underlying data objects in ways that are machine readable and can be understood by an algorithm. While this is a subject I’ve worked with in my job and studied in classes, I haven’t until recently taken a formal, direct approach to learning the theory behind this specific facet of my work. I acquired the book “Feature engineering for machine learning and data analytics” in a humble bundle quite a long time ago and have been starting to go through it. The book is focused on the theory and doesn’t have any specific exercises to go through, but I have been supplementing it by doing exercises I find on the internet using R code in RMarkdown.

For instance, chapter 2 of the book focuses on the feature engineering part of text mining, and this example may make an effect illustration of what I mean when I talk about feature engineering as distinct from other parts of data analytics. Text mining is about deriving useful information from unstructured text data using a machine. So it is text mining when social media companies go through all of your tweets and use that to build a profile of your likes and dislikes, your opinions, your interests, the kinds of people you are friends with, whether or not you’re a potential dangerous enemy of the state who must be watched by intelligence officers… you know, the usual normal things. Humans produce enormous amounts of text data every day and much of it is available on the internet (though a large proportion of that remains un-indexed by search engines, hidden away in private websites where no spider may venture alive!), but that data isn’t easy for computers to understand in the form it comes into the world. That’s what makes it unstructured data, after all - without a mediating representation of the data, the computer can’t make heads or tails of it, although such rich information about the world is contained within.

Before the algorithm can understand the text, it has to be transformed into a format the computer understands. There are many techniques for computing features (“features” being the characteristics derived from underlying data objects which a computer can use for its algorithmic analyses) from text data, but I don’t need to get into that just yet. The point is that there is a transformation process which precedes analysis. Of course, I was already familiar with this process from my work, particularly the arduous process of data cleaning and harmonizing data from multiple sources. But the process for preparing text is in practice very different due to starting with completely unstructured data. I’ve practiced in R using the text of Jane Austen’s novels and attempting to build computer-analyzable data out of it, and there are many fiddly aspects which are just as challenging as aspects of the data cleaning and harmonization I do. It’s often frustrating trying to get the data to cooperate, because an R function or other aspect of the cleaning process does not behave the way you want it to or think that it should!

REFERENCES Dong, Guozhu, and Huan Liu, eds. Feature Engineering for Machine Learning and Data Analytics. First edition. Chapman & Hall/CRC Data Mining & Knowledge Discovery Series, no. 44. Boca Raton: CRC Press/Taylor & Francis Group, 2018.

Written on April 24, 2022