Understanding Data 01 – Why is it important ?

Machine learning is not all about mathematical models and algorithms. In fact it can be described as a combination of following three fields.

  1. Data
  2. Feature Engineering
  3. Model

Each of these play a vital role in developing an efficient and effective machine learning system. Lets see why.

Imagine the model as a smart student. And we give him a book to study (in this case the book represents the data+feature). If the content of the book is wrong, or if the content of the book is right but is in a different language which the student can’t understand then even though the student is smart, he will have a hard time learning.

Similarly even if we have a mathematical model which is super smart, if our data and features (how we represent that data to the model) is wrong then the algorithm will most probably fail.

Same goes the other way as well, if the book is correct but the student is stupid then also he will not learn properly. And also even if we have correct data, and a smart algorithm, if our representation of that data (how we feed that data to the algorithm) is wrong then still the system will fail. Therefore to have an efficient as well as an effective machine learning system we have to pay attention to all three above aspects.

Now one might be wondering why do we need the feature engineering (what ever the hell that is) ?   Well we will discuss feature engineering in detail in the coming articles. But for now lets see why.

On the get go we might think okay, I have data, I have the model, why not feed the data directly ?

Well in a handful of special situations we can do that. But in a real world situation things are bit messy. Among the things that can happen here are few;

  1. some parts of data can be missing
  2. some parts of data might be corrupted
  3. some of the data might be duplicated
  4. etc.

And also we have to consider the amount of data. Say we are building a machine learning predictor system for a YouTube like channel, where it can predict what a particular user will be interested in. In such a case we will have Peta bytes of data or more (trillions of records about each video that was clicked by users from all around the world.) Such a large volume of data will include details such as, say, user’s middle name, which is in the data but has almost nothing to do with what he will click when he is on YouTube. Such information are useless in the perspective of the machine learning algorithm. If we feed it such data then it might try to find out relationships between the videos watched and the middle name (which is more or less a stupid thing to do). And also having such huge mount of data, an algorithms might take weeks even months to learn even in hardware accelerated computers.

Thus feature engineering is necessary in order to filter out the necessary parts from the data and then to feed it to the algorithm. This will make the system efficient and also much more accurate.

So we will step by step look in to the aspects of data and feature engineering in the coming articles.

Next on : Understanding Data 02 – Attributes and Values


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s