Statistics in the field of Data Science

Queenie Pamatian
3 min readOct 2, 2020

--

According to Investopedia, statistics is a form of mathematical analysis that makes use of quantified models, representations, and synopses on a given set of data, be it experimental or real life data.

To me, personally, statistics is just another intimidating mathematical science subject, but we see it almost everyday and can make use of it in the aspect of our day to day lives.

As an aspiring data scientist, I have no choice but to embrace statistics, as it is somehow intertwined with my field of interest. A data scientist would definitely be working with data, probably like a unicorn who will see through data collection until the end, or maybe just provide insights and analysis with a clean data set. And what would a data scientist do with these data? a data scientist would have to assess which data, and specific variables to be used; analyze the data; identify patterns and trends; applying models. In essence, data scientists utilize statistics in a lot of step of their work. In data analysis, they might use simple mean, median, mode, or maybe resort to more complicated statistical tools such as test of correlations. They are using statistics in order to generate insights with the data that they have. It is really necessary to have sufficient knowledge in statistics, to become a data scientist.

There are a lot of statistical techniques that will help a data scientist. For an instance, regression (linear) is very useful to predict next sets of data, hence, it can help a data scientist form recommendations. Sampling methods are basic, but very much needed as well. A data scientist might one day encounter a very huge data and if the data scientist do not have direct access to the data, the time for data extraction and transfer would be a loss, hence an adequate sampling is the next best thing.

Let’s say you are given multiple data sets with over a hundred thousand rows and ten columns each. What do you do?

Of course the wise answer would be to explore and clean the data (other than praying that your laptop could handle the load). Exploring the data will be identifying which data do you need and cleaning the data would be removing those data that are not remotely relevant to the study, it is an option to remove the null values as well.

If you do not explore and clean the data, you will not be able to see a greater picture because the other data that you don’t need acts as noise. You can spend time coding and making it work for this variable, but it turns out you have a better option that you would have known if you made time to explore the data. Evidently the process of exploring and cleaning the data takes a lot of time (Our mentor even said that this process takes weeks, only to do the data science thing for two hours), but it is a process you can not skip. If you do, you may spend days and days trying to process a very heavy file (longer time to load), only to find it futile because the data is dirty and some of the variables are not standard with each other. Moreover, having large amount of data, especially if its personal, poses a security risk, so it is better to explore and clean the data before you put those data science magic.

--

--