If you’re new to Analytics, you might encounter too many topics to explore in this particular field starting from Reports, Dashboards, Business Intelligence to Data Visualization to Data Analytics, Big Data to AI, Machine Learning, Deep Learning. The list is incredibly overwhelming for a newbie to begin his/her journey.
I really wanted to rank and check which one is currently trending relative to each topic among these five buzzwords: “Business Intelligence”, “Data Analytics”, “Big Data”, “Machine Learning”, “Deep Learning”.
I made use of my favorite Google Trends tool for my reference purpose. I’m interested to assess based on the worldwide data for last 5 years using “Google” search engine queries as the prime source.
I inferred the following from the above user-searched data:
Big Data stayed at the top of the users’ mind for quite long time since 2012. However, Machine Learning is soaring higher from 2015, and it could potentially overtake Big Data in a year as the “hottest” skill-set to have for any aspiring Analytics professional.
Deep Learning is an emerging space! It would eventually gain more momentum in 1 year from now. It’s essential to gain the knowledge of Machine Learning concepts prior to learning about Deep Learning.
Needless to say, Data Analytics field is also growing moderately. For beginners, this could be the best area to begin your journey.
BI space is starting to lose out its focus among the users thanks to self-service BI portals (and automation of building reports/dashboards), Advanced Analytics.
I happened to see few additional interesting insights when I drilled it down at the industry-wise.
Data analytics is still the hot topic for Internet & Telecom
Big data for Health, Government, Finance, Sports, Travel to name a few
BI for Business & Industrial
Machine Learning for Science
Users interest by Region says that China is keen on Machine Learning field and Japan on Deep Learning. Overall, Big Data still spread all over the world as the hot-topic for time being. Based on the above graphs, it’s quite evident that Machine Learning would turn out to be the top-most skill set for any Analytics professional to have at his/her kitty.
You can go through this Forbes article to understand the differences between Machine Learning and Deep Learning at a high level.
Pls let me know what you think would be the hottest topic of interest in the Analytics spectrum.
Google Docs Explore feature was first introduced last year, and it aids us while drafting documents, spreadsheets, slides to a good extent. In this article, I’d like to share few stuffs around how to use Google docs “Explore” feature on a spreadsheet. It uses machine learning algorithm to understand the natural language based text queries and delivers the outcome instantly!
Isn’t it cool if you could just use natural language texting to understand the top, bottom and basic statistics in a matter of few seconds?! Needless to say, it also recommends us to refine the queries further 🙂
For illustration purpose, here’s a simple example:
I’ve created a sample data depicting the students’ scorecard. It comprises of student names, subjects and the corresponding scores. If you click on the headers and navigate to Explore option at the bottom-right corner of the Google Spreadsheet screen, it would pop-up and recommend us the queries that we might be interested in.
Some of the questions can be:
a) Top Score,
b) Least Score or Bottom Score,
c) Unique Subjects
d) Average of Score by Subject
e) Histogram of Score
f) Bar Chart of Average Score by Subject
You can start exploring the available fields using this way. I believe it can be a handy tool if you use Google spreadsheet for your projects. It also allows us to customize the generated formula as per below screenshot. The syntax looks pretty much like an SQL programming.
You can also make use of this feature as part of your exploratory data analyses. And here’s another classic example for you:
If you’re a teacher/head of the school, you would be surprised to see the below distribution Average Score vs. Subject. The tool provides us the insight that Physics has got the lowest score among other subjects. This is just the starting point to your analysis to go deeper into the data, and uncover the patterns. One of the key decisions from this inference could be to refine your education methodologies for ‘Physics’ subject.
Go ahead and explore the Google’s Explore feature and let me know how it helps you!
Data analysis of any sort requires cleaning and formatting the data.
Predominantly, Microsoft Excel spreadsheet can be used for that matter. The source of data could be from multiple upstream systems! It’s highly unlikely that you would just get the data ready for further processing.
Let’s take a hypothetical example:
A fashion based e-commerce startup wants to identify which top 3 cities in a specific country has returned back the maximum products to their retailers. The company then might be interested to scrutinize the problems faced by its customers, and takes key decisions to minimize the returns or strengthens the returns policy to prevent the losses incurred by the same.
The returns team of that company maintains one relevant field by the name: “Address”. In the excel sheet, it would be a manual and repetitive task to extract the City/State/Pincode from the Address. Of course, one can use the combination of MID, FIND kind of formulas to extract what we want to an extent. Well, there’s still a better way in Microsoft Excel 2013 and above versions.
It’s called “Flash Fill” concept designed by Dr. Sumit Gulwani, Microsoft Researcher. This is a machine learning algorithm and discovers patterns based on a couple of data examples and populates the remaining data using what it had learned! This is a great deal of time saver for many cases. I’ll highlight an example below.
Using the available Address, we can now extract County/City/State/Pincode using Flash Fill feature.
Create a new field/variable and name it. I created “County” for my requirement.
I just typed three records manually such as Orleans, Livingston, Gloucester.
Then, I highlighted these three and dragged the text until the end of the records. You can see below that it just replicated the three words repeatedly.
At the end of this screenshot, you can see a tab that appeared to enable you to choose few more options.
Click “Flash Fill” and see the magic for yourself :). It has identified the pattern that I’m interested to extract only the County information from the Address field. You can similarly try to extract other key info such as State, Pincode.
In certain cases, the Flash Fill automatically pops-up and recommends while you type the sample data as per below.
You can apply Flash Fill to format your number such as Telephone number, Social Security Number etc. to name a few.
A couple of tips:
If it fails to identify pattern in your case, educate it by typing few more examples for “Flash Fill” to learn from it. Usually, I type 2 or 3 examples and the algorithm picks up thereafter for the remaining data.
In the above example, I had a separator such as comma to differentiate the county, state, pincode info in the Address field. So, it became pretty easier for “Flash Fill”. Alternatively, you can iterate few more times to clean the data as per your wish.
The purpose of forecasting or prediction is to take an informed decision today based on the past historical information that we have.
Most often, I think what can be the differences between forecasting and prediction or predictive analytics. Are you too confused? Please read on…
Apart from being the buzzwords among the business people, are there any differences in their definitions, usage? Based on what I explored by going through few articles on the web, these are my findings:
Forecasting is a generic term used by professionals across various disciplines. It applies at the high-level. For ex: What could be the total sales for a particular product line in next quarter? It uses time series data. Another classic example could be weather forecasting for next week. It involves time as a dependent variable.
Predictive analytics is a term hugely became popular in the analytics space. It can be done at a much granular level: For ex: A credit card company would be interested to predict – Which customers might default during the New Year festival session? Notwithstanding that, predictive analytics helps us in understanding the relationship between variables using regression method.
Another interesting perspective on the difference which I read is as follows:
Forecasting is all about numbers – Again, the total sales example which I pointed above. Prediction is more on the behavior – In Amazon website, you must have seen the recommendation engine. Although it involves more of Machine Learning concepts (which improves/learns by itself), it recommends, say, the books based on what I had earlier purchased on the same website. The latter categorizes based on what genre, author I would be interested to read. In fact, I discovered some amazing books thanks to Amazon’s intelligent recommendation engine!
What’s your take on Forecasting vs. Predictive Analytics?
If you’ve gone through the list of data science skills, you would be surprised as to where to begin this journey. I’ve been pondering on the same thought for quite some time!
I discovered a timeline chart in the form of an infographics released by Kunal Jain of AnalyticsVidhya fame. It depicts the list of stuffs that one can learn step by step on a monthly basis. Although it was prepared in Jan 2016, I think it still looks relevant, and we can make use of it as a goal plan. Welcome, on board! Let’s do this!!
The emphasis should be focused on understanding the concepts, techniques rather than merely knowing how to program them. I think that tools/programming might get changed over the period of time. What doesn’t get changed are the basics which applies to any domain/field.
There are plenty of ways one can learn these from. I think, to learn a complex course like these, MOOC (Massive Online Open Courses) is the best way followed by peer group learning, YouTube videos/blog tutorials, e-books/text books.
How are you preparing yourself to become a data scientist?
Do you have a fair idea about data science? I hope, yes. How about its skill-sets?
For newbies, Data Science is the field which is intersecting Statistics, Mathematics, Programming/Technology and Business. Using a combination of all these components, insights, models can be drawn using the data. It was a term coined in 2001 by a Professor of Statistics, William S. Cleveland at Purdue University.
I read at least 5 articles from the web today to understand the nuances of this role. Little did I know prior to my research that there are lot many things attached to the buzzword “Data Science”.
Broadly speaking, the skill-sets required to be a Data Scientist (as the way many companies call the folks who work in Data Science domain) fall under the following skills viz,
Math & Modelling
We can even sub-classify them into further such as Programming – R, SAS, Python, SQL, NOSQL to name a few.
Business2Community features an article written by Bob Hayes. He came up with a questionnaire listing 25 data science skills, captured the responses, analyzed and ranked the top 10 skills from the results based on the proficiency level as “Intermediate” criterion.
Top 10 Data Science skills in general are:
S – Communication (87% possess skill)
T – Managing Structured data (75%)
M – Math (71%)
B – Project management (71%)
S – Data Mining and Viz Tools (71%)
S – Science/Scientific Method (65%)
S – Data Management (65%)
B – Product design and development (59%)
S – Statistics and statistical modeling (59%)
B – Business development (53%)
It’s interesting to note the fact that Communication stands first when compared to other skills! Catering either to internal or external customers, the data scientists talk to business functions such as Marketing, HR, Operations, Finance etc.I think it makes sense because what’s the point if one works hard at developing a model but not conveying the results as per the business needs. Guess what? “Data Presentation” has become one of the top 10 skills in 2016 published by LinkedIn.
Bob also charted out the top skill sets by job role level. This one is another interesting perspective.
Researcher can focus more on Statistics; Business Manager on communication, project management; Developer mostly on the programming aspects and so on.
With that said, it’s very tough to focus on learning all the skill-sets of a typical data scientist at a single stretch. Depends on who you want to become, the above list would be beneficial for you. Hence, you can prioritize and narrow down to the list and start learning one at a time! If you’re already good at Statistical concepts, try learning how to program the techniques using “R” programming language. This way, I think one can steadily adapt to the data science skill-sets.
Please remember that there’s no one size fits all approach! If your buddy is good at programming because of his formal educational background being from a software discipline and moving faster on a learning curve, that’s perfectly okay for you to keep up at your pace depending on your comfort level. At the core of data science, you can be really good at one skill-set and know the basics & become eventually to an intermediate level at another skill-sets.
My focus will be on statistics to begin with. What is yours right now?
Analytics is in the process of transitioning from being niche and only used by few companies & their business functions to mainstream now. The penetration of it is so wide since its application can be made useful across all the business groups of the organization.
It paved the way to new roles in the companies such as Data Analyst, Data Scientist. You need not get surprised if there’s a role such as CAO – Chief Analytics Offer in any company 🙂
I was researching few weeks ago and found that its application is proliferated & penetrated across all major domains.
Measuring Sales force effectiveness is key to forecast how many products could be sold by a company in a given period of time. Forecasting sales data in a certain region at the sales folks level is essential to ensure if the company could meet the target assigned to that particular geography. KPI metrics will be an useful aide during this process.
A simple time series forecasting might throw the pattern based on historical data. Using this, you can adjust or transition the sales folks between different product lines or regions to achieve the targets within certain time-frame. For example, if your focus is on acquiring more B2B merchants, you can deploy more sales people in B2B sales team rather than B2C.
Its sole existence is to understand the customers’ behaviors and offer the customized services ahead of your competitors.
In this space, I can think of at least 3 companies which have offered me unmatched and personalized services. a) Amazon for recommending me the best product based on my previous purchase orders and browsing history. Until now, I’ve not discovered any other service which can recommend me best books as Amazon’s recommendation engine does! b) Uber for customized coupons and reengaging me to its business as I’m mostly using Ola of late (I feel perhaps it’s because of availability of cabs), c) Dominios Pizza for attractive coupons based on my history of buying with them using mobile app.
Life Time Value (LTV) of the customer is a key metric for these companies and they would maximize their product offerings to improve its value.
Supply Chain analytics is a specialized area of focus especially for e-commerce companies.
Just think of it this way – you ordered a book of worth 150 bucks and you’re not there to pick the order when the delivery is attempted. The logistics company tries for 3 days and the company would be investing in arranging the delivery for 3 times in 3 days. This cost is not sustainable and ROI is negligible or sometimes even go negative for them! Based on these data, “reverse logistics” was introduced by DTDC which means the logistic company ties up with your nearby mom & pop store such that you can pick up the order at your convenience.
Of course, employees are the biggest assets of any organization! I can’t agree more on this!!
There are analytical models which can understand the factors influencing the employees to leave, predict which employee to leave, predict which employee is more suitable for a given position or fit mapping, reengaging employees with training and fun programs, thereafter measuring the ROI of it.
Risk and fraud prevention is crucial for individuals and corporations. The financial institutions, be it a bank or a credit card company for instance, employed techniques that can alert and flag the unusual transactions in real-time.
Hope this article gave you a heads-up on how the use of analytics in major business functions is taking place.
I’ve been working on various Software Development Life Cycles such as Waterfall, Agile Scrum.
In a typical software project, I see there are 6 steps involved:
1) Requirements elicitation (gathering),
5) Deployment and
6) Maintenance (Support)
These best practices enable us to be focused on the deliverable and keep a tab on the timeline. The same thought process can be applied for any analytics projects as well. This article gives you a perspective on the steps involved in the analytics projects.
Let’s take a simple example to go over the steps in detail.
This example is created for illustration purpose.
Step 1) Unstructured to Structured Problem Statement
The Government of a country wants to cut off the intermediaries while distributing the benefits to the poor people. The pain point for the Government is that, its beneficiary schemes are not reaching to the poor people to a great accuracy and degree. And they want to tackle this problem with a better approach.
For this purpose, we can take up the Below Poverty Line (BPL) to measure the poverty of a family. On the international standard, if a family lives with the daily wage of say $1.90, then the family would be categorized to be living under the BPL. To validate if such an experiment would be useful, the committee is thinking to set this up as a pilot project for an identified district of a state.
The aim of this project is to identify 5,000 BPL families in the district and offer them each a smart card. The Government might use the card to transact some amount directly to the beneficiaries eliminating the middle men on a monthly/quarterly basis.
Step 2) Data Collection
This is a very challenging step of the project. Remember, “garbage in, garbage out”? If the quality of your data is bad, then the model or outcome of what you’re intending to bring out would be erroneous. 60% to 70% of the time invested in analytics projects could be utilized during this stage!
The data can be external or internal, in general. Assume that the Government do not have any census data which captured the monthly income of the family to determine if they belong to BPL or not.
In this case, a survey should be rolled out comprising of simple questions and get the answers from each family.
What kind of questions to be asked to each family?
If you too think about asking a direct monthly income based question, then think twice. I had exactly the same viewpoint. However, it won’t work because either the family might underestimate its income or they don’t have the standard daily wages coming in.
Thus, the questions to be framed should somehow cover the income type without even directly touching upon it. Some of the parameters to be considered are: Family size, head of the family (M/F), access to clean water, sanitation, land & house (own/rent), vehicle, education, skill-sets, occupation, weekly or monthly expenses on staple food items etc.
At the end of this exercise, you would have framed the questions. Think of it as a model and assign the score (weight) to each parameter. Using the accumulated score, a family can be determined if BPL or not.
Step 3) Data Processing
The agents who ask questions and capture data on behalf of the Government or the analytics consultant should be educated. At times, the data would be collected or stored in incorrect format which might become very difficult to analyse the data. Also, there should not be any missing data!
Coding the data, formatting the same can happen at this stage. For instance, if the head of the family is Female, the likelihood of being BPL is relatively higher than the case of a Male being the head of the family. Similarly, if the family owns the land or house or heavy weight vehicles like tractor, it’s a sign that they might not fall under BPL. Hence, we can code as 1 for owning the land and 0 for otherwise.
Step 4) Data Analysis
Using the assigning weights or scoring technique, each family can be assessed using the scorecard. Based on this metric, we can classify them as BPL or not. This is one way of solving the problem.
Alternatively, Regression model, which is one of the renowned and widely used statistical concepts, can also be used based on training data. A training data, say 10,000 records, each record representing the family declaring them as BPL or not. Using this data, Is_BPL (containing Yes/No data) can act as the dependent variable and other variables discussed in step 2 can act as independent variables.
Regression would plot an equation. Using this mathematical expression, the surveyed data of each family would be inputted and the outcome of the model would be Yes/No (1/0) indicating if the family falls under BPL or not.
Step 5) Data Interpretation
If the model is well developed using the programming languages such as R, SAS, it would simply emit out if the family is BPL or not when we input the details.
At the granular level, the job would be made easier! Also, it’s to be noted that accuracy of this type of project would not be practically 100% since it’s very difficult to implement in a country which has huge population. An accuracy of 70%-80% would still be good for the model to be delivered to the concerned authorities.
Step 6) Call to Action
The last but the crucial step of this project is to call for action. The Government, at this stage, has the database of the district and if the families are BPL or not. They can sort out the first 5,000 of them and offer the smart cards to be made usable by them for Government welfare schemes.
In a nutshell, these are the steps involved in any analytics projects:
I came across an article written by Vincent at Data Science Central blog. Hope you enjoy reading this one, too!
You may be an employee, a consultant or an CXO of a company. How do you view the $16 bn Analytics industry?
Broadly speaking, when you look at Analytics services, it can be offered by companies in two modes:
Boutique (also called as Niche Analytics companies) Model
The companies fall under this category has deep expertise in the field of Analytics. They can offer services in almost all major domains such as Banking, Finance, Insurance, Manufacturing, Government etc. to name a few.
They have the right set of people trained on domains, tools & techniques required for any analytics projects. Major clients approach (or vice-versa) these companies to help them in analyzing the data. It typically works like how IT outsourcing takes place.
Few companies in India that are operating under this model are
MNC companies who have big ticket investments would set-up their in-house Anaytics division to cater to their tailored needs. For instance, a credit card company can set-up its own analytics team to prevent and alert the fraud transactions happening over its network in the form of a system.
These companies store the highly confidential data, mostly in BFSI space, hires the consultants in setting-up their division. And then the team grows based on the ROI and the type of projects they take up.
This really got me thinking. Why is Analytics a buzzword these days?
According to many industry experts that I listened to, this is not a brand new process as such. However, industries are adopting this at a high scale.
Why now? Because of overwhelming data that are accumulated thanks to all the Social Media, Search Engine and especially my most favorite User Generated Content (UGC). With billions of searches on Google, millions of photos being uploaded on Facebook, a million ride thanks to Uber, these cutting-edge software companies have now access to store torrents of data, and make informed decisions out of them!
Just visualize the volume of data being generated every minute, variety of data such as text/multimedia/rich content created and shared, and velocity (past and real-time data). These are the 3 Vs of Big Data! Companies want to make sense out of it to reduce costs, improve revenue, profits and customer satisfaction.
The transition in major companies is evident. From Business Reporting using KPI Metrics to Business Intelligence and Data Visualization + Dashboards to Descriptive Analytics to Predictive Analytics to Prescriptive Analytics.
I’ll write my understanding on these on separate articles.