Friday, June 26, 2015

The ‘data’ Jump off: Drilling down in Tableau dashboards using Action filters

Tableau never ceases to impress me. As a newbie still learning the ropes, I am always discovering a new thing that blows my mind. During our monthly data review meeting, I realized that users were always asking for further details concerning the performance of a certain indicator. When displaying data from regions and comparing the performance of regions, questions are always being asked- show us the sites contributing to that performance or how has the region been performing over the years or the previous months? Presenting data in Tableau has made a big difference for me and. I have always wondered why PowerPoint does not provide links to data for drilling down to data easily without leaving PowerPoint or the presentation. Thinking of this; I remembered how I can implement this drill down functionality in SSRS with sub reports and sub report parameters. I didn’t know how to do it in Tableau. I turned to Mr. Google. Googling led me to a number of links including the Tableau knowledge base article here. But the most helpful information was what I found

Thursday, March 26, 2015

SQL SUM values with NULLS, COALESCE and you will see different results

I thought I have been writing many SQL queries to fully understand SQL. I was wrong. While generating data from our SQL Server database today, I wrote a certain query to get me total of new positive HIV clients at PMTCT per county. This is the query that I wrote.
SELECT County,[Reporting Month], sum([HV0201]) as 'Women tested at ANC', sum([HV0204])as 'Total women tested', sum([HV0205]) as 'Known Positive's at Entry', sum([HV0206]) as 'HIV+ at ANC', sum([HV0207]) as 'HIV+ at L&D', sum([HV0208]) as 'HIV+ at Postnatal', (sum([HV0206]))+(sum([HV0207]))+(sum([HV0208])) as 'Total New Positives', sum([HV0209]) as 'Total Positive', sum([HV0210]) as 'Total with known status', sum([HV0213])as 'NVP only Prophylaxis', sum([HV0214])as 'AZT+SdNVP Prophylaxis', sum([HV0215])as 'Interrupted HAART', sum([HV0216]) as 'HAART', sum([HV0217]) as 'Total Maternal Prophylaxis', sum([HV0220]) as 'Assessed for Eligibility', sum([HV0224]) as 'Infants tested', sum([HV0229]) as 'Infants HIV positive', sum([HV0244]) as 'Total Infants issued with Infant Prophylaxis' from vw_PMTCTData_RetainedSites group by County,[Reporting Month]
I was querying a View that I had created in SQL Server. I obtained the results and proceeded to paste them (with headers) to an Excel spread sheet to share with the relevant person. This is how the sheet looked like.
My summation returns NULL when any one of the three Indicators - 'HIV+ at ANC','HIV+ at L&D' and 'HIV+ at Postnatal' is NULL. This is not what I wanted. You get to know the Total women who are positive by adding the Women who knew their status as HIV positive when they got pregnant with the new women who have discovered that they are HIV positive after being tested at either the ANC clinic(when still pregnant), during Labour and delivery or at testing positive at the Postnatal care clinic. These are called the 'New positives' So from my results, the results for Congo and Yukon counties are wrong since for Congo, the value for HIV + at LD was NULL while for Yukon, the value for HIV+ at Postnatal was NULL. To sort this out, you use the COALESCE function. The new T-SQL statement will look like this
SELECT County,[Reporting Month], sum([HV0201]) as 'Women tested at ANC', sum([HV0204])as 'Total women tested', sum([HV0205]) as 'Known Positive's at Entry', sum([HV0206]) as 'HIV+ at ANC', sum([HV0207]) as 'HIV+ at L&D', sum([HV0208]) as 'HIV+ at Postnatal', (sum([HV0206]))+(sum([HV0207]))+(sum([HV0208])) as 'Total New HIV+', COALESCE(sum([HV0206]),0)+COALESCE(sum([HV0207]),0)+COALESCE(sum([HV0208]),0) 'Total New HIV+_COALESCE', sum([HV0209]) as 'Total Positive', sum([HV0210]) as 'Total with known status', sum([HV0213])as 'NVP only Prophylaxis', sum([HV0214])as 'AZT+SdNVP Prophylaxis', sum([HV0215])as 'Interrupted HAART', sum([HV0216]) as 'HAART', sum([HV0217]) as 'Total Maternal Prophylaxis', sum([HV0220]) as 'Assessed for Eligibility', sum([HV0224]) as 'Infants tested', sum([HV0229]) as 'Infants HIV+', sum([HV0244]) as 'Total Infants issued with Infant Prophylaxis' from vw_PMTCTData_RetainedSites group by County,[Reporting Month]
I pasted my new result set into Excel with the results of the two summations side by side.
The summation using the ordinary SUM ( shown in Yellow)shows the Total New HIV positive for Congo and Yukon regions as NULL. The summation using the COALESCE function (shown in green color) shows the Total New HIV positive for the two regions as 15 and 27 respectively. This shows that using bare SUM in T-SQL can return you what you don't expect, least to say the wrong answer when one of the variables in the SUM has nulls in it. To avoid this, you should always use the COALSCE command in SQL Summations. This more so applies to databases where strict controls were not effected and NULL values end up getting stored. I have always said that there is a different between a NULL and a 0. That itself is a post for another day

Thursday, December 11, 2014

My first Tableau dashboard

After a lot googling and putting data into the right format, I was able to craft my first Tableau dashboard.  

Wednesday, December 3, 2014

My first Visualization with Tableau

It has been a good week for me. I have pushed myself to learn Tableau within the shortest time and get things working.

I was asked to a site level data analysis for specific HIV care and treatment Indicators. The data was from an Access database application. I first had to split the database to get to the back end of the database. The second and major headache was to get the data in the right format that Tableau will accept. I used Excel to transpose the data to a format that Tableau accepts.

I loaded the data and started getting my hands dirty with tableau. Eventually, I was able to come up with this 

Tuesday, December 2, 2014

A not so gentle introduction to Tableau

It is my second week at my new work place. I am liking the place.It has been a hectic two weeks. The Introductions, orientations and all.

And now comes the time to settle down...and then the data, and of course the request that comes with it. OK, that is not a big deal, let me see the data. And what do you use here for data analysis?

"Tableau"

Whaaaaaat? What is Tableau? I ask. A tablet? a book?

"Tableau is a data analysis software", I am informed. "And that is what we use here for data analysis. That is what you will be using".

I slump on my desk, just like my son usually does when he is told to do something he does not like. He picked it from Sponge bob square pants. Well, that said, there are no pants here,I realize that I don't have all the time to whine, I'd better get myself started with this Tableau monster. So first, I head to Mr. google and google what this Tableau.

I find out that Tableau is a BI software more so like others BI tools I have used elsewhere: Actuate, SAP Crystal Reports, SQL Server BI stack, Pentaho, Jasper Reports and of course the master - Excel. I come to also realize that it is highly regarded and is among the top BI tools as per this Gartner report. It is placed on top, ahead of many "Big" players in the data analytics field. They are placed above Qlik View, SAP, IBM, Micro strategy and even Microsoft itself. (not that I feel they are the best, but I have been using Microsoft BI software for some time now. You can download the report here

Ok, so i started to have some respect to this Software and the company. I spend my better part of the afternoon hearing more about it and eventually downloaded the free version - Tableau Public before I could get my licensed version. The license can cause an arm and a leg, especially for Africa and Middle east countries.
,


Wednesday, November 26, 2014

Excel sparklines; a trending wonder in cell

Microsoft Excel never ceases to amaze. There are many tools in excel that the novice information user can be empowered to use and make sense of their data. From Excel 2010, Microsoft introduced a tiny little wonder called sparklines. Sparklines are tiny charts  that can be embedded on a  worksheet cell to provide visual representation of data. Sparklines provides trends visualization on a single cell from a series of values, such as sales over a period of time.

Creating sparklines are pretty easy. You just select the cells with the data that you want to display the trend, then from the Excel menu, choose Insert. Under the Sparkline ribbon, choose the type of sparkline that you want to use to visualize your data. There are three options; line, column and win-loss chart. The line is pretty much straight forward and easy to understand. You then specify the cell that you want your sparkline display. The best practice is to show the visualization next to the data - the last cell after the data on the same row.

You can format your sparklines as per your taste. You can modify the size of the trend line or the color. But what i like the most is to format the sparkline to assist the user to visualize data. This can be achieved by adding points of interest in the sparklines. Under the show tab, you can format your sparkline to show the highests point, the lowest, negative points, the first or the last point. You can add tick marks that show trends over the periods in your dataset. To format your sparkline, select the sparkline to reveal the Design Tab.


For what I do, I created sparklines to help program managers to see the trends of new HIV infections from the districts that they work in; in addition to several HIV care and treatment indicators that needed to be tracked over time. As shown below.

From this visualization, one can see at a glance that an Indicator that is cumulative (incremental) has a problem if the sparkline is not showing a continued upward trend as shown below.


This is another reason why Excel is usually called the king of Business Intelligence

Wednesday, November 13, 2013

Programmer to Statistician. The Intro

I remember when I first took a job as a Data Manager. We had a rigorous two day interview that was by any means NOT mean for either a programmer or a statistician. The requirements ranged from developing databases, data collection systems for mobile and desktops, analyzing data using SAS, SPSS, STATA and/or EPI Info. I had applied for the job coz i felt I could do the databases and programming part, and try basic SPSS data analysis since I am a Computer Science graduate.

I got shortlisted for the interview and came the day of the interview, made my way to the organization dressed in Khaki trousers. We had been told day one will be purely practical and if you make it through,  day 2 will be the oral part. We where given desktops loaded with Visual Studio 2005, VB 6, Java and Netbeans IDE, STATA 10, SPSS 17 and SAS ver 9.1, and of course Epi Info. There where three parts of the practical Interview - Part 1 was to create a data dictionary for a research study for HIV in Adolescents. We where given a 2 page questionnaire to create the data dictionary with. Part 1 B was to create a database based on the data dictionary using either Access, SQL Server or FoxPro. Part II of the database was to create a data collection tool or application that will save data to the database that you had created in Part 1 B. You where asked to place validation checks such that string data wont be saved in places where numeric data was expected as well as create skip patterns as per the questionnaire -say if respondent is male, then you wont ask them the question on When they started menstruating.

In the last part of the interview, we where given a data set  that had fictitious data from the said study- collected by the questionnaire we had just designed. It was in excel and we where asked to pull it into a statistical package of our choice. We where then asked to perform data manipulations/data management tasks like identifying duplicate entries, identifying missing values, invalid values in some variables and knocking of duplicates. Lastly we where asked do some descriptive statistics and save the output in a report or make a presentation on power point to present the following day.

Basically, it was a long day for all of us. We left their offices at 6:30.

In the mix were both statisticians, mathematicians and us,the computer science grads. As you can see, neither of the teams had an upper hand since if you where a stats guy, you will have an easy time with the stats packages, data manipulation and analysis but you will have a rough time to program an application in VB 6, .NET or Java. Similarly, the Computer science guy may be good in the development side and creating a well normalized database but when it comes to the statistics, he will do very little indeed!

I have realized that there is a thin line between the two jobs. Actually, they really inter marry a lot. Gone are the days when an organization will hire a different programmer and a different data analyst, they would rather hire a cross breed of the two who can do both tasks well.

Either way, I got the job. And for the next three years, I learned a lot as a cross breed. From being humbled  -  I did not know what an observation was nor a variable was. I will be sharing more on how people in these two fields can bridge the gap between them.