Find data problems with math and visualizations
One of the first things you when to do during the initial exploration of a new data set is to detect any outliers. If detected ask if this is in fact valid data, it could indicate an issue with the measuring system. If the the data is valid, then experiment with removing the extreme values to make the middle value differences more obvious. What happens to the averages if you remove these handful of values? With Python we can quickly find out if outliers are making it difficult to see patterns.
Pseudocoelomates are a classification of animal that I thought would be a good example of excluding the outlier for clarity. In the first graph, you can see that the species count of Nematodes are so high that we have trouble seeing the differences between the lower values.
In the second graph, we can see that Kinorhyncha is orders of magnitude smaller than Acanthocephala. Yet here we can see that it has a significantly higher count when compared to its two lower neighbors.
If you follow the link to my Jupyter notebook, you can see where I experiment with the graphs. First I used the default vertical orientation for the bar graph.
Considering the length of these obscure Latin names, it was just simpler to flip the graph horizontally. I would have had to tinker with the angles and sizes of the fonts to make that readable to the end user.
It was also nice to have the graphs in one file instead of two. This is important because the intent of the graph is to show that Nematodes are very diverse compared to their categorical neighbors. So much so that it would not be very clear to the reader what the comparisons between everyone else look like. The other groups have interesting relationships, and now the reader should have an easy time switching between your text and visualization.
The notebook contains another example that experiments with donation amounts, again you can see that one high level donor throws off the mean. Most values were less than $100 though, so we use some binning techniques to further zoom into this level and get a more clear graph of regression type data.
I also include an example of how to use Jupyter to write a file that can be launched from the command line without needing Jupyter. Maybe you want your app to fetch fresh data in the morning for you, this would be the next step to make this happen. This can easily be triggered by cron on a daily schedule for example.
Further explorations: This data set is super small, I dropped them into a text editor to generate the lists and dictionary values for this example. What happens when you change the object from something tiny to a huge data file that you pipe in? In order to create the ‘without’ Nematode set you would need to create an iterator that cycles through the first set and only copies your wanted values or drops the unwanted.
Once you have the data captured, you can see that we quickly get a professional image ready to send to your team. Follow my Gitlab link to explore some more examples of similar topics.
Synopsis of the Phyla of Metazoa
Source: Zoology:: Dorit, Walker, Barnes
ISBN 0-03-030504-7 Chapter 23, pg 555