{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Data Visualization\n",
    "\n",
    "- tools:\n",
    "    - `seaborn` - generating plots\n",
    "    - `pandas` - wrangling data\n",
    "    - `matplotlib` - fine-tuning plots\n",
    "- plotting\n",
    "    - quantitative data\n",
    "    - categorical data\n",
    "- customizing visualizations\n",
    "\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/COGS108/Lectures-Fa22/blob/main/03_Ethics/03_03_dataviz.ipynb)\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "For more information on this topic, check out: (1) Jake VanderPlas' <a href=\"https://github.com/jakevdp/PythonDataScienceHandbook\" class=\"alert-link\">Python Data Science Handbook</a> and (2) Berkeley's <a href=\"https://www.textbook.ds100.org/ch/06/viz_intro.html\" class=\"alert-link\">Data 100 Textbook</a>.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "A good data visualization can help you:\n",
    "- identify anomalies in your data\n",
    "- better understand your own data\n",
    "- communicate your findings\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Quick Introduction\n",
    "95%+ of plots fall into just a few types:\n",
    "- single variable \n",
    "    - continuous\n",
    "    - discrete\n",
    "- discrete vs discrete\n",
    "- discrete vs continuous\n",
    "- continuous vs continuous\n",
    "\n",
    "## Basic Visualizations\n",
    "\n",
    "- histograms\n",
    "- densityplots\n",
    "- scatterplot\n",
    "- barplot\n",
    "    - grouped barplot\n",
    "    - stacked barplot\n",
    "- boxplot (and related things like violinplots, etc)\n",
    "- line plot\n",
    "\n",
    "## Variable types : Plots\n",
    "- statistical/distribution of quantitative variable\n",
    "    - single variable\n",
    "        - histogram\n",
    "        - densityplot\n",
    "    - single variable x categorical variable\n",
    "        - boxplot\n",
    "- count data \n",
    "    - count data x categorical variable\n",
    "        - barplot\n",
    "    - count data x 2 categorical variables\n",
    "        - grouped bar plot\n",
    "        - stacked bar plot\n",
    "- Directly view quantitative variables\n",
    "    - one variable x time\n",
    "        - line plot\n",
    "    - one variable x time x categorical variable\n",
    "        - multiple lines on the same plot\n",
    "    - two (or maybe 3) quantitative variables \n",
    "        - scatter plot\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "![](https://github.com/COGS108/Lectures-Sp22/raw/master/03_ethics/images/histogram.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "![](https://github.com/COGS108/Lectures-Sp22/raw/master/03_ethics/images/densityplot.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "![](https://github.com/COGS108/Lectures-Sp22/raw/master/03_ethics/images/scatterplot.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "![](https://github.com/COGS108/Lectures-Sp22/raw/master/03_ethics/images/barplot.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "![](https://github.com/COGS108/Lectures-Sp22/raw/master/03_ethics/images/grouped_barplot.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "![](https://github.com/COGS108/Lectures-Sp22/raw/master/03_ethics/images/stacked_barplot.png)\n",
    "\n",
    "Source: [Storytelling with Data (Nussbaumer Knaflic)](http://www.storytellingwithdata.com/books)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "![](https://github.com/COGS108/Lectures-Sp22/raw/master/03_ethics/images/boxplot.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "![](https://github.com/COGS108/Lectures-Sp22/raw/master/03_ethics/images/line_plot.png) \n",
    "\n",
    "Source: [Storytelling with Data (Nussbaumer Knaflic)](http://www.storytellingwithdata.com/books)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "####  Question #1  \n",
    "\n",
    "You want to visualize how many people in your dataset prefer chocolate chip cookies and how many prefer oatmeal raisin cookies.\n",
    "\n",
    "**What type of visualization would be most appropriate?**\n",
    "\n",
    "- A) histogram\n",
    "- B) scatterplot\n",
    "- C) barplot\n",
    "- D) boxplot\n",
    "- E) line plot\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "####  Question #2\n",
    "\n",
    "You're interested in visualizing how many servings of milk an individual drinks each day among those who prefer chocolate chip cookies and those who prefer oatmeal raisin cookies.\n",
    "\n",
    "**What type of visualization would be most appropriate?**\n",
    "\n",
    "- A) histogram\n",
    "- B) scatterplot\n",
    "- C) barplot\n",
    "- D) boxplot\n",
    "- E) line plot"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "####  Question #3\n",
    "\n",
    "You're interested in visualizing how many servings of milk an individual drinks each year over the course of their life.\n",
    "\n",
    "**What type of visualization would be most appropriate?**\n",
    "\n",
    "- A) histogram\n",
    "- B) scatterplot\n",
    "- C) barplot\n",
    "- D) boxplot\n",
    "- E) line plot"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Plotting in Python: Getting Started\n",
    "\n",
    "First we'll import the libraries we'll use for plotting. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# import working with data libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "# import seaborn\n",
    "import seaborn as sns\n",
    "\n",
    "# import matplotlib\n",
    "import matplotlib.pyplot as plt # Typical way of import MPL\n",
    "import matplotlib as mpl # This line is used less frequently\n",
    "\n",
    "#improve resolution\n",
    "#comment this line if erroring on your machine/screen\n",
    "%config InlineBackend.figure_format ='retina'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.__version__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# `seaborn` \n",
    "\n",
    "`seaborn` is a great place to get started when generating plots that don't look awful."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Class Data\n",
    "With the libraries we need imported, the first dataset we'll use today is data from the COGS 108 class survey from the Spring of 2019."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "df = pd.read_csv('https://github.com/COGS108/Lectures-Sp22/raw/master/03_ethics/data/df_for_viz.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "df.head()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Wrangling that's been done:\n",
    "- removed lots of identifying information\n",
    "- standardized gender & job\n",
    "- separated out programming responses"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "df.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Quantitative Variables\n",
    "\n",
    "- histograms\n",
    "- densityplots\n",
    "- scatterplots\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Histograms and Densityplots\n",
    "\n",
    "__Histograms__ & __Densityplots__ are helpful for visualizing information about a _single quantitative variable_.\n",
    "\n",
    "We can use seaborn's `histplot` function. (`distplot` in older versions of `seaborn`)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# set plotting size parameter\n",
    "plt.rcParams['figure.figsize'] = (17, 7) #default plot size to output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.set_theme(context='notebook',style='white',font_scale=2,rc={'axes.spines.right': False,'axes.spines.top': False} )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# histogram\n",
    "#`distplot` in older versions of `seaborn`\n",
    "sns.histplot(df['statistics'], bins=10, kde=False);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "One thing to note about histograms is the fact that the number of  bins displayed plays a large role what the viewer takes away from the visualization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "# `distplot` in older versions of `seaborn`\n",
    "# just histogram - set kde = False\n",
    "sns.histplot(df['statistics'], bins=20);\n",
    "\n",
    "# Alternative approach using pandas\n",
    "# df['statistics'].hist(bins=10);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "This doesn't follow \"visualization best practices.\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Visualization Best Practices\n",
    "\n",
    "- Choose the right type of visualization\n",
    "- Be mindful when choosing colors\n",
    "- Label your axes\n",
    "- Make text big enough\n",
    "- Keep it simple\n",
    "- Less is more: \n",
    "    - Aim to improve your data:ink ratio\n",
    "    - Everything on the page should serve a purpose. If it doesn't, remove it.\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Best Practices: Example"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "![](https://github.com/COGS108/Lectures-Sp22/raw/master/03_ethics/images/original.png)\n",
    "\n",
    "Source: [Storytelling with Data (Nussbaumer Knaflic)](http://www.storytellingwithdata.com/books)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "Ideas:\n",
    "\n",
    "- Pros:\n",
    "    - consistent colors from left to right\n",
    "    - values provided for each slice\n",
    "    - overall picture\n",
    "- Cons:\n",
    "    - text size\n",
    "    - legend not ideal\n",
    "    - colors are not intuitive\n",
    "    - pie chart not ideal b/c of # of categories\n",
    "\n",
    "Suggestions: \n",
    "- different visualiztion - stacked barplot?\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "#### Clicker Question #4\n",
    "\n",
    "Consider what are some positive and some negative aspects of this visualization. Click in when you have finished thinking.\n",
    "\n",
    "- A) I have some ideas!\n",
    "- B) I've got no ideas.\n",
    "- C) I'm not sure what I'm supposed to be thinking about."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "![](https://github.com/COGS108/Lectures-Sp22/raw/master/03_ethics/images/improvement1.png)\n",
    "\n",
    "Source: [Storytelling with Data (Nussbaumer Knaflic)](http://www.storytellingwithdata.com/books)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "![](https://github.com/COGS108/Lectures-Sp22/raw/master/03_ethics/images/improvement2.png)\n",
    "\n",
    "Source: [Storytelling with Data (Nussbaumer Knaflic)](http://www.storytellingwithdata.com/books)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "![](https://github.com/COGS108/Lectures-Sp22/raw/master/03_ethics/images/improvement3.png)\n",
    "\n",
    "Source: [Storytelling with Data (Nussbaumer Knaflic)](http://www.storytellingwithdata.com/books)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "\n",
    "### Less is more\n",
    "\n",
    "The *less is more* approach suggests that we should probably get rid of this background color now and remove the gridlines. We'll use the _less is more_ approach as we work through the other types of visualizations.\n",
    "\n",
    "Let's improve that now for our original plot..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "# `distplot` in older versions of `seaborn`\n",
    "# change color to dark grey\n",
    "ax = sns.histplot(df['statistics'], kde=False, \n",
    "                  bins=10, color='#686868')\n",
    "\n",
    "# remove the top and right lines\n",
    "sns.despine()\n",
    "\n",
    "# add title and axis labels (modify x-axis label)\n",
    "ax.set_title('Most COGS108 students are moderately comfortable with statistics')\n",
    "ax.set_ylabel('Count')\n",
    "ax.set_xlabel('How comfortable are you with statistics?');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "# kdeplot to only display the densityplot\n",
    "ax = sns.kdeplot(df['programming'], color='#686868')\n",
    "\n",
    "# remove the top and right lines\n",
    "sns.despine()\n",
    "\n",
    "# add title and axis labels (modify x-axis label)\n",
    "ax.set_title('Most COGS108 students are pretty comfortable with programming')\n",
    "ax.set_ylabel('Count')\n",
    "ax.set_xlabel('How comfortable are you with programming?');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Scatterplots\n",
    "\n",
    "Scatterplots can help visualize the relationship between __two quantitative variables__."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "sns.scatterplot(x='programming', y='statistics', data=df, \n",
    "                # alpha=0.1 # comment this in and out\n",
    "               );\n",
    "\n",
    "# alternative with pandas\n",
    "# df.plot.scatter('programming', 'statistics');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# jitter points to see relationship, try different levels of it\n",
    "sns.lmplot(x='programming', y='statistics', data=df,\n",
    "           fit_reg=False, height=6, aspect=2,\n",
    "          x_jitter=.15, y_jitter=.15);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# fit a linear model, showing the line of best fit \n",
    "# and also 95% confidence interval on the fit\n",
    "sns.lmplot(x='programming', y='statistics', data=df,\n",
    "           fit_reg=True, height=6, aspect=2,\n",
    "          x_jitter=.20, y_jitter=.20);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "####  Question #5\n",
    "\n",
    "What can we say about the relationship between students' comfortability with programming and statistics?\n",
    "\n",
    "- A) Students who are more comfortable programming are more comfortable with statistics\n",
    "- B) Students sho are more comfortable programming are less comfortable with statistics\n",
    "- C) There is little relationship between students' comfort level with programming and statistics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Scatterplots (by a categorical variable)\n",
    "\n",
    "When you want to plot two numeric variables but want to get some insight about a *third* categorical variable, you can color the points on the plot by the categorical variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# control color palette\n",
    "unique = df[\"lecture_attendance\"].append(df[\"gender\"]).unique()\n",
    "\n",
    "#s1=df[\"lecture_attendance\"]\n",
    "#s2=df[\"gender\"]\n",
    "#s3 = pd.concat([s1,s2]).unique()\n",
    "\n",
    "\n",
    "palette = dict(zip(unique, sns.color_palette()))\n",
    "palette.update({\"Total\":\"k\"})\n",
    "print(palette)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# control color palette\n",
    "unique = df[\"lecture_attendance\"].append(df[\"gender\"]).unique()\n",
    "palette = dict(zip(unique, sns.color_palette()))\n",
    "palette.update({\"Total\":\"k\"})\n",
    "\n",
    "# color points by gender is\n",
    "sns.lmplot(x='programming', y='statistics', data=df, hue='gender',\n",
    "           fit_reg=True, height=6, aspect=2, \n",
    "           x_jitter=.5, y_jitter=.5,\n",
    "           palette=palette);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "\n",
    "sns.lmplot(x='programming', y='statistics', data=df, hue='lecture_attendance',\n",
    "           fit_reg=True, height=6, aspect=2, \n",
    "           x_jitter=.5, y_jitter=.5,\n",
    "           palette=palette);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "#### Clicker Question #6\n",
    "\n",
    "What can we say about the relationship between students' comfortability with programming and statistics and gender? And, how easy is this to determine?\n",
    "\n",
    "- A) Females and Other/Prefer not to say tend to be more comfortable with programming; easy to determine\n",
    "- B) Females and Other/Prefer not to say tend to be more comfortable with programming; difficult to determine\n",
    "- C) Males tend to be more comfortable with programming; easy to determine\n",
    "- D) Males tend to be more comfortable with programming; difficult to determine\n",
    "- E) I'm super lost."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "We don't get a _ton_ more information here, but what we may see a slight shift in programming comfortability to include more males relative to females. To better understand this, a boxplot would be helpful. (We'll look at this shortly.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Categorical Variables\n",
    "\n",
    "- barplots\n",
    "- grouped barplots\n",
    "- stacked barplots"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Barplots"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "In `seaborn` there are two types of bar charts:\n",
    "1. `countplot` - counts the number of times each category appears in a column\n",
    "2. `barplot` - groups dataframe by a categorical column and plots the height bars according to the average of a numerical column within each group (This is usually not the right way to visualize quantitative data, so we're not covering it in this class.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# generate default barplot\n",
    "sns.countplot(x='lecture_attendance', \n",
    "              data=df);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "ax = sns.countplot(x='lecture_attendance', \n",
    "                   data=df, color = '#686868')\n",
    "\n",
    "# add title and axis labels (modify x-axis label)\n",
    "ax.set_title('Most COGS108 students prefer to attend lecture')\n",
    "ax.set_ylabel('Count')\n",
    "ax.set_xlabel('Lecture Attendance Preference')\n",
    "# set tick labels\n",
    "ax.set_xticklabels((\"attend\", \"not attend\"));"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "ax = sns.countplot(x='gender', data=df, color='#686868')\n",
    "\n",
    "# add title and axis labels (modify x-axis label)\n",
    "ax.set_title('There are more males than females in COGS108')\n",
    "ax.set_ylabel('Count') \n",
    "ax.set_xlabel('Gender');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "It's often a good idea to order axes from largest to smallest for categorical data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "ax = sns.countplot(x='gender', data=df, color = '#686868',\n",
    "             order=['male', 'female', 'other or prefer not to say'])\n",
    "\n",
    "# add title and axis labels (modify x-axis label)\n",
    "ax.set_title('Male is the most prevalent gender in COGS108.')\n",
    "ax.set_ylabel('Count')\n",
    "ax.set_xlabel('Gender');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "# warning: not seaborn\n",
    "# pandas approach\n",
    "# proportion of the class familiar with each programming language\n",
    "a = df.iloc[:,5:11].sum()/len(df)\n",
    "a = a.sort_values(axis=0, ascending=False)\n",
    "a.plot.bar(color='#686868', rot=0);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Grouped Barplots"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# same color palette as defined earlier\n",
    "# generate grouped barplot by specifying hue\n",
    "ax = sns.countplot(x='lecture_attendance', hue='gender',\n",
    "                   data=df, palette=palette, )\n",
    "\n",
    "# add title and axis labels (modify x-axis label)\n",
    "ax.set_title('Most COGS108 students prefer to attend lecture')\n",
    "ax.set_ylabel('Count')\n",
    "ax.set_xlabel('Lecture Attendance Preference')\n",
    "ax.set_xticklabels(('attend', 'not attend'));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Because we have different numbers of males and females, comparing counts is not all that helpful... \n",
    "\n",
    "We need proportions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Stacked Barplots"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# warning: this is not seaborn\n",
    "df2 = df.groupby([ 'lecture_attendance','gender'])['lecture_attendance'].count().unstack('gender').fillna(0)\n",
    "sub_df2 = np.transpose(df2.div(df2.sum()))\n",
    "\n",
    "# generate plot\n",
    "ax = sub_df2.plot(kind='bar', stacked=True, rot=0,\n",
    "                  title='Lecture Attendance does not appear to differ by gender')\n",
    "\n",
    "# customize plot\n",
    "ax.legend(('not attend','attend'), loc='center left', bbox_to_anchor=(1.0, 0.5))\n",
    "ax.set_ylabel(\"Proportion of students\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# More plots\n",
    "\n",
    "- boxplots (quantitative + categorical)\n",
    "- lineplots (quantitative over time)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Boxplots\n",
    "\n",
    "By default, the box delineates the 25th and 75th percentile. The line down the middle represents the median. \"Whiskers\" extend to show the range for the rest of the data, excluding outliers. Outliers are marked as individual points outside of the whiskers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# generate boxplots\n",
    "sns.boxplot(y='statistics', x='gender', data=df);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "## Outlier determination\n",
    "\n",
    "Outliers show up as individual points on boxplots. But, we don't see any on this boxplot. Let's see why..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# determine the 25th and 75th percentiles\n",
    "lower, upper = np.percentile(df['statistics'], [25, 75])\n",
    "lower, upper"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# calculate IQR\n",
    "iqr = upper - lower\n",
    "iqr"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Typically, the inter-quartile range (IQR) is used to determine which values get marked as outliers. The IQR is: 75th percentile - 25th percentile. Values greater than 1.5 x IQR above the 75th or below the 25th percentile are marked as outliers.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# calculate lower cutoff\n",
    "# values below this are outliers \n",
    "lower_cutoff = lower - 1.5 * iqr\n",
    "\n",
    "# calculate upper cutoff\n",
    "# values above this are outliers \n",
    "upper_cutoff = upper + 1.5 * iqr\n",
    "\n",
    "lower_cutoff, upper_cutoff"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Boxplots really shine when you want to look at the range of typical values for a quantitative variable, _broken down by a separate categorical variable_."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# generate boxplots\n",
    "# we can make sure the colors match what we used earlier for the same groups\n",
    "ax = sns.boxplot(x='gender', y='statistics', data=df)\n",
    "\n",
    "ax.set_title('Gender not related to comfort with statistics')\n",
    "ax.set_ylabel('Comfort with Statistics')\n",
    "ax.set_xlabel('Gender');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# generate boxplots\n",
    "# we can make sure the colors match what we used earlier for the same groups\n",
    "ax = sns.boxplot(x='gender', y='statistics', data=df, palette=palette)\n",
    "\n",
    "ax.set_title('Gender not related to comfort with statistics')\n",
    "ax.set_ylabel('Comfort with Statistics')\n",
    "ax.set_xlabel('Gender');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "Much better! "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Histograms (by a categorical variable)\n",
    "\n",
    "The same data plotted as a histogram are not so easily interpretable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# `distplot` in older versions of `seaborn`\n",
    "sns.histplot(df.loc[df['gender'] == 'female', 'statistics'], kde=True, color=\"red\")\n",
    "sns.histplot(df.loc[df['gender'] == 'male', 'statistics'], kde=True, color=\"purple\")\n",
    "sns.histplot(df.loc[df['gender'] == 'other or prefer not to say', 'statistics'], kde=True);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Customization: `births` data\n",
    "\n",
    "Now that we're getting the hang of this, let's see how complicated things can get. We'll return to using a line chart to look at birth patterns over time. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# get the data\n",
    "births = pd.read_csv('https://github.com/COGS108/Lectures-Sp22/raw/master/03_ethics/data/births.csv')\n",
    "births.head()\n",
    "births.year.max()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "from datetime import datetime\n",
    "\n",
    "# calculate values & wrangle\n",
    "quartiles = np.percentile(births['births'], [25, 50, 75])\n",
    "mu, sig = quartiles[1], 0.74 * (quartiles[2] - quartiles[0])\n",
    "births = births.query('(births > @mu - 5 * @sig) & (births < @mu + 5 * @sig)')\n",
    "\n",
    "births['day'] = births['day'].astype(int)\n",
    "\n",
    "births.index = pd.to_datetime(10000 * births.year +\n",
    "                              100 * births.month +\n",
    "                              births.day, format='%Y%m%d')\n",
    "births_by_date = births.pivot_table('births',\n",
    "                                    [births.index.month, births.index.day])\n",
    "births_by_date.index = [datetime(2012, month, day)\n",
    "                        for (month, day) in births_by_date.index]\n",
    "\n",
    "\n",
    "# plot the thing\n",
    "fig, ax = plt.subplots(figsize=(22, 5))\n",
    "births_by_date.plot(ax = ax)\n",
    "ax.get_legend().remove()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "What are all those dips? Well, let's annotate the plot to get a better sense of what's going on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "# plot the thing\n",
    "fig, ax = plt.subplots(figsize=(22, 7))\n",
    "births_by_date.plot(ax=ax)\n",
    "ax.get_legend().remove();\n",
    "\n",
    "# define style\n",
    "style = dict(size=16, color='gray')\n",
    "\n",
    "# add annotation\n",
    "ax.text('2012-1-1', 3950, \"New Year's Day\", **style)\n",
    "ax.text('2012-7-4', 4250, \"Independence Day\", ha='center', **style)\n",
    "ax.text('2012-9-4', 4850, \"Labor Day\", ha='center', **style)\n",
    "ax.text('2012-10-31', 4600, \"Halloween\", ha='right', **style)\n",
    "ax.text('2012-11-25', 4450, \"Thanksgiving\", ha='center', **style)\n",
    "ax.text('2012-12-25', 3850, \"Christmas \", ha='right', **style)\n",
    "\n",
    "# label the axes\n",
    "ax.set(title='USA births by day of year (1969-1988)',\n",
    "       ylabel='average daily births')\n",
    "\n",
    "# format the x axis with centered month labels\n",
    "ax.xaxis.set_major_locator(mpl.dates.MonthLocator())\n",
    "ax.xaxis.set_minor_locator(mpl.dates.MonthLocator(bymonthday=15))\n",
    "ax.xaxis.set_major_formatter(plt.NullFormatter())\n",
    "ax.xaxis.set_minor_formatter(mpl.dates.DateFormatter('%h'));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Annotation directly on plots can help explain the plot to viewers."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# Saving Plots\n",
    "\n",
    "While we're using a Jupyter notebook right now, you won't always be. So, you'll need to know how to save figures."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "# save fig to plots directory\n",
    "# this will only work if you have \n",
    "# a plots directory in your working directory\n",
    "fig.savefig('images/my_figure.png',dpi=300)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Note that the file format is inferred from the extension you specify in the filename. \n",
    "\n",
    "To see which file types are supported:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "fig.canvas.get_supported_filetypes()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# Viewing Saved Plots\n",
    "\n",
    "Once a plot is saved, it may be helpful to view it through IPython or your notebook. To do so, you'd use the following:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Can import with Markdown formatting... (or with HTML in a markdown cell)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "![dates figure](https://github.com/COGS108/Lectures-Sp22/raw/master/03_ethics/images/my_figure.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false,
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "# to see contents of a saved image\n",
    "from IPython.display import Image\n",
    "Image('https://github.com/COGS108/Lectures-Sp22/raw/master/03_ethics/images/my_figure.png')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  },
  "rise": {
   "scroll": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
