Lecture 3.3

1. Data Collection

!pip install beautifulsoup4
Requirement already satisfied: beautifulsoup4 in /Users/enfageorge/miniconda/lib/python3.10/site-packages (4.12.2)
Requirement already satisfied: soupsieve>1.2 in /Users/enfageorge/miniconda/lib/python3.10/site-packages (from beautifulsoup4) (2.4.1)

1.1 Requests

import requests
URL = "https://csc380.beingenfa.com/Syllabus/Key_Info.html"
r = requests.get(URL)
r.status_code
200
r.text
'<!DOCTYPE html>\n<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"><head>\n\n<meta charset="utf-8">\n<meta name="generator" content="quarto-1.3.361">\n\n<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">\n\n\n<title>CSC 380 – key_info</title>\n<style>\ncode{white-space: pre-wrap;}\nspan.smallcaps{font-variant: small-caps;}\ndiv.columns{display: flex; gap: min(4vw, 1.5em);}\ndiv.column{flex: auto; overflow-x: auto;}\ndiv.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}\nul.task-list{list-style: none;}\nul.task-list li input[type="checkbox"] {\n  width: 0.8em;\n  margin: 0 0.8em 0.2em -1em; /* quarto-specific, see https://github.com/quarto-dev/quarto-cli/issues/4556 */ \n  vertical-align: middle;\n}\n</style>\n\n\n<script src="../site_libs/quarto-nav/quarto-nav.js"></script>\n<script src="../site_libs/quarto-nav/headroom.min.js"></script>\n<script src="../site_libs/clipboard/clipboard.min.js"></script>\n<script src="../site_libs/quarto-search/autocomplete.umd.js"></script>\n<script src="../site_libs/quarto-search/fuse.min.js"></script>\n<script src="../site_libs/quarto-search/quarto-search.js"></script>\n<meta name="quarto:offset" content="../">\n<script src="../site_libs/quarto-html/quarto.js"></script>\n<script src="../site_libs/quarto-html/popper.min.js"></script>\n<script src="../site_libs/quarto-html/tippy.umd.min.js"></script>\n<script src="../site_libs/quarto-html/anchor.min.js"></script>\n<link href="../site_libs/quarto-html/tippy.css" rel="stylesheet">\n<link href="../site_libs/quarto-html/quarto-syntax-highlighting.css" rel="stylesheet" id="quarto-text-highlighting-styles">\n<script src="../site_libs/bootstrap/bootstrap.min.js"></script>\n<link href="../site_libs/bootstrap/bootstrap-icons.css" rel="stylesheet">\n<link href="../site_libs/bootstrap/bootstrap.min.css" rel="stylesheet" id="quarto-bootstrap" data-mode="light">\n<script id="quarto-search-options" type="application/json">{\n  "location": "sidebar",\n  "copy-button": false,\n  "collapse-after": 3,\n  "panel-placement": "start",\n  "type": "textbox",\n  "limit": 20,\n  "language": {\n    "search-no-results-text": "No results",\n    "search-matching-documents-text": "matching documents",\n    "search-copy-link-title": "Copy link to search",\n    "search-hide-matches-text": "Hide additional matches",\n    "search-more-match-text": "more match in this document",\n    "search-more-matches-text": "more matches in this document",\n    "search-clear-button-title": "Clear",\n    "search-detached-cancel-button-title": "Cancel",\n    "search-submit-button-title": "Submit",\n    "search-label": "Search"\n  }\n}</script>\n\n\n<link rel="stylesheet" href="../styles.css">\n</head>\n\n<body class="nav-sidebar docked">\n\n<div id="quarto-search-results"></div>\n  <header id="quarto-header" class="headroom fixed-top">\n  <nav class="quarto-secondary-nav">\n    <div class="container-fluid d-flex">\n      <button type="button" class="quarto-btn-toggle btn" data-bs-toggle="collapse" data-bs-target="#quarto-sidebar,#quarto-sidebar-glass" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }">\n        <i class="bi bi-layout-text-sidebar-reverse"></i>\n      </button>\n      <nav class="quarto-page-breadcrumbs" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../Syllabus/Key_Info.html">Syllabus</a></li><li class="breadcrumb-item"><a href="../Syllabus/Key_Info.html">Key Info</a></li></ol></nav>\n      <a class="flex-grow-1" role="button" data-bs-toggle="collapse" data-bs-target="#quarto-sidebar,#quarto-sidebar-glass" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }">      \n      </a>\n      <button type="button" class="btn quarto-search-button" aria-label="" onclick="window.quartoOpenSearch();">\n        <i class="bi bi-search"></i>\n      </button>\n    </div>\n  </nav>\n</header>\n<!-- content -->\n<div id="quarto-content" class="quarto-container page-columns page-rows-contents page-layout-article">\n<!-- sidebar -->\n  <nav id="quarto-sidebar" class="sidebar collapse collapse-horizontal sidebar-navigation docked overflow-auto">\n    <div class="pt-lg-2 mt-2 text-left sidebar-header">\n    <div class="sidebar-title mb-0 py-0">\n      <a href="../">CSC 380</a> \n    </div>\n      </div>\n        <div class="mt-2 flex-shrink-0 align-items-center">\n        <div class="sidebar-search">\n        <div id="quarto-search" class="" title="Search"></div>\n        </div>\n        </div>\n    <div class="sidebar-menu-container"> \n    <ul class="list-unstyled mt-1">\n        <li class="sidebar-item sidebar-item-section">\n      <div class="sidebar-item-container"> \n            <a class="sidebar-item-text sidebar-link text-start" data-bs-toggle="collapse" data-bs-target="#quarto-sidebar-section-1" aria-expanded="true">\n <span class="menu-text">Course Content</span></a>\n          <a class="sidebar-item-toggle text-start" data-bs-toggle="collapse" data-bs-target="#quarto-sidebar-section-1" aria-expanded="true" aria-label="Toggle section">\n            <i class="bi bi-chevron-right ms-2"></i>\n          </a> \n      </div>\n      <ul id="quarto-sidebar-section-1" class="collapse list-unstyled sidebar-section depth1 show">  \n          <li class="sidebar-item">\n  <div class="sidebar-item-container"> \n  <a href="../Course_Content/Week_1/home.html" class="sidebar-item-text sidebar-link">\n <span class="menu-text">Week 1</span></a>\n  </div>\n</li>\n          <li class="sidebar-item">\n  <div class="sidebar-item-container"> \n  <a href="../Course_Content/Week_2/home.html" class="sidebar-item-text sidebar-link">\n <span class="menu-text">Week 2</span></a>\n  </div>\n</li>\n          <li class="sidebar-item">\n  <div class="sidebar-item-container"> \n  <a href="../Course_Content/Week_3/home.html" class="sidebar-item-text sidebar-link">\n <span class="menu-text">Week 3</span></a>\n  </div>\n</li>\n          <li class="sidebar-item">\n  <div class="sidebar-item-container"> \n  <a href="../Course_Content/Week_4/home.html" class="sidebar-item-text sidebar-link">\n <span class="menu-text">Week 4</span></a>\n  </div>\n</li>\n      </ul>\n  </li>\n        <li class="sidebar-item sidebar-item-section">\n      <div class="sidebar-item-container"> \n            <a class="sidebar-item-text sidebar-link text-start" data-bs-toggle="collapse" data-bs-target="#quarto-sidebar-section-2" aria-expanded="true">\n <span class="menu-text">Homework</span></a>\n          <a class="sidebar-item-toggle text-start" data-bs-toggle="collapse" data-bs-target="#quarto-sidebar-section-2" aria-expanded="true" aria-label="Toggle section">\n            <i class="bi bi-chevron-right ms-2"></i>\n          </a> \n      </div>\n      <ul id="quarto-sidebar-section-2" class="collapse list-unstyled sidebar-section depth1 show">  \n          <li class="sidebar-item">\n  <div class="sidebar-item-container"> \n  <a href="../Homework/HW1.html" class="sidebar-item-text sidebar-link">\n <span class="menu-text">HW1: Probability</span></a>\n  </div>\n</li>\n      </ul>\n  </li>\n        <li class="sidebar-item sidebar-item-section">\n      <div class="sidebar-item-container"> \n            <a class="sidebar-item-text sidebar-link text-start" data-bs-toggle="collapse" data-bs-target="#quarto-sidebar-section-3" aria-expanded="true">\n <span class="menu-text">Ethics Discussions</span></a>\n          <a class="sidebar-item-toggle text-start" data-bs-toggle="collapse" data-bs-target="#quarto-sidebar-section-3" aria-expanded="true" aria-label="Toggle section">\n            <i class="bi bi-chevron-right ms-2"></i>\n          </a> \n      </div>\n      <ul id="quarto-sidebar-section-3" class="collapse list-unstyled sidebar-section depth1 show">  \n          <li class="sidebar-item">\n  <div class="sidebar-item-container"> \n  <a href="../Ethics/Week_2.html" class="sidebar-item-text sidebar-link">\n <span class="menu-text">W2: Political Content</span></a>\n  </div>\n</li>\n          <li class="sidebar-item">\n  <div class="sidebar-item-container"> \n  <a href="../Ethics/Week_3.html" class="sidebar-item-text sidebar-link">\n <span class="menu-text">W3: Creative Work</span></a>\n  </div>\n</li>\n          <li class="sidebar-item">\n  <div class="sidebar-item-container"> \n  <a href="../Ethics/Week_4.html" class="sidebar-item-text sidebar-link">\n <span class="menu-text">W4: Mental Health Support</span></a>\n  </div>\n</li>\n      </ul>\n  </li>\n        <li class="sidebar-item sidebar-item-section">\n      <div class="sidebar-item-container"> \n            <a class="sidebar-item-text sidebar-link text-start" data-bs-toggle="collapse" data-bs-target="#quarto-sidebar-section-4" aria-expanded="true">\n <span class="menu-text">Syllabus</span></a>\n          <a class="sidebar-item-toggle text-start" data-bs-toggle="collapse" data-bs-target="#quarto-sidebar-section-4" aria-expanded="true" aria-label="Toggle section">\n            <i class="bi bi-chevron-right ms-2"></i>\n          </a> \n      </div>\n      <ul id="quarto-sidebar-section-4" class="collapse list-unstyled sidebar-section depth1 show">  \n          <li class="sidebar-item">\n  <div class="sidebar-item-container"> \n  <a href="../Syllabus/Key_Info.html" class="sidebar-item-text sidebar-link active">\n <span class="menu-text">Key Info</span></a>\n  </div>\n</li>\n          <li class="sidebar-item">\n  <div class="sidebar-item-container"> \n  <a href="../Syllabus/Syllabus.html" class="sidebar-item-text sidebar-link">\n <span class="menu-text">Official Syllabus</span></a>\n  </div>\n</li>\n      </ul>\n  </li>\n    </ul>\n    </div>\n</nav>\n<div id="quarto-sidebar-glass" data-bs-toggle="collapse" data-bs-target="#quarto-sidebar,#quarto-sidebar-glass"></div>\n<!-- margin-sidebar -->\n    <div id="quarto-margin-sidebar" class="sidebar margin-sidebar">\n        <nav id="TOC" role="doc-toc" class="toc-active">\n    <h2 id="toc-title">On this page</h2>\n   \n  <ul>\n  <li><a href="#syllabus-key-info" id="toc-syllabus-key-info" class="nav-link active" data-scroll-target="#syllabus-key-info">Syllabus (Key Info)</a>\n  <ul class="collapse">\n  <li><a href="#description" id="toc-description" class="nav-link" data-scroll-target="#description">Description</a></li>\n  <li><a href="#course-objective" id="toc-course-objective" class="nav-link" data-scroll-target="#course-objective">Course Objective</a></li>\n  <li><a href="#expected-learning-outcomes" id="toc-expected-learning-outcomes" class="nav-link" data-scroll-target="#expected-learning-outcomes">Expected Learning Outcomes</a></li>\n  </ul></li>\n  </ul>\n</nav>\n    </div>\n<!-- main -->\n<main class="content" id="quarto-document-content">\n\n\n\n<section id="syllabus-key-info" class="level1">\n<h1>Syllabus (Key Info)</h1>\n<section id="description" class="level2">\n<h2 class="anchored" data-anchor-id="description">Description</h2>\n<p>The course introduces students to the principles of data science, which are essential for computer scientists to make effective decisions in their professional careers. In today’s data-driven world, a wide range of computer science sub-disciplines heavily rely on data collection, analysis, and interpretation. With the pervasive presence of artificial intelligence (AI) in our lives, understanding the basics of how these systems work is becoming increasingly important. Additionally, it covers the basics of artificial intelligence (AI) systems and examines practical use cases, current news, and ethical considerations through readings and discussions.</p>\n</section>\n<section id="course-objective" class="level2">\n<h2 class="anchored" data-anchor-id="course-objective">Course Objective</h2>\n<p>Course Objectives</p>\n<p>This course aims to introduce students to the principles and techniques of data science, enabling them to make effective decisions in their computer science careers. During this course, the student will,</p>\n<ul class="task-list">\n<li><p><input type="checkbox">Understand the fundamental concepts and principles of data science, including data collection, preprocessing, analysis, and interpretation.</p></li>\n<li><p><input type="checkbox">Apply data analysis and visualization techniques to derive insights from diverse datasets.</p></li>\n<li><p><input type="checkbox">Gain familiarity with machine learning algorithms and their practical applications.</p></li>\n<li><p><input type="checkbox">Develop proficiency in using data science tools and programming languages.</p></li>\n<li><p><input type="checkbox">Engage in critical thinking and problem-solving through project-based assignments.</p></li>\n<li><p><input type="checkbox">Explore the ethical considerations associated with data-driven decision-making.</p></li>\n<li><p><input type="checkbox">Stay informed about current trends and developments in data science and artificial intelligence.</p></li>\n</ul>\n</section>\n<section id="expected-learning-outcomes" class="level2">\n<h2 class="anchored" data-anchor-id="expected-learning-outcomes">Expected Learning Outcomes</h2>\n<p>A student who successfully completes this course will be able to:</p>\n<ul class="task-list">\n<li><p><input type="checkbox">Explain the difference between different measures of centrality and variability (means vs.&nbsp;medians, variance vs.&nbsp;interquartile range, etc.)</p></li>\n<li><p><input type="checkbox">Convert a raw data source into a version appropriate for downstream analysis using Python.</p></li>\n<li><p><input type="checkbox">Write appropriate visualizations for different sources and types of data.</p></li>\n<li><p><input type="checkbox">Explain why we seek to build machine learning models that generalize rather than memorize their input.</p></li>\n<li><p><input type="checkbox">Explain the different uses for training, validation, and testing datasets</p></li>\n<li><p><input type="checkbox">Select the appropriate evaluation measure for the dataset and task being solved</p></li>\n<li><p><input type="checkbox">Articulate the difference between supervised and unsupervised machine learning, as well as select the appropriate methodology for a given problem</p></li>\n<li><p><input type="checkbox">Demonstrate awareness of bias and ethics in data science.</p></li>\n</ul>\n\n\n</section>\n</section>\n\n</main> <!-- /main -->\n<script id="quarto-html-after-body" type="application/javascript">\nwindow.document.addEventListener("DOMContentLoaded", function (event) {\n  const toggleBodyColorMode = (bsSheetEl) => {\n    const mode = bsSheetEl.getAttribute("data-mode");\n    const bodyEl = window.document.querySelector("body");\n    if (mode === "dark") {\n      bodyEl.classList.add("quarto-dark");\n      bodyEl.classList.remove("quarto-light");\n    } else {\n      bodyEl.classList.add("quarto-light");\n      bodyEl.classList.remove("quarto-dark");\n    }\n  }\n  const toggleBodyColorPrimary = () => {\n    const bsSheetEl = window.document.querySelector("link#quarto-bootstrap");\n    if (bsSheetEl) {\n      toggleBodyColorMode(bsSheetEl);\n    }\n  }\n  toggleBodyColorPrimary();  \n  const icon = "\ue9cb";\n  const anchorJS = new window.AnchorJS();\n  anchorJS.options = {\n    placement: \'right\',\n    icon: icon\n  };\n  anchorJS.add(\'.anchored\');\n  const isCodeAnnotation = (el) => {\n    for (const clz of el.classList) {\n      if (clz.startsWith(\'code-annotation-\')) {                     \n        return true;\n      }\n    }\n    return false;\n  }\n  const clipboard = new window.ClipboardJS(\'.code-copy-button\', {\n    text: function(trigger) {\n      const codeEl = trigger.previousElementSibling.cloneNode(true);\n      for (const childEl of codeEl.children) {\n        if (isCodeAnnotation(childEl)) {\n          childEl.remove();\n        }\n      }\n      return codeEl.innerText;\n    }\n  });\n  clipboard.on(\'success\', function(e) {\n    // button target\n    const button = e.trigger;\n    // don\'t keep focus\n    button.blur();\n    // flash "checked"\n    button.classList.add(\'code-copy-button-checked\');\n    var currentTitle = button.getAttribute("title");\n    button.setAttribute("title", "Copied!");\n    let tooltip;\n    if (window.bootstrap) {\n      button.setAttribute("data-bs-toggle", "tooltip");\n      button.setAttribute("data-bs-placement", "left");\n      button.setAttribute("data-bs-title", "Copied!");\n      tooltip = new bootstrap.Tooltip(button, \n        { trigger: "manual", \n          customClass: "code-copy-button-tooltip",\n          offset: [0, -8]});\n      tooltip.show();    \n    }\n    setTimeout(function() {\n      if (tooltip) {\n        tooltip.hide();\n        button.removeAttribute("data-bs-title");\n        button.removeAttribute("data-bs-toggle");\n        button.removeAttribute("data-bs-placement");\n      }\n      button.setAttribute("title", currentTitle);\n      button.classList.remove(\'code-copy-button-checked\');\n    }, 1000);\n    // clear code selection\n    e.clearSelection();\n  });\n  function tippyHover(el, contentFn) {\n    const config = {\n      allowHTML: true,\n      content: contentFn,\n      maxWidth: 500,\n      delay: 100,\n      arrow: false,\n      appendTo: function(el) {\n          return el.parentElement;\n      },\n      interactive: true,\n      interactiveBorder: 10,\n      theme: \'quarto\',\n      placement: \'bottom-start\'\n    };\n    window.tippy(el, config); \n  }\n  const noterefs = window.document.querySelectorAll(\'a[role="doc-noteref"]\');\n  for (var i=0; i<noterefs.length; i++) {\n    const ref = noterefs[i];\n    tippyHover(ref, function() {\n      // use id or data attribute instead here\n      let href = ref.getAttribute(\'data-footnote-href\') || ref.getAttribute(\'href\');\n      try { href = new URL(href).hash; } catch {}\n      const id = href.replace(/^#\\/?/, "");\n      const note = window.document.getElementById(id);\n      return note.innerHTML;\n    });\n  }\n      let selectedAnnoteEl;\n      const selectorForAnnotation = ( cell, annotation) => {\n        let cellAttr = \'data-code-cell="\' + cell + \'"\';\n        let lineAttr = \'data-code-annotation="\' +  annotation + \'"\';\n        const selector = \'span[\' + cellAttr + \'][\' + lineAttr + \']\';\n        return selector;\n      }\n      const selectCodeLines = (annoteEl) => {\n        const doc = window.document;\n        const targetCell = annoteEl.getAttribute("data-target-cell");\n        const targetAnnotation = annoteEl.getAttribute("data-target-annotation");\n        const annoteSpan = window.document.querySelector(selectorForAnnotation(targetCell, targetAnnotation));\n        const lines = annoteSpan.getAttribute("data-code-lines").split(",");\n        const lineIds = lines.map((line) => {\n          return targetCell + "-" + line;\n        })\n        let top = null;\n        let height = null;\n        let parent = null;\n        if (lineIds.length > 0) {\n            //compute the position of the single el (top and bottom and make a div)\n            const el = window.document.getElementById(lineIds[0]);\n            top = el.offsetTop;\n            height = el.offsetHeight;\n            parent = el.parentElement.parentElement;\n          if (lineIds.length > 1) {\n            const lastEl = window.document.getElementById(lineIds[lineIds.length - 1]);\n            const bottom = lastEl.offsetTop + lastEl.offsetHeight;\n            height = bottom - top;\n          }\n          if (top !== null && height !== null && parent !== null) {\n            // cook up a div (if necessary) and position it \n            let div = window.document.getElementById("code-annotation-line-highlight");\n            if (div === null) {\n              div = window.document.createElement("div");\n              div.setAttribute("id", "code-annotation-line-highlight");\n              div.style.position = \'absolute\';\n              parent.appendChild(div);\n            }\n            div.style.top = top - 2 + "px";\n            div.style.height = height + 4 + "px";\n            let gutterDiv = window.document.getElementById("code-annotation-line-highlight-gutter");\n            if (gutterDiv === null) {\n              gutterDiv = window.document.createElement("div");\n              gutterDiv.setAttribute("id", "code-annotation-line-highlight-gutter");\n              gutterDiv.style.position = \'absolute\';\n              const codeCell = window.document.getElementById(targetCell);\n              const gutter = codeCell.querySelector(\'.code-annotation-gutter\');\n              gutter.appendChild(gutterDiv);\n            }\n            gutterDiv.style.top = top - 2 + "px";\n            gutterDiv.style.height = height + 4 + "px";\n          }\n          selectedAnnoteEl = annoteEl;\n        }\n      };\n      const unselectCodeLines = () => {\n        const elementsIds = ["code-annotation-line-highlight", "code-annotation-line-highlight-gutter"];\n        elementsIds.forEach((elId) => {\n          const div = window.document.getElementById(elId);\n          if (div) {\n            div.remove();\n          }\n        });\n        selectedAnnoteEl = undefined;\n      };\n      // Attach click handler to the DT\n      const annoteDls = window.document.querySelectorAll(\'dt[data-target-cell]\');\n      for (const annoteDlNode of annoteDls) {\n        annoteDlNode.addEventListener(\'click\', (event) => {\n          const clickedEl = event.target;\n          if (clickedEl !== selectedAnnoteEl) {\n            unselectCodeLines();\n            const activeEl = window.document.querySelector(\'dt[data-target-cell].code-annotation-active\');\n            if (activeEl) {\n              activeEl.classList.remove(\'code-annotation-active\');\n            }\n            selectCodeLines(clickedEl);\n            clickedEl.classList.add(\'code-annotation-active\');\n          } else {\n            // Unselect the line\n            unselectCodeLines();\n            clickedEl.classList.remove(\'code-annotation-active\');\n          }\n        });\n      }\n  const findCites = (el) => {\n    const parentEl = el.parentElement;\n    if (parentEl) {\n      const cites = parentEl.dataset.cites;\n      if (cites) {\n        return {\n          el,\n          cites: cites.split(\' \')\n        };\n      } else {\n        return findCites(el.parentElement)\n      }\n    } else {\n      return undefined;\n    }\n  };\n  var bibliorefs = window.document.querySelectorAll(\'a[role="doc-biblioref"]\');\n  for (var i=0; i<bibliorefs.length; i++) {\n    const ref = bibliorefs[i];\n    const citeInfo = findCites(ref);\n    if (citeInfo) {\n      tippyHover(citeInfo.el, function() {\n        var popup = window.document.createElement(\'div\');\n        citeInfo.cites.forEach(function(cite) {\n          var citeDiv = window.document.createElement(\'div\');\n          citeDiv.classList.add(\'hanging-indent\');\n          citeDiv.classList.add(\'csl-entry\');\n          var biblioDiv = window.document.getElementById(\'ref-\' + cite);\n          if (biblioDiv) {\n            citeDiv.innerHTML = biblioDiv.innerHTML;\n          }\n          popup.appendChild(citeDiv);\n        });\n        return popup.innerHTML;\n      });\n    }\n  }\n});\n</script>\n</div> <!-- /content -->\n\n\n\n</body></html>'

1.2 BeautifulSoup

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
print(soup.prettify()) #Output cleared for web
print(soup.get_text()) #Output cleared for web
# creating a list of all common heading tags
heading_tags = ["h1", "h2", "h3"]
for tags in soup.find_all(heading_tags):
    print(tags.text.strip())
On this page
Syllabus (Key Info)
Description
Course Objective
Expected Learning Outcomes

2. Data Processing

2.1 Numpy

Support for large, multi-dimensional arrays and matrices, and a large collection of high-level mathematical functions to operate on these arrays.

import numpy as np

ndarray object: an n-dimensional array of homogeneous data types, with many operations being performed in compiled code for performance

  • Fixed Size
  • Same type of data
  • Much more effiecent mathematical operations than built in data types like lists.

numpy.dtype - intc (same as a C integer) and intp (used for indexing) - int8, int16, int32, int64 - uint8, uint16, uint32, uint64 - float16, float32, float64 - complex64, complex128

Create a numpy array

  • Conversion from other Python structures (e.g., lists, tuples)
  • Built-in NumPy array creation (e.g., arange, ones, zeros, etc.)
  • Reading arrays from a file.
np.array([2,3,1,0])
array([2, 3, 1, 0])
np.zeros((5, 5)) #np.zeros(shape)
array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])
np.ones((6, 2))#np.ones(shape)
array([[1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.]])
np.arange(15,5,-1) #Like range function in python
array([15, 14, 13, 12, 11, 10,  9,  8,  7,  6])
#Return evenly spaced numbers over a specified interval.
np.linspace(0, 100, 5) # numpy.linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None, axis=0)
array([  0.,  25.,  50.,  75., 100.])
np.random.random() #Keeps changing
0.7026417115321625
random_obj = np.random.default_rng(seed=None) #default_rng is the recommended constructor for the random number class
random_obj.random() #changes if you do not give a seed
0.21864992820211548
random_obj = np.random.default_rng(seed=42) #default_rng is the recommended constructor for the random number class
random_obj.random()
0.7739560485559633
print("Original:\n",np.arange(9))
print("After using reshape:\n",np.arange(9).reshape(3,3))
Original:
 [0 1 2 3 4 5 6 7 8]
After using reshape:
 [[0 1 2]
 [3 4 5]
 [6 7 8]]
x = np.arange(2,10)
print(x)
x[-1]
[2 3 4 5 6 7 8 9]
9
x.shape = (1,3)
print(x)
x[-1] # next slide
ValueError: cannot reshape array of size 8 into shape (1,3)
x.shape = (2,4)
print("Array:\n",x,"\n")
print("x[-1]: ",x[0])
print("x[1,3]: ", x[0,3])
Array:
 [[2 3 4 5]
 [6 7 8 9]] 

x[-1]:  [2 3 4 5]
x[1,3]:  5
a = np.arange(1,11)
b = np.arange(12,22)
a+b
array([13, 15, 17, 19, 21, 23, 25, 27, 29, 31])
a = np.arange(1,11).reshape(2,5)
b = np.arange(12,22).reshape(5,2)
result = np.dot(a,b) # To multiply two arrays
result
array([[260, 275],
       [660, 700]])
result.transpose()
array([[260, 660],
       [275, 700]])
np.linalg.inv(result) # and finally
array([[ 1.4 , -0.55],
       [-1.32,  0.52]])

2.2 Scipy

  • built on the NumPy
  • various tools and functions for solving common problems in scientific computing.

ex: - Fourier Transforms (scipy.fftpack) - Multidimensional image processing (scipy.ndimage) - Spatial data structures and algorithms (scipy.spatial) ..

2.3 Continue our discussion on Pandas

import pandas as pd
WORLD_DATA_PATH = "spotify-top-50/data/spotify-streaming-top-50-usa.csv"
world_df = pd.read_csv(WORLD_DATA_PATH)
world_df.sample()
date position song artist popularity duration_ms album_type total_tracks release_date is_explicit album_cover_url
269 2023-05-23 20 Wasted On You Morgan Wallen 86 178520 album 30 2021-01-08 False https://i.scdn.co/image/ab67616d0000b2737d6813...

Q: The time range of the dataset?

world_df['date'].dtype
dtype('O')
type(world_df['date'][0])
str
# Convert column date to date datatype

world_df['date'] = pd.to_datetime(world_df['date'])
# Q : What is the time range in which this dataset is recording top 50?
# Assume that it records everyday

world_df['date'].max(), world_df['date'].min()
(Timestamp('2023-06-27 00:00:00'), Timestamp('2023-05-18 00:00:00'))