Lab 11: Working with Data
Lab 11: Working with Data
NetID:
Link to published notebook:
A Data Science Workflow
A Data Science Workflow
A data science project needs a flexible, modular, iterative and multiparadigm workflow.
Setting up Questions
Setting up Questions
The first stage of the workflow is where you frame questions. To get some useful conclusions from the data, you need to start out with the right questions.
What Can You Learn from the Data?
What Can You Learn from the Data?
Data Wrangling
Data Wrangling
Exploratory Data Analysis
Exploratory Data Analysis
Analyzing Data: Machine Learning can Help
Analyzing Data: Machine Learning can Help
Communicating Results
Communicating Results
Part 1: EDA or Exploratory Data Analysis
Part 1: EDA or Exploratory Data Analysis
In this lab, we will have a look at a couple of simple examples of working with data.
Background
Background
What is an Abalone?
What is an Abalone?
What does the dataset contain?
What does the dataset contain?
How can you go about doing Exploratory Data Analysis (EDA)?
How can you go about doing Exploratory Data Analysis (EDA)?
Code for Exploratory Data Analysis
Code for Exploratory Data Analysis
Load the Data
Load the Data
Problem 1: How many rows and columns are there?
Problem 1: How many rows and columns are there?
List some basic information about the dataset, such as the number of rows and columns, column names, and data types (String, Integer, or Real)
Code solution
Code solution
Problem 2: Calculate descriptive statistics
Problem 2: Calculate descriptive statistics
Problem 3: Visual exploration using scatter plots
Problem 3: Visual exploration using scatter plots
Problem 4: Visual exploration using histograms
Problem 4: Visual exploration using histograms
The following code selects the male samples and plots a histogram of their number of “Rings”:
In[]:=
abaloneData[Select[#Sex=="M"&]/*Histogram,"Rings"]
Out[]=
Modify the code above to create a histogram of the number of rings of the female samples. What is the most frequently occurring value for the number of rings and how many samples do you see for this value?
Problem 5: Non-graphical exploration - find the correlation of the features.
Problem 5: Non-graphical exploration - find the correlation of the features.
The following code shows the correlation between the numerical features:
In[]:=
numericalColumns={"Length","Diameter","Height","WholeWeight","ShuckedWeight","VisceraWeight","ShellWeight","Rings"};
In[]:=
Correlation[Values/@Normal@abaloneData[All,numericalColumns]]//TableForm
Out[]//TableForm=
1. | 0.986812 | 0.827554 | 0.925261 | 0.897914 | 0.903018 | 0.897706 | 0.55672 |
0.986812 | 1. | 0.833684 | 0.925452 | 0.893162 | 0.899724 | 0.90533 | 0.57466 |
0.827554 | 0.833684 | 1. | 0.819221 | 0.774972 | 0.798319 | 0.817338 | 0.557467 |
0.925261 | 0.925452 | 0.819221 | 1. | 0.969405 | 0.966375 | 0.955355 | 0.54039 |
0.897914 | 0.893162 | 0.774972 | 0.969405 | 1. | 0.931961 | 0.882617 | 0.420884 |
0.903018 | 0.899724 | 0.798319 | 0.966375 | 0.931961 | 1. | 0.907656 | 0.503819 |
0.897706 | 0.90533 | 0.817338 | 0.955355 | 0.882617 | 0.907656 | 1. | 0.627574 |
0.55672 | 0.57466 | 0.557467 | 0.54039 | 0.420884 | 0.503819 | 0.627574 | 1. |
Here are the names of the numerical features:
In[]:=
numericalColumns
Which two features are the most correlated and which ones are the least correlated?
Part 2: EDA of a WebPage of Your Choice
Part 2: EDA of a WebPage of Your Choice
In this section we will perform EDA on a webpage of your choice to see how we can quickly get quantitative and visual information about this page.
Explore a webpage of your choice
Explore a webpage of your choice
Set the URL for the page you want to explore (we are using a page on Abalones for example):
url="https://www.marinebio.net/marinescience/06future/abintro.htm";
Import the text from the webpage:
In[]:=
pageText=Import[url];
Problem 6: What is the page talking about?
Problem 6: What is the page talking about?
Create a word cloud from the page text:
In[]:=
WordCloud[pageText]
What are the most frequently used words on this page?
Problem 7: Analyze the text
Problem 7: Analyze the text
The following code finds all the sentences on the page:
In[]:=
sentences=TextSentences[pageText];
Number of sentences found on the page:
In[]:=
Length[sentences]
Create a histogram of the length of the sentences:
In[]:=
Histogram[WordCount/@sentences]
In[]:=
ReverseSort[WordCount/@sentences]
In[]:=
Commonest[WordCount/@sentences]
How many sentences did you find?
What is the shortest, longest and most common length of a sentence on the page?
What is the shortest, longest and most common length of a sentence on the page?
Part 3: Yet Another Survey
Part 3: Yet Another Survey
Thanks for being a fantastic class.
Please fill out this short survey for me - so I can improve this course: https://wolfr.am/ECE101SP24
Please fill out this short survey for me - so I can improve this course: https://wolfr.am/ECE101SP24
Submitting your work
Submitting your work
1
. Publish your notebook
1
.1
.From the cloud notebook, click on “Publish” at the top right corner.
1
.2
.From the desktop notebook, use the menu option File -> Publish to Cloud
2
.Copy the published link
3
.Add it to the top of the notebook, below your netID
4
.Print to PDF
5
.Upload to Gradescope
6
.Just to be sure, maybe ping your TA Sattwik on Slack that you have submitted.