Lab 11: Working with Data

NetID:
Link to published notebook:

A Data Science Workflow

A data science project needs a flexible, modular, iterative and multiparadigm workflow.

Setting up Questions

The first stage of the workflow is where you frame questions. To get some useful conclusions from the data, you need to start out with the right questions.

What Can You Learn from the Data?


Data Wrangling


Exploratory Data Analysis


Analyzing Data: Machine Learning can Help


Communicating Results


Part 1: EDA or Exploratory Data Analysis

In this lab, we will have a look at a couple of simple examples of working with data.

Background

What is an Abalone?


What does the dataset contain?


How can you go about doing Exploratory Data Analysis (EDA)?


Code for Exploratory Data Analysis

Load the Data


Problem 1: How many rows and columns are there?

List some basic information about the dataset, such as the number of rows and columns, column names, and data types (String, Integer, or Real)
​

Code solution


Problem 2: Calculate descriptive statistics


Problem 3: Visual exploration using scatter plots


Problem 4: Visual exploration using histograms

The following code selects the male samples and plots a histogram of their number of “Rings”:
In[]:=
abaloneData[Select[#Sex=="M"&]/*Histogram,"Rings"]
Out[]=
Modify the code above to create a histogram of the number of rings of the female samples. What is the most frequently occurring value for the number of rings and how many samples do you see for this value?
​

Problem 5: Non-graphical exploration - find the correlation of the features.

The following code shows the correlation between the numerical features:
In[]:=
numericalColumns={"Length","Diameter","Height","WholeWeight","ShuckedWeight","VisceraWeight","ShellWeight","Rings"};
In[]:=
Correlation[Values/@Normal@abaloneData[All,numericalColumns]]//TableForm
Out[]//TableForm=
1.
0.986812
0.827554
0.925261
0.897914
0.903018
0.897706
0.55672
0.986812
1.
0.833684
0.925452
0.893162
0.899724
0.90533
0.57466
0.827554
0.833684
1.
0.819221
0.774972
0.798319
0.817338
0.557467
0.925261
0.925452
0.819221
1.
0.969405
0.966375
0.955355
0.54039
0.897914
0.893162
0.774972
0.969405
1.
0.931961
0.882617
0.420884
0.903018
0.899724
0.798319
0.966375
0.931961
1.
0.907656
0.503819
0.897706
0.90533
0.817338
0.955355
0.882617
0.907656
1.
0.627574
0.55672
0.57466
0.557467
0.54039
0.420884
0.503819
0.627574
1.
Here are the names of the numerical features:
In[]:=
numericalColumns
Which two features are the most correlated and which ones are the least correlated?
​

Part 2: EDA of a WebPage of Your Choice

In this section we will perform EDA on a webpage of your choice to see how we can quickly get quantitative and visual information about this page.

Explore a webpage of your choice

Set the URL for the page you want to explore (we are using a page on Abalones for example):
url="https://www.marinebio.net/marinescience/06future/abintro.htm";
Import the text from the webpage:
In[]:=
pageText=Import[url];

Problem 6: What is the page talking about?

Create a word cloud from the page text:
In[]:=
WordCloud[pageText]
What are the most frequently used words on this page?
​

Problem 7: Analyze the text

The following code finds all the sentences on the page:
In[]:=
sentences=TextSentences[pageText];
Number of sentences found on the page:
In[]:=
Length[sentences]
Create a histogram of the length of the sentences:
In[]:=
Histogram[WordCount/@sentences]
In[]:=
ReverseSort[WordCount/@sentences]
In[]:=
Commonest[WordCount/@sentences]
How many sentences did you find?
What is the shortest, longest and most common length of a sentence on the page?
​

Part 3: Yet Another Survey

Thanks for being a fantastic class.
Please fill out this short survey for me - so I can improve this course: https://wolfr.am/ECE101SP24

Submitting your work

1
.
Publish your notebook
1
.
1
.
From the cloud notebook, click on “Publish” at the top right corner.
1
.
2
.
From the desktop notebook, use the menu option File -> Publish to Cloud
2
.
Copy the published link
3
.
Add it to the top of the notebook, below your netID
4
.
Print to PDF
5
.
Upload to Gradescope
6
.
Just to be sure, maybe ping your TA Sattwik on Slack that you have submitted.