You are working as a data scientists and you have received data on house prices in the Boston region. The data set contains the following variables: • crim: per capita crime rate by town • zn: proportion of residential land zoned for lots over 25,000 sq.ft. • indus: proportion of non-retail business acres per town • chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) • nox: nitric oxides concentration • rm: average number of rooms per dwelling •age: proportion of owner-occupied units built prior to 1940 • dis: weighted distances to five Boston employment centers • rad: index of accessibility to radial highways • tax: full-value property-tax rate per $10,000 • ptratio: pupil-teacher ratio by town • b: 1000(Bk – 0.63)² where Bk is the proportion of blacks by town • Istat: % lower status of the population • medv: Median value of owner-occupied homes in $1000s Given this information: 1. Download the dataset boston.csv and open it as a PANDAS dataframe. 2. Using 'medv' as the response variable and per capita crime rate by town, proportion of owner-occupied units built prior to 1940, and nitric oxides concentration as predictors, fit a linear model (OLS), and a k-nearest neigherbour model (using the 5 nearest neighbour). Which one has better prediction properties using k-fold cross validation (k=5)? Explain why.

Database System Concepts
7th Edition
ISBN:9780078022159
Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Chapter1: Introduction
Section: Chapter Questions
Problem 1PE
icon
Related questions
Question

Solve in Python,

Dataset can be downloaded using this link : https://file.io/h3XG1C7pEz2v

You are working as a data scientists and you have received data on house prices in the Boston region.
The data set contains the following variables:
• crim: per capita crime rate by town
• zn: proportion of residential land zoned for lots over 25,000 sq.ft.
• indus: proportion of non-retail business acres per town
chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
• nox: nitric oxides concentration
●
• rm: average number of rooms per dwelling
•age: proportion of owner-occupied units built prior to 1940
• dis: weighted distances to five Boston employment centers
• rad: index of accessibility to radial highways
• tax: full-value property-tax rate per $10,000
• ptratio: pupil-teacher ratio by town
• b: 1000(Bk - 0.63)² where Bk is the proportion of blacks by town
• Istat: % lower status of the population
• medv: Median value of owner-occupied homes in $1000s
Given this information:
1. Download the dataset boston.csv and open it as a PANDAS dataframe.
2. Using 'medv' as the response variable and per capita crime rate by town, proportion of owner-occupied units built prior to 1940, and nitric oxides
concentration as predictors, fit a linear model (OLS), and a k-nearest neigherbour model (using the 5 nearest neighbour). Which one has better prediction
properties using k-fold cross validation (k=5)? Explain why.
Transcribed Image Text:You are working as a data scientists and you have received data on house prices in the Boston region. The data set contains the following variables: • crim: per capita crime rate by town • zn: proportion of residential land zoned for lots over 25,000 sq.ft. • indus: proportion of non-retail business acres per town chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) • nox: nitric oxides concentration ● • rm: average number of rooms per dwelling •age: proportion of owner-occupied units built prior to 1940 • dis: weighted distances to five Boston employment centers • rad: index of accessibility to radial highways • tax: full-value property-tax rate per $10,000 • ptratio: pupil-teacher ratio by town • b: 1000(Bk - 0.63)² where Bk is the proportion of blacks by town • Istat: % lower status of the population • medv: Median value of owner-occupied homes in $1000s Given this information: 1. Download the dataset boston.csv and open it as a PANDAS dataframe. 2. Using 'medv' as the response variable and per capita crime rate by town, proportion of owner-occupied units built prior to 1940, and nitric oxides concentration as predictors, fit a linear model (OLS), and a k-nearest neigherbour model (using the 5 nearest neighbour). Which one has better prediction properties using k-fold cross validation (k=5)? Explain why.
Expert Solution
steps

Step by step

Solved in 4 steps with 1 images

Blurred answer
Knowledge Booster
Random variables
Learn more about
Need a deep-dive on the concept behind this application? Look no further. Learn more about this topic, computer-science and related others by exploring similar questions and additional content below.
Similar questions
  • SEE MORE QUESTIONS
Recommended textbooks for you
Database System Concepts
Database System Concepts
Computer Science
ISBN:
9780078022159
Author:
Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:
McGraw-Hill Education
Starting Out with Python (4th Edition)
Starting Out with Python (4th Edition)
Computer Science
ISBN:
9780134444321
Author:
Tony Gaddis
Publisher:
PEARSON
Digital Fundamentals (11th Edition)
Digital Fundamentals (11th Edition)
Computer Science
ISBN:
9780132737968
Author:
Thomas L. Floyd
Publisher:
PEARSON
C How to Program (8th Edition)
C How to Program (8th Edition)
Computer Science
ISBN:
9780133976892
Author:
Paul J. Deitel, Harvey Deitel
Publisher:
PEARSON
Database Systems: Design, Implementation, & Manag…
Database Systems: Design, Implementation, & Manag…
Computer Science
ISBN:
9781337627900
Author:
Carlos Coronel, Steven Morris
Publisher:
Cengage Learning
Programmable Logic Controllers
Programmable Logic Controllers
Computer Science
ISBN:
9780073373843
Author:
Frank D. Petruzella
Publisher:
McGraw-Hill Education