Question 11 pts Scenario Questions and Answers
Imagine you are asked to use machine learning to build a model that can predict the energy bills of a house. You are given data drawn from a city-wide survey describing variables including the house’s street, house number, age, floor area, heating type (gas/oil/electric) and window type (single/double glazed). You have this data for many houses along with a variable recording actual heating energy costs.
Of the variables mentioned above, say which you might use as inputs to a machine learning model.

age, floor area, heating type and window type

street, house number, age, floor area, heating type and window type (i.e., all)

floor area, heating type

house number, age, floor area, heating type

Scenario Questions and Answers Flag question: Question 2
Question 21 pts
A reminder of the previous scenario:
Imagine you are asked to use machine learning to build a model that can predict the energy bills of a house. You are given data drawn from a city-wide survey describing variables including the house’s street, house number, age, floor area, heating type (gas/oil/electric) and window type (single/double glazed). You have this data for many houses along with a variable recording actual heating energy costs.
How might you treat the values of the variables? Are they ordinal, nominal, continuous numeric or discrete numeric? Which of the following would be a suitable classification of the variables (a) street, (b) house number, (c) age, (d) floor area, (e) heating type, (f) window type?

(a) nominal, (b) nominal, (c) nominal, (d) continuous numeric, (e) ordinal (f) ordinal

(a) nominal, (b) nominal, (c) continuous numeric, (d) continuous numeric, (e) nominal (f) ordinal

(a) nominal, (b) ordinal, (c) continuous numeric, (d) discrete numeric, (e) nominal (f) ordinal

(a) nominal, (b) ordinal, (c) continuous numeric, (d) continuous numeric, (e) discrete numeric (f) ordinal

Flag question: Question 3
Question 31 pts
A spam classification system is being developed, and is currently undergoing testing. We want the system to prefer labelling a message as not-spam rather than potentially having an important message hidden from the recipient. Four models have been tested, and the confusion matrices for these are below. Which model would you recommend?

Predicted:
Actual: Spam Not spam
Spam 1996 4
Not spam 55 1945

Predicted:
Actual: Spam Not spam
Spam 1773 227
Not spam 232 1768

Predicted:
Actual: Spam Not spam
Spam 1973 27
Not spam 32 1968

Predicted:
Actual: Spam Not spam
Spam 1663 337
Not spam 2 1998

Scenario Questions and Answers Flag question: Question 4
Question 41 pts
The following plot shows the performance of a model in terms of the cost function on training data and validation data as the training process runs. What is the best term to describe the point in the training process marked by a C?

Generalised model

Underfitted

Outlier

Overfitted

Flag question: Question 5
Question 51 pts
A hotel chain wants to organise all of their customers into different groups so that each group contains similar customers. They want to be able to target packages such as inclusive meals, or room service discounts, to groups of similar customers, in the hope that they will visit the hotels more often. Ideally they would like to have 4-6 such groups. However, they do not know what group any individual currently belongs to. They do have data describing previous business with the customers: numeric variables like number of visits and length of stay; and nominal variables like whether or not they made use of the hotel gym and swimming pool.
Which of these best describes the task:

Without class labels, it cannot be done with any machine learning technique

This is a classification task, so logistic regression should be used

This is a clustering task that could be approached with techniques like k-means

This is a clustering task that is suitable for decision trees

Scenario Questions and Answers Flag question: Question 6
Question 61 pts
You have been asked to build a linear regression model to predict the yield of a crop, for which data collection is very expensive and time consuming. You have measurements of 200 trials growing the crop in different locations, and split this model into 160 training points and 40 validation points. You train a model on the training data using ordinary least squares, where the model achieves an r-squared value of 0.95, and test it on the validation data, where you get an r-squared of 0.56. Which of the following should you do?

R-squared of 0.95 is excellent: we can just use this model.

The model seems to have overfit the training data. We need to collect more data to do any better.

The model seems to have overfit the training data. Try rebuilding the model using ridge regression or lasso.

Ordinary least squares has a random element to its learning, so re-run the algorithm to see if it did better the second time.

Scenario Questions and Answers Flag question: Question 7
Question 71 pts
You have been asked to build a model that will predict whether a potato plant has a particular disease based on several measurements (plant height, leaf width, number of flowers, whether or not there are spots on the leaves). Which of the following algorithms would not be suitable for this task?

Logistic regression

kNN

kernel density estimation

Decision Tree

Flag question: Question 8
Question 81 pts
Look at this illustration of a decision tree that was trained on data from a fitness app, where the output variable measures the quality of sleep a person reports at the end of a day where the variables measured are average heart rate, whether or not they went outside, the number of coffees consumed, and the number of steps taken. The classes are Good, or Bad.

What is the probability of a good night’s sleep given by the tree following a day where a person had an average heart rate of 60, stayed inside, and they drank 3 cups of coffee:

0.4

0.6

0.9

0.7

Flag question: Question 9
Question 91 pts
Which of the following is not a valid hyperparameter that you can tune when training a multi-layer perceptron:

Learning rate

Random State

Number of layers

Activation function

Flag question: Question 10
Question 101 pts
Which type of data would be best suited to a convolutional neural network?

Numbers stored in comma separated files

Photographs

Clusters of map coordinates

Database tables

Flag question: Question 11
Question 111 pts
Feature selection is an important part of the data mining process, whereby we choose the features that are most helpful for building a model, and ignore the others. There are several approaches to selecting the features. Which of the following is described by this statement:
“The features are analysed and their statistical correlation with the target variable is measured. The most highly correlated features are selected and used to construct the model, and the rest of the features are ignored.”

Embedded feature selection

LASSO

Wrapper feature selection

Filter feature selection

Flag question: Question 12
Question 121 pts
Which of the following sentences has a classic problem for an NLP model using the bag of words approach?

“This is the best car I have ever driven.”

“This product was not so bad after all”

“This is the funniest movie of the year.”

Flag question: Question 13
Question 131 pts
A sample of text reviewing local restaurants includes words like “eat”, “eating”, “eatery”, “eaten”. Which of the following methods is not directly applicable for text that includes a lot of similar words like this?

Lemmatisation

Bag-of-words

Stemming

Word embedding

Flag question: Question 14
Question 141 pts
A predictive model was used to estimate the probability of needing to pay customer insurance claims for a car accident. To get a good spread of data, claim rates were taken from three groups of customers: (1) regular commuters, who travel 10k-20k miles per year (2) professional drivers such as taxi or delivery drivers; and (3) customers who make frequent use of public transport, who drive less than 5000 miles per year.
Data sets from the three groups of people above were rebalanced to be equal in size, then combined into one data set. This was used to train a regression model, which initially seemed to be a good fit to the complete data set. In testing, the model seemed to massively over-estimate claim rates among those in group (3), driving infrequently. What error in the modelling process could have caused this?

This was a clustering problem, not a regression problem.

This is a problem of aggregation: very different types of customers probably have different distributions of claims so will likely need separate models.

More groups of customers should have been identified before the aggregation.

The data had a large number of missing values which should have been corrected.

Flag question: Question 15
Question 151 pts
You have been asked to visualise the number of students registered to different computing courses. The raw data looks like this:

Course Students
Programming 24
Databases 29
User Interfaces 41
Web Design 47
AI 44

Which of the following visualisations should you choose to best represent this data?