Register Now

Login

Lost Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Login

Register Now

Welcome to All Test Answers

FINAL EXAM – Big Data Analytics and Database Design


Download  file

If you are not a member register here to download this file 


 

Frequent Itemset Mining and Association Rules]

  • [10 marks] Compute frequent itemsets for the baskets below with A-Priori Algorithm. Assume support threshold is s ≥ 3. Provide the details of your computation and all the necessary steps.
    a) Mug, Pen, TV
    b) Glass, Mug, Monitor, TV
    c) Mug, Monitor, Laptop
    d) Glass, Laptop, Monitor
    e) Mug, Pen, Laptop, Monitor, TV
  • [10 marks] Provide a description of Multihash algorithm for finding frequent pairs supported by a diagram. What is stored in main memory in Pass 1 and Pass 2?
    [Finding Similar Items]
  • [10 marks] Compute the signature matrix with single pass over two provided hash functions. Provide all intermediate computations.
    Note: the values of the hash functions are given to you for simplicity.

  • [10 marks] Answer the following questions.
    o Based on the signature matrix that you computed in the previous question, estimate the Jaccard similarity between the following pairs of sets:
    i. Sets S2 and S5.
    ii. Sets S3 and S4.
    o What is the actual Jaccard similarity between these pairs of sets?
    [Clustering]
  • [10 marks] Apply hierarchical clustering on the following data in a 2-diemnsional Euclidean space. Assuming stopping point is k = 2 (k is the number of clusters). Provide all intermediate computations. Draw the Dendrogram diagram.
    Note: use Manhattan Distance. Dist( (x1, x2), (y1, y2) ) = |x1 – y1| + |x2 – y2|
    For example, Dist( (2, 6), (4, 8) ) = |2 – 4| + |6 – 8| = 2 + 2 = 4
    o (2, 6)
    o (4, 8)
    o (2, 4)
    o (0, 0)
    [Large-Scale Machine Learning]
  • [10 marks] What is the difference between supervised and unsupervised learning? List 3 methods for supervised machine learning.
  • [10 marks] Consider the following training dataset for detecting spam emails. +1 means the email is spam and -1 means it is not.
    Apply the Winnow algorithm with θ = 6.
    Assume the raising factor is 2 and the lowering factor is ½.
    Find weight vector w. Visit each of the training emails once (stopping criteria). Show all the intermediate steps.

[10 marks] Using the following decision tree, we want to know if a person buys insurance or not. The leaf nodes of the decision tree are corresponding to “buy insurance?” feature.

Determine whether the following three customers will but insurance or not based on the above decision tree.

  • [10 marks] Given the following sample of the Web graph:
    o Compute the transition matrix M.
    o Compute only the first step of PageRank (start from initial rank vector r0 and compute r1).

  • [10 marks] How PageRank fix the problems of dead ends and spider traps?
  • [10 marks] Consider the following graph. Assume we are interested to perform topic specific PageRank. Assume the pages related to our topic is S = {a, c} and β = 0.8. Compute the updated transition matrix A. Show all the steps
  • [Mining Data Streams]
  • [10 marks] Suppose in DGIM algorithm we start with buckets presented below.
    How the modified buckets will look like after the following bits enter the stream?
    First, a 1 enters, then a 0 enters, then a 1 enters, and at the end a 1 enters the stream.
    Show all intermediate steps.
    Note that always one or two buckets with the same number of 1s must exist.

  • [10 marks] Assume we want to use Bloom filtering to filter email addresses. Assume we are only interested to keep the following two emails:
    • jack@gmail.com
    • sarah@yahoo.com
    Assume we use two hash functions to build the Bloom filter bit array. The values of the hash functions for these emails are given below:
    • Hash function 1:
    o h1(jack@gmail.com) = 3
    o h1(sarah@yahoo.com) = 7
    • Hash function 2:
    o h2(jack@gmail.com) = 8
    o h2(sarah@yahoo.com) = 3
    Assume the size of the bit array is 10 and its first index is 1 and its last index is 10.
    First, build the Bloom filter bit array.
    Second, determine if the following email addresses will pass the Bloom filter or not. Justify your answer. Below, you can find the values of both hash functions for each of the input emails.
    a) hello@cnn.com
    o h1(hello@cnn.com) = 3
    o h2(hello@cnn.com) = 3
    b) spam@money.com
    o h1(spam@money.com) = 8
    o h2(spam@money.com) = 7
  • [10 marks] Prove that Reservoir Sampling algorithm has the following property. Recall that the algorithm maintains a sample S with size s from the stream.
    o After n elements, the sample contains each element seen so far with probability s/n
    Note: here are the steps of the Reservoir Sampling algorithm
    o Store all the first s elements of the stream to S
    o Suppose we have seen n-1 elements, and now the nth element arrives (n > s)
    o With probability s/n, keep the nth element, else discard it
    o If we picked the nth element, then it replaces one of the s elements in the sample S, picked uniformly at random

About

Leave a reply

Captcha Click on image to update the captcha .

error: Content is protected !!