PRACTICE MIDTERM-Big Data Analytics and Database Design

Username*

E-Mail*

Password*

Confirm Password*

Profile Picture

Browse

Country

Captcha* Captcha Click on image to update the captcha .

PRACTICE MIDTERM-Big Data Analytics and Database Design

Download file

PRACTICE MIDTERM-Big Data Analytics and Database Design
1 file(s) 169.12 KB

If you are not a member register here to download this file

[Frequent Itemset Mining and Association Rules]

[10 marks] Compute frequent itemsets for the baskets below with A-Priori Algorithm. Assume support threshold is s ≥ 3. Provide the details of your computation and all the necessary steps.
a) Bread, Coke, Milk, Pepsi
b) Coke, Diaper, Milk
c) Beer, Bread, Diaper, Milk
d) Beer, Bread, Fanta, Diaper
e) Beer, Coke Diaper, Milk
[10 marks] Using the baskets of question 1, what is the confidence and interest of the following association rules?
i. {Milk} → {Coke}
ii. {Diaper, Milk} → {Beer}
Which one would you rank higher (i.e., is more interesting)? Justify your answer.
[10 marks] Provide a description of PCY algorithm supported by a diagram. What is stored in main memory in Pass 1 and Pass 2. (Assume we look for frequent pairs of elements.)
[10 marks] Describe how and why the bitmap is used in PCY algorithm.
[Finding Similar Items]
[10 marks] Answer the following questions.
a) Provide the definition of Jaccard Similarity (formula).
b) What is the Jaccard similarity of the two sets S and T in the following figure?

[10 marks] Compute the signature matrix with single pass over two provided hash functions.
Note: the values of the hash functions are given to you for simplicity.

[10 marks] Answer the following questions.
a) Based on the signature matrix that you computed in the previous question, estimate
the Jaccard similarity of sets S1 and S4.
b) What is the actual Jaccard similarity between S1 and S4?
[Clustering]
[10 marks] What is the advantage of BFR algorithm over k-means for clustering?
[10 marks] Define centroid and clustroid? Provide supporting examples of computing centroid
and clustroid.
[10 marks] Apply hierarchical clustering on the following data in a 2-diemnsional Euclidean
space. Assuming stopping point is k = 2 (k is the number of clusters). Provide all intermediate
computations.
Note: use Manhattan Distance. Dist( (x1, x2), (y1, y2) ) = |x1 – y1| + |x2 – y2|
For example, Dist( (4, 10), (3, 8) ) = |4 – 3| + |10 – 8| = 1 + 2 = 3
a) (4, 10)
b) (3, 8)
c) (6, 10)
d) (6, 8)

Leave a reply