[Jul-2023] Verified Databricks Exam Dumps with Associate-Developer-Apache-Spark Exam Study Guide [Q76-Q91]

1. filter
2. col(“supplier”).isin(“Sports”)
3. “itemName”
4. explode(col(“attributes”))

1. where
2. col(“supplier”).contains(“Sports”)
3. “itemName”
4. “attributes”

1. where
2. col(supplier).contains(“Sports”)
3. explode(attributes)
4. itemName

1. where
2. “Sports”.isin(col(“Supplier”))
3. “itemName”
4. array_explode(“attributes”)

1. filter
2. col(“supplier”).contains(“Sports”)
3. “itemName”
4. explode(“attributes”)

Explanation
Output of correct code block:
+———————————-+——+
|itemName |col |
+———————————-+——+
|Thick Coat for Walking in the Snow|blue |
|Thick Coat for Walking in the Snow|winter|
|Thick Coat for Walking in the Snow|cozy |
|Outdoors Backpack |green |
|Outdoors Backpack |summer|
|Outdoors Backpack |travel|
+———————————-+——+
The key to solving this question is knowing about Spark’s explode operator. Using this operator, you can extract values from arrays into single rows. The following guidance steps through the answers systematically from the first to the last gap. Note that there are many ways to solving the gap questions and filtering out wrong answers, you do not always have to start filtering out from the first gap, but can also exclude some answers based on obvious problems you see with them.
The answers to the first gap present you with two options: filter and where. These two are actually synonyms in PySpark, so using either of those is fine. The answer options to this gap therefore do not help us in selecting the right answer.
The second gap is more interesting. One answer option includes “Sports”.isin(col(“Supplier”)). This construct does not work, since Python’s string does not have an isin method. Another option contains col(supplier). Here, Python will try to interpret supplier as a variable. We have not set this variable, so this is not a viable answer. Then, you are left with answers options that include col (“supplier”).contains(“Sports”) and col(“supplier”).isin(“Sports”). The question states that we are looking for suppliers whose name includes Sports, so we have to go for the contains operator here.
We would use the isin operator if we wanted to filter out for supplier names that match any entries in a list of supplier names.
Finally, we are left with two answers that fill the third gap both with “itemName” and the fourth gap either with explode(“attributes”) or “attributes”. While both are correct Spark syntax, only explode (“attributes”) will help us achieve our goal. Specifically, the question asks for one attribute from column attributes per row – this is what the explode() operator does.
One answer option also includes array_explode() which is not a valid operator in PySpark.
More info: pyspark.sql.functions.explode – PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3

QUESTION 77
In which order should the code blocks shown below be run in order to create a DataFrame that shows the mean of column predError of DataFrame transactionsDf per column storeId and productId, where productId should be either 2 or 3 and the returned DataFrame should be sorted in ascending order by column storeId, leaving out any nulls in that column?
DataFrame transactionsDf:
1.+————-+———+—–+——-+———+—-+
2.|transactionId|predError|value|storeId|productId| f|
3.+————-+———+—–+——-+———+—-+
4.| 1| 3| 4| 25| 1|null|
5.| 2| 6| 7| 2| 2|null|
6.| 3| 3| null| 25| 3|null|
7.| 4| null| null| 3| 2|null|
8.| 5| null| null| null| 2|null|
9.| 6| 3| 2| 25| 2|null|
10.+————-+———+—–+——-+———+—-+
1. .mean(“predError”)
2. .groupBy(“storeId”)
3. .orderBy(“storeId”)
4. transactionsDf.filter(transactionsDf.storeId.isNotNull())
5. .pivot(“productId”, [2, 3])

4, 5, 2, 3, 1

4, 2, 1

4, 1, 5, 2, 3

4, 2, 5, 1, 3

4, 3, 2, 5, 1

QUESTION 78
The code block displayed below contains an error. The code block should save DataFrame transactionsDf at path path as a parquet file, appending to any existing parquet file. Find the error.
Code block:

transactionsDf.format(“parquet”).option(“mode”, “append”).save(path)

The code block is missing a reference to the DataFrameWriter.

save() is evaluated lazily and needs to be followed by an action.

The mode option should be omitted so that the command uses the default mode.

The code block is missing a bucketBy command that takes care of partitions.

Given that the DataFrame should be saved as parquet file, path is being passed to the wrong method.

QUESTION 79
Which of the following code blocks returns a new DataFrame in which column attributes of DataFrame itemsDf is renamed to feature0 and column supplier to feature1?

itemsDf.withColumnRenamed(attributes, feature0).withColumnRenamed(supplier, feature1)

1.itemsDf.withColumnRenamed(“attributes”, “feature0”)
2.itemsDf.withColumnRenamed(“supplier”, “feature1”)

itemsDf.withColumnRenamed(col(“attributes”), col(“feature0”), col(“supplier”), col(“feature1”))

itemsDf.withColumnRenamed(“attributes”, “feature0”).withColumnRenamed(“supplier”, “feature1”)

itemsDf.withColumn(“attributes”, “feature0”).withColumn(“supplier”, “feature1”)

QUESTION 80
Which of the following code blocks performs an inner join of DataFrames transactionsDf and itemsDf on columns productId and itemId, respectively, excluding columns value and storeId from DataFrame transactionsDf and column attributes from DataFrame itemsDf?

transactionsDf.drop(‘value’, ‘storeId’).join(itemsDf.select(‘attributes’), transactionsDf.productId==itemsDf.itemId)

1.transactionsDf.createOrReplaceTempView(‘transactionsDf’)
2.itemsDf.createOrReplaceTempView(‘itemsDf’)
3.
4.spark.sql(“SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId”).drop(“attributes”)

transactionsDf.drop(“value”, “storeId”).join(itemsDf.drop(“attributes”),
“transactionsDf.productId==itemsDf.itemId”)

1.transactionsDf
2. .drop(col(‘value’), col(‘storeId’))
3. .join(itemsDf.drop(col(‘attributes’)), col(‘productId’)==col(‘itemId’))

1.transactionsDf.createOrReplaceTempView(‘transactionsDf’)
2.itemsDf.createOrReplaceTempView(‘itemsDf’)
3.
4.statement = “””
5.SELECT * FROM transactionsDf
6.INNER JOIN itemsDf
7.ON transactionsDf.productId==itemsDf.itemId
8.”””
9.spark.sql(statement).drop(“value”, “storeId”, “attributes”)

Explanation
This question offers you a wide variety of answers for a seemingly simple question. However, this variety reflects the variety of ways that one can express a join in PySpark. You need to understand some SQL syntax to get to the correct answer here.
transactionsDf.createOrReplaceTempView(‘transactionsDf’)
itemsDf.createOrReplaceTempView(‘itemsDf’)
statement = “””
SELECT * FROM transactionsDf
INNER JOIN itemsDf
ON transactionsDf.productId==itemsDf.itemId
“””
spark.sql(statement).drop(“value”, “storeId”, “attributes”)
Correct – this answer uses SQL correctly to perform the inner join and afterwards drops the unwanted columns. This is totally fine. If you are unfamiliar with the triple-quote “”” in Python: This allows you to express strings as multiple lines.
transactionsDf \
drop(col(‘value’), col(‘storeId’)) \
join(itemsDf.drop(col(‘attributes’)), col(‘productId’)==col(‘itemId’))
No, this answer option is a trap, since DataFrame.drop() does not accept a list of Column objects. You could use transactionsDf.drop(‘value’, ‘storeId’) instead.
transactionsDf.drop(“value”, “storeId”).join(itemsDf.drop(“attributes”),
“transactionsDf.productId==itemsDf.itemId”)
Incorrect – Spark does not evaluate “transactionsDf.productId==itemsDf.itemId” as a valid join expression.
This would work if it would not be a string.
transactionsDf.drop(‘value’, ‘storeId’).join(itemsDf.select(‘attributes’), transactionsDf.productId==itemsDf.itemId) Wrong, this statement incorrectly uses itemsDf.select instead of itemsDf.drop.
transactionsDf.createOrReplaceTempView(‘transactionsDf’)
itemsDf.createOrReplaceTempView(‘itemsDf’)
spark.sql(“SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId”).drop(“attributes”) No, here the SQL expression syntax is incorrect. Simply specifying -columnName does not drop a column.
More info: pyspark.sql.DataFrame.join – PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3

QUESTION 81
Which of the following describes the role of tasks in the Spark execution hierarchy?

Tasks are the smallest element in the execution hierarchy.

Within one task, the slots are the unit of work done for each partition of the data.

Tasks are the second-smallest element in the execution hierarchy.

Stages with narrow dependencies can be grouped into one task.

Tasks with wide dependencies can be grouped into one stage.

QUESTION 82
The code block displayed below contains an error. The code block should return a new DataFrame that only contains rows from DataFrame transactionsDf in which the value in column predError is at least 5. Find the error.
Code block:
transactionsDf.where(“col(predError) >= 5”)

The argument to the where method should be “predError >= 5”.

Instead of where(), filter() should be used.

The expression returns the original DataFrame transactionsDf and not a new DataFrame. To avoid this, the code block should be transactionsDf.toNewDataFrame().where(“col(predError) >= 5”).

The argument to the where method cannot be a string.

Instead of >=, the SQL operator GEQ should be used.

QUESTION 83
The code block displayed below contains an error. The code block should write DataFrame transactionsDf as a parquet file to location filePath after partitioning it on column storeId. Find the error.
Code block:
transactionsDf.write.partitionOn(“storeId”).parquet(filePath)

The partitioning column as well as the file path should be passed to the write() method of DataFrame transactionsDf directly and not as appended commands as in the code block.

The partitionOn method should be called before the write method.

The operator should use the mode() option to configure the DataFrameWriter so that it replaces any existing files at location filePath.

Column storeId should be wrapped in a col() operator.

No method partitionOn() exists for the DataFrame class, partitionBy() should be used instead.

QUESTION 84
Which of the following code blocks efficiently converts DataFrame transactionsDf from 12 into 24 partitions?

transactionsDf.repartition(24, boost=True)

transactionsDf.repartition()

transactionsDf.repartition(“itemId”, 24)

transactionsDf.coalesce(24)

transactionsDf.repartition(24)

QUESTION 85
Which of the following code blocks removes all rows in the 6-column DataFrame transactionsDf that have missing data in at least 3 columns?

transactionsDf.dropna(“any”)

transactionsDf.dropna(thresh=4)

transactionsDf.drop.na(“”,2)

transactionsDf.dropna(thresh=2)

transactionsDf.dropna(“”,4)

QUESTION 86
Which of the following options describes the responsibility of the executors in Spark?

The executors accept jobs from the driver, analyze those jobs, and return results to the driver.

The executors accept tasks from the driver, execute those tasks, and return results to the cluster manager.

The executors accept tasks from the driver, execute those tasks, and return results to the driver.

The executors accept tasks from the cluster manager, execute those tasks, and return results to the driver.

The executors accept jobs from the driver, plan those jobs, and return results to the cluster manager.

QUESTION 87
Which of the following code blocks displays the 10 rows with the smallest values of column value in DataFrame transactionsDf in a nicely formatted way?

transactionsDf.sort(asc(value)).show(10)

transactionsDf.sort(col(“value”)).show(10)

transactionsDf.sort(col(“value”).desc()).head()

transactionsDf.sort(col(“value”).asc()).print(10)

transactionsDf.orderBy(“value”).asc().show(10)

itemsDf.filter(col(supplier).not_contains(‘X’)).select(supplier).distinct()

itemsDf.select(~col(‘supplier’).contains(‘X’)).distinct()

itemsDf.filter(not(col(‘supplier’).contains(‘X’))).select(‘supplier’).unique()

itemsDf.filter(~col(‘supplier’).contains(‘X’)).select(‘supplier’).distinct()

itemsDf.filter(!col(‘supplier’).contains(‘X’)).select(col(‘supplier’)).unique()

QUESTION 89
Which of the following statements about lazy evaluation is incorrect?

Predicate pushdown is a feature resulting from lazy evaluation.

Execution is triggered by transformations.

Spark will fail a job only during execution, but not during definition.

Accumulators do not change the lazy evaluation model of Spark.

Lineages allow Spark to coalesce transformations into stages

QUESTION 90
The code block shown below should add column transactionDateForm to DataFrame transactionsDf. The column should express the unix-format timestamps in column transactionDate as string type like Apr 26 (Sunday). Choose the answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__(__2__, from_unixtime(__3__, __4__))

1. withColumn
2. “transactionDateForm”
3. “MMM d (EEEE)”
4. “transactionDate”

1. select
2. “transactionDate”
3. “transactionDateForm”
4. “MMM d (EEEE)”

1. withColumn
2. “transactionDateForm”
3. “transactionDate”
4. “MMM d (EEEE)”

1. withColumn
2. “transactionDateForm”
3. “transactionDate”
4. “MM d (EEE)”

1. withColumnRenamed
2. “transactionDate”
3. “transactionDateForm”
4. “MM d (EEE)”

QUESTION 91
Which of the following code blocks returns a copy of DataFrame transactionsDf that only includes columns transactionId, storeId, productId and f?
Sample of DataFrame transactionsDf:
1.+————-+———+—–+——-+———+—-+
2.|transactionId|predError|value|storeId|productId| f|
3.+————-+———+—–+——-+———+—-+
4.| 1| 3| 4| 25| 1|null|
5.| 2| 6| 7| 2| 2|null|
6.| 3| 3| null| 25| 3|null|
7.+————-+———+—–+——-+———+—-+

transactionsDf.drop(col(“value”), col(“predError”))

transactionsDf.drop(“predError”, “value”)

transactionsDf.drop(value, predError)

transactionsDf.drop([“predError”, “value”])

transactionsDf.drop([col(“predError”), col(“value”)])

[Jul-2023] Verified Databricks Exam Dumps with Associate-Developer-Apache-Spark Exam Study Guide [Q76-Q91]

Leave a Reply Cancel reply

Related Posts

Guide (New 2024) Actual Databricks Databricks-Certified-Data-Analyst-Associate Exam Questions [Q15-Q39]