Image of spark with scala icon

Best way to read multiple paths in spark with scala

Sharing the perfect way to read multiple paths in Spark using Scala

As we are working on Apache Spark. Most of the time we have encountered a scenario in which we have to read the multiple paths in spark. The path may be from S3, GCS, HDFS, or local file system paths. Most of the time, we takes all paths which need to read in Spark. Next, arrange paths using the text editor as the format Spark requires.

Spark takes multiple paths as a read input in the below format:-

val df = spark.read.parquet("/path/file1" , "/path/file2" , "/path/file3")

This is the content of file /home/test/temp.txt
“hdfs://tmp/dir1/file1.parquet”,”hdfs://tmp/dir2/file2.parquet”,”hdfs://tmp/dir3/file3.parquet”,hdfs://tmp/dir4/file4.parquet”

The above code will work, when passing the exact paths in double quotation with comma separated. The catch here is if you pass the temp.txt file content in a single string. The code will throw an error. Sharing the code snippet to read a file. And take the multiple paths in a string. which is already saved in file in double quotation with comma separated.

// running code in spark-shell, which will throw illegal character exception
scala>  import scala.io.Source
import scala.io.Source

scala>  val bufferedSource = Source.fromFile("/home/test/temp.txt")
bufferedSource: scala.io.BufferedSource = non-empty iterator

scala>  for( line <- bufferedSource.getLines ){
    |          spark.read.parquet(line).count()
    |                }

The above code will throw below mentioned error.

23/01/29 12:36:13 WARN streaming.FileStreamSink: Error while looking for metadata directory.
java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 0: "hdfs://tmp/dir1/file1.parquet%22,%22hdfs://tmp/dir2/file2.parquet%22,%22hdfs://tmp/dir3/file3.parquet%22,%22hdfs://tmp/dir4/file4.parquet%22
  at org.apache.hadoop.fs.Path.initialize(Path.java:259)
  at org.apache.hadoop.fs.Path.<init>(Path.java:217)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:560)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:559)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
  at scala.collection.immutable.List.flatMap(List.scala:355)
  at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:559)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:243)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:231)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:668)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:652)
  at $anonfun$1.apply(<console>:29)
  at $anonfun$1.apply(<console>:27)
  at scala.collection.Iterator$class.foreach(Iterator.scala:891)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
  ... 52 elided
Caused by: java.net.URISyntaxException: Illegal character in scheme name at index 0: "hdfs://tmp/dir1/file1.parquet%22,%22hdfs://tmp/dir2/file2.parquet%22,%22hdfs://tmp/dir3/file3.parquet%22,%22hdfs://tmp/dir4/file4.parquet%22  at java.net.URI$Parser.fail(URI.java:2848)
  at java.net.URI$Parser.checkChars(URI.java:3021)
  at java.net.URI$Parser.checkChar(URI.java:3031)
  at java.net.URI$Parser.parse(URI.java:3047)
  at java.net.URI.<init>(URI.java:746)
  at org.apache.hadoop.fs.Path.initialize(Path.java:256)
  ... 70 more

In most cases, code take multiples paths from a file or a database table as a input for Spark code.

The below mentioned code will help you to take multiples paths as input in Spark. Below code taking paths from file temp.txt. The code will help to take inputs paths from database.

// read the text file
val bufferedSource = Source.fromFile("/home/test/temp.txt")
// loop to iterate all lines of file one by one
for( line <- bufferedSource.getLines ){
   // replace “ in the string with blank
   var lineWithComma = line.replace("\"","")
   // create a list which contain all paths
   var listOfPaths = lineWithComma.split(",").map(_.trim).toList
   // Read the paths from list in spark
   val batchDF = spark.read.parquet(listOfPaths:_*)
                  }

Conclusion

No more decoration is required in paths using the text editor. The content will help to read multiple paths in Spark as input in an easy way. Hope this helps.

MtroView team is giving its best & working hard to provide the content to help you understand things better in simple terms. We are heartily open to receiving suggestions to improve our content to help more people in society. Please write to us in a comment/send an email to mtroview@gmail.com

Leave a Comment

Your email address will not be published. Required fields are marked *