partition the csv files creating log files

Question

partition the csv files creating log files

Rakesh Kumar 45

I am partitioning the csv files and storing in azure data lake. The destination contains:-

_committed_138917450370135985

_started_138917450370135985

_SUCCESS

part-00000-tid-138917450370135985-822eee2b-508b-46ea-9ed6-c426f350d05c-223-1-c000.csv

I only want a file which should be name as table.csv.

Don't want __committed, __started, _success

PRADEEPCHEEKATLA 91,321 Reputation points Moderator

2023-12-07T07:03:50.8833333+00:00

@Rakesh Kumar - Just checking in to see if the below answer helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

2 answers

Your answer

PRADEEPCHEEKATLA 91,321 Reputation points Moderator

2023-12-07T07:03:50.8833333+00:00

@Rakesh Kumar - Just checking in to see if the below answer helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Answer 1

PRADEEPCHEEKATLA 91,321 Moderator

@Rakesh Kumar - Thanks for the question and using MS Q&A platform.

This is an expected behaviour when run any spark job to create these files.

When DBIO transactional commit is enabled, metadata files starting with started and committed will accompany data files created by Apache Spark jobs. Generally you shouldn’t alter these files directly. Rather, use the VACUUM command to clean the files.

A combination of below three properties will help to disable writing all the transactional files which start with "_".

We can disable the transaction logs of spark parquet write using

spark.sql.sources.commitProtocolClass = org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol

This will help to disable the committed and started files but still _SUCCESS, _common_metadata and _metadata files will generate.

We can disable the _common_metadata and _metadata files using

parquet.enable.summary-metadata=false

We can also disable the _SUCCESS file using

mapreduce.fileoutputcommitter.marksuccessfuljobs=false

For more details, refer "Transactional Writes to Cloud Storage with DBIO" and "Stop Azure Databricks auto creating files" and "How do I prevent _success and _committed files in my write output?".

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Rakesh Kumar 45 Reputation points

2023-12-01T06:30:20.1233333+00:00

Thanks @PRADEEPCHEEKATLA It works. Now I can see only one file ("part-00000-ef2cc21b-db94-4f09-8f09-be02c6510150-c000.csv"). When I am partitioning the file it should name Table1.csv at destination point(ADLS).Can you help how can we rename the file at the time of partitioning?
PRADEEPCHEEKATLA 91,321 Reputation points Moderator

2023-12-01T06:35:02.4533333+00:00

@Rakesh Kumar - Glad to know it helped. Would could rename the file after file creation as discussed on the SO thread:https://stackoverflow.com/questions/54101135/how-do-i-rename-the-file-that-was-saved-on-a-datalake-in-azure

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
Rakesh Kumar 45 Reputation points

2023-12-01T07:25:29.8066667+00:00
@PRADEEPCHEEKATLA Can we rename at the file creation when partitioning the file. The solution which you have provided is

They have partitioned the data then they are moving to other location.

But i want to rename at the time of file partition don't want to create another folder
PRADEEPCHEEKATLA 91,321 Reputation points Moderator

2023-12-05T04:06:36.31+00:00

@Rakesh Kumar - Unfortunately, you cannot change the directly because parquet files generate part000* files as default.

You can rename the files once it generated as shown above or this can be easily achieved using dbutils.fs.mv(old_name, new_name) by just replacing the paths of the part-00000 files.

May may checkout the video which explains the same: Rename spark generated part files in data lake.

Hope this helps. Do let us know if you any further queries.
Rakesh Kumar 20 Reputation points

2023-12-05T05:13:23.93+00:00

@PRADEEPCHEEKATLA
my end result is in csv format not in parquet format
PRADEEPCHEEKATLA 91,321 Reputation points Moderator

2023-12-05T05:30:35.5433333+00:00

@Rakesh Kumar - Apologizes for the confusion - irrespective of the format it can be any file format.
Spark uses Hadoop File Format, which requires data to be partitioned - that's why you have part- files. You can easily change filename after processing as said above (You can rename the files once it generated as shown above or this can be easily achieved using dbutils.fs.mv(old_name, new_name) by just replacing the paths of the part-00000 files.).

Hope this helps. Do let us know if you any further queries.
PRADEEPCHEEKATLA 91,321 Reputation points Moderator

2023-12-13T02:53:12.28+00:00

@Rakesh Kumar - Following up to see if the above answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Answer 2

Rakesh Kumar - Thanks for the question and using MS Q&A platform.

When exporting a Delta table from ADLS Gen2 to a CSV file using DataFrame.coalesce(1) or DataFrame.repartition(1) in Azure Databricks, Spark automatically creates additional files such as:

_SUCCESS – indicates the job completed successfully.
_committed_* / _started_* – used internally by Spark’s commit protocol to ensure atomic and fault-tolerant writes.

Note: These files cannot be suppressed or disabled via Spark configuration—they are part of the standard write behavior. The recommended solution is to post-process the output using dbutils.fs to isolate and move the actual CSV file.

To avoid exposing these extra files to downstream systems or users, the best practice is to:

Write the CSV to a temporary directory
Identify the actual CSV file (e.g., part-00000-*.csv)
Use dbutils.fs.cp() or dbutils.fs.mv() to move or copy only the CSV file to the final destination
Optionally delete the temporary directory

Here is the sample code:

# Step 1: Write to a temporary location
df.coalesce(1).write.mode("overwrite").option("header", "true").csv("/mnt/tmp/export_csv")

# Step 2: Identify the CSV file
files = dbutils.fs.ls("/mnt/tmp/export_csv")
csv_file = [f.path for f in files if f.path.endswith(".csv")][0]

# Step 3: Move the CSV file to final location
dbutils.fs.mv(csv_file, "/mnt/final/export/data.csv")

# Step 4: Clean up temporary directory
dbutils.fs.rm("/mnt/tmp/export_csv", recurse=True)

Here is an example: (aka: Demo for you 😊)

This will indeed write:

One CSV file (e.g., part-00000-*.csv)
Along with _SUCCESS and possibly _committed_* files

from pyspark.sql import Row

# Create a sample DataFrame
data = [Row(id=1, name="Alice"), Row(id=2, name="Bob")]
df = spark.createDataFrame(data)

# Write the DataFrame to the specified path as CSV in a single partition
df.coalesce(1).write.mode("overwrite").csv("/Volumes/XXXXX/XXXX/XXX/XXX/One")

User's image

✅ To isolate the CSV file and remove the extras, here’s the complimentary PySpark code:


from pyspark.sql import Row

# Create a sample DataFrame
data = [Row(id=1, name="Alice"), Row(id=2, name="Bob")]
df = spark.createDataFrame(data)

# Define paths
temp_path = "/Volumes/XXXXX/XXXX/XXX/XXX/Two/tmp"
final_path = "/Volumes/XXXXX/XXXX/XXX/XXX/Two/data.csv"

# Step 1: Write to temp directory
df.coalesce(1).write.mode("overwrite").option("header", "true").csv(temp_path)

# Step 2: Identify the actual CSV file
files = dbutils.fs.ls(temp_path)
csv_file = [f.path for f in files if f.path.endswith(".csv")][0]

# Step 3: Move the CSV file to final destination
dbutils.fs.mv(csv_file, final_path)

# Step 4: Clean up the temporary directory
dbutils.fs.rm(temp_path, recurse=True)

# Display files in folder

display(dbutils.fs.ls('/Volumes/XXXXX/XXXX/XXX/XXX/Two'))

ADB0925-04

Hope this information is helpful. Please feel free to reach out if you have any further questions or need additional assistance. If this response addresses your query, kindly consider clicking 'Upvote' and selecting 'Accept Answer'—this may help other community members who come across this thread.

𝘛𝘰 𝘴𝘵𝘢𝘺 𝘪𝘯𝘧𝘰𝘳𝘮𝘦𝘥 𝘢𝘣𝘰𝘶𝘵 𝘵𝘩𝘦 𝘭𝘢𝘵𝘦𝘴𝘵 𝘶𝘱𝘥𝘢𝘵𝘦𝘴 𝘢𝘯𝘥 𝘪𝘯𝘴𝘪𝘨𝘩𝘵𝘴 𝘰𝘯 𝘈𝘻𝘶𝘳𝘦 𝘋𝘢𝘵𝘢𝘣𝘳𝘪𝘤𝘬𝘴, 𝘥𝘢𝘵𝘢 𝘦𝘯𝘨𝘪𝘯𝘦𝘦𝘳𝘪𝘯𝘨, 𝘢𝘯𝘥 Data & AI 𝘪𝘯𝘯𝘰𝘷𝘢𝘵𝘪𝘰𝘯𝘴, 𝘧𝘰𝘭𝘭𝘰𝘸 𝘮𝘦 𝘰𝘯 𝘓𝘪𝘯𝘬𝘦𝘥𝘐𝘯.

Share via

partition the csv files creating log files

2 answers

Your answer