Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Delta Lake lets you update the schema of a table. The following types of changes are supported:
- Adding new columns (at arbitrary positions)
- Reordering existing columns
- Renaming existing columns
You can make these changes explicitly using DDL or implicitly using DML.
Important
An update to a Delta table schema is an operation that conflicts with all concurrent Delta write operations.
When you update a Delta table schema, streams that read from that table terminate. If you want the stream to continue you must restart it. For recommended methods, see Production considerations for Structured Streaming.
Explicitly update schema to add columns
ALTER TABLE table_name ADD COLUMNS (col_name data_type [COMMENT col_comment] [FIRST|AFTER colA_name], ...)
By default, nullability is true.
To add a column to a nested field, use:
ALTER TABLE table_name ADD COLUMNS (col_name.nested_col_name data_type [COMMENT col_comment] [FIRST|AFTER colA_name], ...)
For example, if the schema before running ALTER TABLE boxes ADD COLUMNS (colB.nested STRING AFTER field1) is:
- root
| - colA
| - colB
| +-field1
| +-field2
the schema after is:
- root
| - colA
| - colB
| +-field1
| +-nested
| +-field2
Note
Adding nested columns is supported only for structs. Arrays and maps are not supported.
Explicitly update schema to change column comment or ordering
ALTER TABLE table_name ALTER [COLUMN] col_name (COMMENT col_comment | FIRST | AFTER colA_name)
To change a column in a nested field, use:
ALTER TABLE table_name ALTER [COLUMN] col_name.nested_col_name (COMMENT col_comment | FIRST | AFTER colA_name)
For example, if the schema before running ALTER TABLE boxes ALTER COLUMN colB.field2 FIRST is:
- root
| - colA
| - colB
| +-field1
| +-field2
the schema after is:
- root
| - colA
| - colB
| +-field2
| +-field1
Explicitly update schema to replace columns
ALTER TABLE table_name REPLACE COLUMNS (col_name1 col_type1 [COMMENT col_comment1], ...)
For example, when running the following DDL:
ALTER TABLE boxes REPLACE COLUMNS (colC STRING, colB STRUCT<field2:STRING, nested:STRING, field1:STRING>, colA STRING)
if the schema before is:
- root
| - colA
| - colB
| +-field1
| +-field2
the schema after is:
- root
| - colC
| - colB
| +-field2
| +-nested
| +-field1
| - colA
Explicitly update schema to rename columns
Note
This feature is available in Databricks Runtime 10.4 LTS and above.
To rename columns without rewriting any of the columns' existing data, you must enable column mapping for the table. See Rename and drop columns with Delta Lake column mapping.
To rename a column:
ALTER TABLE table_name RENAME COLUMN old_col_name TO new_col_name
To rename a nested field:
ALTER TABLE table_name RENAME COLUMN col_name.old_nested_field TO new_nested_field
For example, when you run the following command:
ALTER TABLE boxes RENAME COLUMN colB.field1 TO field001
If the schema before is:
- root
| - colA
| - colB
| +-field1
| +-field2
Then the schema after is:
- root
| - colA
| - colB
| +-field001
| +-field2
See Rename and drop columns with Delta Lake column mapping.
Explicitly update schema to drop columns
Note
This feature is available in Databricks Runtime 11.3 LTS and above.
To drop columns as a metadata-only operation without rewriting any data files, you must enable column mapping for the table. See Rename and drop columns with Delta Lake column mapping.
Important
Dropping a column from metadata does not delete the underlying data for the column in files. To purge the dropped column data, you can use REORG TABLE to rewrite files. You can then use VACUUM to physically delete the files that contain the dropped column data.
To drop a column:
ALTER TABLE table_name DROP COLUMN col_name
To drop multiple columns:
ALTER TABLE table_name DROP COLUMNS (col_name_1, col_name_2)
Explicitly update schema to change column type or name
You can change a column's type or name or drop a column by rewriting the table. To do this, use the overwriteSchema option.
The following example shows changing a column type:
(spark.read.table(...)
  .withColumn("birthDate", col("birthDate").cast("date"))
  .write
  .mode("overwrite")
  .option("overwriteSchema", "true")
  .saveAsTable(...)
)
The following example shows changing a column name:
(spark.read.table(...)
  .withColumnRenamed("dateOfBirth", "birthDate")
  .write
  .mode("overwrite")
  .option("overwriteSchema", "true")
  .saveAsTable(...)
)
Enable schema evolution
You can enable schema evolution by doing one of the following:
- Set the .option("mergeSchema", "true")to a Spark DataFramewriteorwriteStreamoperation. See Enable schema evolution for writes to add new columns.
- Use MERGE WITH SCHEMA EVOLUTIONsyntax. See Schema evolution syntax for merge.
- Set the Spark conf spark.databricks.delta.schema.autoMerge.enabledtotruefor the current SparkSession.
Databricks recommends enabling schema evolution for each write operation rather than setting a Spark conf.
When you use options or syntax to enable schema evolution in a write operation, this takes precedence over the Spark conf.
Note
There is no schema evolution clause for INSERT INTO statements.
Enable schema evolution for writes to add new columns
Columns that are present in the source query but missing from the target table are automatically added as part of a write transaction when schema evolution is enabled. See Enable schema evolution.
Case is preserved when appending a new column. New columns are added to the end of the table schema. If the additional columns are in a struct, they are appended to the end of the struct in the target table.
The following example demonstrates using the mergeSchema option with Auto Loader. See What is Auto Loader?.
(spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.schemaLocation", "<path-to-schema-location>")
  .load("<path-to-source-data>")
  .writeStream
  .option("mergeSchema", "true")
  .option("checkpointLocation", "<path-to-checkpoint>")
  .trigger(availableNow=True)
  .toTable("table_name")
)
The following example demonstrates using the mergeSchema option with a batch write operation:
(spark.read
  .table(source_table)
  .write
  .option("mergeSchema", "true")
  .mode("append")
  .saveAsTable("table_name")
)
Automatic schema evolution for Delta Lake merge
Schema evolution allows you to resolve schema mismatches between the target and source table in merge. It handles the following two cases:
- A column exists in the source table but not the target table. - That column will be added to the target schema, and its values will be populated from the corresponding column in the source. - This only applies when the column name and structure in the merge source exactly match the target assignment. 
- The new column must be directly assigned from the source, without expressions, renaming, or transformations. 
 - These examples allow schema evolution: - UPDATE SET target.newcol = source.newcol -- The column newcol will be added to target. UPDATE SET target.somestruct.newfield = source.somestruct.newfield -- The field newfield will be added in struct column somestruct in target. UPDATE SET * -- Any columns and nested fields in source that don't exist in target will be added to target.- These examples do not trigger schema evolution: - UPDATE SET target.newcol = source.someothercol UPDATE SET target.newcol = source.x + source.y UPDATE SET target.newcol = source.output.newcol
- A column exists in the target table but not the source table. - The target schema is not changed. These columns: - Are left unchanged for - UPDATE SET *.
- Are set to - NULLfor- INSERT *.
- May still be explicitly modified if assigned in the action clause. 
 - For example: - UPDATE SET * -- The target columns that are not in the source are left unchanged. INSERT * -- The target columns that are not in the source are set to NULL. UPDATE SET target.onlyintarget = 5 -- The target column is explicitly updated. UPDATE SET target.onlyintarget = source.someothercol -- The target column is explicitly updated from some other source column.
You must manually enable automatic schema evolution. See Enable schema evolution.
Note
In Databricks Runtime 12.2 LTS and above, columns and struct fields present in the source table can be specified by name in insert or update actions. In Databricks Runtime 11.3 LTS and below, only INSERT * or UPDATE SET * actions can be used for schema evolution with merge.
In Databricks Runtime 13.3 LTS and above, you can use schema evolution with structs nested inside maps, such as map<int, struct<a: int, b: int>>.
Schema evolution syntax for merge
In Databricks Runtime 15.4 LTS and above, you can specify schema evolution in a merge statement using SQL or Delta table APIs:
SQL
MERGE WITH SCHEMA EVOLUTION INTO target
USING source
ON source.key = target.key
WHEN MATCHED THEN
  UPDATE SET *
WHEN NOT MATCHED THEN
  INSERT *
WHEN NOT MATCHED BY SOURCE THEN
  DELETE
Python
from delta.tables import *
(targetTable
  .merge(sourceDF, "source.key = target.key")
  .withSchemaEvolution()
  .whenMatchedUpdateAll()
  .whenNotMatchedInsertAll()
  .whenNotMatchedBySourceDelete()
  .execute()
)
Scala
import io.delta.tables._
targetTable
  .merge(sourceDF, "source.key = target.key")
  .withSchemaEvolution()
  .whenMatched()
  .updateAll()
  .whenNotMatched()
  .insertAll()
  .whenNotMatchedBySource()
  .delete()
  .execute()
Example operations of merge with schema evolution
Here are a few examples of the effects of merge operation with and without schema evolution.
| Columns | Query (in SQL) | Behavior without schema evolution (default) | Behavior with schema evolution | 
|---|---|---|---|
| Target columns: key, valueSource columns: key, value, new_value | MERGE INTO target_table tUSING source_table sON t.key = s.keyWHEN MATCHED  THEN UPDATE SET *WHEN NOT MATCHED  THEN INSERT * | The table schema remains unchanged; only columns key,valueare updated/inserted. | The table schema is changed to (key, value, new_value). Existing records with matches are updated with thevalueandnew_valuein the source. New rows are inserted with the schema(key, value, new_value). | 
| Target columns: key, old_valueSource columns: key, new_value | MERGE INTO target_table tUSING source_table sON t.key = s.keyWHEN MATCHED  THEN UPDATE SET *WHEN NOT MATCHED  THEN INSERT * | UPDATEandINSERTactions throw an error because the target columnold_valueis not in the source. | The table schema is changed to (key, old_value, new_value). Existing records with matches are updated with thenew_valuein the source leavingold_valueunchanged. New records are inserted with the specifiedkey,new_value, andNULLfor theold_value. | 
| Target columns: key, old_valueSource columns: key, new_value | MERGE INTO target_table tUSING source_table sON t.key = s.keyWHEN MATCHED  THEN UPDATE SET new_value = s.new_value | UPDATEthrows an error because columnnew_valuedoes not exist in the target table. | The table schema is changed to (key, old_value, new_value). Existing records with matches are updated with thenew_valuein the source leavingold_valueunchanged, and unmatched records haveNULLentered fornew_value. See note (1). | 
| Target columns: key, old_valueSource columns: key, new_value | MERGE INTO target_table tUSING source_table sON t.key = s.keyWHEN NOT MATCHED  THEN INSERT (key, new_value) VALUES (s.key, s.new_value) | INSERTthrows an error because columnnew_valuedoes not exist in the target table. | The table schema is changed to (key, old_value, new_value). New records are inserted with the specifiedkey,new_value, andNULLfor theold_value. Existing records haveNULLentered fornew_valueleavingold_valueunchanged. See note (1). | 
(1) This behavior is available in Databricks Runtime 12.2 LTS and above; Databricks Runtime 11.3 LTS and below error in this condition.
Exclude columns with Delta Lake merge
In Databricks Runtime 12.2 LTS and above, you can use EXCEPT clauses in merge conditions to explicitly exclude columns. The behavior of the EXCEPT keyword varies depending on whether or not schema evolution is enabled.
With schema evolution disabled, the EXCEPT keyword applies to the list of columns in the target table and allows excluding columns from UPDATE or INSERT actions. Excluded columns are set to null.
With schema evolution enabled, the EXCEPT keyword applies to the list of columns in the source table and allows excluding columns from schema evolution. A new column in the source that is not present in the target is not added to the target schema if it is listed in the EXCEPT clause. Excluded columns that are already present in the target are set to null.
The following examples demonstrate this syntax:
| Columns | Query (in SQL) | Behavior without schema evolution (default) | Behavior with schema evolution | 
|---|---|---|---|
| Target columns: id, title, last_updatedSource columns: id, title, review, last_updated | MERGE INTO target tUSING source sON t.id = s.idWHEN MATCHED  THEN UPDATE SET last_updated = current_date()WHEN NOT MATCHED  THEN INSERT * EXCEPT (last_updated) | Matched rows are updated by setting the last_updatedfield to the current date. New rows are inserted using values foridandtitle. The excluded fieldlast_updatedis set tonull. The fieldreviewis ignored because it is not in the target. | Matched rows are updated by setting the last_updatedfield to the current date. Schema is evolved to add the fieldreview. New rows are inserted using all source fields exceptlast_updatedwhich is set tonull. | 
| Target columns: id, title, last_updatedSource columns: id, title, review, internal_count | MERGE INTO target tUSING source sON t.id = s.idWHEN MATCHED  THEN UPDATE SET last_updated = current_date()WHEN NOT MATCHED  THEN INSERT * EXCEPT (last_updated, internal_count) | INSERTthrows an error because columninternal_countdoes not exist in the target table. | Matched rows are updated by setting the last_updatedfield to the current date. Thereviewfield is added to the target table, but theinternal_countfield is ignored. New rows inserted havelast_updatedset tonull. | 
Dealing with NullType columns in schema updates
Because Parquet doesn't support NullType, NullType columns are dropped from the DataFrame when writing into Delta tables, but are still stored in the schema. When a different data type is received for that column, Delta Lake merges the schema to the new data type. If Delta Lake receives a NullType for an existing column, the old schema is retained and the new column is dropped during the write.
NullType in streaming is not supported. Since you must set schemas when using streaming this should be very rare. NullType is also not accepted for complex types such as ArrayType and MapType.
Replace table schema
By default, overwriting the data in a table does not overwrite the schema. When overwriting a table using mode("overwrite") without replaceWhere, you may still want to overwrite the schema of the data being written. You replace the schema and partitioning of the table by setting the overwriteSchema option to true:
df.write.option("overwriteSchema", "true")
Important
You cannot specify overwriteSchema as true when using dynamic partition overwrite.