Facing Issue while extracting the offset value from event hubs in data bricks

Question

Facing Issue while extracting the offset value from event hubs in data bricks

Ravi Sai Mahsiva 20

Hi Team,

Could you please help us to provide code to extract the offset data from event hubs attached the screenshot of event hub details, the below code were using it, even I have tried with case sensitive pass the offset column in schema, at final result attached, in offset column we were getting sequence number, can you please tell how to extract offset column value in data bricks using spark

User's image


def get_geofence_event_schema():
    return StructType([
        StructField("id", StringType(), True),
        StructField("type", StringType(), True),
        StructField("source", StringType(), True),
        StructField("specversion", StringType(), True),
        StructField("time", TimestampType(), True),
        StructField("datacontenttype", StringType(), True),
        StructField("pubsubname", StringType(), True),
        StructField("topic", StringType(), True),
        StructField("traceid", StringType(), True),
        StructField("traceparent", StringType(), True),
        StructField("tracestate", StringType(), True),
        StructField("Offset", StringType(), True),
        StructField("data", StringType(), True)
    ])


df_batch = (
    spark.read
        .format("kafka")
        .options(**event_hub_config)
        .load()
        .limit(10)
        .withColumn("geofenceevent", from_json(col("value").cast(StringType()), get_geofence_event_schema()))
)

display(df_batch)

Ravi Sai Mahsiva 20 Reputation points

2025-10-16T17:02:53.71+00:00
Hi @Manoj Kumar Boyini Thanks for the update,
Could you please is it possible to get the offset values using the below syntax, we want to read the kafka events, not from event hub, could you please check and do the needfull

df_batch = ( spark.read .format("kafka") .options(**event_hub_config) .load() .limit(10))
Manoj Kumar Boyini 330 Reputation points Microsoft External Staff Moderator

2025-10-16T17:27:19.6533333+00:00

Hi Ravi Sai Mahsiva,

Thanks for getting back.

Yes, you can get the offset values using that code when you read data with .format("kafka"). But just to clarify the offset you get in this case is a Kafka offset, not the same as the Event Hubs offset you see in the Azure portal.

When you connect to Event Hubs through the Kafka option, Spark treats it like a normal Kafka topic. The offset values you see (like 58) simply show the message position inside each partition. These are fine to use if you’re only working with Kafka-style events.

If what you actually want is the Event Hubs byte offset (the big number like 55834574848), that won’t show up here because it isn’t part of the Kafka output. To get that value, you’d need to read from Event Hubs directly using .format("eventhubs"), which gives access to its system properties including the real offset.

Using .format("kafka") → you get Kafka-style offsets (small numbers).

Using .format("eventhubs") → you can get the Event Hubs offset (the long one shown in the portal).

If you’re connecting to a real Kafka cluster and not Event Hubs, then your current setup is perfect you can continue using the offset values you’re already getting.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Thanks,
Manoj
Manoj Kumar Boyini 330 Reputation points Microsoft External Staff Moderator

2025-10-17T14:02:40.4633333+00:00

Hi Ravi Sai Mahsiva,

I hope you had a chance to review the information shared earlier, and I hope this information has been helpful! If you still have questions, please let us know what is needed in the comments so the question can be answered.

Thanks,
Manoj

Answer accepted by question author

0 additional answers

Your answer

Ravi Sai Mahsiva 20 Reputation points

2025-10-16T17:02:53.71+00:00

Hi @Manoj Kumar Boyini Thanks for the update,
Could you please is it possible to get the offset values using the below syntax, we want to read the kafka events, not from event hub, could you please check and do the needfull

df_batch = ( spark.read .format("kafka") .options(**event_hub_config) .load() .limit(10))
Manoj Kumar Boyini 330 Reputation points Microsoft External Staff Moderator

2025-10-16T17:27:19.6533333+00:00

Hi Ravi Sai Mahsiva,

Thanks for getting back.

Yes, you can get the offset values using that code when you read data with .format("kafka"). But just to clarify the offset you get in this case is a Kafka offset, not the same as the Event Hubs offset you see in the Azure portal.

When you connect to Event Hubs through the Kafka option, Spark treats it like a normal Kafka topic. The offset values you see (like 58) simply show the message position inside each partition. These are fine to use if you’re only working with Kafka-style events.

If what you actually want is the Event Hubs byte offset (the big number like 55834574848), that won’t show up here because it isn’t part of the Kafka output. To get that value, you’d need to read from Event Hubs directly using .format("eventhubs"), which gives access to its system properties including the real offset.

Using .format("kafka") → you get Kafka-style offsets (small numbers).

Using .format("eventhubs") → you can get the Event Hubs offset (the long one shown in the portal).

If you’re connecting to a real Kafka cluster and not Event Hubs, then your current setup is perfect you can continue using the offset values you’re already getting.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Thanks,
Manoj
Manoj Kumar Boyini 330 Reputation points Microsoft External Staff Moderator

2025-10-17T14:02:40.4633333+00:00

Hi Ravi Sai Mahsiva,

I hope you had a chance to review the information shared earlier, and I hope this information has been helpful! If you still have questions, please let us know what is needed in the comments so the question can be answered.

Thanks,
Manoj

Answer 1

Hi Ravi Sai Mahsiva,

Welcome to Microsoft Q&A Platform. Thank you for reaching out & hope you are doing well.

I had a look at your code and the output, and I can see what’s happening here.

The offset value you see in your Databricks output (like 58) is coming from the Kafka-compatible interface that Event Hubs uses. This is a Kafka-style offset, which just tracks the message position inside each partition. It’s not the same as the Event Hubs offset that you see in the Azure portal (the large number like 55834574848).

Right now, in your code, you’re reading Event Hubs using the .format("kafka") option, which is why you only get the Kafka offset. If you want to get the actual Event Hubs offset, that value is stored inside the system properties of each event — it’s not part of your message payload.

You can read that offset by switching to the Event Hubs connector instead of the Kafka one and accessing the system property like this:

event_hub_df = (
    spark.read
        .format("eventhubs")
        .options(**event_hub_config)
        .load()
)

from pyspark.sql.functions import col, from_json

df = event_hub_df.select(
    col("body").cast("string").alias("value"),
    col("systemProperties.offset").alias("eventhub_offset")
)

display(df)

Also, in your current schema, you added an "Offset" field inside the JSON structure, but that field doesn’t actually exist in your event data which is why it always shows as null

The small offset you see now is from Kafka, not Event Hubs.

The large offset in the Azure portal is part of Event Hub system properties.

To get that value, use the Event Hubs connector and read it from systemProperties.offset

Kindly let us know if the above helps or you need further assistance on this issue.

If the answer is helpful, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".

Thanks,
Manoj

Share via

Facing Issue while extracting the offset value from event hubs in data bricks

0 additional answers

Your answer