Change Data Capture

Hello Guys,

Hope you all are going well!!!

I am working on change data capture functionality (version 11.7 and above) and able to do the basic setup with tracking, policy and cdc_table-name tables. In CDC documentation, it’s been mentioned that “Multi-threading” is available in CDC when it’s compared with OpenEdge Auditing.

I tried to find where multi-threading is available that is not there in OpenEdge Auditing because CDC is all about storing before/after values in CDC/shadow table.

Please share your inputs/suggestions.

Regards,
 
Hello Rob Fitzpatrick,

Thanks for your reply!

Actually, now I am not able to find multi-threading related info on CDC documentation but I read it somewhere that performance of CDC is comparatively very good from OpenEdge Auditing as it uses multi-threading.

Is there any difference when data is written to auditing tables (_aud-audit-data, _aud-audit-data-value) or to CDC tables (cdc_tracking_table, cdc_table-name) w.r.t multi-threading? I mean, is there any parallel processing (multi-threading) happens when data changes (database CUD operations) are written to its respective cdc tables.

Please share your inputs on the same.

Regards,
 

Cringer

ProgressTalk.com Moderator
Staff member
I suspect you're mixing up CDC vs Auditing and 11.7 vs 12.2. There have been a lot of changes, including a multi-threaded database processes etc in 12.
 

TomBascom

Curmudgeon
I’m pretty sure that, if you ever heard anything like that, the source of those statements was somewhat confused.

At a very high level there are some similarities between auditing and cdc. At least one commercial product exists that uses auditing for ETL purposes. But there are also some significant differences:

- Auditing (usually) depends on proutil to move data to a reporting database and to purge stale data
- CDC requires you to write your own 4gl programs to export the changes and to purge any stale data

Perhaps the confusion arises from the ability to provide your own 4gl code to manage CDC. In a sense that means that, yes, you could "multi-thread" that process (not really "threads" per se but rather "processes"). Yes, this is possible. But you could, in theory (see mention of commercial product above), do that with auditing as well.

As for performance gains? The two technologies are not *usually* used for the same purpose so making that kind of comparison is not very helpful. In the cases where there might be some overlap I wouldn't bet very much against proutil. The proutil ustility can cheat in ways that the 4gl cannot.

It is more reasonable to speak of performance gains when comparing CDC to the old "replication trigger" based Pro2. At least that is apples to apples in terms of the intended use of the software. Is the CDC version of Pro2 faster than the old trigger based code? That is harder to answer. I have not personally tested it. And if you were going to test it you should be aware that almost every install of the trigger based stuff is unique. So what you see at site A may have little or no bearing on the results at site B.
 

Rob Fitzpatrick

ProgressTalk.com Sponsor
I suspect what you may have read is that there are some high-level similarities between OE Auditing and CDC, in that both can be configured to give you information about changes to tables, and that CDC is a more lightweight solution. I think there is some truth to that, if you just want information about table changes, although don't be tempted to view them as equivalent features. If you need a robust, non-repudiable audit trail, only OE Auditing provides that.

As Tom points out, these features typically only constitute the front half of a complete solution for a particular business need. Typically OE Auditing facilitates building an audit application that allows an auditor to view an audit trail of application activity. But once the data is written to the audit tables, it should be dumped from the audit tables in the production environment (via proutil -C auditarchive) and loaded into an audit archive database where it would then be read by an audit-specific application.

The typical use case for CDC would be capturing information about changes to certain tables in a database in individual table-specific change tables, as well as overall changes in a change-tracking table. These can be queried by some process that polls them for changes and then acts on them in whatever manner is desired (e.g. feeding changes to a flat file, third-party application, API, Kafka stream, whatever is appropriate, and then deleting those change records).

For the purpose of direct comparison, I suppose it is worth discussing the initial writes that happen in the production database. With CDC, writes are governed by per-table CDC policies that determine which tables are monitored for changes and what level of verbosity is required for each. That could range from something simple, like just tracking that a change has occurred, to more detailed information like the type of change and the list of changed fields and their before and after values. All of this write activity happens server-side; it is done by remote servers or by self-service ABL clients. In that sense, yes the writes are multi-threaded as there can be many such processes that are concurrently writing change-table records.

The OE Auditing feature has existed much longer (10.1A, versus 11.7 for CDC). Its operation is governed by audit events and audit policies. Broadly, there are three sets of event types that can be audited: CUD events (table creates/updates/deletes), system events (database logins, database configuration changes, etc.), and application events (where the audit data and audit context are provided by the application code running in the client). OEA has a somewhat ungainly data-encoding method for field values and it may involve writes to several different tables to record a single event. Importantly, application events are passed to the server from the client, so they can add to client/server traffic volume. As with CDC, event data will be written by servers or by self-service clients.

All that said, while it's fine to have an academic interest in how things work under the covers, I think your use of the features or add-ons should be governed by your business needs and not by whether they are multi-threaded. In general, multi-threading isn't a feature; it is a means to an end. "Multi-threaded" shouldn't be viewed as synonymous with "better" or with "runs faster".
 
Thanks a lot @Rob Fitzpatrick for this valuable information, I was looking for this:-

"For the purpose of direct comparison, I suppose it is worth discussing the initial writes that happen in the production database. With CDC, writes are governed by per-table CDC policies that determine which tables are monitored for changes and what level of verbosity is required for each. That could range from something simple, like just tracking that a change has occurred, to more detailed information like the type of change and the list of changed fields and their before and after values. All of this write activity happens server-side; it is done by remote servers or by self-service ABL clients. In that sense, yes the writes are multi-threaded as there can be many such processes that are concurrently writing change-table records."

And this multi-threading can't be done in auditing because of its Application data model where application events are passed to the server from the client.

Regards,
 
Thanks for your reply @TomBascom .

So it means below statement applies on auditing as well:-

"All of this write activity happens server-side; it is done by remote servers or by self-service ABL clients. In that sense, yes the writes are multi-threaded as there can be many such processes that are concurrently writing change-table records."

Please suggest.
 

TomBascom

Curmudgeon
Write activity is write activity.

CDC writes to more tables.

That doesn't make it any more or less "multi-threaded".

Rob's comment about "in that sense" is important.

"Concurrent" is a bit of a misnomer as well. Writes are *very* carefully ordered (they have to be for rollback and recovery) and even when done across many tables there is really just a single "thread" of writing going on.
 

Rob Fitzpatrick

ProgressTalk.com Sponsor
And this multi-threading can't be done in auditing because of its Application data model
Please re-read my response carefully; that's not what I said.

OEA audits three different classes of events, and one of them (application events) originates client-side, so it adds more client-side traffic. That is additional overhead for the client/server connections. The servers still write the data.

But depending on the implementation, that might not matter at all. It is up to the implementer to decide which audit event types they care about and wish to create audit policies for. And some users may choose not to use application events at all, either because they require additional design and coding or just because they do not need to audit these events.

Please note that both CDC and OEA are very complicated features with a lot of configurable options and the devil is in the details in terms of their performance impact in a specific implementation. It really can't be analyzed meaningfully in a forum post.

And again, in my opinion, perceived performance differences shouldn't be the reason for choosing to use one feature over the other.
 

Rob Fitzpatrick

ProgressTalk.com Sponsor
"Concurrent" is a bit of a misnomer as well. Writes are *very* carefully ordered (they have to be for rollback and recovery) and even when done across many tables there is really just a single "thread" of writing going on.

Tom's point is well taken: all database writes, whether to application or system data structures (e.g. tables, indexes, sequences) happen first in the before-image subsystem, and then in the after-image subsystem (if enabled), and then in the database (via the buffer pool).

BI writes are first done in the BI buffers and then later synchronized with the BI storage area. Specifically, BI writes are done in a single BI buffer at a time, called the current output buffer. While many different processes can attempt to write BI notes in a short span of time (and thus appear, macroscopically, to be writing simultaneously), in fact they are taking turns. In order for a process to write to the current output buffer, it must acquire a lock on the BIB latch so it has temporary (perhaps microseconds, or less) exclusive access to that buffer so it can do consistent reads and writes. Once it has finished its BI writes it releases the BIB latch and some other process may acquire it to write BI notes.

The process is similar for the AI buffers and files and the AIB latch.

I believe this explains Tom's last comment. But he will correct me if it didn't, and I'll learn something. ;)
 
Top