Working with Databases

This blog is intended to provide the knowledge on database management in talend. We are going to explore the concepts such as Session Management, table and database management, using hashmaps key tables, and a lot more.

Database Session Management

The developer can manage how and when data is committed to a database using DATABASE SESSIONS. This recipe demonstrates how to accomplish this in Talend. Editing, duplicating, and adding a task to a database connection are just a few of the management options accessible.

These management choices are detailed in the sections below. 

Getting Ready

Open the job jo_cook_ch07_0060_databaseSession.

On closer inspection, you'll notice that the job is set up to commit after each record is written.

Become a Talend Certified professional by learning Talend Certification Course from hkrtrainings ! 

How to do it…

The following are the steps to be taken:

  • Execute the job. You'll see that the records are being added to the database quite slowly.
  • Kill the job. The records have been added to the database table testSession, which can be seen by inspecting the database table testSession.
  • Drag tMysqlConnection from the metadata panel and tMysqlCommit from the palette (this isn't available from the Repository panel) and connect them as follows:

Working with Databases

  • Open tMysqlOutput and select the Use an existing connection option. You'll notice that all of the connection data has been masked.
  • Before all ten records are processed, run the job and kill it. The table has no data in it, as you can tell if you look at it closely.
  • Run the job and wait for it to complete. The data has now been written to the database table.

How does it work?

The tMysqlConnection component connects to MySQL and starts a session. When you choose to Use an existing connection in an output component, the component is added to the session, assuring that the records written require explicit commit. The commit component closes the session and commits the writes to the database.

Executions

  • The job's first run demonstrates how each record is committed in an atomic transaction.
  • When a connection is attached to a record, the second execution reveals that records output without a commit will not be added to the database.
  • All of the records are committed as a single transaction in the final execution. 

Multiple Outputs

  • In order for a transaction to incorporate several tables, additional output components can be added to it. For instance, if the writer for an order item fails, we may want to stop the writing of the customer and order.
  • By assuring that all three output tables use the same connection, we can ensure that they are either committed as a single transaction or that no more dependent rows are written if any of the others fail.

[Related Articles: Talend Vs Informatica]

Remember the Commit

A common beginner's mistake is to believe that the connection just eliminates the need for manual component setup, which is correct, but it also initiates a database session. As a result, if you don't include a commit component, no data will be written to the database.

Committing but not Closing

A session is automatically closed when the commit component is set. If you want to commit while keeping a session open, the tMysqlCommit component includes a setting that allows you to keep the session open after the commit.

Passing a Session to a Child Job

A common real-world requirement is to pass a value from the parent Job to the child Job.

Environment

This procedure was written with:

  • Talend Open Studio for DI 5.0-r72978
  • JDK version: Sun JDK build 1.6.0_26-b03
  • Operating system: Windows XP SP3

The suggested procedure was created in the following environment.

  • Data Integration releases: 4.2.3, 5.0, 5.0.2, 5.1.1

Getting Ready

Open the Job

jo_cook_ch07_0070_databaseSessionParent

It is identical to the completed version from the previous recipe, but with a child job in place of the main process. On closer inspection, the kid job should have a connection set up, and it should be the same connection as the parent job.

Talend Training

  • Master Your Craft
  • Lifetime LMS & Faculty Access
  • 24/7 online expert support
  • Real-world & Project Based Learning

How do you achieve it?

The following are the steps to consider.:

  • Execute the job. When you look at the database table testSessionChild, you'll notice that no records have been added.
  • In the parent job, open tMysqlConnection.
  • Please check the box. Set the Shared DB Connection Name to "cookbook" when using or registering a shared DB Connection, as seen in the screenshot:

  • Carry out the identical procedure for the child’s job's connection.
  • Execute the job. When you look at the database table testSessionChild now, you'll notice that the records have been added.

How does it work?

The parent job's tMysqlConnection component makes a connection and starts a session, while the child job's tMysqlConnection component does the same. The issue with this scenario is that they both create separate sessions, resulting in no entries being committed to the database when the parent is executed, despite the fact that the child has written records.

When the parent of the "cookbook" defines a shared connection, the session information becomes available as a session in the child if we choose to use it, which we do in this case by using the shared connection registered by the parent.

Explore Talend Sample Resumes! Download & Edit, Get Noticed by Top Employers!

As a result, the parent and child connections are now shared in the same session, and when the command is sent, the records contributed by the child are likewise committed.

Inserting, Updating, and Deleting data using various fields and keys

Many apps will write to/delete from the same table in a variety of ways, with different fields serving as keys and updating different fields at different times. This recipe demonstrates how to accomplish this without having to construct new schemas every time.

Getting ready

Open the job jo_cook_ch07_0080_fieldSelection.

How do you achieve it?

The following are the steps to be taken:

  • Change the field Action on data from Insert to Insert or update within the tMysqlOutput component.
  • Now select Advanced settings.
  • To reveal the Field Options, check the box labeled "Use field options." All of the fields should be Insertable and Updatable, as you can see.
  • In the Updatable column, uncheck createdDate and createdBy.
  • In the column Insertable, uncheck updatedDate and updatedBy.
  • Finally, check for the column id in the Update Key column. As seen in the following screenshot, your Field options should now appear like this:

  • Execute the job and examine the table. The updatedDate and updatedBy fields are both blank, indicating that the entry has been created.
  • Remove the values for insertedDate and insertedBy from tFixedFlowInput.
  • To parseDate("yyyy-MM-dd","2012-05-22"), add a value to updatedDate.
  • Add the value "ZZ" to the updatedBy property.
  • Change customerName with "testCustomerNewName."
  • Execute the job and examine the table. You'll see that the name has been altered, as well as the updatedDate and updatedBy columns.

How does it work?

First, we choose whether to insert or update the insert technique. This allows us to use the same component to write to both new and existing records.

The first operation is an insert, which populates the createdDate and createdBy columns for the table's record while leaving the updatedDate and updatedBy columns empty.

Any subsequent write to this database for the supplied key is an update, therefore this will leave the createdDate fields blank and populate the updatedDate and updatedBy columns, as well as the new value of customerName.

There’s more…

This is a wonderful way to ensure that tasks follow pre-defined schemas, which encourages reusability and traceability, and it also enables us to update rows using a different set of key values based on the data we have.

Updating

Since Talend would be producing a SQL WHERE clause under the covers, any key can be used to conduct an update, not only the table's primary key. Simply select the fields you want to use as the column's key in the Field choices box, then click the Update key.

Deleting

You'll also see a column for defining the deletion key in the list of fields. The same technique applies to deleting rows as it does for updating them; however, in the Advanced settings tab, the column Deletion key should be used instead, and the Action on data in the Basic settings set to Delete.

Subscribe to our youtube channel to get new updates..!

Individual Rejections and Errors Are Captured

Many database applications need that a log of rejections and errors be kept in order to identify erroneous records and make manual corrections. This recipe demonstrates how to use the reject row facility to collect mistakes for particular rows.

Getting ready

Open the job jo_cook_ch07_0090_rejectsAndErrors.

How do you do it?

The following are the measures to take:

  • Create a tMysqlOutput using the customer table copied from metadata.
  • Change the table's name to "customer_reject_test," and the table's action to drop the table if it exists, then construct.
  • The only Row choice available when right-clicking the tMysqlOutput component is Main.
  • Execute the job. There are errors in the console, and the table is empty.
  • Select Advanced settings from the tMysqlOutput menu.
  • When you right-click the tMysqlOutput component, a flow labelled Rejects appears.
  • Send this flow to a tLogRow component with the Vertical Mode configured.
  • Execute your job. A rejection row with a duplicate key problem will now be printed in the console.
  • Apart from the rejected entry, you should also notice that the remainder of the input rows have been written to the database.

How does it work?

  • The tMysqlOutput inserts in bulk mode, known as extended insert, by default. This enables a single SQL statement to include many rows.
  • Our data contains a problematic row, which means that when we try to insert it, it, along with the other rows in the same group, is rejected.
  • When batch insert techniques are disabled, the tMysqlOutput component allows for the creation of a Reject flow. It provides a copy of the input record as well as a cause for rejection, same like other Talend Reject flows.
  • Thus, We can only see the Main row when we initially look at the row output from tMysqlOutput. We were then able to connect a rejected flow after turning off the Extend Insert option.
  • With the bulk insert option disabled, we can now reject individual rows, resulting in a successful second job execution, with valid records written to the table and rejects collected and reported in the console.

There’s more…

There are a few other things to consider:

Die from an Error

In order for rejects to be recorded using this approach, the option to Die on Error must also be deactivated in addition to any bulk insert methods.

Efficiency

The ability to reject rows without cancelling the task is really valuable and makes code and error management much easier, but it comes with a cost: the rows must be added one at a time.

As inserting rows one at a time is inefficient compared to employing bulk insertion methods or setting block sizes, you may find that there is a trade-off between loading rates and error reporting and management in some databases.

Error Handling

Individual rows being rejected makes troubleshooting considerably easier because the error for a single reject row is a one-to-one match.

Using batch insert techniques, such as the MySQL extended insert method, on the other hand, will return one error but reject the entire batch of rows, which is both bad and good.

As a result, when one row fails, the entire batch is rejected, resulting in a situation where we have both good and bad records in a batch of rejects, forcing us to develop more complex techniques of correcting them for a single reject.

Table and Database Management

This simple script demonstrates how to run table-related and database management commands.

Getting ready

Create a new job jo_cook_ch07_0100_tableManagement

How to do it…

The following are the measures to take:

  • Select the component tMysqlRow and drag the cookbookDB connection into the canvas.
  • Add the following code to the Query area:
    "
    CREATE TABLE `test_mysqlrow` ( `id` int(11) NOT NULL,
    `data` varchar(45) DEFAULT NULL, PRIMARY KEY (`id`)
    )
    "
  • When you run the job, you'll notice that the table testMysqlRow is created.

How does it work?

The database equivalent of both tJava and tJavaRow, the tMysqlRow component can be used in a flow like tJavaRow or standalone like tJava.

However, like tJava, the tMysqlRow component is most usually used independently, like in this case, when creating a table.

There’s more…

This basic example explains how to utilise tMysqlRow in a single, isolated situation. It's typically used before processing to build temporary tables or drop constraints or indexes before bulk loading, and it's also used after processing to remove temporary tables and restore constraints or indexes.

Talend Training

Weekday / Weekend Batches

Handling Surrogate Tables Parent and Child Keys

Surrogate keys would be used by many application databases to uniquely identify rows in a table. As a result, after writing a record, it's common to need to capture the surrogate key so that any connected child elements can use the parent's surrogate as a foreign key. This recipe demonstrates one approach for constructing surrogate keys in a relation and then goes over a few more.

Getting ready

Open the job jo_cook_ch07_0110_surrogateKeys

How do you do it?

The following are the measures to take:

  • Add the following query to the globalCustomer component of the tMysqlInput component:
    "
    SELECT COALESCE(MAX(customerId),0) FROM globalCustomer
    "
  • Add the following code to tJavaRow 1:
    globalMap.put("maxCustomerId",input_row.maxCustomerId);
    System.out.println("Max customer id = " 
    + globalMap.get("maxCustomerId"));
  • Add the following code to the Expression field for customerId in tMap_1:
    Numeric.sequence("customer",((Integer)
    globalMap.get("maxCustomerId"))+1,1)
  • For the globalCustomer lookup, open tMap_2 and add customerURN and source as join keys.
  • Add the following code into the Expression field for the orderId:
    Numeric.sequence("order",((Integer)
    globalMap.get("maxOrderId"))+1,1)
  • The tMap_2 should now resemble the following image:

  • Execute your job. You'll notice that the customer and order records have been added, that the surrogate keys have been generated appropriately, and that the foreign key references between customer and order have been correctly created.

How does it work?

The following two stages are used to create unique keys:

The first stage is to capture the highest value for the key in the customer3NF and order3NF tables, which is done in tMysqlInput. These values are then saved in the globalMap variables to be used later.

The second stage is to create a sequence that starts at the maximum value plus one and uses the sequence's output as the surrogate key for each input in the customer row now that we know the highest key value in the customer3NF table.

We then write the customer data to the customer3NF table, read the order data, and connect the order data to the customer table using the natural key (customerURN and source) in tMap to extract the customer surrogate to use as the foreign key for the order data.

We also build the order surrogate and write the completed order row to the database using tMap.

There’s more…

We can blitz a table at a time with the way shown, which is particularly efficient for batch processing or data migrations. There is one caveat: we must ensure that no other task or process is simultaneously writing to the same table using the same mechanism. If we can't be certain, we should try a different approach.

Using HashMap Key Table for Adding Efficiency

By avoiding re-reading the globalCustomer table, this approach can be made more efficient. It's advisable to transfer the surrogate key (created) and natural keys (source and customerURN) into a hashMap if you have enough memory. This hashMap may then be utilised as a lookup, avoiding the need to reload the full customer table from the database rather than reading data already in memory.

Ranges

Prior to writing, identify a range of values as a little variation on the preceding procedure. This could be accomplished by inserting a temporary marker record with a key equal to the current maximum value plus the number of records to be written plus one into the table (or a given range plus one). This means that a second process will begin loading at the top of the range, avoiding any potential for overlap. But don't forget to erase the marker record at the conclusion of the range's writing.

Sequences

Sequences are database objects that each time they are read, they return a unique value. The procedure simply adds a lookup to the main flow that selects the next sequence value for the table and uses the resulting lookup to produce surrogate keys.

This approach is suitable for both real-time and batch updates; however, sequences are not supported by all database versions.

It is slightly less efficient than the Talend produced sequence approach, but it has the benefit of being usable by multiple processes at once. As a result, in a real-time/web service-based context, it is the optimal way to use.

Keys that Automatically Increment

When a record is written to a table, certain databases allow fields to produce a new one-up number automatically. We don't need to generate a key if the key field is set to auto increment; we only need to write the record to a table and the database will handle the rest.

This strategy, however, has a drawback in that we must re-read the table using the natural key in order to locate the database-generated surrogate.

The Component LastInsertId
  • The tMysqlLastInsertedId component in the flow is a somewhat more efficient alternative to re-reading with the natural key.
  • This component will add a field to the existing schema and populate it with the most recently written record automatically.
  • Since this component would return the connection's most recent value, only write one table per database session.
  • Also, keep in mind that this feature is only available for a limited number of databases.
Procedure for Automatic Incrementing

The last option on the list is to utilise an auto-increment key field but write the record to the database using a stored procedure. After that, the process can write the record and capture the most recent value recorded. This has the benefit of working with most databases, but the drawback of requiring the maintenance of a stored procedure/function.

Conclusion:

We hope this blog is very helpful in learning the different procedures to work with the databases with Talend.

Find our upcoming Talend Training Online Classes

  • Batch starts on 1st Dec 2022, Weekday batch

  • Batch starts on 5th Dec 2022, Weekday batch

  • Batch starts on 9th Dec 2022, Fast Track batch

Global Promotional Image
 

Categories

Request for more information

Gayathri
Gayathri
Research Analyst
As a senior Technical Content Writer for HKR Trainings, Gayathri has a good comprehension of the present technical innovations, which incorporates perspectives like Business Intelligence and Analytics. She conveys advanced technical ideas precisely and vividly, as conceivable to the target group, guaranteeing that the content is available to clients. She writes qualitative content in the field of Data Warehousing & ETL, Big Data Analytics, and ERP Tools. Connect me on LinkedIn.