[Progress News] [Progress OpenEdge ABL] Leveraging the MarkLogic Virtual Views for Ad-Hoc Analytics

  • Thread starter Thread starter Drew Wanczowski
  • Start date Start date
Status
Not open for further replies.
D

Drew Wanczowski

Guest
During a roundtable on data culture in analytics management I moderated, a recurring question raised by data architects and IT leaders was: How can we manage report overload and avoid creating a new report for every request or question?

Because of their immediate and not-so-well-defined nature, one-off, ad-hoc questions can feel like fire drills. For analysts, it means less time for long-term projects and preparing and analyzing data that is not always ready for a BI tool. For data engineers, it means exposing certain information in tables and views to BI tools so analysts can construct queries over those tables.

There is usually nothing quick about these ad-hoc requests. The process of exploring, prototyping, and validating data to answer loosely defined questions is time-consuming and resource-intensive. Additionally, introducing unstructured and semi-structured data adds a layer of complexity.

To lessen the load on data teams, organizations can improve how they handle ad-hoc analysis. While fostering data literacy helps, not every user can turn raw data into an explanation. Even with broad data access, analysts and engineers still play a crucial role in surfacing and interpreting information.

Progress MarkLogic Server 12 has introduced Virtual Views to help drive quicker development cycles and reduce wait times to the business-critical data, while optimizing your system’s resources.

Building Relational Views Over Unstructured Data​


Whether you are recording information from a scientific experiment, results from a safety test or transactions from a financial system, users need to consume it in a way that works for them. Systems often expose this data in feeds of JSON or XML. While these formats are great for record retention or APIs, analysts and SMEs are used to running reports using BI tools and data science utilities that expect rows and columns.

The MarkLogic platform has always supported analysts with secure access to complete, governed enterprise information. Information for reports and analysis is integrated from unstructured documents and a relational lens is applied over it to make it ready for SQL-based intelligence tools.

Through the MarkLogic Server Template Drive Extraction feature, engineers can configure a mapping of these document structures to views containing rows and columns. This differs from ETL (Extract-Transform-Load) since you are not creating a copy of the record. This data is projected into these views in a transactionally consistent manner. You insert the record once and it is available as a document (JSON/XML), search entry and view entry. This reduces the efforts and need to synchronize data across systems. It also ensures that all your consuming methods are consistent.

To help deliver those views faster, the Virtual Views functionality in MarkLogic Server 12.0 allows you to generate a TDE on the fly, without reindexing all data in the database.

Materialized and Virtual Views​


Prior to MarkLogic Server 12.0, all views defined by TDE have been materialized. This creates a unique index in the system. While records are processed by the system, these views will be updated as part of a transaction. This indexing process ensures that every field that has been defined is extracted and saved. This enables robust analytics on top of the MarkLogic platform. However, utilizing materialized views does add to indexing time as well as storage.

Not all use cases require such an optimized index pattern. The Virtual Views feature leverages the MarkLogic Universal Index. As part of our record processing, the structure and content are indexed for search. This allows us to build rich and robust end-user applications using search grammar and faceted aggregations. Like traditional relational databases, you can run queries on these views and only enable a materialized index if you need to. This makes your database footprint smaller and allows for more flexible ad-hoc querying for analytical workloads.


The Setup​

Let’s see Virtual TDEs in action. You will first need to install MarkLogic Server 12.0. Follow the setup instructions and create a user for demonstration purposes.

We will be using MarkLogic Flux to load sample data into the default Documents database. The following command will take an aggregate JSON set and load it as documents into the database. You can also load data from a variety of formats.


Code:
./bin/flux import-aggregate-json-files \
    --json-lines \
    --path viscosity.json \
    --connection-string "user:password@localhost:8000" \
    --permissions rest-reader,read,rest-writer,update \
    --collections http://example.com/tests/viscosity \
    --uri-template "/data/viscosity/{sample_guid}.json"

Verify that documents have been loaded by logging into the MarkLogic Query Console and clicking explore on the Documents database. You can also run a search query to see that the records made it into the system.

Code:
'use strict';

const op = require('/MarkLogic/optic');

op.fromSearchDocs(cts.collectionQuery('http://example.com/tests/viscosity))
  .offsetLimit(0, 10)
  .result();

Defining Views​

We will now define our template against data we would like to project into views. Note we can configure the template to make all data materialized or virtual at the top level. Or we can specify individual fields to materialize for more efficient scanning and sorting.


The first step is to review the shape of your data and determine which fields we will be extracting. We have generated sample lab experiment tests. This document contains high-level metadata and measurement throughout the test.


Code:
{
    "sample_guid": "12b95849-f5b7-4398-9ab7-347f9f6d2b84",
    "sample_number": 4,
    "chemical_name": "chemical B",
    "test_date": "2024-10-17T11:15:12Z",
    "test_operator": "Dorothy Thayre",
    "test_method": "method 1",
    "measure": [
        {
            "viscosity": 6.47,
            "temperature": 9.71,
            "time_elapsed": 19.49
        },
        {
            "viscosity": 88.5,
            "temperature": 36.92,
            "time_elapsed": 74.2
        },
        {
            "viscosity": 8.14,
            "temperature": 56.73,
            "time_elapsed": 80.78
        },
        {
            "viscosity": 51.88,
            "temperature": 69.23,
            "time_elapsed": 62.55
        },
        {
            "viscosity": 75.02,
            "temperature": 35.98,
            "time_elapsed": 83.97
        }
    ]
}

In this template, we will extract all the firstlevel metadata into one view. The metadata view will be part of Viscosity schema and will be virtual. Note that viewVirtual is set to true at the view level. We can also materialize individual fields if we would like faster operations, such as sorting. Note the Date field with the virtual property set to false.

Code:
'use strict';
declareUpdate();
const tde = require("/MarkLogic/tde.xqy");

let template = xdmp.toJSON(
{
    "template": {
        "description": "Viscosity Test Template",
        "context": "/",
        "collections": [
            "http://example.com/tests/viscosity"
        ],
        "vars": [
            {
                "name": "guid",
                "val": "sample_guid"
            }
        ],
        "rows": [
            {
                "schemaName": "Viscosity",
                "viewName": "Metadata",
                "viewLayout": "sparse",
                "viewVirtual": true,
                "columns": [
                    {
                        "name": "GUID",
                        "scalarType": "string",
                        "val": "$guid",
                        "nullable": false,
                        "invalidValues": "ignore"
                    },
                    {
                        "name": "SampleNumber",
                        "scalarType": "string",
                        "val": "sample_number",
                        "nullable": false,
                        "invalidValues": "ignore"
                    },
                    {
                        "name": "Operator",
                        "scalarType": "string",
                        "val": "test_operator",
                        "nullable": false,
                        "invalidValues": "ignore"
                    },
                    {
                        "name": "Method",
                        "scalarType": "string",
                        "val": "test_method",
                        "nullable": false,
                        "invalidValues": "ignore"
                    },
                    {
                        "name": "Date",
                        "scalarType": "dateTime",
                        "val": "test_date",
                        "virtual": false,
                        "nullable": false,
                        "invalidValues": "ignore"
                    }
                ]
            }
        ]
    }
})


tde.templateInsert("/tde/viscosity.json", template, [xdmp.permission("rest-writer", "update"), xdmp.permission("rest-reader", "read")]);

You can now query these views directly in SQL in Query Console. As noted, the only indexing occurs against fields that you marked as virtual “false.”

Code:
SELECT * FROM Viscosity.Metadata
LIMIT 100
The Universal Index in MarkLogic Server automatically indexes XML elements and JSON properties during document ingestion, without needing a predefined schema. It supports fast and flexible querying by indexing full text, structure and values. The index treats structural and textual queries similarly, enabling efficient combined searches across different data types.


Virtual Views can leverage this index to build filters against the data. When creating an SQL query with a WHERE clause, the MarkLogic Server query engine will automatically map these constraints to the universal index. There is no additional work for you as a developer.

Code:
-- query

SELECT * FROM Viscosity.Metadata
WHERE Method = 'method 1'
LIMIT 100

In this case, the Method field is a String and will map over to an element value query or property value query.

Adding New Columns on the Fly​


As we proceed, we want to add new views and columns. This would typically require a reindex. However, having the view set to virtual, we can add these columns and rely on the data that has already been persisted in the Universal Index.

Code:
'use strict';
declareUpdate();
const tde = require("/MarkLogic/tde.xqy");

let template = xdmp.toJSON(
{
    "template": {
        "description": "Viscosity Test Template",
        "context": "/",
        "collections": [
            "http://example.com/tests/viscosity"
        ],
        "vars": [
            {
                "name": "guid",
                "val": "sample_guid"
            }
        ],
        "rows": [
            {
                "schemaName": "Viscosity",
                "viewName": "Metadata",
                "viewLayout": "sparse",
                "viewVirtual": true,
                "columns": [
                    {
                        "name": "GUID",
                        "scalarType": "string",
                        "val": "$guid",
                        "nullable": false,
                        "invalidValues": "ignore"
                    },
                    {
                        "name": "SampleNumber",
                        "scalarType": "string",
                        "val": "sample_number",
                        "nullable": false,
                        "invalidValues": "ignore"
                    },
                    {
                        "name": "Operator",
                        "scalarType": "string",
                        "val": "test_operator",
                        "nullable": false,
                        "invalidValues": "ignore"
                    },
                    {
                        "name": "Method",
                        "scalarType": "string",
                        "val": "test_method",
                        "nullable": false,
                        "invalidValues": "ignore"
                    },
                    {
                        "name": "Date",
                        "scalarType": "dateTime",
                        "val": "test_date",
                        "virtual": false,
                        "nullable": false,
                        "invalidValues": "ignore"
                    }
                ]
            }
        ],
        "templates": [
            {
                "context": "measure",
                "rows": [
                    {
                        "schemaName": "Viscosity",
                        "viewName": "Measure",
                        "viewLayout": "sparse",
                        "viewVirtual": true,
                        "columns": [
                            {
                                "name": "GUID",
                                "scalarType": "string",
                                "val": "$guid",
                                "nullable": false,
                                "invalidValues": "ignore"
                            },
                            {
                                "name": "Temperature",
                                "scalarType": "string",
                                "val": "temperature",
                                "nullable": false,
                                "invalidValues": "ignore"
                            },
                            {
                                "name": "Elapsed",
                                "scalarType": "string",
                                "val": "time_elapsed",
                                "nullable": false,
                                "invalidValues": "ignore"
                            },
                            {
                                "name": "Viscosity",
                                "scalarType": "string",
                                "val": "viscosity",
                                "nullable": false,
                                "invalidValues": "ignore"
                            }
                        ]
                    }
                ]
            }
        ]
    }
})


tde.templateInsert("/tde/viscosity.json", template, [xdmp.permission("rest-writer", "update"), xdmp.permission("rest-reader", "read")]);

These views can also be joined and sorted with ease. There is no need to wait for this data to reindex since it is already present in the Universal Index.

Code:
-- query

SELECT * FROM Viscosity.Metadata
INNER JOIN Viscosity.Measure
ON Metadata.GUID = Measure.GUID
ORDER BY Date
LIMIT 100

Connecting via ODBC​


In addition to the capabilities of Query Console, you can connect applications over ODBC. This enables BI tools such as Tableau or Microsoft Power BI to be leveraged against the dataset.

Relational Visualization of Unstructured Data in Tableau Generated with MarkLogic Virtual Views




Conclusion


With the innovative MarkLogic Virtual Views feature, you can now leverage the Universal Index as part of your analytical query workload. Enabling flexible, efficient and robust ad-hoc querying. This provides a smaller database footprint without compromising query flexibility. These new enhancements to Template Driven Extraction have numerous benefits:


Flexible Analytics: Run queries on demand and enable materialized indexes only when necessary.
Efficient Data Management: Seamlessly project data into rows and columns without creating new records.
Consistent Consumption: Ensure all consuming methods are transactionally consistent.
Optimized Performance: Reduced indexing time and storage needs with Virtual Views.

Before you do another “quick pull of the database” or create another dashboard no one is going to look at, try an MVP that will give you enough to sit with the stakeholders and understand the real question at hand.

Get started with MarkLogic Server 12 today.

Continue reading...
 
Status
Not open for further replies.
Back
Top