Data, Data, Everywhere By Edward Rodden, CIO, SugarCreek

Data, Data, Everywhere

Edward Rodden, CIO, SugarCreek | Monday, 06 February 2017, 10:16 IST

Manufacturers have been connecting devices for many years, but the pace is accelerating with the focus on IoT initiatives. One of the consequences of this proliferation of connected devices in the IoT is a deluge of data, in many forms and in many places. For manufacturing entities, much of this data relates to, or is linked to other data sets.

ERP, Scada, CRM systems, just to name a few, generate significant volumes of data that can and should be tied together for various purposes. Unstructured data from website traffic can also be relevant. In some companies, a common database exists, in others it does not.

Edge processing, where relevant data is filtered and sent “home” from the edge is growing as a means to reduce traffic of unneeded data.

Visualization tools have advanced significantly over the years, to the point an IT skill set is no longer required to construct dashboards, etc. How to get the data to a point a dashboard can be applied, however, has remained the purview of true IT practitioners, either for lack of skill sets or as basic protection of the data from unskilled users.

Traditional Data Warehousing using ETL is still in play, but more and more is viewed as an old and inflexible approach requiring long project cycles and intense resourcing.

When common or compatible databases exist, views can be used to connect data and provide a composite view across databases. This, however, tends to be an iterative process where IT people, who do not understand the business use of the data, go back and forth with users, who do not understand databases. Together they will achieve the view the user wants eventually.

Views can also be of limited use against production data, unless tools like SQL AlwaysOn are used to replicate copies to report against.

The myriad approaches to “Big Data” seems to be in almost constant flux, to the degree that it is nearly impossible to complete an analysis of these approaches without something new being introduced during the analysis.

I remember attending a conference over the last year, listening to a well know speaker advocate Graph databases as the next “big thing.” I messaged my team–“Does anyone know what a Graph database is?” One knew a little about it, the others not much.

On a daily basis, my mailbox is flooded with emails hyping the next big data tool(s) with companies I have not heard of and vague claims of superior functionality.

I read available literature for a few hours each day minimum, but it is hardly enough to keep up with trends and changes.

Additionally, there are ongoing changes in basic infrastructure with the spread of virtualization beyond servers into software defined data centers and networks with micro segmentation and next generation firewalls providing east-west security.

Infrastructure and data based changes are blurring traditional IT roles. Server and Network roles, (traditionally separate roles), are beginning to blend functions. This will likely continue until it will be hard to distinguish between the two.

A newer role of Data Scientist has emerged, but it is likely to become a splintered role by definition as people focus on gaining expertise with specific types of databases and tools.

Many IT groups in mid-size and smaller manufacturing companies tend to be small with skill sets focused around infrastructure and major systems in use. Meeting the changes being described can be a very difficult challenge for such groups as it is hard to encompass all of the various skill sets in a small group.

Machine learning, often referred to as AI, is a very promising approach for a variety of reasons. In the context of databases, AI is probably a misleading term; I would refer to it as “Automated Data Mining” as opposed to self-learning.

Machine learning tools use mathematical algorithms to search for matching data, or links, in the content of multiple databases. As an example, the content of a column in one database containing customer numbers may match to a high degree the content of a column in a second database. This could lead to the presumption that both columns contain customer numbers and a join, or link, may be established on that basis.

Tools of this type can also perform cleaning to varying degrees, such as being capable of recognizing U.S.A. and USA as the same thing to normalize the data accordingly.

In an ideal world, users without IT coding or database skill sets can use tools like these against replicated datasets to define relationships among multiple datasets, clean the data, and join it in such a way that visualization tools can be applied.

Achieving this would relieve IT (and users) of the iterative process of building views between IT and users leading to far more efficient processes in the future.