The term Data Scientist is a recent arrival in our pantheon of job descriptions and academic tracks. But are they actually Data
Scientists or a variant of Software Engineers?
If you scan the internet for data science, lots of tech blogs and websites show up. Python programming tips and statistical programming tips using various tool packages for AI/ML are what dominate — that is what data scientists are trained for. They are hired because they learned more math — statistics and modeling — and they know different programming languages and tools that fall into the AI/ML category.
Peering into what the role actually is, you do see that a lot of their time is spent wrestling with data — data sources, cleaning data, mapping data semantically, and so on. Statistically, these software engineers spend most of their time working on data rather than programming.1 Who cares, you wonder. Stuff has to get done.
Try to remember a time when all the data formats, rules, and, often, the data itself were embedded in each program. Every program or sub-system defined and refined the data uniquely.2
When we had to combine or roll up results, we added more integration programs to translate and transform data to feed the next set of programs. IT managers sat on a mountain of integration code that added little value, and each time there was a change, it took a committee to synchronize all the programs to make even simple changes.3 There were more programmers working on maintenance than on value-added new work.
The advent of data-centricity in systems design, middleware, object-oriented programming, great and inexpensive databases, EDI networks and supply chain application networks that do data translations, and so on, segmented the data from the code and significantly reduced the pain and chaos of change.
Then, along came AI/machine learning and the data scientist. Development in AI, unfortunately, has often become a new island in IT. A large part of data scientists’ work is extracting data from a variety of sources, an exercise in uniquely sourcing, understanding, formatting, defining (labeling), cataloging, and attributing the data.4
To add to the challenge, these newly minted data scientists work in IT but, often, not with IT. Everybody is busy doing their own assignments. Time spent blending traditional software developers and data scientists is often limited. Many software engineers I talk to are working with crushing deadlines, with scant attention being paid to tried and true software quality methods. Formal system software reviews mostly don’t happen. Data is often not incorporated into a common data dictionary for reference and re-use.
We saw this movie before, and it is expensive to clean up.
And yet …
we already have data people — database designers and administrators, and other data roles — who know all about data, its structure, the selection and design of data stores, and so on.
we have the users who care about the content and can better judge the meaning, relevancy, validity, and value of the data.
So, can we avoid making the mistakes of the past?
What is Data Science?
Let’s take a quick look at the world of data.
Our world of data has gotten pretty complex. We not only have the digital enterprise data we all are so familiar with, like a customer record, or an EDI purchase order, but now we have vast troves of unstructured and streaming analog data to dissect, understand, store, and use in ways we might not grasp initially. Or it might be garbage.
Managing master data management (MDM) has been the backwater of the IT department for some time. Data work such as designing and transforming a technical spec that is a PDF or even a fax into a digital format that can be an input into a CAD system, or translating languages is a huge challenge. It is very underappreciated work, often done by very specialized people, software tools, or even data-entry service companies who are all quietly part of solutions you may use every day.
Users groan at the burgeoning data requirements of standards and regulations that require more data and adherence to formats in PIM systems, product tracking data, electronic customs forms, financial payment, or supplier information management (vendor company data, product data, pricing data, supplier sustainability/risk assessments), and the ever-expanding transportation/logistics data.
We can’t ignore this work because without the right data we cannot trade. Without the right data we cannot meet legal requirements.
In this exacting world of data, not fully filling out the forms (like electronic filing for customs or a complete EDI transmission)5 will stop the transaction, stalling sales and on-time fulfillment. We have to give the data people their due for all the underappreciated6 work they have been doing on our behalf.7
As more and more businesses are online, global, multi-industry, and so on, companies have also had to intake, analyze, and translate data from multiple sources: multiple/diverse data sets and data that might have a slightly different meaning for each application or enterprise. Hats off to all the beavers in the “data department,” the compliance department, those in data policy and governance, the EDI people, and others who fixed all this.
Without going into technical treaties, data experts and database experts (designers, administrators, etc.) organize digital data into fixed structures and further define the data by its size and other characteristics. They build the data dictionary.
However, the most important part of the data definition is its context. For example, a billing system that has to access a database for patient records will define you as a patient, whereas a billing system for standard goods will access a customer database and will define you as a customer. Same person, different context. How that data is acted upon — a customer can buy, consume, use, reuse, return, look for a discount; a patient has visits, tests and procedures — starts to form other data to define the interrelationships and rules about how that data can be changed, and who can change it.
There are parallel databases for products, purchase orders, suppliers, care providers and so on, and they all have to work together. The processes that interact, translate and reference them are relying on consistent data. This is all part of the structured data world. End-users are part of this, since they are the data’s owners and its customers. They know what is good vs. bad data.
Now, end-users are taking an interest in data due to the new data sources, for example, weather, social, pictures, video; movement of people or conveyances; the temperature of a latte, or the viscosity of the lubricant. We also have augmented reality, which introduces further dynamics into a world of interwoven IoT, 3D graphical data, and video data, to name a few other layers in this world of data.
The many potential data sources are interesting and complex to interpret. And they are all part of the unstructured world of analog data: strings, graphics, video, and streams and reams of data.
In order to understand this data and apply it, new techniques need to be used to interpret and transform it into a workable digital format so those fancy algorithms can get to work. AI/Machine Learning and lots of big data can be used for massive cosmological research8 or the “simple” scanning of text within a customer satisfaction survey. Searching text for consumer sentiment and product feedback9 has become an important aid to product and service improvements.
Tools like cognitive computing and deep learning, therefore, rely on the use of semantics and taxonomy10 to understand, translate, validate, and produce consistent and accurate, relevant data. This is the work of the data analyst. And we need this because the software alone is not always getting the “words” right.
If we follow this thought, we move from the data analyst to a database designer/administrator who thinks about what kind of database structure should be used to store unstructured data. What to retain? How best to access it? See Table One for a simple description of the roles and the skills needed for each.
Then there’s all that data and sources. The data for supply chain is exploding (see Figure 1).
It will require the skills of all the experts (business/supply chain analysts/scientists, data resource managers and AI software engineers and all the other people mentioned above) to figure it all out. This is very time consuming. Hence, it would be better to organize it a bit and get the knowledgeable people involved, cooperating and invested in this new wave of how we work.
Figure 1 – The World of Data
Let’s look a bit more at the Data Scientist and the user — in this changing environment we have been calling the Supply Chain Scientist:
Using various analytical methods and AI/machine-learning technology to source and process data. But these tools initially don’t get the connection between all the references, concepts, and uses of language within text; nor can they aptly identify an object within a picture. This takes yet more tools and, mostly, manual brain time to sort out. And can we say that many techies don’t enjoy spending time on this kind of work? Or at least, they prefer to do what they do best — code! In fairness, they are not subject matter experts, whether it be products, the weather, the customer, or the needs of a demographic. So, it will take them a lot longer to sort through this.
Supply Chain Scientists
know the data well and they will learn about the new sources and their potential value in new types of queries. Getting that source data and translating it, storing it and protecting it is the work of database designers and administrators. Using that data in various programs is the work of software engineers.
In addition, large organizations also have an architect who coordinates and overseas the technology selection and integration. Even in smaller organizations, the overall portfolio and its architectural mapping is an underlying concept that is generally practiced — formally or not.
Organizing the work well can make us a lot more productive and make the work a lot less frustrating for all. Long term, the organization will have a lot more leverage from the investment in establishing the right data architecture early in their AI/machine learning journey rather than letting things get out of control.
Skills and Roles
What to Know
This is the business analyst who is interested in customers, markets, and their environments. This person knows what issues impact supply chains and the sources that can be acquired to analyze those impacts. They intuitively know the semantics about the data, but the data is not systematized.
They systemize the data about the data (the data dictionary). They know the systemic automation of data ontology and the tools and data-lake technology best suited to storing raw data from the potential, diverse sources.
Due to the diversity of data and sources and the applications that might use them, we need to restructure data and enhance its definition based on these new needs. That means we are moving to a world that is beyond the manual creation of relationships in database structures. Data lakes or data fabric (two of the terms you might see) use knowledge graphs, automating the ontology, linking the various types of data stores, processes (programs), and APIs as yet another layer in the information architecture.
Of course, we still have the traditional roles of EDI/API developers and users, as well as traditional software engineering roles.
Recommendations — Let’s Get These Roles Right!
OK, so what are we getting at here?
The short of it is this: there is a difference between being responsible for the meaning and management of data and its quality and the coding/automation for acquiring and processing the data.
Data management is the domain of data people, and the automating/coding that of the software engineer. These roles are very deep in terms of the time it takes, and the knowledge, methods, and technologies required to perform these activities.
As we mentioned in previous articles, in fact, it can take multiple types of technologies to move from discovering data sources to acquiring, cleaning, defining, and structuring the data before an analysis is even done.13
Our contention is that as organizations begin to use AI/machine learning, they will see how much time programmers spend on the data, long before they get to actually code/analyze anything. Additionally, isolating all that work with them may lead to further messes.
That approach doesn’t get other knowledgeable people involved, leading to time sinks, errors and resentment.
Change management, which the advent of AI/ML has necessitated, will not go well if people are not respected and given their due for all the knowledge they have acquired. It is also a poor use of an enterprise’s investment in the intellectual property they have built up.
Organize the Work!
These and other issues are easily solved by thinking about roles:
- Segment the roles between the business analyst/Supply Chain Scientist, the Data Management expert and the Data Scientist/Software Engineer.
- Turn to commercially curated data to support the processes of defining and cataloging data and providing APIs (as well as other services).
This not a radical idea, nor is it new. It is mostly the way we have worked, until now. In large organizations, the data people have been different from the programmers. Still, everyone will need to learn a lot of new skills and technologies.
In our complex world of AI/machine learning, we need to see the specialties involved and the new tools and skills each profession needs to learn.
As we have previously stated, supply chain professionals can opt to use an application provider who has built a lot of AI/machine learning into their solution. Even if this is the chosen option, organizations still need to put a lot of thought into the organization, role, skills and collaborative work structure. Your business data will not be the public data. To interpret data for your own needs, you will need skills in-house, even with experienced external support.
We all have a lot to learn in the new world of AI. We will learn it together as a team — IT and the users. But we have to start now.
References/ More Reading:
Artificial Intelligence / Machine Learning Collection
1 Writing programs is the job description and what the company thinks it’s paying them for. The desired result is automation. — Return to article text above
2 Thus, a major cause of Y2K — Return to article text above
3 This process was made more difficult due to lack of documentation. — Return to article text above
4 And many of those data scientists, who are actually software engineers, are not too happy with how they spend their day. — Return to article text above
5 Read Business Transformed — Return to article text above
6 At a recent user conference, over 100 people attended a software release update session, but only two of us — one user and I — attended the product and data management session. — Return to article text above
7 Part of the reason people use supply chain networks is the data standardization and translation services they provide. — Return to article text above
8 For a really impressive example of big data, watch Seeing the Beginning of Time, a Thomas Lucas production. — Return to article text above
9 In documents or on the web — Return to article text above
10 Taxonomy relates to the orderly classification of terms and phrases according to their presumed natural relationships and groups (just like you learned in high school biology about plants). — Return to article text above
11 AI platforms such as TensorFlow, Infosys Nia, SAS, Microsoft Azure; and programming languages such as Python, LSP, SQL; predictive analytics and many more tools that might be in the portfolio — Return to article text above
12 Semantic tools: RDF, RDFS, OWL, SPARQL, SHACL, R2RML, JSON-LD, and PROV-O — Return to article text above
13 In time, the market will organize itself, and in so doing, will simplify the AI/machine-learning platforms and robust functional applications that incorporate AI/machine learning. A few companies are there already, which we mentioned in footnote 11. Alternatively, users can opt to work with the application providers to ensure the data is curated and the knowledge-broadening power of machine learning is applied to their relevant industry sector and business processes. — Return to article text above
To view other articles from this issue of the brief, click here.