Introduction and Context:
In e-commerce marketplaces, towards materializing a good shopping experience, there is need to enrich product data before they are listed on the site as the product data provided by the manufacturers, sellers/merchants etc have been found to be either incomplete and/or inaccurate, at times. In addition, information provided is from a Manufacturer or Seller’s perspective and not necessarily from a Customer’s perspective i.e. to aid towards purchase decision.
To illustrate this point further - A sports shoe comes in 10+ sizes, 12 colour combinations, different for males and females. Information about each combination has to be spelled out differently. Let’s imagine the site is selling 10K different shoes from different brands with similar combinations. And from a usage perspective, some shoes goes well for folks with `flat feet’ and some don’t. And that is just shoes, imagine the site selling millions of products across several categories. Clearly, Product information has to be gathered, enriched for completeness, accuracy and is to be provided from a Customer’s perspective.
Given the volume of products that an e-commerce site handles which could be in tens to hundreds of millions of products, it doesn’t make business sense to manually validate and curate product information but instead, ML/DL models are deployed at large. Typically, different ML/DL models are employed for a variety of purposes including product classification into appropriate taxonomy category or sub-category and many a time separate models are used for predicting attribute(s) of a product in case information is missing and/or inaccurate, for instance predicting Colour, Size attributes etc. A combination of language models including fastText, NER, BiLSTSM, BERT (early LLM) etc. are used to train and infer towards solutioning. With the advent of GPT and ChatGPT and the buzz around Gen AI and LLMs, questions that companies are facing is not only to effectively make use of LLM models but also the future of existing ML/DL models, notwithstanding the well-known adage of “If it ain’t broke, don’t fix it”. Some of the questions being asked include: Will the ML and DL models be subsumed by LLM models? With the advent of `Foundation Models’ like LLM, will there be need for humans to label data and will there be a need for content curators and tools to aid in supervised learning? This article provides perspectives on the challenges and opportunities involved.
Perspectives towards solutioning:
Broadly speaking, machine learning (ML/DL/LLM) tasks fall into couple of main categories, behaviour prediction and content semantic understanding and generation. In both of these categories, models have evolved over multiple generations from feature based algorithms to deep algorithms over several years. Language models serve specific language related tasks. LLMs are more powerful versions of language models, but that doesn’t change the fact that they are not designed for every ML task. Needless to say, a majority of the ML tasks outside of the language domain (propensity models for churn, recommendation, fraud, algorithmic trading, image recognition etc. to name some significant ML tasks) will continue to be handled by the models designed for these purposes, and the need for these tasks in the industry will certainly continue to grow. And like language models, these non-language models will evolve as well. What LLMs will likely replace are the language models with lower ROI. There could be a consolidation of multiple language models and be replaced by fewer LLMs, as well. Clearly it only makes sense to use a model when the return on investment outweighs the cost, and LLMs are very expensive to train and run today and clearly companies have to figure out a strategy to deal with it.
There are key privacy and compliance challenges involved though while leveraging LLMs for the enterprise which include employees querying third party services and unknowingly or inadvertently sharing confidential information. Given the privacy, compliance, expenditure and complexity involved, it is quite likely that companies will deploy their `local’ LLM services via incorporating `Transfer Learning’ based on open source models, some examples of which include Llama 2, Flan-T5, Falcon etc. serving as a foundational substrate and augment the training with company specific data. In fact, large enterprises have the opportunity to leverage large volumes of data generated in-house and clients they work with. However, large enterprises commonly struggle to aggregate data across siloed datasets within even their own organization. For instance, a global enterprise may struggle to centrally aggregate data generated by branches located in different jurisdictions or face barriers to sharing data between internal teams. The sensitive nature of unstructured text data may include company secrets or customer PII that needs to be carefully tracked and monitored for compliance. As regards `learning’, federated learning methodologies can be highly effective in addressing data collection challenges. With federated learning, large enterprises can train and fine-tune LLM models across a series of siloed datasets. Federated learning is already an approach that has been deployed at scale by leading edge companies.
Challenges and Opportunities:
With the emergence of the new era of LLM models, the pretraining primarily involves unstructured and unsupervised Internet data. This shift has led to a perception that we have moved beyond the human labelling era and can potentially avoid the associated human effort, time, and financial resources. This development is both exciting and aligns with the longstanding goal of the weakly, semi and self-supervised learning community. But then there are challenges to fully move away from “human in the loop” and here is why. OpenAI has acknowledged the difficulties associated with ”aligning” a GPT model to ensure it generates outputs that are helpful, harmless, and truthful. It is worth noting that human-generated data often contains dangerous, violence-inciting, and unethical content. As GPT models are trained on such data, it is not surprising that these issues may arise and should be expected. To address these challenges, GPT models employ a technique called reinforcement learning from human feedback (RLHF). The fundamental concept behind RLHF is to fine-tune a pre- trained GPT model using a set of human-labelled preference data.
The most effective use of LLMs relies on the quality of the prompts. A carefully designed prompt can unlock the most power of an LLM. For instance, it has been shown that few-shot prompting via providing an LLM examples can substantially improve the quality of the answers, again needing a “human in the loop”. The LLMs tend to be more confident than they should be, especially when the answers are likely to be wrong or uninformative, or hallucinating which requires potential human intervention, as well. As mentioned earlier, on companies potentially veering into developing their own LLMs but based on an open source LLM substrate using Transfer Learning techniques and customize it for specific domains, human curated data is essential in fine-tuning the models. Fine-tuning involves training additional layers on top of the pre-trained model. However, when using this approach, it is important to consider and address potential disadvantages such as limited flexibility, biases and data privacy concerns.
Conclusion and some food for thought:
In the time to come, both LLM and non-language models will continue to evolve and we will come across LLMs pervading lots of use cases solving problems in multiple domains. It is likely that LLMs could be used in conjunction with other ML and DL models either downstream or upstream to solve interesting use cases, as well. Safe guarding data privacy and compliance regulation will certainly be a challenge even with “local LLMs” as LLMs can by themselves become the source for `data leakages’. It is quite likely that a hybrid system will come into play where LLMs and human decision-makers can evolve together.