Being A Data Analytics Swiss Army Knife - My Experience As An Intern In Deloitte France

I recently completed a 6 month internship in France at Deloitte, and I can confidently say that it was one of the most fulfilling, challenging, and important experiences for me in many ways.

Finding the internship

I began the internship search close to 8 months before my eventual starting month in June 2018. In fact, I was on my student exchange in Paris, France during the Academic Year 17/18 where I first begun the internship hunt.

Finding an internship as a foreigner is honestly not easy. I used multiple methods to get my resume through the (digital) door, such as cold call emails, LinkedIn listings, Angel List and Welcome To The Jungle. There are also other French websites that are in the job boards space, such as JobiJoba and Cadremploi, but I personally did not use them much.

Another strategy that I adopted was to apply through the career portal of each CAC40 (the French version of Dow Jones) company that I was interested in. Many of these companies are very traditional in their hiring practices and do not cast their net widely by posting their listings to job boards.

In the end, after applying to over 250 different internship positions located in Paris, I received a callback-rate of around 2 percent, with an eventual offer at Deloitte France. I do think that it is beneficial to apply to companies that have an international and diverse workforce, as they would be accustomed to the hiring process for overseas

Note: I do recommend taking a look at the list of companies participating in the French Tech Visa. There are many companies (mostly startups) of various sizes that are actively growing, and are open to hiring foreigners.

The Internship

On my contract, it said Stagaire Data Analyste (which translates to Data Analyst Intern). I accepted the offer by Deloitte France Forensic & Disputes, a department within Deloitte France Financial Advisory.

Within the department, it is split between general investigators and tech specialist investigators. The tech-related investigators are then further split into eDiscovery and Analytics. eDiscovery handles forensic data searching for clients, while the Analytics team deals with all forms of data services, from due dilligence, to Know Your Customer (KYC) services.

ElasticSearch

Ony of my first assignments was to evaluate the ElasticSearch search engine as an alternative to the Relativity software platform. ElasticSearch is part of the elastic stack, comprising of ElasticSearch, Kibana, and LogStash.

A quick description of each:

Name Description
ElasticSearch Scalable search engine written in Java, exposing RESTful APIs for platform-agnostic interfacing.
Kbana Allows visualization of data within the ElasticSearch database. Reacts to real-time changes in search index
LogStash Data streaming from data sources to ElasticSearch or other data stores

I ended up dockerizing the entire elastic stack to experiment and demo it's capabilities. Although the use case was present and viable, shortening keyword search times by factors of 20, there was some inertia in coming up with a plan to adopt the faster search engine. In the end,nothing came out of the research besides a dockerized proof-of-concept.

Due Diligence Machine Learning Model

When it comes to journal entry investigations, the traditional workflow was to extract the general ledger, and use Excel to manually slice and filter the data according to certain criterion that were "red flags". These "red flags" are usually built up from scratch in collaboration between the clients and the investigators, but there are usually some common red flags that pop up across each of these journal entry investigation projects.

As you can imagine, attempting to use excel to slice and dice the data into a manageable chunk to select suspicious entries is an extremely difficult thing to do. Many of these requirements change according to context (such as the country of incorporation of the clients), time (such as corporate restructuring), and scope of focus (for example, some clients are only interested in a particular keyword). This meant that there was very little workflow reuse between projects, with many of these projects being started from scratch each time.

Being the efficiency-oriented person that I am, I created python scripts that would load and parse a General Ledger into a common format (regardless of whether clients were using different accounting systems providers), scrape relevant data that investigators would normally search manually on Google, and pipelined all of that data into a prediction model based on the AdaBoosted Extra Trees classifier. There was some rudimentary text scoring on the journal entry decriptions, but due to time constraints, I was not able to incorporate some NLP techniques that I knew of. Original datasets were based on past efforts to use analytical methods through Excel, where investigators had created a basic risk "score".

The overall result of these efforts led to cutting down the time required to produce a selection of medium and high risk entries from a dataset of sometimes hundreds of thousands of entries. Normally, a dataset of 300k to 500k entries would take around 3-4 days of gruelling Excel work, but the python package could process them within 30 minutes for the exact same result. Investigators could then zoom in onto highly suspicious entries much faster. We were able to successfully use the package in a due diligence project in Morocco, and was able to validate that the proof-of-concept worked.

The final hurdle of this project was to develop a solution respectful of client-confidentiality, that could be used by investigators as well as to be sold to clients as an ongoing service. This was rather challenging, as I had to architect the application to be completely offline, as well as an online solution as well. Due to resource constraints (i.e. I was the only skilled developer able to work on this), I decided to use the Flask framework for the server, and a browser-based user interface based on React. This allowed the application to be either bundled up as a single offline Windows binary with the help of PyInstaller, or separated by decoupling the application's frontend and backend into the static files generated by React, and the server in a Docker container.

Due to the time and resource constraints of the project, I was able to only ship the first version of the internal application that passed user requirements whilst ensuring that it was well tested and documented. It was a great learning experience, that made me pick up many different technologies and frameworks, such as the Javascript testing framework Jest as a testing runner, Google Chrome's headless node api Puppeteer for UI testing, and many others.

Cryptocurrencies and KYC exploration

My manager and colleagues were interested in exploring the Initial Coin Offering and blockchain space, to evaluate if it was viable to offer a KYC service for startups that were planning to go through that fund raising route. It was an interesting topic for me, that led me to research a lot about the Ethereum blockchain, Ethereum Smart Contracts written in Solidity, as well as understanding how all the cogs worked in the machine. We also reached out to many existing firms in the space, exploring possible partnerships and evaluating our next move in relation to other sibling Deloitte firms.

Unfortunately, I was not able to get any hands on work relating to the blockchain use case for trust-less situations.

Litigation Claims SaaS tool Proof Of Concept

Halfway through the internship, a new project requirement surfaced asking for a proof of concept for litigation claims monitoring. This was for an internal client, and the requirements did not seem too hard at first. Initially, their goal was to share these litigation claims with multiple parties, such as subsidiaries, sibling and holding companies. Their current process was to work in an Excel sheet, leading to a very error-prone workflow.

In choosing the tech stack for this project, I went with the Meteor framework that allowed quick prototyping. However, due to how ambiguous the requirements were as well as expanding requirements as time went on, we were not able to explore much of this POC.

Note: After exploring other languages and web frameworks, I can no longer whole-heartedly recommend the Meteor framework as a quick prototyping web framework.

Text Analysis on Annual Reports

A requirement trickled down to me to find out if we could develop a method to extract out certain figures of interest from annual reports. This project was rather interesting, as it involved text extraction from PDFs, and using a pre-trained neural network to extract out the figures from sentences.

The initial analysis pipelined all historical annual reports through a PDF parsing pipeline that leveraged the textract package, and then performed the sentence ranking using a simple keyword scoring method to score relevant sentences that we were interested in. Much time was spent just trying to get textract to work nicely with docker and windows, as textract only works on unix and requires very specific libraries to work.

I personally would have loved to delve more in the project, to build on the spaCy model to build a more accurate and intelligent model for extracting out the figures from sentences, but unfortunately my last day came too quickly for me to work more on this project.

Documentation (and automated testing) for the team projects

When I arrived at the team, I realised that the team did not have any solid knowledge documentation build up. We had initially set up a Sharepoint site for the department to experiment with moving some administrative functions to it, and we had initially used a common Onenote notebook for notes sharing, but there is an unmistakable inertia that develops in any team attempting to adopt a Sharepoint site, due to the sheer complexity of the tool and administrative work required to maintain a site. Sometimes, emails are just easier to manage and handle, especially if there is no strong need for building a "source of truth".

For technical documentation, there were a few challenges faced. One was that the problem of hosting: we were not able to self host any documentation without authentication being a problem. The only solution that I could come up with was writing a python script that called the relevant makefile used by sphinx to generate html files, and then open up the generated html file in a browser directly (to simulate that the files were being "hosted" somewhere). I settled on sphinx (a rst-based generator) instead of mkdocs (a markdown-based generator) because of the auto-documentation feature that built the documentation directly from rst-format docstrings in the code.

Note for the non-technical reader: docstrings are documentation text that are written in-line of the code, to allow the programmer to have understand the code while looking at it.
reStructured Text (rst) and markdown are typing formats that help convert text into different formats easily.

As for javascript documentation, jsDoc was the simplest to get started with inline code documentation, but there was no nice way to view the documentation besides looking at the code. I wasn't able to optimize this part of the documentation workflow, but I would have been interested in using this sphinx plugin from Mozilla to create a better documentation reading experience.

I'm a firm believer of Test-Driven Development, and I attempted to incorporate the testing practices into my workflow even if it slowed me down a little. There are many reasons why TDD is absolutely necessary for programmer sanity, but that's a rant for another time. By starting the projects from scratch, I was able to use my favourite tools for TDD. For the javascript side, I ended up using jest, react-testing-library and puppeteer for my tests. For the python side, I was mostly using py.test to manage the test cases. I do see tests as a form of self-documentation of user experience, so no team should skimp on them.

In the end, I'm happy knowing that whoever that does a deep dive into the code does not pull their hair out trying to understand it, and the many tests cases that I've written should keep it stable enough for future improvements.

Hiring My Replacement

Nearer to the end of my internship, there came the problem of who would take over my work when I left. It was evident that there was a need for an engineering-oriented person in the team, who could implement theoretical solutions into a something usable.

One day I was pulled into a meeting room with someone from Human Resources that was tasked to find someone to replace me. This was definitely eye-opening for me to be involved in the hiring process, and I was asked to help filter out and select resumes for HR to pursue. I was also asked to look through hiring platforms to see profiles that were interesting enough to pursue.

There are a few cultural differences that are definitely noticeable between the way French people presented their resumes as compared those that I had seen in Singapore. There was a very high emphasis on educational qualifications, and I think this resonates very much in the French culture, as evidenced by the high competitiveness for places in top educational institutes.

Some of the resumes that hardly mentioned any projects or experience received a straight pass from me, regardless of qualification level. People who were able to show a varying skill-set and thirst for knowledge really stood out to me, and I rated their resumes high for follow-up regardless of educational background. The reason as to why versatility was a necessary thing to look at for our situation, was because there is much more value in being a tech generalist in a small team as compared to a specialist. If we were to split all of my roles into individual job positions, we would have a front-end engineer, back-end engineer, data engineer, data scientist, and devops engineer. And maybe a data analyst too. The overall value of a generalist would be much more in such a small team as compared to many individual specialists, skill level aside. The jackpot would be a skilled generalist, one that can context switch very easily and manage each aspect of the project pipeline without stumbling.

In the end, I was able to participate in a phone interview with a prospective hire, seeing if he would be a fit for the team's needs. Unfortunately, I left the firm before being able to have any meaningful knowledge transfer of the projects to any new hires.

Everything in French?

One question I got a lot when I reached back home in Singapore was: Was all my work in French? Not really. There was definitely a large proportion of administrative things that were in French, such as communicating with HR, department meetings and sometimes team meetings. One of my proudest achievements was to hold a complete conversation with a Deloitte IT support staff completely in French while talking about a Sharepoint Active Directory syncing problem. Many of my interactions with IT support was in French as well, and explaining my technical problems such as BIOS virtualization enabling was definitely tricky.

I do think that the entire 6 months was a huge benefit to my French speaking ability, and it is a great opportunity to practice the language immersion and improve your reading and speaking ability. If you are not at a C1 level, it definitely helps to be in an English-first professional setting.

As a whole...

This internship was sort of like a journey of personal and professional growth, with much greater technical challenges as compared to my previous internship. It was both fulfilling and rewarding in so many ways, and I do think that anyone considering an overseas internship should definitely try it out for themselves, no matter how hard it is to get your foot through the door.

Nothing will happen unless you take the first leap.

© 2018-2019 Lee Tze Yiing. All rights reserved.Contact Me