Big Data

May 25, 2019June 11, 2019

Book summary – The Signal and the noise

Risk versus uncertainty

Risk can be mathematically modeled to yield a probability
uncertainty cannot be mathematically modeled

Conditions for quality data

Why google’s Search data is better than Facebook profile data

subject feels she has privacy privacy
subject feels she is not judged
subject sees tangible benefit from being honest

The hedgehog versus the fox

The hedgehog approaches reality through a narrative/ideology while the fox thinks in terms of probabilities
The hedgehog goes very deep in an area while the fox employs multiple different models
The fox is a better forecaster than the hedgehog
The fox is more tolerant of uncertainty

More data does not yield better results and predictions
Deciding the right kind of data from the abundance available
To do prediction it is important to start from intuition and to keep model simple
qualitative data should be weighted and considered
Be self aware of your own biases

Prediction

Similarity scores – clustering in Netflix and baseball
Be wary of confirmation biases
Be wary of overfitting using small sample size – Tokyo earthquakes and global warming
Correlation does not equal causation
short hand heuristics to reduce the computational space – for example chess

Related references

Irrational exuberance, Robert Shiller
Expert political judgement, Philip E. Tetlock
Future shock, Alvin and Heidi Toffler
Principles of forecasting, J Scott Armstrong
Predicting the unpredictable, Hough

May 16, 2019May 16, 2019

Insights from dinner with Josh

Grepsr is increasingly being used in the work place by Quid.

Business people that don’t know how to code use Grepsr to pull data.

There is increasing demand for DIFFs to identify thematic trends. Themes are extracted from articles through the use of NLTK.

The proliferation of machine learning libraries and the maturing of the semantic web is democratizing the access to insights.

The legalization of online sports betting has open a fertile ground for this trend towards democratization.

NBA basketball predictive modeling should be done at the players level instead of the team level as the data becomes too lossy.

The odds of sports books at the opening lines is to encourage even bets on both side. The odds of the closing lines is a weighted average of bets (signals) from the crowd.

May 15, 2019August 12, 2019

3rd June post effects of 13th May 2019 US/China Trade war on US markets

The following is a list of 19 companies either in the tech sector or with market capitalization above USD20 billion that experienced more than 10% drop in share price on the May, 13th 2019.

Below is the breakdown in terms of price performance on 15th May 2019.

2 companies (10.5% of sample) not related to trade war
2 companies (10.5% of sample) fully recovered
7 companies (36.8% of sample) at least 50% recovery
10 companies (52.63%) not recovered by at least 50%

Below is the breakdown in terms of price performance on 3rd June May 2019.

2 companies (10.5% of sample) not related to trade war
2 companies (10.5% of sample) fully recovered
15 companies (52.63%) not recovered by at least 50%

Related references

April 29, 2019April 29, 2019

Insights from Klaren’s birthday

Conversations with Yi (EverString)

The forthcoming trend for engineering

Machine learning is increasingly becoming commoditized. DevOps becomes more important. Demand for specialized service where DevOps is encapsulated will further increase as demand for engineering tasks further outstrips engineering supplies.

On lead generation market

Companies in the lead generation space have need for scalable web crawlers. This helps offset the cost of retaining three in-house engineers.

Lead generation space has consolidated. There were priorly 120k such companies. There is 7k companies in operation. Majority of players are generating leads by scraping LinkedIn.

Consumer space require constant development of new features. Enterprise space requires service heavy. Enterprise space requires not just lead generation but entire channel marketing service suit (physical mail, online advertising, email marketing)

Lead gen hard to retain. The list becomes less valuable once it’s been used. 80% yearly churn is normal. One company reduces yearly churn to just 10% this by reducing second year subscription from USD800/yr to USD200/yr. further discount to USD100/yr if they don’t like. Recurring service is for grabbing fresh leads from same data source.

On Tele conference

Zoom’s product team compared with UberConference has developed a better understanding of the true conference needs of their users in various context. They have worked harder to ensure their product work seamlessly in identified scenarios. A typical example is the ability to join s conference bybthe press of a button on their mobile phone while driving instead of having to type the typical 4 pin digits.

April 23, 2019April 23, 2019

Tesla autonomy

https://www.youtube.com/watch?v=tbgtGQIygZQ

The Mission

Building and optimizing the entire infrastructure (hardware and software) from ground up with autonomous self driving as the mission

Mission and decision making

Design decisions are made with trade off between functionality and cost to achieve the mission while keeping cost in control

Lidar is not useful when cameras are available
driving cars with HD mapping makes the entire operation brittle since actual road conditions can change

Operating structure

Data Team
Hardware Team
Software Team

The Data model

Cars on roads are constantly collecting new data
New data is being utilized to train and improve neural network model
New improved model is constantly being deployed back to the car to improve self driving
Real world data provides visibility into long tail scenarios that simulated data cannot. Simulating long tail scenario is an intractable problem

Balancing between data model and software
- Neural network is suitable for problems that are hard to solve by defining functions / heuristics
- Simple heuristics are better handled through coding in software

Future revenue model

Robo-taxi that will disrupt the ride-sharing space.

Consumer car – USD0.60 / mile
Ride sharing – USD 2-3 / mile
Telsa Network – USD 0.18 / mile

Main challenges:

Legal – need more data and processing time to get approved
Battery capacity
Social norms around robo taxi

April 13, 2019April 14, 2019

Insights from the week

From Connie (Edmodo)

the key to consulting is to organize data into high level mutually exclusive buckets to allow easy defeating by decision makers

From Tim (Edmodo)

Kano model

From Val (Totango)

Company is concerned with increasing revenue and profitability. This will drive higher valuation during further exit

From Yip (ATT)

Analytics from Facebook page comments and twitter hashtag
need to balance customer support demand and cost of running department:
- customer support hotline
- Direct comments from influencers which trigger negative sentiment to support staff
Business analyst reads comments manually to get qualitative needs and understands business needs
Data scientist explores data might not know the business needs
business analyst have problems working with data scientist
tools to help business analyst get directly at the insight instead of via data scientist
build model to predict call support volume by category
build model to quantify feature demand level needs
correlation of weather and commodity prices

April 5, 2019April 15, 2019

Insights on managing Big Data from meet up with Dean and Ved

From Dean (Reputation.com)

Enterprise sales as an acquisition strategy is feasible because revenue per account ranges in the USD millions – e.g. 70 million USD
Once an auto company like Ford or GM signs up, they will start bringing their dealerships in
The infrastructure needs to be able to support the size of the data which can be up to billions of rows
Scaling of infrastructure to handle load ever increasing data becomes critical for the continued growth of the data company
Data Product will appear broken when user attempts generate report while the data is still being written into the database
The key challenge is that different solution is suitable for different operation
Types of data operation include
- writing into the database
- reading from the database
- map reduce to generate custom view for data in the database to support different types of reporting for different departments in the client companies.
Successful data companies will create different layers of data management solutions to cater to the different data needs
- MongoDB
  - good for storing relatively unstructured data
  - querying is slow
  - writing is slow
  - good for performing map reduce
- Elastic Search
  - good for custom querying for data
Dev ops become a very important role
- migration of data between different systems can extend up to weeks before completion
- bad map-reduce query in codes while start causing bottlenecks in reading and writing causing the data product to fail
- dev ops familiar with infrastructure might on occasion have to flush out all queries to reset
- The key challenge is the inability to find bandwidth for flushing out bad queries within the codebase
Mistakes in hindsight
- In hindsight lumping all the data from different companies into the same index on MongoDB does not scale very well
- Might make better sense to create separate database clusters for different clients
Day to day operations
- Hired a very large 100 strong Web Scraping company in India to make sure web-scrapers for customer reviews are constantly up
- Clients occasionally will provide data which internal engineer (Austin) will need to look through before importing into relevant database
Need to increase revenue volume to gear up for IPO
The Catholic church has 10 times more money than Apple and owns a lot of health care companies.

From Dan (Dharma.AI), the classmate of Ved

Currently has 15 customers for their company
Customers prefer using their solution versus open source software because they can scale the volume of data to be digested and solution comes with SLA
Company provides web, mobile and table solutions which client companies’ staff can use in the field to collect demographic and research data in developing countries
The key challenge is balancing between building features for the platform and building features specific verticals:
- Fields differ between industry: fields in the survey document for healthcare company will be very different for fields in the survey document for an auto company
- Fields differ between across company size: survey format for one company might be different as compared to another in the same industry but of different size
- Interface required is differs between companies
Original CEO has been forced to leave the company, new CEO was hired by PE firm to increase revenue volume to gear up for IPO

From Ved

As number of layers increase in the hierarchy, it becomes increasingly challenging for management to keep up to date on the actual situation in the market
New entrant of large establish competitor might sometime serve as an opportunity to ride the wave
when Google decided to repackage Google Docs for Education, it was a perfect opportunity for Edmodo to more tightly integrate into Google and ride that trend rather than being left behind
Failure to ride the wave will result in significant loss of market shares
It takes a lot of discipline to decide on just focusing on the core use case and constantly double down on it.
Knowing that a critical problem, which could potentially kill the company, exists versus successfully convincing everyone in the company that it is important to address it are two different things.

March 26, 2019March 26, 2019

Insights from visit to far west fungi farm

On mushroom

Get woods chips from petco
Alder or ashpen shavings
6 inch to a foot at the bottom
Every few months 2 inch on the top
use Sundew, a carnivores plants to get rid of insects
Go to blue bottle cafe for burlap sacks
Don’t soak more spawns for more than 12 hours each time
If too dry soak, then keep in air for a day or two
Drop some clay in water to detect chlorine in water used for soaking wood and spawn

Meetups

Oakland. Wednesday night
Counter culture labs fermentation station

Useful resources

LibGen.IO – site where free books can be downloaded
sci-hub.tw – site where free research papers can be downloaded
www.bloodhorse.com/horse-racing – site for horse racing statistics
Far West Fungi Farm

January 13, 2019January 15, 2019

Navigating the trough of sorrow

While I was reading through most of the success stories that were published on IndieHackers.com, it occurred to me that my project GetData.IO really took longer than most others to gain significant traction, a full 5 years actually.

The beginning

I first stumbled upon this project back in December 2012 when I was trying to solve two other problems of my own.

In my first problem, I was trying to identify the best stocks to buy on the Singapore Stock Exchange. While browsing through the stocks listed on their website, I soon realize that most stock exchanges as well as other financial websites gear their data presentation towards quick buy and sell behaviors. If you were looking to get data for granular analysis based on historical company performance as opposed to stock price movements, its like pulling teeth. Even then, important financial data I needed for decision making purposes were spread across multiple websites. This first problem lead me to write 2 web-scrappers, one for SGX.com and the other for Yahoo Finance, to extract data-sets which I later combined to help me with my investment decision-making process.

Once I happily parked my cash, I went back to working on my side project then. It was a travel portal which aggregates all the travel packages from tour agencies located in Southeast Asia. It was not long before I encountered my second problem… I had to write a bunch of web-scrapers again to pull data from vendor sites which do not have the APIs! Being forced to write my 3rd, 4th and maybe 5th web-scraper within a single week lead me to put on hold all work and step back to look at the bigger picture.

The insight

Being a web developer, and understanding how other web developers think, it quickly occurred to me the patterns that repeat themselves across webpage listings as well as nested webpages. This is especially true for naming conventions when it came to CSS styling. Developers tend to name their CSS classes the way they would actual physical objects in the world.

I figured if there existed a Semantic Query Language that is program independent, it would provide the benefit of querying webpages as if they were database tables while providing for clean abstraction of schema from the underlying technology. These two insights still prove true today after 6 years into the project.

The trough of sorrow

While the first 5 years depicted in the trend line above seem peaceful due to a lack of activity, it felt anything but peaceful. During this time, I was privately struggling with a bunch of challenges.

Team management mistakes and pre-mature scaling

First and foremost was team management. During the inception of the project my ex-schoolmate from years ago approached me to ask if there was any project that he could get involved in. Since I was working on this project, it was a natural that I would invited him to join the project. We soon got ourselves into an incubator in Singapore called JFDI.

In hindsight, while the experience provided us with general knowledge and friends, it really felt like going through a whirlwind. The most important piece of knowledge I came across during the incubation period was this book recommendation?—?The Founder’s dilemma. I wished I read the book before I made all of the mistakes I did.

There was a lot of hype (see the blip in mid-2013), tension and stress during the period between me and my ex-schoolmate. We went our separate ways due to differences in vision of how the project should proceed shortly after JDFI Demo Day. It was not long before I grew the team to a size of 6 and had it disbanded, realizing it was naive to scale in size before figuring out the monetization model.

Investor management mistakes

During this period of time, I also managed to commit a bunch of grave mistakes which I vow never to repeat again.

Mistake #1 was being too liberal with the stock allocation. When we incorporated the company, I was naive to believe the team would stay intact in its then configuration all the way through to the end. The cliff before vesting were to begin was only 3 months with full vesting occurring in 2 years. When my ex-schoolmate departed, the cap table was in a total mess with a huge chunk owned by a non-operator and none left for future employees without significant dilution of existing folks. This was the first serious red-flag when it came to fund raising.

Mistake #2 was giving away too much of the company for too little, too early in the project before achieving critical milestones. This was the second serious red-flag that really turned off follow up would-be investors.

Mistake #3 was not realizing the mindset difference of investors in Asia versus Silicon Valley, and thereafter picking the wrong geographical location (a.k.a network) to incubate the project. Incubating the project in the wrong network can be really detrimental to its future growth. Asian investors are inclined towards investing in applications that have a clear path to monetization while Silicon Valley investors are open towards investing in deep technology of which the path to monetization is yet apparent. During the subsequent period, I saw two similar projects incubated and successfully launched via Ycombinator.

The way I managed to fix the three problems above was to acquire funds I didn’t yet have by taking up a day job while relocating the project to back to the Valley’s network. I count my blessings for having friends who lend a helping hand when I was in a crunch.

Self-doubt

I remembered having the conversation with the head of the incubator two years into the project during my visit back to Singapore when he tried to convince me the project was going nowhere and I should just throw in the towel. I managed to convince him and more importantly myself to give it go for another 6 months till the end of the year.

I remember the evenings and weekends alone in my room while not working on my day job. In between spurts of coding, I would browse through the web or sit staring at the wall trying to envision how product market fit would look like. As what Steve Jobs mentioned once in his lecture, it felt like pushing against a wall with no signs of progress or movement whatever so. If anything, it was a lot of frustration, self-doubt and dejection. A few times, I felt like throwing in the towel and just giving up. For a period of 6 months in 2014, I actually stopped touching the code in total exasperation and just left the project running on auto-pilot, swearing to never look at it again.

The hiatus was not to last long though. A calling is just like the siren, even if somewhat faint sometimes, it calls out to you in the depths of night or when just strolling along on the serene beaches of California. It was not long before I was back on my MacBook plowing through the project again with renewed vigor.

First signs of life

It was mid-2015, the project was still not showing signs of any form of traction. I had by then stockpiled some cash from my day job and was starting to get interested in acquiring a piece of real estate with the hope of generating some cashflow to bootstrap the project while freeing up my own time. It was during this period of time that I got introduced to my friend’s room mate who also happened to be interested in real estate.

We started meeting on weekends and utilizing GetData.IO to gather real estate data for our real estate investment purposes. We were gonna perform machine learning for real estate. The scope of the project was really demanding. It was during this period of dog fooding that I started understanding how users would use GetData.IO. It was also then when I realized how shitty and unsuited the infrastructure was for the kind and scale of data harvesting required for projects like ours. It catalyzed a full rewrite of the infrastructure over the course of the next two years as well as brought the semantic query language to maturity.

Technical challenges

Similar to what Max Levchin mentioned in the book Founder’s at work, during this period of time there was always this fear in the back of my mind that I would encounter technical challenges which would be unsolvable.

The site would occasionally go down as we started scaling the volume of daily crawls. I would spend hours on the weekends digging through the logs to attempt at reproducing the error so as to understand the root cause. The operations was like a (data) pipeline, scaling one section of the pipeline without addressing further down sections would inevitably cause fissures and breakage. Some form of manual calculus in the head would always need to be performed to figure out the best configuration to balance the volume and the costs.

The number 1 hardest problem I had to tackle during this period of time was the problem of caching and storage. As the volume of data increase, storage cost increase and so did wait time required before data could be downloaded. This problem brought down the central database a few times.

After procrastinating for a while as the problem festered in mid-2016, I decided that it was to be the number 1 priority to be solved. I spend a good 4 months going to big-data and artificial intelligence MeetUps in the Bay Area to check out the types of solutions available for the problem faced. While no suitable solutions were found, the 4 months helped elicit corner cases to the problem which I did not previously thought of. I ended up building my own in-house solution.

Traction and Growth

An unforeseen side effect of solving the storage and caching problem was its effect on SEO. The effects on SEO would not be visible until mid-2017 when I started seeing increased volume of organic traffic to the site. As load times got reduced from more than a minute in some cases to less than 400 milliseconds seconds, the volume of pages indexed by bots would increase, accompanied by increase in volume of visitors and reduction in bounce rates.

Continued education

It was in early-2016 that I came across an article expounding the benefits of reading widely and deeply by Paul Graham which prompted me to pick up my hobby of reading again. A self-hack demonstrated to me by the same friend, who helped relocated me here to the Bay Area, which I pursued vehemently got me reading up to 1.5 books a week. These are books which I summarized on my personal blog for later reference. All the learnings developed my mental model of the world and greatly aided in the way I tackled the project.

Edmodo’s VP of engineering hammered in the importance of not boiling the ocean when attempting to solve a technical problem, of always being judicious with the use of resource during my time working as a tech-lead under his wing. Another key lesson learned from him is that in some circumstances being liked and being effective do not go hand in hand. As the key decision maker, it is important to steadfastly practice the discipline of being effective.

Head of Design, Tim and Lukas helped me appreciate the significance of UX during my time working with them and how it ties to user psychology.

Edmodo’s CEO introduced us to mindfulness meditation late-2016 to help us weather through the turbulent times that was happening within the company then. It was rough. The practice which I have adopted till to date has helped keep my mind balance while navigating the uncertainties of the path I am treading.

Edmodo’s VP of product sent me for a course late-2017 which helped consolidate all the knowledge I have acquired till then into a coherent whole. The knowledge gained has helped greatly accelerated the progress of GetData.IO. During the same period, I was also introduced by him the Vipasanna mediation practice which coincidentally a large percentage of the management team practices.

One very significant paradigm shift I observed in myself during this period of continued education is the observed relationship between myself and the project. It has changed from an attitude of urgently needing to succeed at all cost to an attitude of open curiosity and fascination as one would an open ended science project.

Moving forward

To date, I have started working full time on the project again. GetData.IO has the support of more than 1,500 community members worldwide. Our mission is to turn the Web into the fully functional Giant Graph Database of Human Knowledge. Financially, with the help of our community members, the project is now self-sustaining. I feel grateful for all the support and lessons gained during this 6 year journey. I look forward to the journey ahead as I continue along my path.

November 2, 2018November 2, 2018

Reflections on decentralized file storage

In the pursuit of the GetData.IO concept to its logical conclusion, one sees a wall where three problem needs to be solved to ensure no mega silo is created as a by-product

decentralized data storage for retrieval
decentralized data aggregation
economic viability

Reference

https://medium.com/fragments-network/state-of-decentralized-file-storage-mid-2018-27bd5664f3b7